Editing
VibeVoice
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
'''VibeVoice''' is an experimental [[text-to-speech]] (TTS) framework developed by [[Microsoft Research]] for generating long-form, multi-speaker conversational audio. It was released in August 2025 and is designed to synthesize long-form speech content such as podcasts and audiobooks with up to 4 speakers and with support for voice cloning.<ref>https://github.com/microsoft/VibeVoice</ref> == Development and Release == VibeVoice was developed by a team at Microsoft Research led by Zhiliang Peng, Jianwei Yu, and others, with the technical report published on [[arXiv]] in August 2025.<ref>https://arxiv.org/abs/2508.19205</ref> The project was initially released as open-source software on GitHub and [[Hugging Face]], with model weights made publicly available under the MIT license. However, the release was disrupted in September 2025 when Microsoft removed the official repository and model weights from public access. According to Microsoft's statement, this action was taken after discovering "instances where the tool was used in ways inconsistent with the stated intent" and concerns about responsible AI use.<ref>https://github.com/microsoft/VibeVoice</ref> The repository was later restored without implementation code, while community-maintained forks preserved the original materials. The 1.5B pretrained model remains available on Microsoft's Hugging Face page, but the 7B model was taken down. Community forks have preserved backups of the code and model weights, including both the 1.5B and 7B models.<ref>https://github.com/vibevoice-community/VibeVoice</ref><ref>https://huggingface.co/aoi-ot/VibeVoice-Large</ref> == Technical Architecture == VibeVoice uses a hybrid architecture combining large language models with diffusion-based audio generation. The system uses two specialized tokenizers operating at an ultra-low 7.5 Hz frame rate: * '''Acoustic Tokenizer''': A [[variational autoencoder]] (VAE) based encoder-decoder that compresses audio signals while preserving fidelity * '''Semantic Tokenizer''': A content-focused encoder trained using [[automatic speech recognition]] as a proxy task The core model utilizes [[Qwen2.5]] as its base large language model (available in 1.5B and 7B parameter variants), integrated with a lightweight diffusion head for generating acoustic features. This design achieves what the researchers claim is an 80-fold improvement in data compression compared to the [[Encodec]] model while maintaining audio quality.<ref>https://arxiv.org/abs/2508.19205</ref> == Capabilities and Limitations == VibeVoice can generate speech sequences up to 90 minutes in length with support for up to four distinct speakers. The model demonstrates several emergent capabilities not explicitly trained for, including: * Cross-lingual speech synthesis * Spontaneous singing (though often off-key) * Contextual background music generation * Voice cloning from short prompts However, the system has notable limitations: * Language support restricted to English and Chinese * No explicit modeling of overlapping speech * Occasional instability, particularly with Chinese text synthesis * Uncontrolled generation of background sounds and music * Limited commercial viability due to various technical constraints<ref>https://github.com/microsoft/VibeVoice</ref> == Performance and Evaluation == In comparative evaluations against contemporary TTS systems including [[ElevenLabs]], [[Google]]'s Gemini 2.5 Pro TTS, and others, VibeVoice reportedly achieved superior scores in subjective metrics of realism, richness, and user preference. The model also demonstrated competitive [[word error rate]]s when evaluated using speech recognition systems.<ref>https://huggingface.co/microsoft/VibeVoice-1.5B</ref> However, these evaluations were conducted on a limited test set of eight conversational transcripts totaling approximately one hour of audio, raising questions about the generalizability of the results to broader use cases. == Controversies and Concerns == The temporary removal of VibeVoice from public access highlighted ongoing concerns about the potential misuse of high-quality synthetic speech technology. Microsoft explicitly warned about the potential for creating [[deepfake]] audio content for impersonation, fraud, or disinformation purposes. The model's ability to generate convincing speech from minimal voice prompts, combined with its long-form generation capabilities, raised particular concerns among AI safety researchers about potential misuse for creating fake audio content at scale. == Community Response == Following Microsoft's temporary withdrawal of the official release, the open-source community created several preservation efforts: * '''[https://github.com/vibevoice-community/VibeVoice vibevoice-community/VibeVoice]''': A community-maintained fork preserving the original codebase and model weights * '''[https://github.com/voicepowered-ai/VibeVoice-finetuning VibeVoice-finetuning]''': Unofficial tools for fine-tuning the models using [[Low-Rank Adaptation]] (LoRA) techniques These community efforts have enabled continued research and development despite the official restrictions, though they operate independently of Microsoft's oversight. == External Links == * [https://github.com/vibevoice-community/VibeVoice Community-maintained VibeVoice repository] * [https://arxiv.org/abs/2508.19205 Original technical report on arXiv] [[Category:Artificial intelligence]] [[Category:Speech synthesis]] [[Category:Microsoft Research]] [[Category:Open-source software]]
Summary:
Please note that all contributions to TTS Wiki are considered to be released under the Creative Commons Attribution 4.0 (see
Project:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information