View source for VibeVoice

{{Infobox TTS model
| name = VibeVoice
| developer = [[Microsoft Research]]
| release_date = August 26, 2025
| latest_version = 7B
| architecture = [[Qwen]] 2.5 + Diffusion
| parameters = 1.5B / 7B
| training_data = Proprietary dataset
| languages = English, Chinese
| voices = 4 speakers maximum
| voice_cloning = Yes
| emotion_control = Limited
| streaming = Yes
| license = MIT
| open_source = Limited (code removed)
| code_repository = [https://github.com/vibevoice-community/VibeVoice Community fork]
| model_weights = [https://huggingface.co/vibevoice Community backup]
| website = [https://aka.ms/VibeVoice Microsoft]
}}

'''VibeVoice''' is an experimental [[text-to-speech]] (TTS) framework developed by [[Microsoft Research]] for generating long-form, multi-speaker conversational audio. It was released in August 2025 and is designed to synthesize long-form speech content such as podcasts and audiobooks with up to 4 speakers and with support for voice cloning.<ref>https://github.com/microsoft/VibeVoice</ref>

== Development and Release ==

VibeVoice was developed by a team at Microsoft Research led by Zhiliang Peng, Jianwei Yu, and others, with the technical report published on [[arXiv]] in August 2025.<ref>https://arxiv.org/abs/2508.19205</ref> The project was initially released as open-source software on GitHub and [[Hugging Face]], with model weights made publicly available under the MIT license.

However, the release was disrupted in September 2025 when Microsoft removed the official repository and model weights from public access. According to Microsoft's statement, this action was taken after discovering "instances where the tool was used in ways inconsistent with the stated intent" and concerns about responsible AI use.<ref>https://github.com/microsoft/VibeVoice</ref> The repository was later restored without implementation code, while community-maintained forks preserved the original materials. The 1.5B pretrained model remains available on Microsoft's Hugging Face page, but the 7B model was taken down.

Community forks have preserved backups of the code and model weights, including both the 1.5B and 7B models.<ref>https://github.com/vibevoice-community/VibeVoice</ref><ref>https://huggingface.co/aoi-ot/VibeVoice-Large</ref>

== Technical Architecture ==

VibeVoice uses a hybrid architecture combining large language models with diffusion-based audio generation. The system uses two specialized tokenizers operating at an ultra-low 7.5 Hz frame rate:

* '''Acoustic Tokenizer''': A [[variational autoencoder]] (VAE) based encoder-decoder that compresses audio signals while preserving fidelity
* '''Semantic Tokenizer''': A content-focused encoder trained using [[automatic speech recognition]] as a proxy task

The core model utilizes [[Qwen2.5]] as its base large language model (available in 1.5B and 7B parameter variants), integrated with a lightweight diffusion head for generating acoustic features. This design achieves what the researchers claim is an 80-fold improvement in data compression compared to the [[Encodec]] model while maintaining audio quality.<ref>https://arxiv.org/abs/2508.19205</ref>

== Capabilities and Limitations ==

VibeVoice can generate speech sequences up to 90 minutes in length with support for up to four distinct speakers. The model demonstrates several emergent capabilities not explicitly trained for, including:

* Cross-lingual speech synthesis
* Spontaneous singing (though often off-key)
* Contextual background music generation
* Voice cloning from short prompts

However, the system has notable limitations:

* Language support restricted to English and Chinese
* No explicit modeling of overlapping speech
* Occasional instability, particularly with Chinese text synthesis
* Uncontrolled generation of background sounds and music
* Limited commercial viability due to various technical constraints<ref>https://github.com/microsoft/VibeVoice</ref>

== Performance and Evaluation ==

In comparative evaluations against contemporary TTS systems including [[ElevenLabs]], [[Google]]'s Gemini 2.5 Pro TTS, and others, VibeVoice reportedly achieved superior scores in subjective metrics of realism, richness, and user preference. The model also demonstrated competitive [[word error rate]]s when evaluated using speech recognition systems.<ref>https://huggingface.co/microsoft/VibeVoice-1.5B</ref>

However, these evaluations were conducted on a limited test set of eight conversational transcripts totaling approximately one hour of audio, raising questions about the generalizability of the results to broader use cases.

== Controversies and Concerns ==

The temporary removal of VibeVoice from public access highlighted ongoing concerns about the potential misuse of high-quality synthetic speech technology. Microsoft explicitly warned about the potential for creating [[deepfake]] audio content for impersonation, fraud, or disinformation purposes.

The model's ability to generate convincing speech from minimal voice prompts, combined with its long-form generation capabilities, raised particular concerns among AI safety researchers about potential misuse for creating fake audio content at scale.

== Fine-Tuning ==
Following the release of the community backup, a member of the community has released fine-tuning scripts on GitHub. LoRA fine-tuning is fully supported; however, full-fine-tuning is not yet supported.

Members of the Discord community have reported success both extending VibeVoice to new languages and training on specific voices for better voice cloning; however, there are few LoRA adapters publicly available.

== Community Response ==

Following Microsoft's temporary withdrawal of the official release, the open-source community created several preservation efforts:

* '''[https://github.com/vibevoice-community/VibeVoice vibevoice-community/VibeVoice]''': A community-maintained fork preserving the original codebase and model weights
* '''[https://github.com/voicepowered-ai/VibeVoice-finetuning VibeVoice-finetuning]''': Unofficial tools for fine-tuning the models using [[Low-Rank Adaptation]] (LoRA) techniques

These community efforts have enabled continued research and development despite the official restrictions, though they operate independently of Microsoft's oversight.

== External Links ==

* [https://github.com/vibevoice-community/VibeVoice Community-maintained VibeVoice repository]
* [https://discord.com/invite/ZDEYTTRxWG VibeVoice Discord server (unofficial)]
* [https://arxiv.org/abs/2508.19205 Original technical report on arXiv]

[[Category:Artificial intelligence]]
[[Category:Speech synthesis]]
[[Category:Microsoft Research]]
[[Category:Open-source software]]