Chatterbox

From TTS Wiki
Jump to navigation Jump to search
Chatterbox


Model Information
Developer: Resemble AI
Release date: May 2025
Latest version: Multilingual 2.0
Architecture: CosyVoice 2.0-based
Parameters: 500 million
Training data: 500,000 hours cleaned data
Capabilities
Languages: 23 languages (multilingual version)
Voices: Zero-shot voice cloning
Voice cloning: Yes (5-second reference)
Emotion control: Yes (exaggeration parameter)
Streaming: Yes
Latency: Sub-200ms
Availability
License: MIT License
Open source: Yes
Repository: GitHub
Model weights: Hugging Face
Demo: HF Spaces
Website: resemble.ai/chatterbox

Chatterbox is an open-source text-to-speech (TTS) model developed by Resemble AI and released in May 2025. Built on a modified Llama architecture with 500M parameters, it is marketed as the first open-source TTS model to include controllable emotion exaggeration and has gained attention for claiming to outperform established commercial systems in user preference evaluations. It is built on the CosyVoice 2.0 architecture.

Development and Release[edit | edit source]

Chatterbox was developed by a three-person team at Resemble AI, a voice technology company founded by Zohaib Ahmed and Saqib Muhammad.[1] The initial English-only version was released in May 2025 under the MIT License, followed by a multilingual version supporting 23 languages in September 2025.[2]

The project quickly gained popularity in the open-source community, accumulating over 1 million downloads on Hugging Face and more than 11,000 stars on GitHub within weeks of release.[3]

Technical Architecture[edit | edit source]

Chatterbox utilizes a 500-million parameter model based on a CosyVoice-style modified Llama architecture, significantly smaller than many contemporary TTS systems. The model was trained on approximately 500,000 hours of cleaned audio data and employs what the developers term "alignment-informed inference" for improved stability during generation.

Key technical features include:

  • Zero-shot voice cloning: Ability to clone voices using as little as 5 seconds of reference audio
  • Emotion exaggeration control: A novel parameter allowing users to adjust emotional intensity from monotone to dramatically expressive
  • Fast inference: Sub-200ms latency for real-time applications
  • Multilingual support: The updated version supports 23 languages including Arabic, Chinese, Hindi, and major European languages

Performance Claims and Evaluation[edit | edit source]

Resemble AI conducted a comparative evaluation through Podonos, a third-party evaluation service, testing Chatterbox against ElevenLabs, a leading commercial TTS system. In blind A/B testing, 63.75% of evaluators reportedly preferred Chatterbox's output over ElevenLabs.[4][5]

However, these results should be interpreted with caution, as the evaluation was limited in scope and conducted by a single third-party service. The testing methodology, sample size, and demographic composition of evaluators have not been independently verified. Additionally, the comparison was limited to a single competitor rather than a comprehensive benchmark against multiple state-of-the-art systems.

Commercial and Research Impact[edit | edit source]

The release of Chatterbox has been significant for the open-source TTS community, representing one of the first production-grade systems to be freely available under a permissive license. This has enabled developers to integrate high-quality TTS capabilities into applications without licensing costs or vendor dependencies.

The system has found applications in various domains including:

  • Audiobook generation and voice narration
  • Game development for non-player character dialogue
  • Educational content creation
  • Accessibility tools for visually impaired users
  • Research and development in speech synthesis

Resemble AI also offers a commercial "Pro" version with enhanced features, service-level agreements, and custom fine-tuning capabilities for enterprise customers requiring guaranteed performance and support. This version is available through their inference partners, such as FAL.

External Links[edit | edit source]