IndexTTS2

From TTS Wiki
Jump to navigation Jump to search
IndexTTS2


Model Information
Developer: Bilibili AI Platform Department
Release date: September 2025 (paper)
Latest version: 2.0
Architecture: Autoregressive Transformer
Parameters: Undisclosed
Training data: 55,000 hours multilingual
Capabilities
Languages: Chinese, English, Japanese
Voices: Zero-shot voice cloning
Voice cloning: Yes (emotion-timbre disentangled)
Emotion control: Yes (multimodal input)
Streaming: Yes
Latency: Not specified
Availability
License: Custom/restrictive (commercial license available)
Open source: Yes
Repository: https://github.com/index-tts/index-tts
Model weights: https://huggingface.co/IndexTeam/IndexTTS-2
Demo: https://index-tts.github.io/index-tts2.github.io/
Website: https://indextts2.org

IndexTTS2 is an open-source text-to-speech model developed by Bilibili's AI Platform Department loosely based on Tortoise TTS. Released in September 2025, it addresses key limitations of traditional TTS models by introducing precise duration control and advanced emotional expression capabilities while maintaining the naturalness advantages of autoregressive generation.

Development and Background[edit | edit source]

IndexTTS2 was developed by a team led by Siyi Zhou at Bilibili's Artificial Intelligence Platform Department in China. The project builds upon the earlier IndexTTS model, incorporating substantial improvements in duration control, emotional modeling, and speech stability. The research was published on arXiv in September 2025, with the paper titled "IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech."[1]

The development was motivated by specific limitations in existing autoregressive TTS models, particularly their inability to precisely control speech duration - a critical requirement for applications such as video dubbing that demand strict audio-visual synchronization. Additionally, the team sought to address the limited emotional expressiveness of existing systems, which are often constrained by scarce high-quality emotional training data.

Technical Architecture[edit | edit source]

IndexTTS2 employs a three-module cascaded architecture:

Text-to-Semantic (T2S) Module[edit | edit source]

The T2S module serves as the core component, utilizing an autoregressive Transformer framework to generate semantic tokens from input text, timbre prompts, style prompts, and optional speech token counts. Key innovations include:

Duration Control Mechanism: A novel duration encoding system where duration information p is computed from target semantic token length T using the formula p = W_num h(T), where W_num represents an embedding table and h(T) returns a one-hot vector

Emotion-Speaker Disentanglement: Implementation of a Gradient Reversal Layer (GRL) to separate emotional features from speaker-specific characteristics, enabling independent control over timbre and emotion

Three-Stage Training Strategy: A progressive training approach designed to overcome the scarcity of high-quality emotional data while maintaining model stability

Semantic-to-Mel (S2M) Module[edit | edit source]

The S2M module employs a non-autoregressive architecture based on flow matching to convert semantic tokens into mel-spectrograms. Notable features include:

GPT Latent Enhancement: Integration of latent features from the T2S module's final transformer layer to improve speech clarity, particularly during emotionally expressive synthesis

Speaker Embedding Integration: Concatenation of speaker embeddings with semantic features to ensure timbre consistency

Vocoder[edit | edit source]

The system utilizes BigVGANv2 as its vocoder to convert mel-spectrograms into final audio waveforms, chosen for its superior audio quality and stability compared to previous vocoders.

Key Features and Capabilities[edit | edit source]

Precise Duration Control[edit | edit source]

IndexTTS2 is claimed to be the first autoregressive zero-shot TTS model to achieve precise duration control. The model supports two generation modes:

Specified Duration Mode: Users can explicitly specify the number of generated tokens to control speech duration with millisecond-level precision

Natural Duration Mode: Free-form generation that faithfully reproduces prosodic features from input prompts without duration constraints

This capability addresses critical requirements for applications like video dubbing, where precise synchronization between audio and visual content is essential.

Fine-Grained Emotional Control[edit | edit source]

The system offers multiple methods for emotional control:

Reference Audio Emotion: Extraction of emotional characteristics from style prompt audio

Natural Language Descriptions: Text-based emotion control using a specialized Text-to-Emotion (T2E) module

Emotion Vector Input: Direct specification of emotional states through numerical vectors

Cross-Speaker Emotion Transfer: Ability to apply emotional characteristics from one speaker to the voice of another

Text-to-Emotion (T2E) Module[edit | edit source]

A specialized component that enables natural language-based emotion control through:

  • Knowledge distillation from DeepSeek-R1 to Qwen3-1.7B
  • Support for seven basic emotions: Anger, Happiness, Fear, Disgust, Sadness, Surprise, and Neutral
  • Generation of emotion probability distributions that are combined with precomputed emotion embeddings

Training and Dataset[edit | edit source]

IndexTTS2 was trained on a substantial multilingual corpus:

Total Training Data: 55,000 hours comprising 30,000 hours of Chinese data and 25,000 hours of English data

Emotional Data: 135 hours of specialized emotional speech from 361 speakers

Training Infrastructure: 8 NVIDIA A100 80GB GPUs using AdamW optimizer with 2e-4 learning rate

Training Duration: Three weeks total training time

Data Sources: Primarily from the Emilia dataset, supplemented with audiobooks and commercial data

The three-stage training methodology includes:

  1. Foundation training on the full dataset with duration control capabilities
  2. Emotional control refinement using curated emotional data with GRL-based disentanglement
  3. Robustness improvement through fine-tuning on the complete dataset

Performance Evaluation[edit | edit source]

Objective Metrics[edit | edit source]

Based on evaluation across multiple datasets (LibriSpeech-test-clean, SeedTTS test-zh/en, AIShell-1), IndexTTS2 demonstrates:

Superior Word Error Rates: Outperforms baseline models including MaskGCT, F5-TTS, CosyVoice2, and SparkTTS across most test sets

Strong Speaker Similarity: Achieves competitive speaker similarity scores while maintaining improved speech clarity

Emotional Fidelity: Highest emotion similarity scores among evaluated models

Subjective Evaluation[edit | edit source]

Human evaluation using Mean Opinion Scores (MOS) across multiple dimensions shows:

Quality MOS: Consistent superiority in perceived audio quality

Similarity MOS: Strong performance in perceived speaker similarity

Prosody MOS: Enhanced prosodic naturalness compared to baseline models

Emotion MOS: Significant improvements in emotional expressiveness

Duration Control Accuracy[edit | edit source]

Precision testing reveals minimal token number error rates:

Original duration: <0.02% error rate

Scaled durations (0.875×-1.125×): <0.03% error rate

Larger scaling factors: Maximum 0.067% error rate

Comparison with Existing Systems[edit | edit source]

IndexTTS2 distinguishes itself from contemporary TTS systems through:

Versus ElevenLabs: Open-source nature and precise duration control capabilities

Versus Traditional TTS: Enhanced emotional expressiveness and zero-shot voice cloning

Versus Other Open-Source Systems: First autoregressive model with precise duration control

Versus Non-Autoregressive Models: Maintains naturalness advantages while adding duration precision

External Links[edit | edit source]

Official IndexTTS Repository

IndexTTS2 Demo Page

IndexTTS2 Models on Hugging Face

IndexTTS2 Research Paper