Editing IndexTTS2

{{Infobox TTS model|name=IndexTTS2|developer=Bilibili AI Platform Department|release_date=September 2025 (paper)|latest_version=2.0|architecture=Autoregressive Transformer|parameters=Undisclosed|training_data=55,000 hours multilingual|languages=Chinese, English, Japanese|voices=Zero-shot voice cloning|voice_cloning=Yes (emotion-timbre disentangled)|emotion_control=Yes (multimodal input)|streaming=Yes|latency=Not specified|license=Custom/restrictive (commercial license available)|open_source=Yes|code_repository=https://github.com/index-tts/index-tts|model_weights=https://huggingface.co/IndexTeam/IndexTTS-2|demo=https://index-tts.github.io/index-tts2.github.io/|website=https://indextts2.org}}'''IndexTTS2''' is an open-source text-to-speech model developed by Bilibili's AI Platform Department loosely based on [[Tortoise TTS]]. Released in September 2025, it addresses key limitations of traditional TTS models by introducing precise duration control and advanced emotional expression capabilities while maintaining the naturalness advantages of autoregressive generation.

== Development and Background ==
IndexTTS2 was developed by a team led by Siyi Zhou at Bilibili's Artificial Intelligence Platform Department in China. The project builds upon the earlier IndexTTS model, incorporating substantial improvements in duration control, emotional modeling, and speech stability. The research was published on arXiv in September 2025, with the paper titled "IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech."<ref>https://arxiv.org/abs/2506.21619</ref>

The development was motivated by specific limitations in existing autoregressive TTS models, particularly their inability to precisely control speech duration - a critical requirement for applications such as video dubbing that demand strict audio-visual synchronization. Additionally, the team sought to address the limited emotional expressiveness of existing systems, which are often constrained by scarce high-quality emotional training data.

== Technical Architecture ==
IndexTTS2 employs a three-module cascaded architecture:

=== Text-to-Semantic (T2S) Module ===
The T2S module serves as the core component, utilizing an autoregressive Transformer framework to generate semantic tokens from input text, timbre prompts, style prompts, and optional speech token counts. Key innovations include:

'''Duration Control Mechanism''': A novel duration encoding system where duration information p is computed from target semantic token length T using the formula p = W_num h(T), where W_num represents an embedding table and h(T) returns a one-hot vector

'''Emotion-Speaker Disentanglement''': Implementation of a Gradient Reversal Layer (GRL) to separate emotional features from speaker-specific characteristics, enabling independent control over timbre and emotion

'''Three-Stage Training Strategy''': A progressive training approach designed to overcome the scarcity of high-quality emotional data while maintaining model stability

=== Semantic-to-Mel (S2M) Module ===
The S2M module employs a non-autoregressive architecture based on flow matching to convert semantic tokens into mel-spectrograms. Notable features include:

'''GPT Latent Enhancement''': Integration of latent features from the T2S module's final transformer layer to improve speech clarity, particularly during emotionally expressive synthesis

'''Speaker Embedding Integration''': Concatenation of speaker embeddings with semantic features to ensure timbre consistency

=== Vocoder ===
The system utilizes [[BigVGANv2]] as its vocoder to convert mel-spectrograms into final audio waveforms, chosen for its superior audio quality and stability compared to previous vocoders.

== Key Features and Capabilities ==

=== Precise Duration Control ===
IndexTTS2 is claimed to be the first autoregressive zero-shot TTS model to achieve precise duration control. The model supports two generation modes:

'''Specified Duration Mode''': Users can explicitly specify the number of generated tokens to control speech duration with millisecond-level precision

'''Natural Duration Mode''': Free-form generation that faithfully reproduces prosodic features from input prompts without duration constraints

This capability addresses critical requirements for applications like video dubbing, where precise synchronization between audio and visual content is essential.

=== Fine-Grained Emotional Control ===
The system offers multiple methods for emotional control:

'''Reference Audio Emotion''': Extraction of emotional characteristics from style prompt audio

'''Natural Language Descriptions''': Text-based emotion control using a specialized Text-to-Emotion (T2E) module

'''Emotion Vector Input''': Direct specification of emotional states through numerical vectors

'''Cross-Speaker Emotion Transfer''': Ability to apply emotional characteristics from one speaker to the voice of another

=== Text-to-Emotion (T2E) Module ===
A specialized component that enables natural language-based emotion control through:

* Knowledge distillation from DeepSeek-R1 to Qwen3-1.7B
* Support for seven basic emotions: Anger, Happiness, Fear, Disgust, Sadness, Surprise, and Neutral
* Generation of emotion probability distributions that are combined with precomputed emotion embeddings

== Training and Dataset ==
IndexTTS2 was trained on a substantial multilingual corpus:

'''Total Training Data''': 55,000 hours comprising 30,000 hours of Chinese data and 25,000 hours of English data

'''Emotional Data''': 135 hours of specialized emotional speech from 361 speakers

'''Training Infrastructure''': 8 NVIDIA A100 80GB GPUs using AdamW optimizer with 2e-4 learning rate

'''Training Duration''': Three weeks total training time

'''Data Sources''': Primarily from the Emilia dataset, supplemented with audiobooks and commercial data

The three-stage training methodology includes:

# Foundation training on the full dataset with duration control capabilities
# Emotional control refinement using curated emotional data with GRL-based disentanglement
# Robustness improvement through fine-tuning on the complete dataset

== Performance Evaluation ==

=== Objective Metrics ===
Based on evaluation across multiple datasets (LibriSpeech-test-clean, SeedTTS test-zh/en, AIShell-1), IndexTTS2 demonstrates:

'''Superior Word Error Rates''': Outperforms baseline models including MaskGCT, F5-TTS, CosyVoice2, and SparkTTS across most test sets

'''Strong Speaker Similarity''': Achieves competitive speaker similarity scores while maintaining improved speech clarity

'''Emotional Fidelity''': Highest emotion similarity scores among evaluated models

=== Subjective Evaluation ===
Human evaluation using Mean Opinion Scores (MOS) across multiple dimensions shows:

'''Quality MOS''': Consistent superiority in perceived audio quality

'''Similarity MOS''': Strong performance in perceived speaker similarity

'''Prosody MOS''': Enhanced prosodic naturalness compared to baseline models

'''Emotion MOS''': Significant improvements in emotional expressiveness

=== Duration Control Accuracy ===
Precision testing reveals minimal token number error rates:

Original duration: <0.02% error rate

Scaled durations (0.875×-1.125×): <0.03% error rate

Larger scaling factors: Maximum 0.067% error rate

== Comparison with Existing Systems ==
IndexTTS2 distinguishes itself from contemporary TTS systems through:

'''Versus ElevenLabs''': Open-source nature and precise duration control capabilities

'''Versus Traditional TTS''': Enhanced emotional expressiveness and zero-shot voice cloning

'''Versus Other Open-Source Systems''': First autoregressive model with precise duration control

'''Versus Non-Autoregressive Models''': Maintains naturalness advantages while adding duration precision

== External Links ==
[https://github.com/index-tts/index-tts Official IndexTTS Repository]

[https://index-tts.github.io/index-tts2.github.io/ IndexTTS2 Demo Page]

[https://huggingface.co/IndexTeam IndexTTS2 Models on Hugging Face]

[https://arxiv.org/abs/2506.21619 IndexTTS2 Research Paper]