View source for IndexTTS2

== Training and Dataset ==
IndexTTS2 was trained on a substantial multilingual corpus:

'''Total Training Data''': 55,000 hours comprising 30,000 hours of Chinese data and 25,000 hours of English data

'''Emotional Data''': 135 hours of specialized emotional speech from 361 speakers

'''Training Infrastructure''': 8 NVIDIA A100 80GB GPUs using AdamW optimizer with 2e-4 learning rate

'''Training Duration''': Three weeks total training time

'''Data Sources''': Primarily from the Emilia dataset, supplemented with audiobooks and commercial data

The three-stage training methodology includes:

# Foundation training on the full dataset with duration control capabilities
# Emotional control refinement using curated emotional data with GRL-based disentanglement
# Robustness improvement through fine-tuning on the complete dataset