VITS

From TTS Wiki
Revision as of 02:39, 22 September 2025 by Ttswikiadmin (talk | contribs) (Created page with "'''VITS''' ('''Variational Inference with adversarial learning for end-to-end Text-to-Speech''') and '''VITS2''' are neural text-to-speech synthesis models that generate speech directly from text input using end-to-end training. VITS was first introduced by researchers at Kakao Enterprise in June 2021, while VITS2 was developed by SK Telecom and published in July 2023 as an improvement over the original model. == Overview == Traditional text-to-speech systems typically...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) and VITS2 are neural text-to-speech synthesis models that generate speech directly from text input using end-to-end training. VITS was first introduced by researchers at Kakao Enterprise in June 2021, while VITS2 was developed by SK Telecom and published in July 2023 as an improvement over the original model.

Overview

Traditional text-to-speech systems typically employ a two-stage pipeline: first converting text to intermediate representations like mel-spectrograms, then generating audio waveforms from these representations. VITS introduced a single-stage approach that generates natural-sounding audio directly from text using variational inference augmented with normalizing flows and adversarial training.

The models are notable for achieving quality comparable to human speech while maintaining parallel generation capabilities, making them significantly faster than autoregressive alternatives. Human evaluation on the LJ Speech dataset showed that VITS outperformed the best publicly available TTS systems at the time and achieved a mean opinion score (MOS) comparable to ground truth recordings.

Technical Architecture

VITS (2021)

VITS employs a conditional variational autoencoder (VAE) framework combined with several advanced techniques:

Core Components:

  • Posterior Encoder: Processes linear-scale spectrograms during training to learn latent representations
  • Prior Encoder: Contains a text encoder and normalizing flows to model the prior distribution of latent variables
  • Decoder: Based on HiFi-GAN V1 generator, converts latent variables to raw waveforms
  • Discriminator: Multi-period discriminator from HiFi-GAN for adversarial training

Key Innovations:

  • Monotonic Alignment Search (MAS): Automatically learns alignments between text and speech without external annotations by finding alignments that maximize the likelihood of target speech.
  • Stochastic Duration Predictor: Uses normalizing flows to model the distribution of phoneme durations, enabling synthesis of speech with diverse rhythms from the same text input.
  • Adversarial Training: Improves waveform quality through generator-discriminator competition

The model addresses the one-to-many relationship in speech synthesis, where a single text input can be spoken in multiple ways with different pitches, rhythms, and prosodic patterns.

VITS2 (2023)

VITS2 introduced several improvements over the original model to address issues including intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion.

Major Improvements:

  • Adversarial Duration Predictor: Replaced the flow-based stochastic duration predictor with one trained through adversarial learning, using a time step-wise conditional discriminator to improve efficiency and naturalness.
  • Enhanced Normalizing Flows: Added transformer blocks to normalizing flows to capture long-term dependencies when transforming distributions, addressing limitations of convolution-only approaches.
  • Improved Alignment Search: Modified Monotonic Alignment Search by adding Gaussian noise to calculated probabilities, giving the model additional opportunities to explore alternative alignments during early training.
  • Speaker-Conditioned Text Encoder: For multi-speaker models, conditioned the speaker vector on the third transformer block of the text encoder to better capture speaker-specific pronunciation and intonation characteristics.

Performance and Evaluation

VITS Results

On the LJ Speech dataset, VITS achieved a MOS of 4.43 (±0.06), compared to 4.46 (±0.06) for ground truth recordings. This outperformed Tacotron 2 + HiFi-GAN at 4.25 (±0.07) and Glow-TTS + HiFi-GAN at 4.32 (±0.07). The model also demonstrated significant improvements in synthesis speed, achieving 67.12× real-time factor compared to 27.48× for Glow-TTS + HiFi-GAN.

VITS2 Results

VITS2 showed further improvements with a MOS of 4.47 (±0.06) on LJ Speech, representing a 0.09 point increase over VITS. In comparative evaluations, VITS2 achieved a CMOS of 0.201 (±0.105) when compared directly to VITS. The model also improved synthesis speed to 97.25× real-time and reduced training time by approximately 22.7%.

Multi-Speaker Capabilities

Both models support multi-speaker synthesis. On the VCTK dataset containing 109 speakers, VITS achieved 4.38 (±0.06) MOS compared to 3.82 (±0.07) for Glow-TTS + HiFi-GAN. VITS2 further improved speaker similarity with a MOS of 3.99 (±0.08) compared to VITS's 3.79 (±0.09) on similarity evaluations.

End-to-End Capabilities

A significant contribution of VITS2 was reducing dependence on phoneme conversion. Using character error rate (CER) evaluation with automatic speech recognition, VITS2 achieved 4.01% CER when using normalized text input compared to 3.92% with phoneme sequences, demonstrating the possibility of fully end-to-end training without explicit phoneme preprocessing.

External Links