Editing VITS (section)

== Technical Architecture ==

=== VITS (2021) ===

VITS employs a conditional variational autoencoder (VAE) framework combined with several advanced techniques:

'''Core Components:'''
* '''Posterior Encoder:''' Processes linear-scale spectrograms during training to learn latent representations
* '''Prior Encoder:''' Contains a text encoder and normalizing flows to model the prior distribution of latent variables
* '''Decoder:''' Based on HiFi-GAN V1 generator, converts latent variables to raw waveforms
* '''Discriminator:''' Multi-period discriminator from HiFi-GAN for adversarial training

'''Key Innovations:'''
* '''Monotonic Alignment Search (MAS):''' Automatically learns alignments between text and speech without external annotations by finding alignments that maximize the likelihood of target speech.
* '''Stochastic Duration Predictor:''' Uses normalizing flows to model the distribution of phoneme durations, enabling synthesis of speech with diverse rhythms from the same text input.
* '''Adversarial Training:''' Improves waveform quality through generator-discriminator competition

The model addresses the one-to-many relationship in speech synthesis, where a single text input can be spoken in multiple ways with different pitches, rhythms, and prosodic patterns.

=== VITS2 (2023) ===

VITS2 introduced several improvements over the original model to address issues including intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion.

'''Major Improvements:'''
* '''Adversarial Duration Predictor:''' Replaced the flow-based stochastic duration predictor with one trained through adversarial learning, using a time step-wise conditional discriminator to improve efficiency and naturalness.
* '''Enhanced Normalizing Flows:''' Added transformer blocks to normalizing flows to capture long-term dependencies when transforming distributions, addressing limitations of convolution-only approaches.
* '''Improved Alignment Search:''' Modified Monotonic Alignment Search by adding Gaussian noise to calculated probabilities, giving the model additional opportunities to explore alternative alignments during early training.
* '''Speaker-Conditioned Text Encoder:''' For multi-speaker models, conditioned the speaker vector on the third transformer block of the text encoder to better capture speaker-specific pronunciation and intonation characteristics.