X-Codec

X-Codec is a neural audio codec designed to enhance semantic understanding in audio language models (LLMs). It was introduced in the paper "Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model," published at AAAI 2025.

Background[edit | edit source]

Traditional audio codecs like EnCodec were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Research found that methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors.

Architecture[edit | edit source]

X-Codec addresses these limitations through a dual-encoder design that incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ.

The architecture consists of:

Acoustic Encoder/Decoder: Convolutional encoder and decoder with a Residual Vector Quantizer (RVQ)
Semantic Module: A pre-trained self-supervised model such as HuBERT or WavLM
Projectors: Linear layers that combine and process the acoustic and semantic features

The acoustic and semantic features are concatenated, transformed, and then quantized together. After quantization, separate post-processing layers reconstruct both semantic and acoustic representations.

Applications[edit | edit source]

X-Codec demonstrated improvements across multiple audio generation tasks including text-to-speech synthesis, music continuation, and general audio classification tasks. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation.

X-Codec 2.0[edit | edit source]

X-Codec 2.0 (also written as XCodec2) is a successor to X-Codec, introduced alongside the LLaSA text-to-speech system in the paper "LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis."

Key Differences from X-Codec[edit | edit source]

X-Codec2 extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized. Major architectural changes include:

Unified Semantic-Acoustic Tokenization: X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre).
Single-Stage Vector Quantization: Unlike the multi-layer residual VQ in most approaches (e.g., X-Codec, DAC, EnCodec), X-Codec2 uses a single-layer Feature-Space Quantization (FSQ) for stability and compatibility with causal language models.
Large Codebook: 65,536 codebook size using Finite Scalar Quantization achieving 99% codebook usage, which is comparable to text tokenizers (LLaMA3 uses 128,256).

Technical Specifications[edit | edit source]

Semantic Encoder: Wav2Vec2-BERT, a semantic encoder pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages.
Training Data: Codec trained on 150k hours of multilingual speech data, including Emilia (En/Zh/De/Fr/Ja/Ko) and MLS (En/Fr/De/Nl/Es/It/Pt/Pl).
Quantization: Finite Scalar Quantization (FSQ), which does not require an explicit VQ objective term (e.g., codebook commitment loss), simplifying optimization during training.

Derivatives[edit | edit source]

X-Codec 2.0 has been extended in several ways:

NeuCodec: Neuphonic's codec for on-device TTS, which is largely based on extending X-Codec 2.0
XCodec2-Streaming: A streaming variant that adopts a causal decoder to focus solely on historical context, enabling streaming waveform reconstruction.

Availability[edit | edit source]

X-Codec is available on GitHub and integrated into Hugging Face's Transformers library. X-Codec 2.0 is available via the xcodec2 Python package and on Hugging Face.

X-Codec

Contents

Background[edit | edit source]

Architecture[edit | edit source]

Applications[edit | edit source]

X-Codec 2.0[edit | edit source]

Key Differences from X-Codec[edit | edit source]

Technical Specifications[edit | edit source]

Derivatives[edit | edit source]

Availability[edit | edit source]

Navigation menu

X-Codec

Background[edit | edit source]

Architecture[edit | edit source]

Applications[edit | edit source]

X-Codec 2.0[edit | edit source]

Key Differences from X-Codec[edit | edit source]

Technical Specifications[edit | edit source]

Derivatives[edit | edit source]

Availability[edit | edit source]

Navigation menu

Search