SNAC

SNAC (Multi-Scale Neural Audio Codec) is a neural audio codec that introduces multi-scale temporal quantization for efficient audio compression. It was presented at the NeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation by researchers from Papla Media and ETH Zurich.

Overview[edit | edit source]

Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. While Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks, SNAC proposes a simple extension of RVQ where the quantizers can operate at different temporal resolutions.

Architecture[edit | edit source]

SNAC encodes audio into hierarchical tokens similarly to SoundStream, EnCodec, and DAC. However, SNAC introduces a simple change where coarse tokens are sampled less frequently, covering a broader time span.

The architecture includes several key innovations:

Multi-Scale Quantization: By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales.
Noise Blocks: Noise blocks that inject input-dependent Gaussian noise for enhanced expressiveness
Depthwise Convolutions: Depthwise convolutions for efficient computation and training stability
Local Windowed Attention: Local windowed attention layers at the lowest temporal resolution to capture contextual relationships

Model Variants[edit | edit source]

SNAC offers several pretrained models optimized for different use cases:

Model	Sample Rate	Bitrate	RVQ Levels	Token Rates	Parameters	Use Case
snac_24khz	24 kHz	0.98 kbps	3	12, 23, and 47 Hz	~20M	Speech
snac_32khz	32 kHz	1.9 kbps	4	10, 21, 42, and 83 Hz	~55M	General audio
snac_44khz	44 kHz	2.6 kbps	4	14, 29, 57, and 115 Hz	~55M	Music/SFX

Each codebook holds 4096 entries (12-bit). The general audio model consists of 16M parameters in the encoder and 38.3M in the decoder, totaling 54.5 M parameters.

Performance[edit | edit source]

For speech, SNAC consistently outperforms all other codecs. Notably, even at bitrates below 1 kbit/s, SNAC maintains audio quality that closely approaches the reference signal. In evaluations, SNAC outperformed competing codecs like Encodec and DAC at comparable bitrates, even matching the quality of systems operating at twice its bitrate.

Applications[edit | edit source]

SNAC has been adopted in several text-to-speech systems:

Orpheus TTS: Orpheus uses SNAC, which creates tokens at four levels of hierarchy. The SNAC model is relatively lightweight and fast, making it suitable for real-time decoding.

With coarse tokens of ~10 Hz and a context window of 2048 you can effectively model a consistent structure of an audio track for ~3 minutes.

Comparison with Other Codecs[edit | edit source]

SNAC from Orpheus does 83 tokens per second, compared to 50 t/s for X-Codec 2.0 and 25 t/s for CosyVoice's codec. SNAC uses one codebook but tokens are created for each level of downsampling, in contrast to codecs like Mimi which use multiple separate codebooks.

SNAC

Contents

Overview[edit | edit source]

Architecture[edit | edit source]

Model Variants[edit | edit source]

Performance[edit | edit source]

Applications[edit | edit source]

Comparison with Other Codecs[edit | edit source]

Navigation menu

SNAC

Overview[edit | edit source]

Architecture[edit | edit source]

Model Variants[edit | edit source]

Performance[edit | edit source]

Applications[edit | edit source]

Comparison with Other Codecs[edit | edit source]

Navigation menu

Search