View source for Orpheus TTS

== Technical Architecture ==

Orpheus TTS differs from conventional text-to-speech systems by using a modified Meta's Llama-3.2-3B language model as its foundation. It takes in a text prompt and generates audio tokens using the 24 kHz variant of the [[SNAC]] audio tokenizer. The system was trained on a dataset comprising over 100,000 hours of English speech data combined with billions of tokens of textual QA pairs, a hybrid approach designed to maintain linguistic understanding while adding speech synthesis capabilities.

The model supports streaming out-of-the-box and can achieve ~200 milliseconds of streaming latency.<ref>https://www.baseten.co/blog/canopy-labs-selects-baseten-as-preferred-inference-provider-for-orpheus-tts-model/</ref>