Orpheus TTS
Orpheus TTS is an open-source text-to-speech (TTS) system developed by Canopy Labs and released in March 2025. Built on the Llama-3.2-3B architecture, it uses a novel approach of using large-language-models with audio tokens instead of traditional TTS-specific architectures.
Development and Release
Orpheus TTS was developed by Canopy Labs, an artificial intelligence startup founded with the stated mission of creating "digital humans that are indistinguishable from real humans."[1] The code and model weights were publicly released on March 18, 2025, under the Apache License 2.0, making both the model weights and training code freely available.[2]
Technical Architecture
Orpheus TTS differs from conventional text-to-speech systems by using a modified Meta's Llama-3.2-3B language model as its foundation. It takes in a text prompt and generates audio tokens using the 24 kHz variant of the SNAC audio tokenizer. The system was trained on a dataset comprising over 100,000 hours of English speech data combined with billions of tokens of textual QA pairs, a hybrid approach designed to maintain linguistic understanding while adding speech synthesis capabilities.
The model supports streaming out-of-the-box and can achieve ~200 milliseconds of streaming latency.[3]
Performance Claims and Evaluation
Canopy Labs claims that Orpheus TTS delivers "natural intonation, emotion, and rhythm that is superior to SOTA closed source models," positioning it as competitive with established commercial systems such as ElevenLabs and other proprietary text-to-speech services.[4]
However, these performance assertions are based primarily on internal evaluations and subjective assessments rather than standardized benchmarks or peer-reviewed studies. The lack of comprehensive comparative analysis with established TTS systems has led to some skepticism within the research community about the extent of its claimed superiority.
Model Variants
The model is available in three variants:
- Pretrained model: The base model trained on the full dataset, suitable for research and custom fine-tuning
- Fine-tuned production model: An optimized version designed for everyday TTS applications, fine-tuned on several voices
- Multilingual model: A family of fine-tuned models with support for other languages