Orpheus TTS: Difference between revisions
Ttswikiadmin (talk | contribs) (Add Orpheus TTS) |
Ttswikiadmin (talk | contribs) (add infobox) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
{{Infobox TTS model | |||
| name = Orpheus TTS | |||
| developer = [[Canopy Labs]] | |||
| release_date = March 18, 2025 | |||
| latest_version = 3B 0.1 | |||
| architecture = [[LLM-Based]] | |||
| parameters = 3 billion | |||
| training_data = 100k+ hours (English) | |||
| languages = English (multilingual in preview) | |||
| voices = 8 distinct voices | |||
| voice_cloning = Yes (zero-shot) | |||
| emotion_control = Yes (tag-based) | |||
| streaming = Yes | |||
| latency = ~200ms | |||
| license = [[Apache License 2.0]] | |||
| open_source = Yes | |||
| code_repository = [https://github.com/canopyai/Orpheus-TTS GitHub] | |||
| model_weights = [https://huggingface.co/canopylabs/orpheus-3b-0.1-ft Hugging Face] | |||
| demo = [https://huggingface.co/spaces/MohamedRashad/Orpheus-TTS HF Spaces] | |||
| website = [https://canopylabs.ai/releases/towards_human_sounding_tts Canopy Labs] | |||
}} | |||
'''Orpheus TTS''' is an open-source [[text-to-speech]] (TTS) system developed by Canopy Labs and released in March 2025. Built on the [[Llama (language model)|Llama-3.2-3B]] architecture, it uses a novel approach of using large-language-models with audio tokens instead of traditional TTS-specific architectures. | '''Orpheus TTS''' is an open-source [[text-to-speech]] (TTS) system developed by Canopy Labs and released in March 2025. Built on the [[Llama (language model)|Llama-3.2-3B]] architecture, it uses a novel approach of using large-language-models with audio tokens instead of traditional TTS-specific architectures. | ||
Line 24: | Line 46: | ||
* '''Fine-tuned production model''': An optimized version designed for everyday TTS applications, fine-tuned on several voices | * '''Fine-tuned production model''': An optimized version designed for everyday TTS applications, fine-tuned on several voices | ||
* '''Multilingual model''': A family of fine-tuned models with support for other languages | * '''Multilingual model''': A family of fine-tuned models with support for other languages | ||
== External Links == | == External Links == |
Latest revision as of 16:06, 20 September 2025
Orpheus TTS
| |
---|---|
Model Information | |
Developer: | Canopy Labs |
Release date: | March 18, 2025 |
Latest version: | 3B 0.1 |
Architecture: | LLM-Based |
Parameters: | 3 billion |
Training data: | 100k+ hours (English) |
Capabilities | |
Languages: | English (multilingual in preview) |
Voices: | 8 distinct voices |
Voice cloning: | Yes (zero-shot) |
Emotion control: | Yes (tag-based) |
Streaming: | Yes |
Latency: | ~200ms |
Availability | |
License: | Apache License 2.0 |
Open source: | Yes |
Repository: | GitHub |
Model weights: | Hugging Face |
Demo: | HF Spaces |
Website: | Canopy Labs |
Orpheus TTS is an open-source text-to-speech (TTS) system developed by Canopy Labs and released in March 2025. Built on the Llama-3.2-3B architecture, it uses a novel approach of using large-language-models with audio tokens instead of traditional TTS-specific architectures.
Development and Release[edit | edit source]
Orpheus TTS was developed by Canopy Labs, an artificial intelligence startup founded with the stated mission of creating "digital humans that are indistinguishable from real humans."[1] The code and model weights were publicly released on March 18, 2025, under the Apache License 2.0, making both the model weights and training code freely available.[2]
Technical Architecture[edit | edit source]
Orpheus TTS differs from conventional text-to-speech systems by using a modified Meta's Llama-3.2-3B language model as its foundation. It takes in a text prompt and generates audio tokens using the 24 kHz variant of the SNAC audio tokenizer. The system was trained on a dataset comprising over 100,000 hours of English speech data combined with billions of tokens of textual QA pairs, a hybrid approach designed to maintain linguistic understanding while adding speech synthesis capabilities.
The model supports streaming out-of-the-box and can achieve ~200 milliseconds of streaming latency.[3]
Performance Claims and Evaluation[edit | edit source]
Canopy Labs claims that Orpheus TTS delivers "natural intonation, emotion, and rhythm that is superior to SOTA closed source models," positioning it as competitive with established commercial systems such as ElevenLabs and other proprietary text-to-speech services.[4]
However, these performance assertions are based primarily on internal evaluations and subjective assessments rather than standardized benchmarks or peer-reviewed studies. The lack of comprehensive comparative analysis with established TTS systems has led to some skepticism within the research community about the extent of its claimed superiority.
Model Variants[edit | edit source]
The model is available in three variants:
- Pretrained model: The base model trained on the full dataset, suitable for research and custom fine-tuning
- Fine-tuned production model: An optimized version designed for everyday TTS applications, fine-tuned on several voices
- Multilingual model: A family of fine-tuned models with support for other languages