ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| https://reddit.com/link/1u4lk5c/video/kyhdw0uog07h1/player Links:
Zyphra has released ZONOS2, its next-generation real-time text-to-speech model focused on expressive, high-fidelity voice cloning. It is open-source under Apache 2.0 and also available on Zyphra Cloud on AMD hardware. The model is designed to solve the usual TTS tradeoff between quality and speed. Zyphra says ZONOS2 is the first sparse MoE TTS model released open-source, with 8B total parameters and 900M active parameters at inference. The goal is straightforward: fast, efficient, and expressive speech synthesis without the usual compromise pileup. A major focus is voice cloning. Zyphra claims ZONOS2 is especially strong at capturing the distinctive characteristics of a speaker, producing more natural-sounding clones across a wide range of voices. The cloning is zero-shot, so no fine-tuning is needed. On the audio side, ZONOS2 predicts Descript Audio Codec (DAC) tokens for 44.1 kHz studio-quality audio. That gives better fidelity, but is harder to model than lower-quality codec setups. Zyphra says it closes that gap through larger-scale model and data training. For text handling, ZONOS2 does not use a phonemizer. Instead, it reads raw UTF-8 bytes, which Zyphra says improves coverage for lower-resource languages, boosts performance on Chinese, Korean, and Japanese, and supports native code-switching mid-sentence. Training also scaled heavily, from roughly 200K hours to 6M+ hours of audio. Zyphra says it used staged data filtering with increasing transcript-agreement strictness across pretraining, midtraining, and annealing. The intended result is fewer hallucinations, mispronunciations, and repetitions. Zyphra is also releasing ZTTS1-Eval, a new benchmark for TTS evaluation. It includes clean and in-the-wild datasets across up to 17 languages, with newer evaluation models such as Qwen3-ASR, ReDimNet, and MSR-UTMOS, plus prosody metrics. That is the gist. Big model, open weights, Apache 2.0, voice cloning, and enough infrastructure behind it to make the old TTS baseline look like scrap metal. [link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.