r/LocalLLaMA · June 13, 2026 · 2 min read

ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning

#voice

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning

https://reddit.com/link/1u4lk5c/video/kyhdw0uog07h1/player

Links:

Blog: https://zyphra.com/our-work/zonos2
Weights: https://huggingface.co/Zyphra/ZONOS2
Inference code: https://github.com/Zyphra/ZONOS2
Eval code: https://github.com/Zyphra/ZTTS1-Eval

Model	TTSDS Prosody Score ↑
ZONOS2 8B	88.7
Qwen 3 TTS 1.7B	87.6
Inworld TTS 2	87.5
Cartesia Sonic 3.5	87.1
Fish S2 Pro	86.6
VoxCPM 2	86.3
Gemini 3.1 Flash	85.7
ZONOS2 8B (Quality Mode)	85.6
ElevenLabs V3	83.2

Zyphra has released ZONOS2, its next-generation real-time text-to-speech model focused on expressive, high-fidelity voice cloning. It is open-source under Apache 2.0 and also available on Zyphra Cloud on AMD hardware.

The model is designed to solve the usual TTS tradeoff between quality and speed. Zyphra says ZONOS2 is the first sparse MoE TTS model released open-source, with 8B total parameters and 900M active parameters at inference. The goal is straightforward: fast, efficient, and expressive speech synthesis without the usual compromise pileup.

A major focus is voice cloning. Zyphra claims ZONOS2 is especially strong at capturing the distinctive characteristics of a speaker, producing more natural-sounding clones across a wide range of voices. The cloning is zero-shot, so no fine-tuning is needed.

On the audio side, ZONOS2 predicts Descript Audio Codec (DAC) tokens for 44.1 kHz studio-quality audio. That gives better fidelity, but is harder to model than lower-quality codec setups. Zyphra says it closes that gap through larger-scale model and data training.

For text handling, ZONOS2 does not use a phonemizer. Instead, it reads raw UTF-8 bytes, which Zyphra says improves coverage for lower-resource languages, boosts performance on Chinese, Korean, and Japanese, and supports native code-switching mid-sentence.

Training also scaled heavily, from roughly 200K hours to 6M+ hours of audio. Zyphra says it used staged data filtering with increasing transcript-agreement strictness across pretraining, midtraining, and annealing. The intended result is fewer hallucinations, mispronunciations, and repetitions.

Zyphra is also releasing ZTTS1-Eval, a new benchmark for TTS evaluation. It includes clean and in-the-wild datasets across up to 17 languages, with newer evaluation models such as Qwen3-ASR, ReDimNet, and MSR-UTMOS, plus prosody metrics.

That is the gist. Big model, open weights, Apache 2.0, voice cloning, and enough infrastructure behind it to make the old TTS baseline look like scrap metal.

submitted by /u/KokaOP
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA