r/LocalLLaMA · · 3 min read

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

I’ve been working on audio.cpp, a native C++ inference framework for audio models built on top of ggml.

The framework currently has 25 model families, but I want to be precise about its state: 12 are released in the repo now and ready for normal use. I’m not counting anything still in integration or optimization as released.q

The released set already covers quite a bit:

TTS / voice cloning / voice design: Chatterbox, MioTTS, OmniVoice, PocketTTS, Qwen3-TTS and VoxCPM2

ASR / alignment / VAD: Qwen3-ASR, Qwen3 Forced Aligner and Silero VAD

Voice conversion / codec / editing: Seed-VC, MioCodec and Vevo2

Vevo2 also handles TTS, singing generation, singing conversion and editing, so this has grown beyond a collection of TTS ports.

The point isn’t to build a model zoo.

It’s to stop treating every audio model as its own island with a separate Python environment, dependency tree, CLI, batching logic and deployment setup. I want these models to share the same runtime, session handling, CLI, server, audio utilities and eventually the same higher-level workflows.

The performance is where the project started to feel genuinely useful rather than just easier to deploy.

These results were measured on Ubuntu/CUDA using the original weights without quantization. The figures compare audio.cpp wall time against the matching Python reference path:

PocketTTS: 3.68× faster on a 1-shot run, 3.22× in a warm session and 3.15× on long-form

Qwen3-TTS: 1.83× on a 1-shot run, 2.74× in a warm session and 3.06× on long-form

Vevo2: 5.03× on a 1-shot run, 1.75× in a warm session and 1.77× on long-form

MioTTS: 2.73× on a 1-shot run and 2.28× in a warm session

Chatterbox: 1.58× on long-form

The long-form throughput makes those numbers easier to picture. Using the same 1,028-word input:

PocketTTS: generated 5m 53.12s of audio in 7.30s48.40× real time

OmniVoice: generated 5m 57.00s in 17.77s20.09× real time

Vevo2: generated 7m 37.68s in 52.47s8.72× real time

Every released TTS family included in that benchmark ran faster than real time, ranging from 4.34× to 48.40×.

I don’t want to oversell it: not every path beats Python yet, and the README keeps the weaker results visible. But the warm-session numbers are the ones I care about most. They are closer to a real service setting, where the model is loaded once and reused across many requests.

The shared runtime is the bigger bet.

The current same-language redubbing pipeline takes a 418s recording, splits it into manageable chunks, transcribes it with Qwen3-ASR, merges the transcript and regenerates the speech in a target reference voice with Qwen3-TTS—all behind 1 CLI command.

The inference and server paths are native C++. There is a Python utility for downloading and converting model packages, but Python isn’t part of the actual inference path.

It’s still early. Backend coverage depends on the model, and framework-wide streaming isn’t generally supported yet, so the current paths should still be treated as offline. The framework can target CPU, CUDA, Vulkan and Metal where the model supports them.

Repo:

https://github.com/0xShug0/audio.cpp

I’d really value benchmarks from other hardware, failing cases, API feedback and PRs.

submitted by /u/Acceptable-Cycle4645
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA