I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I ported NVIDIA's Parakeet speech-to-text models to pure C++/ggml (the engine behind llama.cpp and whisper.cpp). It runs the FastConformer TDT / CTC / RNNT / hybrid models with no Python and no PyTorch, on CPU and GPU (CUDA, HIP, Vulkan, Metal). The goal was to match NeMo exactly, then make it deployable anywhere. Where it landed:
It also does cache-aware streaming with real-time end-of-utterance, word-level timestamps with confidence, and exposes a small flat C-API so you can embed it pretty much everywhere. The GGUF is self-contained: the tokenizer/vocab is baked into the model file, no external files needed. It ships as a backend in LocalAI too, so you get an OpenAI-compatible /v1/audio/transcriptions endpoint fully local. (Disclosure: I work on LocalAI.) https://reddit.com/link/1tt6oja/video/nxngb7x1aj4h1/player Links:
All credit to NVIDIA for the Parakeet models and to ggml for the runtime. Benchmarks, methodology, and per-model plots are in the repo. Happy to answer questions about the port, the decoders, or the numbers. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.