r/MachineLearning · · 1 min read

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Sharing a small CPU inference benchmark for nvidia/parakeet-tdt-0.6b-v3 that turned up a result I didn't expect going in.

Setup: 2 x86-64 vCPUs (AVX2/FMA), 7.7GB RAM, no GPU. Test audio: 16.78s Harvard sentences at 16kHz mono.

Results:

Inference path RTF Peak Memory CPU utilization
HF Transformers bfloat16 0.519 ~430MB delta
ONNX Runtime FP32 (onnx-asr) 0.328 2,667MB 49.9%
GGUF Q6_K (parakeet.cpp) 0.708 928MB 99.8%

ONNX Runtime is 37% faster than HF Transformers bfloat16 on this hardware. The gap comes from operator fusion and AVX2-optimized execution providers in ONNX Runtime that the PyTorch CPU path doesn't exploit as aggressively. Memory cost is the tradeoff — FP32 weights load at ~2.7GB peak.

GGUF Q6_K trades throughput for memory efficiency. 928MB peak vs 2.7GB, but RTF doubles and CPU utilization hits 99.8%. For memory-constrained deployments it's the right call. For sustained throughput on a box with headroom, ONNX wins.

One methodological note worth flagging for anyone doing ASR benchmarking with synthetic audio: espeak-ng inflated WER to 20.9% on a sentence set where gTTS got 4.65%. Both runtimes got identical WER within each run, confirming it's the TTS distribution mismatch rather than model or quantization quality. NVIDIA reports 1.93% on LibriSpeech — the gTTS number is a much more honest CPU-only proxy.

Github repo with code, raw results, and evaluation scripts in comments below.

Disclosure: benchmark was run using Neo, a local AI engineering agent inside Claude Code using its MCP. Mentioning because the runtime and audio choices came from its research phase, not prior knowledge on my end.

submitted by /u/gvij
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning