r/LocalLLaMA · June 9, 2026 · 4 min read

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

#voice #benchmark #open-source #gpu #developer-tool

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

I fine-tuned NVIDIA's Parakeet TDT 0.6B v2 for clinical speech and am releasing the weights as Omi Med STT v1 (CC-BY-4.0).

Disclosure: I'm the founder of Omi Health and built this. Happy to dig into the training mix, benchmark, failure cases, quantization, or anything else.

The goal was simple: get a small local ASR model close enough to the strong cloud systems that patient audio doesn't have to leave the device for transcription.

There's also a runtime for Mac, Windows and Linux. Install + run:

pip install omi-med-stt omi-med-stt consultation.wav

It auto-picks a backend per machine (MLX on Apple Silicon, NeMo on CUDA, GGUF/parakeet.cpp on CPU). q8 is the default; I also built a q4, benchmarked it, and didn't ship it — drug-name accuracy regressed too much.

Benchmark: 1,513 clips / 7.18 h of held-out medical audio, same audio + scorer for every model, ranked by medical-WER (M-WER = errors on clinical terms only) since that's what matters for a scribe. Speed is RTFx (× realtime).

vs other open / local models:

Model	M-WER	WER	Drug	RTFx
VibeVoice-ASR 9B	1.78%	11.10%	1.36%	11×
Omi Med STT v1 (0.6B)	2.37%	8.30%	4.75%	145×
Qwen3 ASR 1.7B	3.13%	10.72%	6.11%	81×
Qwen3 ASR 0.6B	3.38%	11.11%	7.92%	110×
Whisper Large v3 Turbo	3.93%	11.98%	5.88%	46×
Voxtral Mini Transcribe V1	4.53%	13.53%	6.33%	78×
Cohere Transcribe 03-2026	5.05%	14.88%	11.09%	143×
Parakeet TDT 0.6B v3	8.01%	15.26%	9.50%	160×
NVIDIA Canary 1B Flash	8.04%	17.26%	13.12%	61×
Parakeet TDT 0.6B v2 (the base)	8.36%	16.45%	8.60%	154×
Google MedASR	13.86%	35.94%	14.48%	86×

Only VibeVoice edges it on M-WER — but it's a 9B model (~15× the size), slower in my runs, and worse on overall WER (11.10% vs 8.30%). In my eval setup VibeVoice ran on an H100; Omi ran on an A10 (145× RTFx there, ~68× on an Apple-Silicon Mac). And vs the Parakeet base I started from: M-WER cut ~3.5× (8.36 → 2.37), WER roughly halved, and spurious drug mentions dropped from 131 to 9 — adapting a small base goes a long way.

vs general-purpose cloud APIs:

Model	M-WER	WER	Drug	RTFx
ElevenLabs Scribe v2	1.39%	6.53%	0.23%	7.8×
Gemini 3.1 Pro Preview †	1.65%	7.13%	0.23%	1.4×
Soniox STT Async v4	1.95%	6.99%	3.39%	1.8×
Omi Med STT v1	2.37%	8.30%	4.75%	145× ‡
Gemini 3.5 Flash †	2.39%	7.99%	0.45%	3.1×
Reson8 Prerecorded	2.58%	6.69%	6.56%	7.4×
Voxtral Mini Transcribe v2	2.79%	8.12%	5.66%	15×
OpenAI GPT-4o Mini Transcribe	3.55%	10.26%	3.39%	12×

‡ Omi's RTFx is local on-device compute (A10); the cloud figures are per-request round-trips with network + queue included, so it's not a like-for-like compute race — Omi just has a structural latency edge from running locally. † Gemini shown with its hallucinations excluded. Both Gemini models have a failure mode no other system did: on a stress lane of 420 benign, non-diagnostic clips, they ignore the audio and fabricate entire fake consultations — invented symptoms, histories, management plans (3.1 Pro on 33/420, 3.5 Flash on 87/420; every other dedicated ASR model: 0). Count that lane and their real WER is ~14% / 24%. Fine transcribers otherwise, but "fluently invents clinical detail that was never said" is quite a nasty failure if you ask me.

vs medically-specific cloud vendors:

Model	M-WER	WER	Drug	RTFx
AssemblyAI Universal-3 Pro Medical	1.81%	6.94%	1.36%	2.1×
Omi Med STT v1	2.37%	8.30%	4.75%	145× ‡
Deepgram Nova-3 Medical	2.44%	7.33%	2.26%	7.7×
Corti Transcripts	5.12%	9.60%	11.31%	0.9×

‡ Again, Omi's RTFx is on-device local compute; the cloud APIs are network round-trips (see note above).

Challenger here — ahead of Deepgram and Corti on M-WER, behind AssemblyAI (and the strongest general scribes). Drug names are the weakest axis (4.75% drug M-WER) and the #1 thing I'm fixing for v2.

Overall: best locally-running open model on this set, and competitive with the cloud — while keeping audio on the device.

More on training and evaluation: ~127 h of training audio, roughly 71% real / 29% synthetic — a mix of licensed, openly-available, and my own synthetic set tailored for hard-to-source medical speech. The benchmark is a locked split that was never touched during training (0 train/test overlap), made of unpublished audio that's diverse across medical settings (GP dialogue, dictation, medication review, radiology, procedures, long-form).

Curious whether real-world use matches the benchmark — would genuinely value the feedback. Next up: a streaming version and a multilingual one. Which languages would you actually want? Drop them in the comments.

submitted by /u/MajesticAd2862
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA