r/LocalLLaMA · · 4 min read

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

I fine-tuned NVIDIA's Parakeet TDT 0.6B v2 for clinical speech and am releasing the weights as Omi Med STT v1 (CC-BY-4.0).

Disclosure: I'm the founder of Omi Health and built this. Happy to dig into the training mix, benchmark, failure cases, quantization, or anything else.

The goal was simple: get a small local ASR model close enough to the strong cloud systems that patient audio doesn't have to leave the device for transcription.

There's also a runtime for Mac, Windows and Linux. Install + run:

pip install omi-med-stt omi-med-stt consultation.wav 

It auto-picks a backend per machine (MLX on Apple Silicon, NeMo on CUDA, GGUF/parakeet.cpp on CPU). q8 is the default; I also built a q4, benchmarked it, and didn't ship it — drug-name accuracy regressed too much.

Benchmark: 1,513 clips / 7.18 h of held-out medical audio, same audio + scorer for every model, ranked by medical-WER (M-WER = errors on clinical terms only) since that's what matters for a scribe. Speed is RTFx (× realtime).

vs other open / local models:

Model M-WER WER Drug RTFx
VibeVoice-ASR 9B 1.78% 11.10% 1.36% 11×
Omi Med STT v1 (0.6B) 2.37% 8.30% 4.75% 145×
Qwen3 ASR 1.7B 3.13% 10.72% 6.11% 81×
Qwen3 ASR 0.6B 3.38% 11.11% 7.92% 110×
Whisper Large v3 Turbo 3.93% 11.98% 5.88% 46×
Voxtral Mini Transcribe V1 4.53% 13.53% 6.33% 78×
Cohere Transcribe 03-2026 5.05% 14.88% 11.09% 143×
Parakeet TDT 0.6B v3 8.01% 15.26% 9.50% 160×
NVIDIA Canary 1B Flash 8.04% 17.26% 13.12% 61×
Parakeet TDT 0.6B v2 (the base) 8.36% 16.45% 8.60% 154×
Google MedASR 13.86% 35.94% 14.48% 86×

Only VibeVoice edges it on M-WER — but it's a 9B model (~15× the size), slower in my runs, and worse on overall WER (11.10% vs 8.30%). In my eval setup VibeVoice ran on an H100; Omi ran on an A10 (145× RTFx there, ~68× on an Apple-Silicon Mac). And vs the Parakeet base I started from: M-WER cut ~3.5× (8.36 → 2.37), WER roughly halved, and spurious drug mentions dropped from 131 to 9 — adapting a small base goes a long way.

vs general-purpose cloud APIs:

Model M-WER WER Drug RTFx
ElevenLabs Scribe v2 1.39% 6.53% 0.23% 7.8×
Gemini 3.1 Pro Preview † 1.65% 7.13% 0.23% 1.4×
Soniox STT Async v4 1.95% 6.99% 3.39% 1.8×
Omi Med STT v1 2.37% 8.30% 4.75% 145×
Gemini 3.5 Flash † 2.39% 7.99% 0.45% 3.1×
Reson8 Prerecorded 2.58% 6.69% 6.56% 7.4×
Voxtral Mini Transcribe v2 2.79% 8.12% 5.66% 15×
OpenAI GPT-4o Mini Transcribe 3.55% 10.26% 3.39% 12×

‡ Omi's RTFx is local on-device compute (A10); the cloud figures are per-request round-trips with network + queue included, so it's not a like-for-like compute race — Omi just has a structural latency edge from running locally. † Gemini shown with its hallucinations excluded. Both Gemini models have a failure mode no other system did: on a stress lane of 420 benign, non-diagnostic clips, they ignore the audio and fabricate entire fake consultations — invented symptoms, histories, management plans (3.1 Pro on 33/420, 3.5 Flash on 87/420; every other dedicated ASR model: 0). Count that lane and their real WER is ~14% / 24%. Fine transcribers otherwise, but "fluently invents clinical detail that was never said" is quite a nasty failure if you ask me.

vs medically-specific cloud vendors:

Model M-WER WER Drug RTFx
AssemblyAI Universal-3 Pro Medical 1.81% 6.94% 1.36% 2.1×
Omi Med STT v1 2.37% 8.30% 4.75% 145×
Deepgram Nova-3 Medical 2.44% 7.33% 2.26% 7.7×
Corti Transcripts 5.12% 9.60% 11.31% 0.9×

‡ Again, Omi's RTFx is on-device local compute; the cloud APIs are network round-trips (see note above).

Challenger here — ahead of Deepgram and Corti on M-WER, behind AssemblyAI (and the strongest general scribes). Drug names are the weakest axis (4.75% drug M-WER) and the #1 thing I'm fixing for v2.

Overall: best locally-running open model on this set, and competitive with the cloud — while keeping audio on the device.

More on training and evaluation: ~127 h of training audio, roughly 71% real / 29% synthetic — a mix of licensed, openly-available, and my own synthetic set tailored for hard-to-source medical speech. The benchmark is a locked split that was never touched during training (0 train/test overlap), made of unpublished audio that's diverse across medical settings (GP dialogue, dictation, medication review, radiology, procedures, long-form).

Curious whether real-world use matches the benchmark — would genuinely value the feedback. Next up: a streaming version and a multilingual one. Which languages would you actually want? Drop them in the comments.

submitted by /u/MajesticAd2862
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA