I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I fine-tuned NVIDIA's Parakeet TDT 0.6B v2 for clinical speech and am releasing the weights as Omi Med STT v1 (CC-BY-4.0). Disclosure: I'm the founder of Omi Health and built this. Happy to dig into the training mix, benchmark, failure cases, quantization, or anything else. The goal was simple: get a small local ASR model close enough to the strong cloud systems that patient audio doesn't have to leave the device for transcription. There's also a runtime for Mac, Windows and Linux. Install + run: It auto-picks a backend per machine (MLX on Apple Silicon, NeMo on CUDA, GGUF/parakeet.cpp on CPU). q8 is the default; I also built a q4, benchmarked it, and didn't ship it — drug-name accuracy regressed too much. Benchmark: 1,513 clips / 7.18 h of held-out medical audio, same audio + scorer for every model, ranked by medical-WER (M-WER = errors on clinical terms only) since that's what matters for a scribe. Speed is RTFx (× realtime). vs other open / local models:
Only VibeVoice edges it on M-WER — but it's a 9B model (~15× the size), slower in my runs, and worse on overall WER (11.10% vs 8.30%). In my eval setup VibeVoice ran on an H100; Omi ran on an A10 (145× RTFx there, ~68× on an Apple-Silicon Mac). And vs the Parakeet base I started from: M-WER cut ~3.5× (8.36 → 2.37), WER roughly halved, and spurious drug mentions dropped from 131 to 9 — adapting a small base goes a long way. vs general-purpose cloud APIs:
‡ Omi's RTFx is local on-device compute (A10); the cloud figures are per-request round-trips with network + queue included, so it's not a like-for-like compute race — Omi just has a structural latency edge from running locally. † Gemini shown with its hallucinations excluded. Both Gemini models have a failure mode no other system did: on a stress lane of 420 benign, non-diagnostic clips, they ignore the audio and fabricate entire fake consultations — invented symptoms, histories, management plans (3.1 Pro on 33/420, 3.5 Flash on 87/420; every other dedicated ASR model: 0). Count that lane and their real WER is ~14% / 24%. Fine transcribers otherwise, but "fluently invents clinical detail that was never said" is quite a nasty failure if you ask me. vs medically-specific cloud vendors:
‡ Again, Omi's RTFx is on-device local compute; the cloud APIs are network round-trips (see note above). Challenger here — ahead of Deepgram and Corti on M-WER, behind AssemblyAI (and the strongest general scribes). Drug names are the weakest axis (4.75% drug M-WER) and the #1 thing I'm fixing for v2. Overall: best locally-running open model on this set, and competitive with the cloud — while keeping audio on the device. More on training and evaluation: ~127 h of training audio, roughly 71% real / 29% synthetic — a mix of licensed, openly-available, and my own synthetic set tailored for hard-to-source medical speech. The benchmark is a locked split that was never touched during training (0 train/test overlap), made of unpublished audio that's diverse across medical settings (GP dialogue, dictation, medication review, radiology, procedures, long-form). Curious whether real-world use matches the benchmark — would genuinely value the feedback. Next up: a streaming version and a multilingual one. Which languages would you actually want? Drop them in the comments. [link] [comments] |
More from r/LocalLLaMA
-
ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp
Jun 9
-
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all…
Jun 9
-
Pipeline parallelism in llama.cpp may be wasting your VRAM
Jun 8
-
Quick note on the QAT of recent
Jun 8
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.