I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I ran a small benchmark on LLMs for medical scribing. Reason: most discussion around AI scribe safety focuses on hallucinations. That matters, but in notes I kept seeing another problem: models often leave out clinically relevant details from the conversation. So I evaluated 8 frontier models on 300 synthetic doctor-patient dialogues. Each model wrote a SOAP note for every dialogue. Then I used a 4-model judge panel to score the notes for:
The main result: Across 2,400 generated notes, the models produced:
So in this benchmark, omissions were much more common than hallucinations. Some other things that stood out:
The repo includes the transcripts, outputs, scoring scripts, and leaderboard (for link see comments). The next thing I’m interested in is running the same evaluation on models that can run locally. Separately, we also used this benchmark internally for product development. The obvious follow-up was: if a cheap/open model writes well but misses safety facts, can a transcript-grounded wrapper recover those omissions and flag unsupported claims? That direction looks promising. In particular, it makes models like DeepSeek much more interesting: strong prose, low cost, and potentially usable in safer clinical-note pipelines when paired with a safety layer. Earlier evaluation (V1) post can be found here. [link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.