Hugging Face Daily Papers · June 4, 2026 · 4 min read

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric.<br>Github Repo: <a href=\"https://github.com/MAGIC-AI4Med/MedSP1000\" rel=\"nofollow\">https://github.com/MAGIC-AI4Med/MedSP1000</a><br>Dataset Repo: <a href=\"https://huggingface.co/datasets/byrLLCC/MedSP1000\">https://huggingface.co/datasets/byrLLCC/MedSP1000</a></p>\n","updatedAt":"2026-06-04T13:27:20.492Z","author":{"_id":"645720d74a7ffb7d5a4c7603","avatarUrl":"/avatars/f67ff587f42797db39c1b73088f2d416.svg","fullname":"liang cheng","name":"byrLLCC","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7954745292663574},"editors":["byrLLCC"],"editorAvatarUrls":["/avatars/f67ff587f42797db39c1b73088f2d416.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05112","authors":[{"_id":"6a20f30515100c5272a8470e","user":{"_id":"645720d74a7ffb7d5a4c7603","avatarUrl":"/avatars/f67ff587f42797db39c1b73088f2d416.svg","isPro":false,"fullname":"liang cheng","user":"byrLLCC","type":"user","name":"byrLLCC"},"name":"Cheng Liang","status":"claimed_verified","statusLastChangedAt":"2026-06-04T12:42:05.685Z","hidden":false},{"_id":"6a20f30515100c5272a8470f","name":"Pengcheng Qiu","hidden":false},{"_id":"6a20f30515100c5272a84710","name":"Ya Zhang","hidden":false},{"_id":"6a20f30515100c5272a84711","name":"Yanfeng Wang","hidden":false},{"_id":"6a20f30515100c5272a84712","name":"Chaoyi Wu","hidden":false},{"_id":"6a20f30515100c5272a84713","name":"Weidi Xie","hidden":false}],"publishedAt":"2026-06-03T00:00:00.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases","submittedOnDailyBy":{"_id":"645720d74a7ffb7d5a4c7603","avatarUrl":"/avatars/f67ff587f42797db39c1b73088f2d416.svg","isPro":false,"fullname":"liang cheng","user":"byrLLCC","type":"user","name":"byrLLCC"},"summary":"Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.","upvotes":3,"discussionId":"6a20f30515100c5272a84714","githubRepo":"https://github.com/MAGIC-AI4Med/MedSP1000","githubRepoAddedBy":"user","ai_summary":"MedSP1000 introduces an interactive benchmark derived from standardized patients to evaluate clinical agents' dynamic performance across encounters, revealing limitations of current large language models in medical applications.","ai_keywords":["large language models","clinical agents","standardized patients","interactive benchmark","clinical encounter","medical education","peer-reviewed rubrics","agent-based evaluation","clinical reasoning","longitudinal management"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":11},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"645720d74a7ffb7d5a4c7603","avatarUrl":"/avatars/f67ff587f42797db39c1b73088f2d416.svg","isPro":false,"fullname":"liang cheng","user":"byrLLCC","type":"user"},{"_id":"6436aaaa0c77d7c5036abdbd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6436aaaa0c77d7c5036abdbd/C9A276yEeAPkKLEqUcjl_.jpeg","isPro":false,"fullname":"Chaoyi Wu","user":"chaoyi-wu","type":"user"},{"_id":"619f9755da83161f25840698","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/619f9755da83161f25840698/FM421pE1mz5v1YhrxA8ZA.jpeg","isPro":false,"fullname":"Muhammad Umair","user":"umair894","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0}">

Papers

arxiv:2606.05112

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Published on Jun 3

· Submitted by

liang cheng on Jun 4

Upvote

Authors:

Cheng Liang ,

Abstract

MedSP1000 introduces an interactive benchmark derived from standardized patients to evaluate clinical agents' dynamic performance across encounters, revealing limitations of current large language models in medical applications.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.

View arXiv page View PDF GitHub 11 Add to collection

Community

byrLLCC

Paper author Paper submitter about 13 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05112 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05112 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers