Hugging Face Daily Papers · June 24, 2026 · 14 min read

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

\n\t<a id=\"lingxidiagbench-benchmarking-llms-for-chinese-psychiatric-consultation-and-diagnosis-accepted-by-kdd-2026\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#lingxidiagbench-benchmarking-llms-for-chinese-psychiatric-consultation-and-diagnosis-accepted-by-kdd-2026\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tLingxiDiagBench: Benchmarking LLMs for Chinese Psychiatric Consultation and Diagnosis [Accepted by KDD 2026]\n\t</span>\n</h2>\n<p><strong>TL;DR:</strong> A large-scale multi-agent benchmark revealing that while LLMs can distinguish depression from anxiety with 92.3% accuracy, they struggle badly at 12-way differential diagnosis (28.5%) — and better conversational quality doesn't guarantee better diagnosis.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6488a18de22a0081a550c514/6p4JZmA4ojnV8JrTYvmLQ.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6488a18de22a0081a550c514/6p4JZmA4ojnV8JrTYvmLQ.png\" alt=\"image\"></a></p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"dataset-link-httpshuggingfacecodatasetsxushihao6715lingxidiag-16k\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#dataset-link-httpshuggingfacecodatasetsxushihao6715lingxidiag-16k\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tDataset Link: <a href=\"https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K\">https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K</a>\n\t</span>\n</h2>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"the-problem\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#the-problem\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tThe Problem\n\t</span>\n</h3>\n<p>Mental health care faces a global workforce crisis. Psychiatric diagnosis depends on nuanced, multi-turn clinical interviews, yet existing AI benchmarks fall short in three key ways: they use template-based synthetic dialogues with little variability, omit the information needed for differential diagnosis, and rarely support dynamic multi-turn consultation evaluation.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"whats-new\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#whats-new\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tWhat's New\n\t</span>\n</h3>\n<p>This paper introduces <strong>LingxiDiagBench</strong>, the first large-scale, real-data-driven, multi-disease diagnostic benchmark for Chinese psychiatric AI. At its core is <strong>LingxiDiag-16K</strong> — 16,000 synthetic consultation dialogues generated from 1,709 real outpatient EMRs collected at Shanghai Mental Health Center, carefully preserving real clinical demographic and diagnostic distributions across 12 ICD-10 categories.</p>\n<p>The benchmark covers <strong>two evaluation paradigms</strong>:</p>\n<ul>\n<li><strong>Static:</strong> Fixed dialogue transcripts for reproducible diagnosis and next-question prediction tasks</li>\n<li><strong>Dynamic:</strong> Real-time multi-turn consultation where LLMs act as Doctor Agents interviewing LLM-powered Patient Agents</li>\n</ul>\n<p>Four doctor consultation strategies are compared: <em>Free-form</em>, <em>Symptom-Tree</em>, <em>APA-Guided</em>, and <em>APA-Guided + MRD-RAG</em>.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"key-findings\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#key-findings\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tKey Findings\n\t</span>\n</h3>\n<ul>\n<li>🟢 <strong>Binary classification</strong> (depression vs. anxiety) is largely solved — top models hit <strong>92.3% accuracy</strong></li>\n<li>🟡 <strong>4-way classification</strong> (including comorbidity) drops to <strong>43.0%</strong> — comorbidity recognition remains hard</li>\n<li>🔴 <strong>12-way differential diagnosis</strong> hits only <strong>28.5%</strong> — a substantial open challenge</li>\n<li>⚠️ <strong>Dynamic < Static:</strong> Interactive consultation consistently underperforms static evaluation, suggesting poor information-gathering strategies hurt downstream reasoning</li>\n<li>🔍 <strong>Consultation quality ≠ Diagnostic accuracy:</strong> LLM-as-a-Judge scores correlate with diagnostic accuracy at only <strong>r = 0.43</strong>, showing that asking good questions and reaching correct diagnoses are decoupled skills</li>\n<li>✅ <strong>RAG helps:</strong> APA-Guided + MRD-RAG improves overall classification by ~5% over APA-Guided alone</li>\n</ul>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"why-it-matters\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#why-it-matters\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tWhy It Matters\n\t</span>\n</h3>\n<p>LingxiDiagBench provides a standardized, reproducible platform to systematically evaluate and improve AI psychiatric diagnosis — something the field has been missing. The benchmark design is language-agnostic and grounded in international clinical standards (DSM-5/ICD-10), making it extensible beyond Chinese.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"benchmark-results-takeways\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#benchmark-results-takeways\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tBenchmark Results Takeways\n\t</span>\n</h3>\n<h4 class=\"relative group flex items-baseline\">\n\t<a id=\"📊-static-evaluation--best-model-per-task\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#📊-static-evaluation--best-model-per-task\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t📊 Static Evaluation — Best Model per Task\n\t</span>\n</h4>\n<p>Performance on fixed consultation transcripts across both the synthetic (LingxiDiag-16K) and real clinical (LingxiDiag-Clinical) test sets:</p>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th>Task</th>\n<th>Best Model (Synthetic)</th>\n<th>Acc (Synthetic)</th>\n<th>Best Model (Real)</th>\n<th>Acc (Real)</th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td>2-class (Depression vs. Anxiety)</td>\n<td>Gemini-3-Flash</td>\n<td>0.854</td>\n<td>Qwen3-4B</td>\n<td><strong>0.887</strong></td>\n</tr>\n<tr>\n<td>4-class (+ Comorbidity + Others)</td>\n<td>Grok-4.1-Fast</td>\n<td>0.470</td>\n<td>Qwen3-32B</td>\n<td><strong>0.524</strong></td>\n</tr>\n<tr>\n<td>12-class (Full ICD-10 Differential)</td>\n<td>GPT-5-Mini</td>\n<td>0.409</td>\n<td>TF-IDF + SVM</td>\n<td><strong>0.320</strong></td>\n</tr>\n<tr>\n<td>12-class Top-3 Accuracy</td>\n<td>TF-IDF + LR</td>\n<td>0.645</td>\n<td>Qwen3-4B</td>\n<td><strong>0.698</strong></td>\n</tr>\n<tr>\n<td>Overall Score</td>\n<td>TF-IDF + LR</td>\n<td>0.533</td>\n<td>Qwen3-32B</td>\n<td><strong>0.548</strong></td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<hr>\n<h4 class=\"relative group flex items-baseline\">\n\t<a id=\"🤖-dynamic-evaluation--best-strategy-per-dataset\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🤖-dynamic-evaluation--best-strategy-per-dataset\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🤖 Dynamic Evaluation — Best Strategy per Dataset\n\t</span>\n</h4>\n<p>Performance of the end-to-end consultation pipeline (Doctor Agent → Patient Agent → Diagnosis), across both data settings:</p>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th>Strategy</th>\n<th>Best Model</th>\n<th>2-class Acc</th>\n<th>4-class Acc</th>\n<th>12-class Acc</th>\n<th>Clf-Ovl</th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td><strong>Synthetic (LingxiDiag-16K)</strong></td>\n<td></td>\n<td></td>\n<td></td>\n<td></td>\n<td></td>\n</tr>\n<tr>\n<td>Free-form</td>\n<td>Grok-4.1-Fast</td>\n<td>88.6%</td>\n<td>34.0%</td>\n<td>25.5%</td>\n<td>40.1%</td>\n</tr>\n<tr>\n<td>Symptom-Tree</td>\n<td>DeepSeek-V3.2</td>\n<td>86.5%</td>\n<td>31.0%</td>\n<td>21.5%</td>\n<td>38.0%</td>\n</tr>\n<tr>\n<td>APA-Guided</td>\n<td>DeepSeek-V3.2</td>\n<td>88.5%</td>\n<td>31.5%</td>\n<td>23.0%</td>\n<td>41.2%</td>\n</tr>\n<tr>\n<td>APA-Guided + MRD-RAG</td>\n<td><strong>Grok-4.1-Fast</strong></td>\n<td><strong>88.5%</strong></td>\n<td><strong>43.0%</strong></td>\n<td><strong>28.5%</strong></td>\n<td><strong>45.4%</strong></td>\n</tr>\n<tr>\n<td><strong>Real (LingxiDiag-Clinical)</strong></td>\n<td></td>\n<td></td>\n<td></td>\n<td></td>\n<td></td>\n</tr>\n<tr>\n<td>Free-form</td>\n<td>Qwen3-8B</td>\n<td>88.8%</td>\n<td>40.0%</td>\n<td>43.0%</td>\n<td>49.0%</td>\n</tr>\n<tr>\n<td>Symptom-Tree</td>\n<td><strong>GPT-OSS-20B</strong></td>\n<td><strong>91.2%</strong></td>\n<td><strong>43.0%</strong></td>\n<td><strong>44.5%</strong></td>\n<td><strong>50.0%</strong></td>\n</tr>\n<tr>\n<td>APA-Guided</td>\n<td>Qwen3-32B</td>\n<td>80.0%</td>\n<td>36.0%</td>\n<td>46.5%</td>\n<td>48.3%</td>\n</tr>\n<tr>\n<td>APA-Guided + MRD-RAG</td>\n<td>GPT-OSS-20B</td>\n<td>78.8%</td>\n<td>37.5%</td>\n<td>45.5%</td>\n<td>47.2%</td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<hr>\n<h4 class=\"relative group flex items-baseline\">\n\t<a id=\"🔁-cross-dataset-transfer--does-synthetic-training-generalize-to-real-data\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🔁-cross-dataset-transfer--does-synthetic-training-generalize-to-real-data\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🔁 Cross-Dataset Transfer — Does Synthetic Training Generalize to Real Data?\n\t</span>\n</h4>\n<p>To validate that LingxiDiag-16K encodes clinically meaningful knowledge (not just surface statistics), models fine-tuned on synthetic data were evaluated on real clinical cases:</p>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th>Model</th>\n<th>12-class Acc (Real, Zero-shot)</th>\n<th>12-class Acc (Real, +LoRA SFT)</th>\n<th>Gain</th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td>Qwen3-8B</td>\n<td>4.1%</td>\n<td><strong>41.4%</strong></td>\n<td>+37.3%</td>\n</tr>\n<tr>\n<td>Qwen3-32B</td>\n<td>20.4%</td>\n<td><strong>39.7%</strong></td>\n<td>+19.3%</td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<p><em>The authors emphasize that this benchmark is for research purposes only and must not be deployed in clinical settings without rigorous validation and human oversight.</em></p>\n","updatedAt":"2026-06-24T06:15:38.236Z","author":{"_id":"6488a18de22a0081a550c514","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6488a18de22a0081a550c514/OtegVrprR4lPIX0hBCaZ4.png","fullname":"Xu Shihao","name":"XuShihao6715","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.7653796076774597},"editors":["XuShihao6715"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6488a18de22a0081a550c514/OtegVrprR4lPIX0hBCaZ4.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.09379","authors":[{"_id":"6a3b71040a86ac3098d5d7d3","name":"Shihao Xu","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d4","name":"Tiancheng Zhou","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d5","name":"Jiatong Ma","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d6","name":"Yanli Ding","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d7","name":"Yiming Yan","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d8","name":"Ming Xiao","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d9","name":"Guoyi Li","hidden":false},{"_id":"6a3b71040a86ac3098d5d7da","name":"Haiyang Geng","hidden":false},{"_id":"6a3b71040a86ac3098d5d7db","name":"Yunyun Han","hidden":false},{"_id":"6a3b71040a86ac3098d5d7dc","name":"Jianhua Chen","hidden":false},{"_id":"6a3b71040a86ac3098d5d7dd","name":"Yafeng Deng","hidden":false}],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-24T00:00:00.000Z","title":"LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis","submittedOnDailyBy":{"_id":"6488a18de22a0081a550c514","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6488a18de22a0081a550c514/OtegVrprR4lPIX0hBCaZ4.png","isPro":true,"fullname":"Xu Shihao","user":"XuShihao6715","type":"user","name":"XuShihao6715"},"summary":"Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.","upvotes":17,"discussionId":"6a3b71050a86ac3098d5d7de","projectPage":"https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K","githubRepo":"https://github.com/Lingxi-mental-health/LingxiDiagBench","githubRepoAddedBy":"user","ai_summary":"A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy.","ai_keywords":["LLMs","psychiatric diagnosis","multi-agent benchmark","synthetic consultation dialogues","EMR-aligned","ICD-10","differential diagnosis","diagnostic accuracy","consultation quality","LLM-as-a-Judge"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":9,"organization":{"_id":"6a3b71d72bafdc24d5f41775","name":"Lyncia","fullname":"Lyncia","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6488a18de22a0081a550c514/PVYYq18Hgibe92SFJSu_l.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b7774e54b25908c3c98e82","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/29Rta4ox3eDm0h1GVp6kP.png","isPro":false,"fullname":"Ruonan Wu","user":"kusunoki","type":"user"},{"_id":"6445f702e1fd8d65b27dcc10","avatarUrl":"/avatars/406eb535635df69b23ddc973c394a74a.svg","isPro":false,"fullname":"Nothing","user":"ParaNoth","type":"user"},{"_id":"68672521ce001f45e5f70e68","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/vV8PfA9AVZhFefefIv0uD.png","isPro":false,"fullname":"youmin wu","user":"popowu80s","type":"user"},{"_id":"6a3b7c28797012f01bc812cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/YRSQ7x133B43RXiHZSRa9.jpeg","isPro":false,"fullname":"Wang","user":"MinghaoDaniel","type":"user"},{"_id":"64660bcda1a19b0623fcf6c8","avatarUrl":"/avatars/9c1c8202d25f395f763dae19a6c6f948.svg","isPro":false,"fullname":"DongjieTao","user":"Walageguaiguai","type":"user"},{"_id":"6a3b7d7e2bafdc24d5f4ce9d","avatarUrl":"/avatars/def96e377c3847d45556d2636b7456a0.svg","isPro":false,"fullname":"sun","user":"xunyang1998","type":"user"},{"_id":"6a3b7d94e3f9850051703767","avatarUrl":"/avatars/bee01bf1cd65c7b97704059fd75a8563.svg","isPro":false,"fullname":"Xinyi Zhang","user":"Zinnia001","type":"user"},{"_id":"6902dd7dd642fc67f25dbe43","avatarUrl":"/avatars/2ba8415eddb5071e6b707376889d00eb.svg","isPro":false,"fullname":"xm","user":"ethus","type":"user"},{"_id":"666abd809a3e3ce05a6fc0bb","avatarUrl":"/avatars/bc760cbce451dbafaa9f36cac7d8928e.svg","isPro":false,"fullname":"Anonymous","user":"anon-meddial-2026","type":"user"},{"_id":"6459bfe7a82daa98729e898a","avatarUrl":"/avatars/3706278d3fe51d03b79532353c44e2ac.svg","isPro":false,"fullname":"Jason Jarvan","user":"jasonjarvan","type":"user"},{"_id":"67ba9e0617af640b2e622178","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/r9yX_W3Mi9301tZyKmDjI.png","isPro":false,"fullname":"Liu Chang","user":"Uncle226","type":"user"},{"_id":"69cbc607d045760f771d4e1c","avatarUrl":"/avatars/8c3f464437bc54393a60f996e98dc1a8.svg","isPro":false,"fullname":"w","user":"wz2026","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a3b71d72bafdc24d5f41775","name":"Lyncia","fullname":"Lyncia","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6488a18de22a0081a550c514/PVYYq18Hgibe92SFJSu_l.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2602/2602.09379.md","query":{}}">

Papers

arxiv:2602.09379

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Published on Jun 11

· Submitted by

Xu Shihao on Jun 24

Lyncia

Upvote

Authors:

Abstract

A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

XuShihao6715

Paper submitter about 3 hours ago

•

edited about 3 hours ago

LingxiDiagBench: Benchmarking LLMs for Chinese Psychiatric Consultation and Diagnosis [Accepted by KDD 2026]

TL;DR: A large-scale multi-agent benchmark revealing that while LLMs can distinguish depression from anxiety with 92.3% accuracy, they struggle badly at 12-way differential diagnosis (28.5%) — and better conversational quality doesn't guarantee better diagnosis.

Dataset Link: https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K

The Problem

Mental health care faces a global workforce crisis. Psychiatric diagnosis depends on nuanced, multi-turn clinical interviews, yet existing AI benchmarks fall short in three key ways: they use template-based synthetic dialogues with little variability, omit the information needed for differential diagnosis, and rarely support dynamic multi-turn consultation evaluation.

What's New

This paper introduces LingxiDiagBench, the first large-scale, real-data-driven, multi-disease diagnostic benchmark for Chinese psychiatric AI. At its core is LingxiDiag-16K — 16,000 synthetic consultation dialogues generated from 1,709 real outpatient EMRs collected at Shanghai Mental Health Center, carefully preserving real clinical demographic and diagnostic distributions across 12 ICD-10 categories.

The benchmark covers two evaluation paradigms:

Static: Fixed dialogue transcripts for reproducible diagnosis and next-question prediction tasks
Dynamic: Real-time multi-turn consultation where LLMs act as Doctor Agents interviewing LLM-powered Patient Agents

Four doctor consultation strategies are compared: Free-form, Symptom-Tree, APA-Guided, and APA-Guided + MRD-RAG.

Key Findings

🟢 Binary classification (depression vs. anxiety) is largely solved — top models hit 92.3% accuracy
🟡 4-way classification (including comorbidity) drops to 43.0% — comorbidity recognition remains hard
🔴 12-way differential diagnosis hits only 28.5% — a substantial open challenge
⚠️ Dynamic < Static: Interactive consultation consistently underperforms static evaluation, suggesting poor information-gathering strategies hurt downstream reasoning
🔍 Consultation quality ≠ Diagnostic accuracy: LLM-as-a-Judge scores correlate with diagnostic accuracy at only r = 0.43, showing that asking good questions and reaching correct diagnoses are decoupled skills
✅ RAG helps: APA-Guided + MRD-RAG improves overall classification by ~5% over APA-Guided alone

Why It Matters

LingxiDiagBench provides a standardized, reproducible platform to systematically evaluate and improve AI psychiatric diagnosis — something the field has been missing. The benchmark design is language-agnostic and grounded in international clinical standards (DSM-5/ICD-10), making it extensible beyond Chinese.

Benchmark Results Takeways

📊 Static Evaluation — Best Model per Task

Performance on fixed consultation transcripts across both the synthetic (LingxiDiag-16K) and real clinical (LingxiDiag-Clinical) test sets:

Task	Best Model (Synthetic)	Acc (Synthetic)	Best Model (Real)	Acc (Real)
2-class (Depression vs. Anxiety)	Gemini-3-Flash	0.854	Qwen3-4B	0.887
4-class (+ Comorbidity + Others)	Grok-4.1-Fast	0.470	Qwen3-32B	0.524
12-class (Full ICD-10 Differential)	GPT-5-Mini	0.409	TF-IDF + SVM	0.320
12-class Top-3 Accuracy	TF-IDF + LR	0.645	Qwen3-4B	0.698
Overall Score	TF-IDF + LR	0.533	Qwen3-32B	0.548

🤖 Dynamic Evaluation — Best Strategy per Dataset

Performance of the end-to-end consultation pipeline (Doctor Agent → Patient Agent → Diagnosis), across both data settings:

Strategy	Best Model	2-class Acc	4-class Acc	12-class Acc	Clf-Ovl
Synthetic (LingxiDiag-16K)
Free-form	Grok-4.1-Fast	88.6%	34.0%	25.5%	40.1%
Symptom-Tree	DeepSeek-V3.2	86.5%	31.0%	21.5%	38.0%
APA-Guided	DeepSeek-V3.2	88.5%	31.5%	23.0%	41.2%
APA-Guided + MRD-RAG	Grok-4.1-Fast	88.5%	43.0%	28.5%	45.4%
Real (LingxiDiag-Clinical)
Free-form	Qwen3-8B	88.8%	40.0%	43.0%	49.0%
Symptom-Tree	GPT-OSS-20B	91.2%	43.0%	44.5%	50.0%
APA-Guided	Qwen3-32B	80.0%	36.0%	46.5%	48.3%
APA-Guided + MRD-RAG	GPT-OSS-20B	78.8%	37.5%	45.5%	47.2%

🔁 Cross-Dataset Transfer — Does Synthetic Training Generalize to Real Data?

To validate that LingxiDiag-16K encodes clinically meaningful knowledge (not just surface statistics), models fine-tuned on synthetic data were evaluated on real clinical cases:

Model	12-class Acc (Real, Zero-shot)	12-class Acc (Real, +LoRA SFT)	Gain
Qwen3-8B	4.1%	41.4%	+37.3%
Qwen3-32B	20.4%	39.7%	+19.3%

The authors emphasize that this benchmark is for research purposes only and must not be deployed in clinical settings without rigorous validation and human oversight.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2602.09379

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.09379 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.09379 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.09379 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Abstract

Community

LingxiDiagBench: Benchmarking LLMs for Chinese Psychiatric Consultation and Diagnosis [Accepted by KDD 2026]

Dataset Link: https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K

The Problem

What's New

Key Findings

Why It Matters

Benchmark Results Takeways

📊 Static Evaluation — Best Model per Task

🤖 Dynamic Evaluation — Best Strategy per Dataset

🔁 Cross-Dataset Transfer — Does Synthetic Training Generalize to Real Data?

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers