\n\t<a id=\"lingxidiagbench-benchmarking-llms-for-chinese-psychiatric-consultation-and-diagnosis-accepted-by-kdd-2026\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#lingxidiagbench-benchmarking-llms-for-chinese-psychiatric-consultation-and-diagnosis-accepted-by-kdd-2026\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tLingxiDiagBench: Benchmarking LLMs for Chinese Psychiatric Consultation and Diagnosis [Accepted by KDD 2026]\n\t</span>\n</h2>\n<p><strong>TL;DR:</strong> A large-scale multi-agent benchmark revealing that while LLMs can distinguish depression from anxiety with 92.3% accuracy, they struggle badly at 12-way differential diagnosis (28.5%) — and better conversational quality doesn't guarantee better diagnosis.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6488a18de22a0081a550c514/6p4JZmA4ojnV8JrTYvmLQ.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6488a18de22a0081a550c514/6p4JZmA4ojnV8JrTYvmLQ.png\" alt=\"image\"></a></p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"dataset-link-httpshuggingfacecodatasetsxushihao6715lingxidiag-16k\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#dataset-link-httpshuggingfacecodatasetsxushihao6715lingxidiag-16k\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tDataset Link: <a href=\"https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K\">https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K</a>\n\t</span>\n</h2>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"the-problem\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#the-problem\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tThe Problem\n\t</span>\n</h3>\n<p>Mental health care faces a global workforce crisis. Psychiatric diagnosis depends on nuanced, multi-turn clinical interviews, yet existing AI benchmarks fall short in three key ways: they use template-based synthetic dialogues with little variability, omit the information needed for differential diagnosis, and rarely support dynamic multi-turn consultation evaluation.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"whats-new\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#whats-new\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tWhat's New\n\t</span>\n</h3>\n<p>This paper introduces <strong>LingxiDiagBench</strong>, the first large-scale, real-data-driven, multi-disease diagnostic benchmark for Chinese psychiatric AI. At its core is <strong>LingxiDiag-16K</strong> — 16,000 synthetic consultation dialogues generated from 1,709 real outpatient EMRs collected at Shanghai Mental Health Center, carefully preserving real clinical demographic and diagnostic distributions across 12 ICD-10 categories.</p>\n<p>The benchmark covers <strong>two evaluation paradigms</strong>:</p>\n<ul>\n<li><strong>Static:</strong> Fixed dialogue transcripts for reproducible diagnosis and next-question prediction tasks</li>\n<li><strong>Dynamic:</strong> Real-time multi-turn consultation where LLMs act as Doctor Agents interviewing LLM-powered Patient Agents</li>\n</ul>\n<p>Four doctor consultation strategies are compared: <em>Free-form</em>, <em>Symptom-Tree</em>, <em>APA-Guided</em>, and <em>APA-Guided + MRD-RAG</em>.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"key-findings\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#key-findings\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tKey Findings\n\t</span>\n</h3>\n<ul>\n<li>🟢 <strong>Binary classification</strong> (depression vs. anxiety) is largely solved — top models hit <strong>92.3% accuracy</strong></li>\n<li>🟡 <strong>4-way classification</strong> (including comorbidity) drops to <strong>43.0%</strong> — comorbidity recognition remains hard</li>\n<li>🔴 <strong>12-way differential diagnosis</strong> hits only <strong>28.5%</strong> — a substantial open challenge</li>\n<li>⚠️ <strong>Dynamic < Static:</strong> Interactive consultation consistently underperforms static evaluation, suggesting poor information-gathering strategies hurt downstream reasoning</li>\n<li>🔍 <strong>Consultation quality ≠ Diagnostic accuracy:</strong> LLM-as-a-Judge scores correlate with diagnostic accuracy at only <strong>r = 0.43</strong>, showing that asking good questions and reaching correct diagnoses are decoupled skills</li>\n<li>✅ <strong>RAG helps:</strong> APA-Guided + MRD-RAG improves overall classification by ~5% over APA-Guided alone</li>\n</ul>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"why-it-matters\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#why-it-matters\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tWhy It Matters\n\t</span>\n</h3>\n<p>LingxiDiagBench provides a standardized, reproducible platform to systematically evaluate and improve AI psychiatric diagnosis — something the field has been missing. The benchmark design is language-agnostic and grounded in international clinical standards (DSM-5/ICD-10), making it extensible beyond Chinese.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"benchmark-results-takeways\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#benchmark-results-takeways\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tBenchmark Results Takeways\n\t</span>\n</h3>\n<h4 class=\"relative group flex items-baseline\">\n\t<a id=\"📊-static-evaluation--best-model-per-task\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#📊-static-evaluation--best-model-per-task\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t📊 Static Evaluation — Best Model per Task\n\t</span>\n</h4>\n<p>Performance on fixed consultation transcripts across both the synthetic (LingxiDiag-16K) and real clinical (LingxiDiag-Clinical) test sets:</p>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th>Task</th>\n<th>Best Model (Synthetic)</th>\n<th>Acc (Synthetic)</th>\n<th>Best Model (Real)</th>\n<th>Acc (Real)</th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td>2-class (Depression vs. Anxiety)</td>\n<td>Gemini-3-Flash</td>\n<td>0.854</td>\n<td>Qwen3-4B</td>\n<td><strong>0.887</strong></td>\n</tr>\n<tr>\n<td>4-class (+ Comorbidity + Others)</td>\n<td>Grok-4.1-Fast</td>\n<td>0.470</td>\n<td>Qwen3-32B</td>\n<td><strong>0.524</strong></td>\n</tr>\n<tr>\n<td>12-class (Full ICD-10 Differential)</td>\n<td>GPT-5-Mini</td>\n<td>0.409</td>\n<td>TF-IDF + SVM</td>\n<td><strong>0.320</strong></td>\n</tr>\n<tr>\n<td>12-class Top-3 Accuracy</td>\n<td>TF-IDF + LR</td>\n<td>0.645</td>\n<td>Qwen3-4B</td>\n<td><strong>0.698</strong></td>\n</tr>\n<tr>\n<td>Overall Score</td>\n<td>TF-IDF + LR</td>\n<td>0.533</td>\n<td>Qwen3-32B</td>\n<td><strong>0.548</strong></td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<hr>\n<h4 class=\"relative group flex items-baseline\">\n\t<a id=\"🤖-dynamic-evaluation--best-strategy-per-dataset\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🤖-dynamic-evaluation--best-strategy-per-dataset\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🤖 Dynamic Evaluation — Best Strategy per Dataset\n\t</span>\n</h4>\n<p>Performance of the end-to-end consultation pipeline (Doctor Agent → Patient Agent → Diagnosis), across both data settings:</p>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th>Strategy</th>\n<th>Best Model</th>\n<th>2-class Acc</th>\n<th>4-class Acc</th>\n<th>12-class Acc</th>\n<th>Clf-Ovl</th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td><strong>Synthetic (LingxiDiag-16K)</strong></td>\n<td></td>\n<td></td>\n<td></td>\n<td></td>\n<td></td>\n</tr>\n<tr>\n<td>Free-form</td>\n<td>Grok-4.1-Fast</td>\n<td>88.6%</td>\n<td>34.0%</td>\n<td>25.5%</td>\n<td>40.1%</td>\n</tr>\n<tr>\n<td>Symptom-Tree</td>\n<td>DeepSeek-V3.2</td>\n<td>86.5%</td>\n<td>31.0%</td>\n<td>21.5%</td>\n<td>38.0%</td>\n</tr>\n<tr>\n<td>APA-Guided</td>\n<td>DeepSeek-V3.2</td>\n<td>88.5%</td>\n<td>31.5%</td>\n<td>23.0%</td>\n<td>41.2%</td>\n</tr>\n<tr>\n<td>APA-Guided + MRD-RAG</td>\n<td><strong>Grok-4.1-Fast</strong></td>\n<td><strong>88.5%</strong></td>\n<td><strong>43.0%</strong></td>\n<td><strong>28.5%</strong></td>\n<td><strong>45.4%</strong></td>\n</tr>\n<tr>\n<td><strong>Real (LingxiDiag-Clinical)</strong></td>\n<td></td>\n<td></td>\n<td></td>\n<td></td>\n<td></td>\n</tr>\n<tr>\n<td>Free-form</td>\n<td>Qwen3-8B</td>\n<td>88.8%</td>\n<td>40.0%</td>\n<td>43.0%</td>\n<td>49.0%</td>\n</tr>\n<tr>\n<td>Symptom-Tree</td>\n<td><strong>GPT-OSS-20B</strong></td>\n<td><strong>91.2%</strong></td>\n<td><strong>43.0%</strong></td>\n<td><strong>44.5%</strong></td>\n<td><strong>50.0%</strong></td>\n</tr>\n<tr>\n<td>APA-Guided</td>\n<td>Qwen3-32B</td>\n<td>80.0%</td>\n<td>36.0%</td>\n<td>46.5%</td>\n<td>48.3%</td>\n</tr>\n<tr>\n<td>APA-Guided + MRD-RAG</td>\n<td>GPT-OSS-20B</td>\n<td>78.8%</td>\n<td>37.5%</td>\n<td>45.5%</td>\n<td>47.2%</td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<hr>\n<h4 class=\"relative group flex items-baseline\">\n\t<a id=\"🔁-cross-dataset-transfer--does-synthetic-training-generalize-to-real-data\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🔁-cross-dataset-transfer--does-synthetic-training-generalize-to-real-data\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🔁 Cross-Dataset Transfer — Does Synthetic Training Generalize to Real Data?\n\t</span>\n</h4>\n<p>To validate that LingxiDiag-16K encodes clinically meaningful knowledge (not just surface statistics), models fine-tuned on synthetic data were evaluated on real clinical cases:</p>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th>Model</th>\n<th>12-class Acc (Real, Zero-shot)</th>\n<th>12-class Acc (Real, +LoRA SFT)</th>\n<th>Gain</th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td>Qwen3-8B</td>\n<td>4.1%</td>\n<td><strong>41.4%</strong></td>\n<td>+37.3%</td>\n</tr>\n<tr>\n<td>Qwen3-32B</td>\n<td>20.4%</td>\n<td><strong>39.7%</strong></td>\n<td>+19.3%</td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<p><em>The authors emphasize that this benchmark is for research purposes only and must not be deployed in clinical settings without rigorous validation and human oversight.</em></p>\n","updatedAt":"2026-06-24T06:15:38.236Z","author":{"_id":"6488a18de22a0081a550c514","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6488a18de22a0081a550c514/OtegVrprR4lPIX0hBCaZ4.png","fullname":"Xu Shihao","name":"XuShihao6715","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.7653796076774597},"editors":["XuShihao6715"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6488a18de22a0081a550c514/OtegVrprR4lPIX0hBCaZ4.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.09379","authors":[{"_id":"6a3b71040a86ac3098d5d7d3","name":"Shihao Xu","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d4","name":"Tiancheng Zhou","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d5","name":"Jiatong Ma","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d6","name":"Yanli Ding","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d7","name":"Yiming Yan","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d8","name":"Ming Xiao","hidden":false},{"_id":"6a3b71040a86ac3098d5d7d9","name":"Guoyi Li","hidden":false},{"_id":"6a3b71040a86ac3098d5d7da","name":"Haiyang Geng","hidden":false},{"_id":"6a3b71040a86ac3098d5d7db","name":"Yunyun Han","hidden":false},{"_id":"6a3b71040a86ac3098d5d7dc","name":"Jianhua Chen","hidden":false},{"_id":"6a3b71040a86ac3098d5d7dd","name":"Yafeng Deng","hidden":false}],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-24T00:00:00.000Z","title":"LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis","submittedOnDailyBy":{"_id":"6488a18de22a0081a550c514","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6488a18de22a0081a550c514/OtegVrprR4lPIX0hBCaZ4.png","isPro":true,"fullname":"Xu Shihao","user":"XuShihao6715","type":"user","name":"XuShihao6715"},"summary":"Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.","upvotes":17,"discussionId":"6a3b71050a86ac3098d5d7de","projectPage":"https://huggingface.co/datasets/XuShihao6715/LingxiDiag-16K","githubRepo":"https://github.com/Lingxi-mental-health/LingxiDiagBench","githubRepoAddedBy":"user","ai_summary":"A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy.","ai_keywords":["LLMs","psychiatric diagnosis","multi-agent benchmark","synthetic consultation dialogues","EMR-aligned","ICD-10","differential diagnosis","diagnostic accuracy","consultation quality","LLM-as-a-Judge"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":9,"organization":{"_id":"6a3b71d72bafdc24d5f41775","name":"Lyncia","fullname":"Lyncia","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6488a18de22a0081a550c514/PVYYq18Hgibe92SFJSu_l.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b7774e54b25908c3c98e82","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/29Rta4ox3eDm0h1GVp6kP.png","isPro":false,"fullname":"Ruonan Wu","user":"kusunoki","type":"user"},{"_id":"6445f702e1fd8d65b27dcc10","avatarUrl":"/avatars/406eb535635df69b23ddc973c394a74a.svg","isPro":false,"fullname":"Nothing","user":"ParaNoth","type":"user"},{"_id":"68672521ce001f45e5f70e68","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/vV8PfA9AVZhFefefIv0uD.png","isPro":false,"fullname":"youmin wu","user":"popowu80s","type":"user"},{"_id":"6a3b7c28797012f01bc812cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/YRSQ7x133B43RXiHZSRa9.jpeg","isPro":false,"fullname":"Wang","user":"MinghaoDaniel","type":"user"},{"_id":"64660bcda1a19b0623fcf6c8","avatarUrl":"/avatars/9c1c8202d25f395f763dae19a6c6f948.svg","isPro":false,"fullname":"DongjieTao","user":"Walageguaiguai","type":"user"},{"_id":"6a3b7d7e2bafdc24d5f4ce9d","avatarUrl":"/avatars/def96e377c3847d45556d2636b7456a0.svg","isPro":false,"fullname":"sun","user":"xunyang1998","type":"user"},{"_id":"6a3b7d94e3f9850051703767","avatarUrl":"/avatars/bee01bf1cd65c7b97704059fd75a8563.svg","isPro":false,"fullname":"Xinyi Zhang","user":"Zinnia001","type":"user"},{"_id":"6902dd7dd642fc67f25dbe43","avatarUrl":"/avatars/2ba8415eddb5071e6b707376889d00eb.svg","isPro":false,"fullname":"xm","user":"ethus","type":"user"},{"_id":"666abd809a3e3ce05a6fc0bb","avatarUrl":"/avatars/bc760cbce451dbafaa9f36cac7d8928e.svg","isPro":false,"fullname":"Anonymous","user":"anon-meddial-2026","type":"user"},{"_id":"6459bfe7a82daa98729e898a","avatarUrl":"/avatars/3706278d3fe51d03b79532353c44e2ac.svg","isPro":false,"fullname":"Jason Jarvan","user":"jasonjarvan","type":"user"},{"_id":"67ba9e0617af640b2e622178","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/r9yX_W3Mi9301tZyKmDjI.png","isPro":false,"fullname":"Liu Chang","user":"Uncle226","type":"user"},{"_id":"69cbc607d045760f771d4e1c","avatarUrl":"/avatars/8c3f464437bc54393a60f996e98dc1a8.svg","isPro":false,"fullname":"w","user":"wz2026","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a3b71d72bafdc24d5f41775","name":"Lyncia","fullname":"Lyncia","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6488a18de22a0081a550c514/PVYYq18Hgibe92SFJSu_l.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2602/2602.09379.md","query":{}}">
LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy.
Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.
Community
LingxiDiagBench: Benchmarking LLMs for Chinese Psychiatric Consultation and Diagnosis [Accepted by KDD 2026]
TL;DR: A large-scale multi-agent benchmark revealing that while LLMs can distinguish depression from anxiety with 92.3% accuracy, they struggle badly at 12-way differential diagnosis (28.5%) — and better conversational quality doesn't guarantee better diagnosis.

The Problem
Mental health care faces a global workforce crisis. Psychiatric diagnosis depends on nuanced, multi-turn clinical interviews, yet existing AI benchmarks fall short in three key ways: they use template-based synthetic dialogues with little variability, omit the information needed for differential diagnosis, and rarely support dynamic multi-turn consultation evaluation.
What's New
This paper introduces LingxiDiagBench, the first large-scale, real-data-driven, multi-disease diagnostic benchmark for Chinese psychiatric AI. At its core is LingxiDiag-16K — 16,000 synthetic consultation dialogues generated from 1,709 real outpatient EMRs collected at Shanghai Mental Health Center, carefully preserving real clinical demographic and diagnostic distributions across 12 ICD-10 categories.
The benchmark covers two evaluation paradigms:
- Static: Fixed dialogue transcripts for reproducible diagnosis and next-question prediction tasks
- Dynamic: Real-time multi-turn consultation where LLMs act as Doctor Agents interviewing LLM-powered Patient Agents
Four doctor consultation strategies are compared: Free-form, Symptom-Tree, APA-Guided, and APA-Guided + MRD-RAG.
Key Findings
- 🟢 Binary classification (depression vs. anxiety) is largely solved — top models hit 92.3% accuracy
- 🟡 4-way classification (including comorbidity) drops to 43.0% — comorbidity recognition remains hard
- 🔴 12-way differential diagnosis hits only 28.5% — a substantial open challenge
- ⚠️ Dynamic < Static: Interactive consultation consistently underperforms static evaluation, suggesting poor information-gathering strategies hurt downstream reasoning
- 🔍 Consultation quality ≠ Diagnostic accuracy: LLM-as-a-Judge scores correlate with diagnostic accuracy at only r = 0.43, showing that asking good questions and reaching correct diagnoses are decoupled skills
- ✅ RAG helps: APA-Guided + MRD-RAG improves overall classification by ~5% over APA-Guided alone
Why It Matters
LingxiDiagBench provides a standardized, reproducible platform to systematically evaluate and improve AI psychiatric diagnosis — something the field has been missing. The benchmark design is language-agnostic and grounded in international clinical standards (DSM-5/ICD-10), making it extensible beyond Chinese.
Benchmark Results Takeways
📊 Static Evaluation — Best Model per Task
Performance on fixed consultation transcripts across both the synthetic (LingxiDiag-16K) and real clinical (LingxiDiag-Clinical) test sets:
| Task |
Best Model (Synthetic) |
Acc (Synthetic) |
Best Model (Real) |
Acc (Real) |
| 2-class (Depression vs. Anxiety) |
Gemini-3-Flash |
0.854 |
Qwen3-4B |
0.887 |
| 4-class (+ Comorbidity + Others) |
Grok-4.1-Fast |
0.470 |
Qwen3-32B |
0.524 |
| 12-class (Full ICD-10 Differential) |
GPT-5-Mini |
0.409 |
TF-IDF + SVM |
0.320 |
| 12-class Top-3 Accuracy |
TF-IDF + LR |
0.645 |
Qwen3-4B |
0.698 |
| Overall Score |
TF-IDF + LR |
0.533 |
Qwen3-32B |
0.548 |
🤖 Dynamic Evaluation — Best Strategy per Dataset
Performance of the end-to-end consultation pipeline (Doctor Agent → Patient Agent → Diagnosis), across both data settings:
| Strategy |
Best Model |
2-class Acc |
4-class Acc |
12-class Acc |
Clf-Ovl |
| Synthetic (LingxiDiag-16K) |
|
|
|
|
|
| Free-form |
Grok-4.1-Fast |
88.6% |
34.0% |
25.5% |
40.1% |
| Symptom-Tree |
DeepSeek-V3.2 |
86.5% |
31.0% |
21.5% |
38.0% |
| APA-Guided |
DeepSeek-V3.2 |
88.5% |
31.5% |
23.0% |
41.2% |
| APA-Guided + MRD-RAG |
Grok-4.1-Fast |
88.5% |
43.0% |
28.5% |
45.4% |
| Real (LingxiDiag-Clinical) |
|
|
|
|
|
| Free-form |
Qwen3-8B |
88.8% |
40.0% |
43.0% |
49.0% |
| Symptom-Tree |
GPT-OSS-20B |
91.2% |
43.0% |
44.5% |
50.0% |
| APA-Guided |
Qwen3-32B |
80.0% |
36.0% |
46.5% |
48.3% |
| APA-Guided + MRD-RAG |
GPT-OSS-20B |
78.8% |
37.5% |
45.5% |
47.2% |
🔁 Cross-Dataset Transfer — Does Synthetic Training Generalize to Real Data?
To validate that LingxiDiag-16K encodes clinically meaningful knowledge (not just surface statistics), models fine-tuned on synthetic data were evaluated on real clinical cases:
| Model |
12-class Acc (Real, Zero-shot) |
12-class Acc (Real, +LoRA SFT) |
Gain |
| Qwen3-8B |
4.1% |
41.4% |
+37.3% |
| Qwen3-32B |
20.4% |
39.7% |
+19.3% |
The authors emphasize that this benchmark is for research purposes only and must not be deployed in clinical settings without rigorous validation and human oversight.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2602.09379 in a model README.md to link it from this page.
Cite arxiv.org/abs/2602.09379 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2602.09379 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.