We are excited to share <strong>IndustryBench</strong>, a new benchmark designed by the Multimodal and Industrial AI team at Alibaba to test the true industrial knowledge boundaries of LLMs.</p>\n<p>While models have gotten great at general-purpose QA, B2B industrial procurement requires strict adherence to safety and standards. An LLM might give a highly fluent, partially correct answer that recommends the wrong material grade or violates a national standard—a critical failure in the real world. </p>\n<p><strong>Key highlights from our research:</strong></p>\n<ul>\n<li>🏭 <strong>Grounded in Reality:</strong> 2,049 items based on Chinese national standards (GB/T) and real industrial product records (evaluated across ZH, EN, RU, and VI).</li>\n<li>⚠️ <strong>The \"Overthinking\" Trap:</strong> Surprisingly, we found that enabling extended reasoning (thinking mode) actually <em>lowers</em> safety-adjusted scores for 12 out of 13 tested models! Longer answers tend to introduce unsupported, safety-critical hallucinations.</li>\n<li>📏 <strong>Standards & Terminology:</strong> This remains the most persistent weakness across all 17 evaluated models (including frontier models from Google, OpenAI, Anthropic, and the Qwen family).</li>\n<li>⚖️ <strong>New Evaluation Paradigm:</strong> We decouple raw correctness from strict safety-violation (SV) checks to give a much clearer picture of actual deployability.</li>\n</ul>\n<p>Raw accuracy isn't enough when safety is on the line. We invite the community to explore the dataset and see how current models handle strict industrial constraints!</p>\n","updatedAt":"2026-05-13T10:00:13.477Z","author":{"_id":"641123b4230ce11b1be68fa1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641123b4230ce11b1be68fa1/Xm1gZgK4MRq-LN20ZxxHE.jpeg","fullname":"Liang Ding (Hiring https://liamding.cc/hiring.htm)","name":"alphadl","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8754048943519592},"editors":["alphadl"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/641123b4230ce11b1be68fa1/Xm1gZgK4MRq-LN20ZxxHE.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.10267","authors":[{"_id":"6a02f42cb823258e761238ae","user":{"_id":"66004e114b3187d9318c5ebd","avatarUrl":"/avatars/bd8bec5e73d84552bf55fd74c0fa59c5.svg","isPro":false,"fullname":"daniel","user":"danielbai1703","type":"user","name":"danielbai1703"},"name":"Songlin Bai","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:52:40.693Z","hidden":false},{"_id":"6a02f42cb823258e761238af","name":"Xintong Wang","hidden":false},{"_id":"6a02f42cb823258e761238b0","name":"Linlin Yu","hidden":false},{"_id":"6a02f42cb823258e761238b1","name":"Bin Chen","hidden":false},{"_id":"6a02f42cb823258e761238b2","name":"Zhiang Xu","hidden":false},{"_id":"6a02f42cb823258e761238b3","name":"Yuyang Sheng","hidden":false},{"_id":"6a02f42cb823258e761238b4","name":"Changtong Zan","hidden":false},{"_id":"6a02f42cb823258e761238b5","name":"Xiaofeng Zhu","hidden":false},{"_id":"6a02f42cb823258e761238b6","name":"Yizhe Zhang","hidden":false},{"_id":"6a02f42cb823258e761238b7","name":"Jiru Li","hidden":false},{"_id":"6a02f42cb823258e761238b8","name":"Mingze Guo","hidden":false},{"_id":"6a02f42cb823258e761238b9","name":"Ling Zou","hidden":false},{"_id":"6a02f42cb823258e761238ba","name":"Yalong Li","hidden":false},{"_id":"6a02f42cb823258e761238bb","name":"Chengfu Huo","hidden":false},{"_id":"6a02f42cb823258e761238bc","name":"Liang Ding","hidden":false}],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs","submittedOnDailyBy":{"_id":"641123b4230ce11b1be68fa1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641123b4230ce11b1be68fa1/Xm1gZgK4MRq-LN20ZxxHE.jpeg","isPro":false,"fullname":"Liang Ding (Hiring https://liamding.cc/hiring.htm)","user":"alphadl","type":"user","name":"alphadl"},"summary":"In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.","upvotes":2,"discussionId":"6a02f42cb823258e761238bd","projectPage":"https://huggingface.co/datasets/alibaba-multimodal-industrial-ai/IndustryBench","githubRepo":"https://github.com/alibaba-multimodal-industrial-ai/IndustryBench","githubRepoAddedBy":"user","ai_summary":"IndustryBench evaluates industrial procurement question answering systems in Chinese against national standards, revealing significant gaps in safety compliance and highlighting the need for safety-aware assessment beyond standard accuracy metrics.","ai_keywords":["LLM","industrial procurement","standards check","safety violation","IndustryBench","Qwen3-Max","GB/T","multi-language evaluation","safety-aware diagnosis"],"githubStars":33,"organization":{"_id":"6a02ffa5e3fd09a09613cdd3","name":"alibaba-multimodal-industrial-ai","fullname":"1688 multimodal & industrial AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6673d2b28f2103a91b2701b1/XENMVNCtqwTsPdRSEqbEp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"641123b4230ce11b1be68fa1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641123b4230ce11b1be68fa1/Xm1gZgK4MRq-LN20ZxxHE.jpeg","isPro":false,"fullname":"Liang Ding (Hiring https://liamding.cc/hiring.htm)","user":"alphadl","type":"user"},{"_id":"66004e114b3187d9318c5ebd","avatarUrl":"/avatars/bd8bec5e73d84552bf55fd74c0fa59c5.svg","isPro":false,"fullname":"daniel","user":"danielbai1703","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a02ffa5e3fd09a09613cdd3","name":"alibaba-multimodal-industrial-ai","fullname":"1688 multimodal & industrial AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6673d2b28f2103a91b2701b1/XENMVNCtqwTsPdRSEqbEp.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.10267.md"}">
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
IndustryBench evaluates industrial procurement question answering systems in Chinese against national standards, revealing significant gaps in safety compliance and highlighting the need for safety-aware assessment beyond standard accuracy metrics.
AI-generated summary
In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.
Community
We are excited to share IndustryBench, a new benchmark designed by the Multimodal and Industrial AI team at Alibaba to test the true industrial knowledge boundaries of LLMs.
While models have gotten great at general-purpose QA, B2B industrial procurement requires strict adherence to safety and standards. An LLM might give a highly fluent, partially correct answer that recommends the wrong material grade or violates a national standard—a critical failure in the real world.
Key highlights from our research:
- 🏭 Grounded in Reality: 2,049 items based on Chinese national standards (GB/T) and real industrial product records (evaluated across ZH, EN, RU, and VI).
- ⚠️ The "Overthinking" Trap: Surprisingly, we found that enabling extended reasoning (thinking mode) actually lowers safety-adjusted scores for 12 out of 13 tested models! Longer answers tend to introduce unsupported, safety-critical hallucinations.
- 📏 Standards & Terminology: This remains the most persistent weakness across all 17 evaluated models (including frontier models from Google, OpenAI, Anthropic, and the Qwen family).
- ⚖️ New Evaluation Paradigm: We decouple raw correctness from strict safety-violation (SV) checks to give a much clearer picture of actual deployability.
Raw accuracy isn't enough when safety is on the line. We invite the community to explore the dataset and see how current models handle strict industrial constraints!
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.10267 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.10267 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.