Hugging Face Daily Papers · May 13, 2026 · 6 min read

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We are excited to share IndustryBench, a new benchmark designed by the Multimodal and Industrial AI team at Alibaba to test the true industrial knowledge boundaries of LLMs.\nWhile models have gotten great at general-purpose QA, B2B industrial procurement requires strict adherence to safety and standards. An LLM might give a highly fluent, partially correct answer that recommends the wrong material grade or violates a national standard—a critical failure in the real world. \nKey highlights from our research:\n<ul>\n<li>🏭 Grounded in Reality: 2,049 items based on Chinese national standards (GB/T) and real industrial product records (evaluated across ZH, EN, RU, and VI).</li>\n<li>⚠️ The \"Overthinking\" Trap: Surprisingly, we found that enabling extended reasoning (thinking mode) actually lowers safety-adjusted scores for 12 out of 13 tested models! Longer answers tend to introduce unsupported, safety-critical hallucinations.</li>\n<li>📏 Standards & Terminology: This remains the most persistent weakness across all 17 evaluated models (including frontier models from Google, OpenAI, Anthropic, and the Qwen family).</li>\n<li>⚖️ New Evaluation Paradigm: We decouple raw correctness from strict safety-violation (SV) checks to give a much clearer picture of actual deployability.</li>\n</ul>\nRaw accuracy isn't enough when safety is on the line. We invite the community to explore the dataset and see how current models handle strict industrial constraints!\n","updatedAt":"2026-05-13T10:00:13.477Z","author":{"_id":"641123b4230ce11b1be68fa1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641123b4230ce11b1be68fa1/Xm1gZgK4MRq-LN20ZxxHE.jpeg","fullname":"Liang Ding (Hiring https://liamding.cc/hiring.htm)","name":"alphadl","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8754048943519592},"editors":["alphadl"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/641123b4230ce11b1be68fa1/Xm1gZgK4MRq-LN20ZxxHE.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.10267","authors":[{"_id":"6a02f42cb823258e761238ae","user":{"_id":"66004e114b3187d9318c5ebd","avatarUrl":"/avatars/bd8bec5e73d84552bf55fd74c0fa59c5.svg","isPro":false,"fullname":"daniel","user":"danielbai1703","type":"user","name":"danielbai1703"},"name":"Songlin Bai","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:52:40.693Z","hidden":false},{"_id":"6a02f42cb823258e761238af","name":"Xintong Wang","hidden":false},{"_id":"6a02f42cb823258e761238b0","name":"Linlin Yu","hidden":false},{"_id":"6a02f42cb823258e761238b1","name":"Bin Chen","hidden":false},{"_id":"6a02f42cb823258e761238b2","name":"Zhiang Xu","hidden":false},{"_id":"6a02f42cb823258e761238b3","name":"Yuyang Sheng","hidden":false},{"_id":"6a02f42cb823258e761238b4","name":"Changtong Zan","hidden":false},{"_id":"6a02f42cb823258e761238b5","name":"Xiaofeng Zhu","hidden":false},{"_id":"6a02f42cb823258e761238b6","name":"Yizhe Zhang","hidden":false},{"_id":"6a02f42cb823258e761238b7","name":"Jiru Li","hidden":false},{"_id":"6a02f42cb823258e761238b8","name":"Mingze Guo","hidden":false},{"_id":"6a02f42cb823258e761238b9","name":"Ling Zou","hidden":false},{"_id":"6a02f42cb823258e761238ba","name":"Yalong Li","hidden":false},{"_id":"6a02f42cb823258e761238bb","name":"Chengfu Huo","hidden":false},{"_id":"6a02f42cb823258e761238bc","name":"Liang Ding","hidden":false}],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs","submittedOnDailyBy":{"_id":"641123b4230ce11b1be68fa1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641123b4230ce11b1be68fa1/Xm1gZgK4MRq-LN20ZxxHE.jpeg","isPro":false,"fullname":"Liang Ding (Hiring https://liamding.cc/hiring.htm)","user":"alphadl","type":"user","name":"alphadl"},"summary":"In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.","upvotes":2,"discussionId":"6a02f42cb823258e761238bd","projectPage":"https://huggingface.co/datasets/alibaba-multimodal-industrial-ai/IndustryBench","githubRepo":"https://github.com/alibaba-multimodal-industrial-ai/IndustryBench","githubRepoAddedBy":"user","ai_summary":"IndustryBench evaluates industrial procurement question answering systems in Chinese against national standards, revealing significant gaps in safety compliance and highlighting the need for safety-aware assessment beyond standard accuracy metrics.","ai_keywords":["LLM","industrial procurement","standards check","safety violation","IndustryBench","Qwen3-Max","GB/T","multi-language evaluation","safety-aware diagnosis"],"githubStars":33,"organization":{"_id":"6a02ffa5e3fd09a09613cdd3","name":"alibaba-multimodal-industrial-ai","fullname":"1688 multimodal & industrial AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6673d2b28f2103a91b2701b1/XENMVNCtqwTsPdRSEqbEp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"641123b4230ce11b1be68fa1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641123b4230ce11b1be68fa1/Xm1gZgK4MRq-LN20ZxxHE.jpeg","isPro":false,"fullname":"Liang Ding (Hiring https://liamding.cc/hiring.htm)","user":"alphadl","type":"user"},{"_id":"66004e114b3187d9318c5ebd","avatarUrl":"/avatars/bd8bec5e73d84552bf55fd74c0fa59c5.svg","isPro":false,"fullname":"daniel","user":"danielbai1703","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a02ffa5e3fd09a09613cdd3","name":"alibaba-multimodal-industrial-ai","fullname":"1688 multimodal & industrial AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6673d2b28f2103a91b2701b1/XENMVNCtqwTsPdRSEqbEp.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.10267.md"}">

Papers

arxiv:2605.10267

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Published on May 11

· Submitted by

Liang Ding (Hiring https://liamding.cc/hiring.htm) on May 13

1688 multimodal & industrial AI

Upvote

Authors:

Songlin Bai ,

Abstract

IndustryBench evaluates industrial procurement question answering systems in Chinese against national standards, revealing significant gaps in safety compliance and highlighting the need for safety-aware assessment beyond standard accuracy metrics.

AI-generated summary

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

View arXiv page View PDF Project page GitHub 33 Add to collection

Community

alphadl

Paper submitter about 11 hours ago

We are excited to share IndustryBench, a new benchmark designed by the Multimodal and Industrial AI team at Alibaba to test the true industrial knowledge boundaries of LLMs.

While models have gotten great at general-purpose QA, B2B industrial procurement requires strict adherence to safety and standards. An LLM might give a highly fluent, partially correct answer that recommends the wrong material grade or violates a national standard—a critical failure in the real world.

Key highlights from our research:

🏭 Grounded in Reality: 2,049 items based on Chinese national standards (GB/T) and real industrial product records (evaluated across ZH, EN, RU, and VI).
⚠️ The "Overthinking" Trap: Surprisingly, we found that enabling extended reasoning (thinking mode) actually lowers safety-adjusted scores for 12 out of 13 tested models! Longer answers tend to introduce unsupported, safety-critical hallucinations.
📏 Standards & Terminology: This remains the most persistent weakness across all 17 evaluated models (including frontier models from Google, OpenAI, Anthropic, and the Qwen family).
⚖️ New Evaluation Paradigm: We decouple raw correctness from strict safety-violation (SV) checks to give a much clearer picture of actual deployability.

Raw accuracy isn't enough when safety is on the line. We invite the community to explore the dataset and see how current models handle strict industrial constraints!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.10267

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.10267 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.10267 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers