Hugging Face Daily Papers · · 8 min read

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \\ours{}, a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0%) confirms \\ours{} requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \\ours{},Pro exposes brittleness, with every model losing 13--60% relative accuracy under distractor expansion. \\ours{},Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.</p>\n","updatedAt":"2026-05-18T01:55:16.575Z","author":{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","fullname":"Dhaval Patel","name":"DhavalPatel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8513465523719788},"editors":["DhavalPatel"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png"],"reactions":[],"isReport":false}},{"id":"6a0bc10d4bbdd3b46537127b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false},"createdAt":"2026-05-19T01:46:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance](https://huggingface.co/papers/2604.23446) (2026)\n* [PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools](https://huggingface.co/papers/2604.01532) (2026)\n* [OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation](https://huggingface.co/papers/2604.10866) (2026)\n* [FactoryBench: Evaluating Industrial Machine Understanding](https://huggingface.co/papers/2605.07675) (2026)\n* [Kintsugi: Learning Policies by Repairing Executable Knowledge Bases](https://huggingface.co/papers/2605.09487) (2026)\n* [Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles](https://huggingface.co/papers/2604.28087) (2026)\n* [MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring](https://huggingface.co/papers/2605.09684) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.23446\">IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.01532\">PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.10866\">OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07675\">FactoryBench: Evaluating Industrial Machine Understanding</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09487\">Kintsugi: Learning Policies by Repairing Executable Knowledge Bases</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.28087\">Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09684\">MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-19T01:46:53.087Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7462999820709229},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.08614","authors":[{"_id":"6a0a715b75184a0d71e0259e","name":"Devin Yasith De Silva","hidden":false},{"_id":"6a0a715b75184a0d71e0259f","name":"Dhaval Patel","hidden":false},{"_id":"6a0a715b75184a0d71e025a0","name":"Christodoulos Constantinides","hidden":false},{"_id":"6a0a715b75184a0d71e025a1","name":"Shuxin Lin","hidden":false},{"_id":"6a0a715b75184a0d71e025a2","name":"Nianjun Zhou","hidden":false},{"_id":"6a0a715b75184a0d71e025a3","name":"Paul J Adams","hidden":false},{"_id":"6a0a715b75184a0d71e025a4","name":"Sal Rosato","hidden":false},{"_id":"6a0a715b75184a0d71e025a5","name":"Nicolas Constantinides","hidden":false},{"_id":"6a0a715b75184a0d71e025a6","name":"Deborah L. McGuinness","hidden":false},{"_id":"6a0a715b75184a0d71e025a7","name":"Jayant Kalagnanam","hidden":false}],"publishedAt":"2026-05-09T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules","submittedOnDailyBy":{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","isPro":false,"fullname":"Dhaval Patel","user":"DhavalPatel","type":"user","name":"DhavalPatel"},"summary":"Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce , a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\\%) confirms requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \\,Pro exposes brittleness, with every model losing 13--60\\% relative accuracy under distractor expansion. \\,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.","upvotes":6,"discussionId":"6a0a715b75184a0d71e025a8","ai_summary":"Large language models struggle to translate industrial monitoring rules into maintenance actions due to brittleness and pattern-matching behaviors, despite achieving high performance on structured benchmarks.","ai_keywords":["large language models","symbolic rules","maintenance actions","expert-validated questions","Disjunctive Normal Form","embedding-based distractor sampling","multiple-choice questions","failure modes","Bradley-Terry Elo","structural perturbation"],"organization":{"_id":"616e7b1d75754a5d5fa455cf","name":"ibm","fullname":"IBM","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637bfdf60dc13843b468ac20/9228luWRoGbZwKGxkOOsj.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","isPro":false,"fullname":"Dhaval Patel","user":"DhavalPatel","type":"user"},{"_id":"649733dcec4946d16e971c2d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649733dcec4946d16e971c2d/Tuq9iCyJ-_oSFQY5jX7zf.png","isPro":false,"fullname":"Christodoulos Constantinides","user":"cc4718","type":"user"},{"_id":"6465dde22da1abc24231aa26","avatarUrl":"/avatars/be779f5af50b3b4a9d9fcba32b86f1eb.svg","isPro":false,"fullname":"Devin Yasith De Silva","user":"DevinDeSilva","type":"user"},{"_id":"69830aeb14c5880cb86ce4f5","avatarUrl":"/avatars/71beb8419979239522f9fef75bc62e94.svg","isPro":false,"fullname":"Иван Серебряков","user":"done-77","type":"user"},{"_id":"662436602d61edba3d27e263","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/IpikJRhMDkaaAvVMiaVj6.png","isPro":false,"fullname":"Chathurangi Shyalika","user":"ChathurangiShyalika","type":"user"},{"_id":"6826c3d7a677fa26bd486db6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/vy5ixwWBqZAnGxVVtKx9m.png","isPro":false,"fullname":"James Rayfield","user":"jtrayfield","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"616e7b1d75754a5d5fa455cf","name":"ibm","fullname":"IBM","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637bfdf60dc13843b468ac20/9228luWRoGbZwKGxkOOsj.png"}}">
Papers
arxiv:2605.08614

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

Published on May 9
· Submitted by
Dhaval Patel
on May 18
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Large language models struggle to translate industrial monitoring rules into maintenance actions due to brittleness and pattern-matching behaviors, despite achieving high performance on structured benchmarks.

AI-generated summary

Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce , a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.

Community

Paper submitter 1 day ago

Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \ours{}, a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0%) confirms \ours{} requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \ours{},Pro exposes brittleness, with every model losing 13--60% relative accuracy under distractor expansion. \ours{},Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.08614 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08614 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08614 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers