Hugging Face Daily Papers · May 21, 2026 · 5 min read

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

LLMEval-Logic is a forward-authored Chinese logical reasoning benchmark, Z3-audited end-to-end. Items are written from real situational scenarios (not back-templated from formulas), every Base item is paired with a gold first-order-logic formalization that is double-checked by the Z3 SMT solver plus an expert rubric, and selected items are then adversarially hardened by a closed-loop agent workflow.<br>We evaluate 14 frontier LLMs across 7 families under thinking / no-thinking. Headline result: the strongest model (Gemini 3.1 Pro, thinking) reaches only 37.5% Item Accuracy on the Hard subset, with substantial Sub-Q → Item gaps showing models can solve individual sub-questions but fail to maintain a coherent closed candidate space across chained queries. We also find a clear Base ↔ Hard rank inversion among thinking variants (Spearman ρ = −0.61).<br>Released as 80% public + 20% private contamination-resistant holdout, with code, dataset, and rubrics:</p>\n<ul>\n<li>Dataset: <a href=\"https://huggingface.co/datasets/llmeval-fdu/LLMEval-Logic\">https://huggingface.co/datasets/llmeval-fdu/LLMEval-Logic</a></li>\n<li>Code: <a href=\"https://github.com/llmeval/LLMEval-Logic\" rel=\"nofollow\">https://github.com/llmeval/LLMEval-Logic</a></li>\n<li>Project: <a href=\"https://llmeval.com/\" rel=\"nofollow\">https://llmeval.com/</a></li>\n</ul>\n","updatedAt":"2026-05-21T11:27:38.157Z","author":{"_id":"65b71c0582d38451342f7334","avatarUrl":"/avatars/f9763a0ac361c350e6c6732e23564567.svg","fullname":"Ming Zhang","name":"konglongge","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8802151679992676},"editors":["konglongge"],"editorAvatarUrls":["/avatars/f9763a0ac361c350e6c6732e23564567.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.19597","authors":[{"_id":"6a0ec243164dbbc68a26c793","name":"Ming Zhang","hidden":false},{"_id":"6a0ec243164dbbc68a26c794","name":"Qiyuan Peng","hidden":false},{"_id":"6a0ec243164dbbc68a26c795","name":"Yinxi Wei","hidden":false},{"_id":"6a0ec243164dbbc68a26c796","name":"Yujiong Shen","hidden":false},{"_id":"6a0ec243164dbbc68a26c797","name":"Kexin Tan","hidden":false},{"_id":"6a0ec243164dbbc68a26c798","name":"Yuhui Wang","hidden":false},{"_id":"6a0ec243164dbbc68a26c799","name":"Zhenghao Xiang","hidden":false},{"_id":"6a0ec243164dbbc68a26c79a","name":"Junjie Ye","hidden":false},{"_id":"6a0ec243164dbbc68a26c79b","name":"Zhangyue Yin","hidden":false},{"_id":"6a0ec243164dbbc68a26c79c","name":"Zhiheng Xi","hidden":false},{"_id":"6a0ec243164dbbc68a26c79d","name":"Shihan Dou","hidden":false},{"_id":"6a0ec243164dbbc68a26c79e","name":"Tao Gui","hidden":false},{"_id":"6a0ec243164dbbc68a26c79f","name":"Maxm Pan","hidden":false},{"_id":"6a0ec243164dbbc68a26c7a0","name":"Ruizhi Yang","hidden":false},{"_id":"6a0ec243164dbbc68a26c7a1","name":"Qi Zhang","hidden":false},{"_id":"6a0ec243164dbbc68a26c7a2","name":"Xuanjing Huang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65b71c0582d38451342f7334/eXvKJF7PFl1v5_2Avyh8r.png","https://cdn-uploads.huggingface.co/production/uploads/65b71c0582d38451342f7334/0v7bTkeyxQgx8rl_OFyhK.png","https://cdn-uploads.huggingface.co/production/uploads/65b71c0582d38451342f7334/Qq7T9VvIxhGhnPQadFJpw.png"],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening","submittedOnDailyBy":{"_id":"65b71c0582d38451342f7334","avatarUrl":"/avatars/f9763a0ac361c350e6c6732e23564567.svg","isPro":false,"fullname":"Ming Zhang","user":"konglongge","type":"user","name":"konglongge"},"summary":"Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.","upvotes":13,"discussionId":"6a0ec243164dbbc68a26c7a3","projectPage":"https://llmeval.com/","githubRepo":"https://github.com/llmeval/LLMEval-Logic","githubRepoAddedBy":"user","ai_summary":"A Chinese logical reasoning benchmark for large language models is introduced, featuring expert-verified natural-language items with formal annotations and adversarial hardening to better evaluate rule-governed reasoning capabilities.","ai_keywords":["large language models","logical reasoning","natural-language logical reasoning","formal annotations","Z3","expert rubrics","adversarial workflow","benchmark evaluation"],"githubStars":5,"organization":{"_id":"69ff2594cc8653a142518262","name":"llmeval-fdu","fullname":"LLMEval Official Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b71c0582d38451342f7334/6Yp0cY1aevBGLPe_s48w6.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65b71c0582d38451342f7334","avatarUrl":"/avatars/f9763a0ac361c350e6c6732e23564567.svg","isPro":false,"fullname":"Ming Zhang","user":"konglongge","type":"user"},{"_id":"628c5da32f09ccf530204dbe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1653366416287-628c5da32f09ccf530204dbe.jpeg","isPro":false,"fullname":"Zhangyue Yin","user":"yinzhangyue","type":"user"},{"_id":"65435cad429b80b14922ab8d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/N8oWq4ZZn3dRxmXi18FrA.jpeg","isPro":false,"fullname":"Shichun Liu","user":"Liusc2020","type":"user"},{"_id":"66276727368ec2a0b933772c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66276727368ec2a0b933772c/ff5kAJuUr8tVlbNP1If_Y.jpeg","isPro":false,"fullname":"ccloud0525","user":"Ccloud0525","type":"user"},{"_id":"67f10aaab32c04093a5509c9","avatarUrl":"/avatars/dc0778734047af3bba537befefe9dabf.svg","isPro":false,"fullname":"Peixin Wang","user":"pxxxc1","type":"user"},{"_id":"64a7e7b2ef22f9c793e01454","avatarUrl":"/avatars/cd3706ffedbf68f58a8e53046008b7fb.svg","isPro":false,"fullname":"tongjingqi(SII)","user":"tongjingqi","type":"user"},{"_id":"687616e84185ace792e53075","avatarUrl":"/avatars/1c3c64a21d99292270c43164bf2afd06.svg","isPro":false,"fullname":"ML","user":"ml020112","type":"user"},{"_id":"69a27f0cb0b213c23ca29d60","avatarUrl":"/avatars/e86a4fd305d4d1b7335f8592462601c1.svg","isPro":false,"fullname":"shaofanliu","user":"Sfliu25","type":"user"},{"_id":"64c91218cb2f1bf0e7d11a28","avatarUrl":"/avatars/cedb3c612359ebf6342320c473456e6e.svg","isPro":false,"fullname":"ZhangZhihao","user":"Zhangzzz1","type":"user"},{"_id":"64ae0b6ccf90fe27556cf56e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/uVXJd3TBssT_Ug4yAl3c9.jpeg","isPro":false,"fullname":"Shihan Dou","user":"Ablustrund","type":"user"},{"_id":"69bb68f0d761588f14dc1027","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/MVGZl14xD2DxwUZcJ-WLP.jpeg","isPro":false,"fullname":"전 지후","user":"eclark15","type":"user"},{"_id":"63f03ecc5c2ceb16fc7263b5","avatarUrl":"/avatars/e3bcf6938463c99287cd1c76a19c7517.svg","isPro":false,"fullname":"SII-LeeSXian","user":"LEE0v0","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69ff2594cc8653a142518262","name":"llmeval-fdu","fullname":"LLMEval Official Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b71c0582d38451342f7334/6Yp0cY1aevBGLPe_s48w6.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.19597.md"}">

Papers

arxiv:2605.19597

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

Published on May 19

· Submitted by

Ming Zhang on May 21

LLMEval Official Team

Upvote

Authors:

Abstract

A Chinese logical reasoning benchmark for large language models is introduced, featuring expert-verified natural-language items with formal annotations and adversarial hardening to better evaluate rule-governed reasoning capabilities.

AI-generated summary

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.