LLMEval-Logic is a forward-authored Chinese logical reasoning benchmark, Z3-audited end-to-end. Items are written from real situational scenarios (not back-templated from formulas), every Base item is paired with a gold first-order-logic formalization that is double-checked by the Z3 SMT solver plus an expert rubric, and selected items are then adversarially hardened by a closed-loop agent workflow.<br>We evaluate 14 frontier LLMs across 7 families under thinking / no-thinking. Headline result: the strongest model (Gemini 3.1 Pro, thinking) reaches only 37.5% Item Accuracy on the Hard subset, with substantial Sub-Q → Item gaps showing models can solve individual sub-questions but fail to maintain a coherent closed candidate space across chained queries. We also find a clear Base ↔ Hard rank inversion among thinking variants (Spearman ρ = −0.61).<br>Released as 80% public + 20% private contamination-resistant holdout, with code, dataset, and rubrics:</p>\n<ul>\n<li>Dataset: <a href=\"https://huggingface.co/datasets/llmeval-fdu/LLMEval-Logic\">https://huggingface.co/datasets/llmeval-fdu/LLMEval-Logic</a></li>\n<li>Code: <a href=\"https://github.com/llmeval/LLMEval-Logic\" rel=\"nofollow\">https://github.com/llmeval/LLMEval-Logic</a></li>\n<li>Project: <a href=\"https://llmeval.com/\" rel=\"nofollow\">https://llmeval.com/</a></li>\n</ul>\n","updatedAt":"2026-05-21T11:27:38.157Z","author":{"_id":"65b71c0582d38451342f7334","avatarUrl":"/avatars/f9763a0ac361c350e6c6732e23564567.svg","fullname":"Ming Zhang","name":"konglongge","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8802151679992676},"editors":["konglongge"],"editorAvatarUrls":["/avatars/f9763a0ac361c350e6c6732e23564567.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.19597","authors":[{"_id":"6a0ec243164dbbc68a26c793","name":"Ming Zhang","hidden":false},{"_id":"6a0ec243164dbbc68a26c794","name":"Qiyuan Peng","hidden":false},{"_id":"6a0ec243164dbbc68a26c795","name":"Yinxi Wei","hidden":false},{"_id":"6a0ec243164dbbc68a26c796","name":"Yujiong Shen","hidden":false},{"_id":"6a0ec243164dbbc68a26c797","name":"Kexin Tan","hidden":false},{"_id":"6a0ec243164dbbc68a26c798","name":"Yuhui Wang","hidden":false},{"_id":"6a0ec243164dbbc68a26c799","name":"Zhenghao Xiang","hidden":false},{"_id":"6a0ec243164dbbc68a26c79a","name":"Junjie Ye","hidden":false},{"_id":"6a0ec243164dbbc68a26c79b","name":"Zhangyue Yin","hidden":false},{"_id":"6a0ec243164dbbc68a26c79c","name":"Zhiheng Xi","hidden":false},{"_id":"6a0ec243164dbbc68a26c79d","name":"Shihan Dou","hidden":false},{"_id":"6a0ec243164dbbc68a26c79e","name":"Tao Gui","hidden":false},{"_id":"6a0ec243164dbbc68a26c79f","name":"Maxm Pan","hidden":false},{"_id":"6a0ec243164dbbc68a26c7a0","name":"Ruizhi Yang","hidden":false},{"_id":"6a0ec243164dbbc68a26c7a1","name":"Qi Zhang","hidden":false},{"_id":"6a0ec243164dbbc68a26c7a2","name":"Xuanjing Huang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65b71c0582d38451342f7334/eXvKJF7PFl1v5_2Avyh8r.png","https://cdn-uploads.huggingface.co/production/uploads/65b71c0582d38451342f7334/0v7bTkeyxQgx8rl_OFyhK.png","https://cdn-uploads.huggingface.co/production/uploads/65b71c0582d38451342f7334/Qq7T9VvIxhGhnPQadFJpw.png"],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening","submittedOnDailyBy":{"_id":"65b71c0582d38451342f7334","avatarUrl":"/avatars/f9763a0ac361c350e6c6732e23564567.svg","isPro":false,"fullname":"Ming Zhang","user":"konglongge","type":"user","name":"konglongge"},"summary":"Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.","upvotes":13,"discussionId":"6a0ec243164dbbc68a26c7a3","projectPage":"https://llmeval.com/","githubRepo":"https://github.com/llmeval/LLMEval-Logic","githubRepoAddedBy":"user","ai_summary":"A Chinese logical reasoning benchmark for large language models is introduced, featuring expert-verified natural-language items with formal annotations and adversarial hardening to better evaluate rule-governed reasoning capabilities.","ai_keywords":["large language models","logical reasoning","natural-language logical reasoning","formal annotations","Z3","expert rubrics","adversarial workflow","benchmark evaluation"],"githubStars":5,"organization":{"_id":"69ff2594cc8653a142518262","name":"llmeval-fdu","fullname":"LLMEval Official Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b71c0582d38451342f7334/6Yp0cY1aevBGLPe_s48w6.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65b71c0582d38451342f7334","avatarUrl":"/avatars/f9763a0ac361c350e6c6732e23564567.svg","isPro":false,"fullname":"Ming Zhang","user":"konglongge","type":"user"},{"_id":"628c5da32f09ccf530204dbe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1653366416287-628c5da32f09ccf530204dbe.jpeg","isPro":false,"fullname":"Zhangyue Yin","user":"yinzhangyue","type":"user"},{"_id":"65435cad429b80b14922ab8d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/N8oWq4ZZn3dRxmXi18FrA.jpeg","isPro":false,"fullname":"Shichun Liu","user":"Liusc2020","type":"user"},{"_id":"66276727368ec2a0b933772c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66276727368ec2a0b933772c/ff5kAJuUr8tVlbNP1If_Y.jpeg","isPro":false,"fullname":"ccloud0525","user":"Ccloud0525","type":"user"},{"_id":"67f10aaab32c04093a5509c9","avatarUrl":"/avatars/dc0778734047af3bba537befefe9dabf.svg","isPro":false,"fullname":"Peixin Wang","user":"pxxxc1","type":"user"},{"_id":"64a7e7b2ef22f9c793e01454","avatarUrl":"/avatars/cd3706ffedbf68f58a8e53046008b7fb.svg","isPro":false,"fullname":"tongjingqi(SII)","user":"tongjingqi","type":"user"},{"_id":"687616e84185ace792e53075","avatarUrl":"/avatars/1c3c64a21d99292270c43164bf2afd06.svg","isPro":false,"fullname":"ML","user":"ml020112","type":"user"},{"_id":"69a27f0cb0b213c23ca29d60","avatarUrl":"/avatars/e86a4fd305d4d1b7335f8592462601c1.svg","isPro":false,"fullname":"shaofanliu","user":"Sfliu25","type":"user"},{"_id":"64c91218cb2f1bf0e7d11a28","avatarUrl":"/avatars/cedb3c612359ebf6342320c473456e6e.svg","isPro":false,"fullname":"ZhangZhihao","user":"Zhangzzz1","type":"user"},{"_id":"64ae0b6ccf90fe27556cf56e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/uVXJd3TBssT_Ug4yAl3c9.jpeg","isPro":false,"fullname":"Shihan Dou","user":"Ablustrund","type":"user"},{"_id":"69bb68f0d761588f14dc1027","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/MVGZl14xD2DxwUZcJ-WLP.jpeg","isPro":false,"fullname":"전 지후","user":"eclark15","type":"user"},{"_id":"63f03ecc5c2ceb16fc7263b5","avatarUrl":"/avatars/e3bcf6938463c99287cd1c76a19c7517.svg","isPro":false,"fullname":"SII-LeeSXian","user":"LEE0v0","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69ff2594cc8653a142518262","name":"llmeval-fdu","fullname":"LLMEval Official Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b71c0582d38451342f7334/6Yp0cY1aevBGLPe_s48w6.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.19597.md"}">
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A Chinese logical reasoning benchmark for large language models is introduced, featuring expert-verified natural-language items with formal annotations and adversarial hardening to better evaluate rule-governed reasoning capabilities.
AI-generated summary
Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.
Community
LLMEval-Logic is a forward-authored Chinese logical reasoning benchmark, Z3-audited end-to-end. Items are written from real situational scenarios (not back-templated from formulas), every Base item is paired with a gold first-order-logic formalization that is double-checked by the Z3 SMT solver plus an expert rubric, and selected items are then adversarially hardened by a closed-loop agent workflow.
We evaluate 14 frontier LLMs across 7 families under thinking / no-thinking. Headline result: the strongest model (Gemini 3.1 Pro, thinking) reaches only 37.5% Item Accuracy on the Hard subset, with substantial Sub-Q → Item gaps showing models can solve individual sub-questions but fail to maintain a coherent closed candidate space across chained queries. We also find a clear Base ↔ Hard rank inversion among thinking variants (Spearman ρ = −0.61).
Released as 80% public + 20% private contamination-resistant holdout, with code, dataset, and rubrics:
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.19597 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.19597 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.