Hugging Face Daily Papers · June 11, 2026 · 5 min read

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.</p>\n","updatedAt":"2026-06-11T03:28:14.347Z","author":{"_id":"689ec537196ab997b13dc977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png","fullname":"Haoran Zhang","name":"zzzhr97","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8746557235717773},"editors":["zzzhr97"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.10479","authors":[{"_id":"6a28f511e7d78ea7587e55cf","user":{"_id":"68cd16fbb13e69da211b92e1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/44eyBxALGEDfXgVu-NPZH.png","isPro":false,"fullname":"Shunkai Zhang","user":"PKUShunkai","type":"user","name":"PKUShunkai"},"name":"Shunkai Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:41:47.640Z","hidden":false},{"_id":"6a28f511e7d78ea7587e55d0","user":{"_id":"689ec537196ab997b13dc977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png","isPro":false,"fullname":"Haoran Zhang","user":"zzzhr97","type":"user","name":"zzzhr97"},"name":"Haoran Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:41:44.264Z","hidden":false},{"_id":"6a28f511e7d78ea7587e55d1","name":"Yun Luo","hidden":false},{"_id":"6a28f511e7d78ea7587e55d2","user":{"_id":"666ea165f98a92bc8997cbfb","avatarUrl":"/avatars/be1fce19192fd4f220fe82f373859b52.svg","isPro":false,"fullname":"Qianjia Cheng","user":"CajZella","type":"user","name":"CajZella"},"name":"Qianjia Cheng","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:41:41.046Z","hidden":false},{"_id":"6a28f511e7d78ea7587e55d3","name":"Haodi Lei","hidden":false},{"_id":"6a28f511e7d78ea7587e55d4","name":"Yizhuo Li","hidden":false},{"_id":"6a28f511e7d78ea7587e55d5","name":"Runzhe Zhan","hidden":false},{"_id":"6a28f511e7d78ea7587e55d6","name":"Zhilin Wang","hidden":false},{"_id":"6a28f511e7d78ea7587e55d7","name":"Bangjie Xu","hidden":false},{"_id":"6a28f511e7d78ea7587e55d8","name":"Yucheng Su","hidden":false},{"_id":"6a28f511e7d78ea7587e55d9","name":"Xinmiao Han","hidden":false},{"_id":"6a28f511e7d78ea7587e55da","name":"Xiaoye Qu","hidden":false},{"_id":"6a28f511e7d78ea7587e55db","name":"Dongrui Liu","hidden":false},{"_id":"6a28f511e7d78ea7587e55dc","name":"Zhouchen Lin","hidden":false},{"_id":"6a28f511e7d78ea7587e55dd","name":"Yu Qiao","hidden":false},{"_id":"6a28f511e7d78ea7587e55de","name":"Ning Ding","hidden":false},{"_id":"6a28f511e7d78ea7587e55df","name":"Yafu Li","hidden":false},{"_id":"6a28f511e7d78ea7587e55e0","name":"Yu Cheng","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics","submittedOnDailyBy":{"_id":"689ec537196ab997b13dc977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png","isPro":false,"fullname":"Haoran Zhang","user":"zzzhr97","type":"user","name":"zzzhr97"},"summary":"Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.","upvotes":18,"discussionId":"6a28f512e7d78ea7587e55e1","projectPage":"https://simplified-reasoning.github.io/ComBench","githubRepo":"https://github.com/Simplified-Reasoning/ComBench","githubRepoAddedBy":"user","ai_summary":"A new benchmark called ComBench is introduced to evaluate large language models' combinatorial reasoning abilities through Olympiad-level problems that test both proof construction and explicit mathematical constructions.","ai_keywords":["combinatorics","large language models","mathematical reasoning","proof grading","construction verification","Olympiad-level problems","combinatorial reasoning","proof quality","construction validity"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":14,"organization":{"_id":"6a03a74a34f1cffc0570e62d","name":"Simplified-Reasoning","fullname":"Simplified Reasoning","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62495cb96ee7ee6b646db130/S-AqbmitJFm2PxXyZrX7H.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"689ec537196ab997b13dc977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png","isPro":false,"fullname":"Haoran Zhang","user":"zzzhr97","type":"user"},{"_id":"644915c5e87a77e872e61350","avatarUrl":"/avatars/46ba7bdf04ad4c1b0ad79155010dc684.svg","isPro":false,"fullname":"Luo","user":"ramiroluo","type":"user"},{"_id":"68cd16fbb13e69da211b92e1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/44eyBxALGEDfXgVu-NPZH.png","isPro":false,"fullname":"Shunkai Zhang","user":"PKUShunkai","type":"user"},{"_id":"64cb54da1af278541d663708","avatarUrl":"/avatars/c44507cc92bb2e83154bad31b90ce6dd.svg","isPro":false,"fullname":"Xiaoye Qu","user":"Xiaoye08","type":"user"},{"_id":"653dd16277c2f09452ad37cd","avatarUrl":"/avatars/a95f9527722845a5414d86180c8e945d.svg","isPro":false,"fullname":"Yunzhuo Hao","user":"luckychao","type":"user"},{"_id":"67247adb73d1eb17b6bfd27c","avatarUrl":"/avatars/57bdbb7362f9854c87dd0a71ae071652.svg","isPro":false,"fullname":"Zefeng He","user":"yhx12","type":"user"},{"_id":"65697feb9fb2d79a79e14e0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65697feb9fb2d79a79e14e0a/wVGaBjn8pQIJneZWSFIwS.jpeg","isPro":false,"fullname":"haodi lei","user":"bingyang-lei","type":"user"},{"_id":"645b4819f9d4ec91fdd54852","avatarUrl":"/avatars/e12efb8e030688a0afcc72176b453fb3.svg","isPro":false,"fullname":"Jiawei Gu","user":"kuvvi","type":"user"},{"_id":"664717a50860c78e7c7b7c52","avatarUrl":"/avatars/ca17216b6d73234e1a68510f87653b3a.svg","isPro":false,"fullname":"Puyi Wang","user":"Puyiiii","type":"user"},{"_id":"63f3502a520c14618925825a","avatarUrl":"/avatars/e986a2a6625e7be6890616a417f908d2.svg","isPro":false,"fullname":"Yafu Li","user":"yaful","type":"user"},{"_id":"6498fde776d49ee00f79cbfe","avatarUrl":"/avatars/4c284a71080150e6cb3b9632dfccef60.svg","isPro":false,"fullname":"Xuyang Hu","user":"huxy912","type":"user"},{"_id":"65f955121cccf63639b81337","avatarUrl":"/avatars/a8503d47cdc67f14b57ca16f05becea1.svg","isPro":false,"fullname":"zqyz","user":"zqyz333","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a03a74a34f1cffc0570e62d","name":"Simplified-Reasoning","fullname":"Simplified Reasoning","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62495cb96ee7ee6b646db130/S-AqbmitJFm2PxXyZrX7H.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.10479.md"}">

Papers

arxiv:2606.10479

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Published on Jun 9

· Submitted by

Haoran Zhang on Jun 11

Simplified Reasoning

Upvote

Authors:

Shunkai Zhang ,

Haoran Zhang ,

Qianjia Cheng ,

Abstract

A new benchmark called ComBench is introduced to evaluate large language models' combinatorial reasoning abilities through Olympiad-level problems that test both proof construction and explicit mathematical constructions.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct