Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.</p>\n","updatedAt":"2026-06-11T03:28:14.347Z","author":{"_id":"689ec537196ab997b13dc977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png","fullname":"Haoran Zhang","name":"zzzhr97","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8746557235717773},"editors":["zzzhr97"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.10479","authors":[{"_id":"6a28f511e7d78ea7587e55cf","user":{"_id":"68cd16fbb13e69da211b92e1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/44eyBxALGEDfXgVu-NPZH.png","isPro":false,"fullname":"Shunkai Zhang","user":"PKUShunkai","type":"user","name":"PKUShunkai"},"name":"Shunkai Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:41:47.640Z","hidden":false},{"_id":"6a28f511e7d78ea7587e55d0","user":{"_id":"689ec537196ab997b13dc977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png","isPro":false,"fullname":"Haoran Zhang","user":"zzzhr97","type":"user","name":"zzzhr97"},"name":"Haoran Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:41:44.264Z","hidden":false},{"_id":"6a28f511e7d78ea7587e55d1","name":"Yun Luo","hidden":false},{"_id":"6a28f511e7d78ea7587e55d2","user":{"_id":"666ea165f98a92bc8997cbfb","avatarUrl":"/avatars/be1fce19192fd4f220fe82f373859b52.svg","isPro":false,"fullname":"Qianjia Cheng","user":"CajZella","type":"user","name":"CajZella"},"name":"Qianjia Cheng","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:41:41.046Z","hidden":false},{"_id":"6a28f511e7d78ea7587e55d3","name":"Haodi Lei","hidden":false},{"_id":"6a28f511e7d78ea7587e55d4","name":"Yizhuo Li","hidden":false},{"_id":"6a28f511e7d78ea7587e55d5","name":"Runzhe Zhan","hidden":false},{"_id":"6a28f511e7d78ea7587e55d6","name":"Zhilin Wang","hidden":false},{"_id":"6a28f511e7d78ea7587e55d7","name":"Bangjie Xu","hidden":false},{"_id":"6a28f511e7d78ea7587e55d8","name":"Yucheng Su","hidden":false},{"_id":"6a28f511e7d78ea7587e55d9","name":"Xinmiao Han","hidden":false},{"_id":"6a28f511e7d78ea7587e55da","name":"Xiaoye Qu","hidden":false},{"_id":"6a28f511e7d78ea7587e55db","name":"Dongrui Liu","hidden":false},{"_id":"6a28f511e7d78ea7587e55dc","name":"Zhouchen Lin","hidden":false},{"_id":"6a28f511e7d78ea7587e55dd","name":"Yu Qiao","hidden":false},{"_id":"6a28f511e7d78ea7587e55de","name":"Ning Ding","hidden":false},{"_id":"6a28f511e7d78ea7587e55df","name":"Yafu Li","hidden":false},{"_id":"6a28f511e7d78ea7587e55e0","name":"Yu Cheng","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics","submittedOnDailyBy":{"_id":"689ec537196ab997b13dc977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png","isPro":false,"fullname":"Haoran Zhang","user":"zzzhr97","type":"user","name":"zzzhr97"},"summary":"Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.","upvotes":18,"discussionId":"6a28f512e7d78ea7587e55e1","projectPage":"https://simplified-reasoning.github.io/ComBench","githubRepo":"https://github.com/Simplified-Reasoning/ComBench","githubRepoAddedBy":"user","ai_summary":"A new benchmark called ComBench is introduced to evaluate large language models' combinatorial reasoning abilities through Olympiad-level problems that test both proof construction and explicit mathematical constructions.","ai_keywords":["combinatorics","large language models","mathematical reasoning","proof grading","construction verification","Olympiad-level problems","combinatorial reasoning","proof quality","construction validity"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":14,"organization":{"_id":"6a03a74a34f1cffc0570e62d","name":"Simplified-Reasoning","fullname":"Simplified Reasoning","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62495cb96ee7ee6b646db130/S-AqbmitJFm2PxXyZrX7H.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"689ec537196ab997b13dc977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png","isPro":false,"fullname":"Haoran Zhang","user":"zzzhr97","type":"user"},{"_id":"644915c5e87a77e872e61350","avatarUrl":"/avatars/46ba7bdf04ad4c1b0ad79155010dc684.svg","isPro":false,"fullname":"Luo","user":"ramiroluo","type":"user"},{"_id":"68cd16fbb13e69da211b92e1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/44eyBxALGEDfXgVu-NPZH.png","isPro":false,"fullname":"Shunkai Zhang","user":"PKUShunkai","type":"user"},{"_id":"64cb54da1af278541d663708","avatarUrl":"/avatars/c44507cc92bb2e83154bad31b90ce6dd.svg","isPro":false,"fullname":"Xiaoye Qu","user":"Xiaoye08","type":"user"},{"_id":"653dd16277c2f09452ad37cd","avatarUrl":"/avatars/a95f9527722845a5414d86180c8e945d.svg","isPro":false,"fullname":"Yunzhuo Hao","user":"luckychao","type":"user"},{"_id":"67247adb73d1eb17b6bfd27c","avatarUrl":"/avatars/57bdbb7362f9854c87dd0a71ae071652.svg","isPro":false,"fullname":"Zefeng He","user":"yhx12","type":"user"},{"_id":"65697feb9fb2d79a79e14e0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65697feb9fb2d79a79e14e0a/wVGaBjn8pQIJneZWSFIwS.jpeg","isPro":false,"fullname":"haodi lei","user":"bingyang-lei","type":"user"},{"_id":"645b4819f9d4ec91fdd54852","avatarUrl":"/avatars/e12efb8e030688a0afcc72176b453fb3.svg","isPro":false,"fullname":"Jiawei Gu","user":"kuvvi","type":"user"},{"_id":"664717a50860c78e7c7b7c52","avatarUrl":"/avatars/ca17216b6d73234e1a68510f87653b3a.svg","isPro":false,"fullname":"Puyi Wang","user":"Puyiiii","type":"user"},{"_id":"63f3502a520c14618925825a","avatarUrl":"/avatars/e986a2a6625e7be6890616a417f908d2.svg","isPro":false,"fullname":"Yafu Li","user":"yaful","type":"user"},{"_id":"6498fde776d49ee00f79cbfe","avatarUrl":"/avatars/4c284a71080150e6cb3b9632dfccef60.svg","isPro":false,"fullname":"Xuyang Hu","user":"huxy912","type":"user"},{"_id":"65f955121cccf63639b81337","avatarUrl":"/avatars/a8503d47cdc67f14b57ca16f05becea1.svg","isPro":false,"fullname":"zqyz","user":"zqyz333","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a03a74a34f1cffc0570e62d","name":"Simplified-Reasoning","fullname":"Simplified Reasoning","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62495cb96ee7ee6b646db130/S-AqbmitJFm2PxXyZrX7H.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.10479.md"}">
ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A new benchmark called ComBench is introduced to evaluate large language models' combinatorial reasoning abilities through Olympiad-level problems that test both proof construction and explicit mathematical constructions.
Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.
Community
Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.10479 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.10479 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.10479 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.