General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce \\textsc{Claw-SWE-Bench}, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release \\textsc{Claw-SWE-Bench Lite} for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1%$ Pass@1, whereas the full adapter reaches $73.4%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\\times$ nine-model sweep and a five-claw $\\times$ two-model sweep, model choice changes Pass@1 by $29.4$,pp and harness choice by $27.4$,pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison.<br>The data is available at \\url{<a href=\"https://github.com/opensquilla/claw-swe-bench%7D\" rel=\"nofollow\">https://github.com/opensquilla/claw-swe-bench}</a> and \\url{<a href=\"https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench%7D\">https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench}</a>.</p>\n","updatedAt":"2026-06-11T04:22:55.701Z","author":{"_id":"65e52e7d27dc8aa470a640e3","avatarUrl":"/avatars/022a179d14de29b9ab9d96fcc85aa264.svg","fullname":"hankai","name":"hankaixyz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8730777502059937},"editors":["hankaixyz"],"editorAvatarUrls":["/avatars/022a179d14de29b9ab9d96fcc85aa264.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12344","authors":[{"_id":"6a2a369980a9c7c6830c0ffd","user":{"_id":"6a295424a4883994a7716f5a","avatarUrl":"/avatars/8a1cea372b74b4f6d823e678e6a9837d.svg","isPro":false,"fullname":"Zheng","user":"AnneZheng","type":"user","name":"AnneZheng"},"name":"Mengyu Zheng","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:37:58.533Z","hidden":false},{"_id":"6a2a369980a9c7c6830c0ffe","user":{"_id":"65e52e7d27dc8aa470a640e3","avatarUrl":"/avatars/022a179d14de29b9ab9d96fcc85aa264.svg","isPro":false,"fullname":"hankai","user":"hankaixyz","type":"user","name":"hankaixyz"},"name":"Kai Han","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:37:56.051Z","hidden":false},{"_id":"6a2a369980a9c7c6830c0fff","name":"Boxun Li","hidden":false},{"_id":"6a2a369980a9c7c6830c1000","name":"Haiyang Xu","hidden":false},{"_id":"6a2a369980a9c7c6830c1001","name":"Yuchuan Tian","hidden":false},{"_id":"6a2a369980a9c7c6830c1002","name":"Wei He","hidden":false},{"_id":"6a2a369980a9c7c6830c1003","name":"Hang Zhou","hidden":false},{"_id":"6a2a369980a9c7c6830c1004","name":"Jianyuan Guo","hidden":false},{"_id":"6a2a369980a9c7c6830c1005","name":"Hailin Hu","hidden":false},{"_id":"6a2a369980a9c7c6830c1006","name":"Lin Ma","hidden":false},{"_id":"6a2a369980a9c7c6830c1007","name":"Chao Xu","hidden":false},{"_id":"6a2a369980a9c7c6830c1008","name":"Guohao Dai","hidden":false},{"_id":"6a2a369980a9c7c6830c1009","name":"Lixue Xia","hidden":false},{"_id":"6a2a369980a9c7c6830c100a","name":"Yunchao Wei","hidden":false},{"_id":"6a2a369980a9c7c6830c100b","name":"Yunhe Wang","hidden":false},{"_id":"6a2a369980a9c7c6830c100c","name":"Yu Wang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65e52e7d27dc8aa470a640e3/vA2UdegX_Ej_Gq9i16a4E.png"],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks","submittedOnDailyBy":{"_id":"65e52e7d27dc8aa470a640e3","avatarUrl":"/avatars/022a179d14de29b9ab9d96fcc85aa264.svg","isPro":false,"fullname":"hankai","user":"hankaixyz","type":"user","name":"hankaixyz"},"summary":"General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.","upvotes":55,"discussionId":"6a2a369a80a9c7c6830c100d","projectPage":"https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench","githubRepo":"https://github.com/opensquilla/claw-swe-bench","githubRepoAddedBy":"user","ai_summary":"A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation.","ai_keywords":["SWE-bench","OpenClaw","adapter protocol","Pass@1","GLM 5.1","harness","API cost","benchmark"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":9,"organization":{"_id":"6a27a8a19add684d19e73d40","name":"TokenRhythm","fullname":"TokenRhythm","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a27a70ec151fa93b00a6eed/ETg9yUkTJ7XauYMd3XqSH.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65e52e7d27dc8aa470a640e3","avatarUrl":"/avatars/022a179d14de29b9ab9d96fcc85aa264.svg","isPro":false,"fullname":"hankai","user":"hankaixyz","type":"user"},{"_id":"6a295424a4883994a7716f5a","avatarUrl":"/avatars/8a1cea372b74b4f6d823e678e6a9837d.svg","isPro":false,"fullname":"Zheng","user":"AnneZheng","type":"user"},{"_id":"64d1b85a31c655ff8aa6baaa","avatarUrl":"/avatars/aec7c2e497b9306d46731a5fab1de9e7.svg","isPro":false,"fullname":"songliuyang","user":"iamsly","type":"user"},{"_id":"668e3ed00a6a758a8d119e45","avatarUrl":"/avatars/72a1e8a5fff5b6b09fabf6a72741ae59.svg","isPro":false,"fullname":"chengsiyang","user":"chengsiyang","type":"user"},{"_id":"64ba7a73a01da43118fce871","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ba7a73a01da43118fce871/x2QDw6N8QRZkwwYOLNLYD.png","isPro":false,"fullname":"YuchuanTian","user":"yuchuantian","type":"user"},{"_id":"64354571876d46610bef3a8c","avatarUrl":"/avatars/fdbde55d359148488f1438b0307deed1.svg","isPro":false,"fullname":"PIKA PIKA","user":"PIKA665","type":"user"},{"_id":"6406ac13a577649430c61faa","avatarUrl":"/avatars/c528c3b1524f2c928bda12e92796a0f3.svg","isPro":false,"fullname":"宗英杰","user":"Zyjay","type":"user"},{"_id":"6a29597b6e65dd49017a948c","avatarUrl":"/avatars/fc1a8bafc521a3a1757d135391bba75f.svg","isPro":false,"fullname":"liuxinchen","user":"liuxinchen1997","type":"user"},{"_id":"6953897fa6ebf89c814f4cc5","avatarUrl":"/avatars/5f287f9e303ff1c187713fc89e84330f.svg","isPro":false,"fullname":"MBerger","user":"SHakeShakeShake","type":"user"},{"_id":"64acc03fa49185bdc2f415a3","avatarUrl":"/avatars/eb2ee218711480d1da56a34ded1167c3.svg","isPro":false,"fullname":"Traly","user":"Traly","type":"user"},{"_id":"635780c6937b7591c301ce5d","avatarUrl":"/avatars/8aa2588fa0bc06a74cf13ac86f59316b.svg","isPro":false,"fullname":"XIYU WANG","user":"ShinyXiyu","type":"user"},{"_id":"64c9b80c3d5a0dfed5e84923","avatarUrl":"/avatars/778ffe55c3687634943ed51b857f825b.svg","isPro":false,"fullname":"Linwei Tao","user":"linweitao","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"6a27a8a19add684d19e73d40","name":"TokenRhythm","fullname":"TokenRhythm","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a27a70ec151fa93b00a6eed/ETg9yUkTJ7XauYMd3XqSH.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12344.md"}">
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation.
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.
Community
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce \textsc{Claw-SWE-Bench}, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release \textsc{Claw-SWE-Bench Lite} for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1%$ Pass@1, whereas the full adapter reaches $73.4%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$,pp and harness choice by $27.4$,pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison.
The data is available at \url{https://github.com/opensquilla/claw-swe-bench} and \url{https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench}.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.12344 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.12344 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.