Hugging Face Daily Papers · · 5 min read

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.</p>\n","updatedAt":"2026-05-15T03:47:26.661Z","author":{"_id":"65a7c0335e79abfa2ec30c52","avatarUrl":"/avatars/2f62f83f9c5c4cc9444571f067cd85b7.svg","fullname":"Shuangrui Ding","name":"Mar2Ding","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8617385029792786},"editors":["Mar2Ding"],"editorAvatarUrls":["/avatars/2f62f83f9c5c4cc9444571f067cd85b7.svg"],"reactions":[{"reaction":"🔥","users":["ChrisDing1105"],"count":1}],"isReport":false}},{"id":"6a069791b6829194f8e1feb4","author":{"_id":"65a7c0335e79abfa2ec30c52","avatarUrl":"/avatars/2f62f83f9c5c4cc9444571f067cd85b7.svg","fullname":"Shuangrui Ding","name":"Mar2Ding","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-15T03:48:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"github repo: https://github.com/InternLM/WildClawBench\nleaderboard: https://internlm.github.io/WildClawBench/","html":"<p>github repo: <a href=\"https://github.com/InternLM/WildClawBench\" rel=\"nofollow\">https://github.com/InternLM/WildClawBench</a><br>leaderboard: <a href=\"https://internlm.github.io/WildClawBench/\" rel=\"nofollow\">https://internlm.github.io/WildClawBench/</a></p>\n","updatedAt":"2026-05-15T03:48:33.464Z","author":{"_id":"65a7c0335e79abfa2ec30c52","avatarUrl":"/avatars/2f62f83f9c5c4cc9444571f067cd85b7.svg","fullname":"Shuangrui Ding","name":"Mar2Ding","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6504567861557007},"editors":["Mar2Ding"],"editorAvatarUrls":["/avatars/2f62f83f9c5c4cc9444571f067cd85b7.svg"],"reactions":[{"reaction":"🔥","users":["ChrisDing1105","yuhangzang"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.10912","authors":[{"_id":"6a0696b9b1a8cbabc9f0999b","name":"Shuangrui Ding","hidden":false},{"_id":"6a0696b9b1a8cbabc9f0999c","name":"Xuanlang Dai","hidden":false},{"_id":"6a0696b9b1a8cbabc9f0999d","name":"Long Xing","hidden":false},{"_id":"6a0696b9b1a8cbabc9f0999e","name":"Shengyuan Ding","hidden":false},{"_id":"6a0696b9b1a8cbabc9f0999f","name":"Ziyu Liu","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099a0","name":"Yang JingYi","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099a1","name":"Penghui Yang","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099a2","name":"Zhixiong Zhang","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099a3","name":"Xilin Wei","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099a4","name":"Xinyu Fang","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099a5","name":"Yubo Ma","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099a6","name":"Haodong Duan","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099a7","name":"Jing Shao","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099a8","name":"Jiaqi Wang","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099a9","name":"Dahua Lin","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099aa","name":"Kai Chen","hidden":false},{"_id":"6a0696b9b1a8cbabc9f099ab","name":"Yuhang Zang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65a7c0335e79abfa2ec30c52/5ECynKsuVTQS7X5d9BWUi.png"],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation","submittedOnDailyBy":{"_id":"65a7c0335e79abfa2ec30c52","avatarUrl":"/avatars/2f62f83f9c5c4cc9444571f067cd85b7.svg","isPro":false,"fullname":"Shuangrui Ding","user":"Mar2Ding","type":"user","name":"Mar2Ding"},"summary":"Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.","upvotes":36,"discussionId":"6a0696b9b1a8cbabc9f099ac","projectPage":"https://internlm.github.io/WildClawBench/","githubRepo":"https://github.com/internlm/WildClawBench","githubRepoAddedBy":"user","ai_summary":"WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools instead of synthetic sandboxes.","ai_keywords":["command-line interface","multimodal tasks","Docker container","LLM/VLM judge","semantic verification","tool calls","agent benchmarks","native-runtime evaluation"],"githubStars":368,"organization":{"_id":"64a2d5fa81252883206f24c9","name":"internlm","fullname":"Intern Large Models","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6432683407bad11484a68457/Q3Y0dL79GcsnaBCGRMooZ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65a7c0335e79abfa2ec30c52","avatarUrl":"/avatars/2f62f83f9c5c4cc9444571f067cd85b7.svg","isPro":false,"fullname":"Shuangrui Ding","user":"Mar2Ding","type":"user"},{"_id":"64f5964a413ca787f12b8ade","avatarUrl":"/avatars/e4b65c19b92e50d93e3137468a58c96a.svg","isPro":false,"fullname":"Yang Penghui","user":"ygyjrc","type":"user"},{"_id":"646cd947da8e99940b6e55cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646cd947da8e99940b6e55cf/9c0P0WppFqNW9pdo8LgOS.jpeg","isPro":false,"fullname":"Shengyuan Ding","user":"ChrisDing1105","type":"user"},{"_id":"6512aaceea5ccf1113e334fc","avatarUrl":"/avatars/a5b30f1c14cb2b2c633076719f824ec0.svg","isPro":false,"fullname":"Long Xing","user":"long-xing1","type":"user"},{"_id":"65ab5332043d53781a115475","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ab5332043d53781a115475/UaxSFDWteYsByzx7G_KKy.jpeg","isPro":false,"fullname":"Zhixiong Zhang (SII)","user":"rookiexiong","type":"user"},{"_id":"64adfeac4beffa272dfaef21","avatarUrl":"/avatars/883f6ba38b993476115dfafcef9ce3c1.svg","isPro":false,"fullname":"Yifei Li","user":"JoeLeelyf","type":"user"},{"_id":"64f73a44102fbfb26410962e","avatarUrl":"/avatars/328302a495de6a4418be835456d1d3c6.svg","isPro":false,"fullname":"jingyi Yang","user":"JY-Young","type":"user"},{"_id":"63859cf3b2906edaf83af9f0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63859cf3b2906edaf83af9f0/kajwuVzd4pDucSPlwghxo.png","isPro":true,"fullname":"Yuhang Zang","user":"yuhangzang","type":"user"},{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","isPro":true,"fullname":"Zhaowei Wang","user":"ZhaoweiWang","type":"user"},{"_id":"62eb70462f0f5e54df42f778","avatarUrl":"/avatars/456049dba67638d3cdb330cdf383f272.svg","isPro":false,"fullname":"Xilin Wei","user":"Wiselnn","type":"user"},{"_id":"6a069e1d1b095ce1e9ba4eec","avatarUrl":"/avatars/49a5a57fa7b0858757efeec2a39a1811.svg","isPro":false,"fullname":"Kaiyan Cao","user":"caokaiyanmi","type":"user"},{"_id":"650abbb71aece923f21d87fc","avatarUrl":"/avatars/f09ff031c278bc42bfd7a563853e142c.svg","isPro":false,"fullname":"Junbo Niu","user":"Niujunbo2002","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64a2d5fa81252883206f24c9","name":"internlm","fullname":"Intern Large Models","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6432683407bad11484a68457/Q3Y0dL79GcsnaBCGRMooZ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.10912.md"}">
Papers
arxiv:2605.10912

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Published on May 11
· Submitted by
Shuangrui Ding
on May 15
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools instead of synthetic sandboxes.

AI-generated summary

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

Community

Paper submitter about 21 hours ago

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.10912
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.10912 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.10912 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers