We introduce WeaveBench, a long-horizon benchmark with 114 tasks across 8 real-world domains, where agents must interleave GUI and CLI/code operations in a single trajectory on a real Ubuntu desktop. The best frontier model reaches only 41.2% PassRate, and our trajectory-aware judge shows outcome-only grading greatly overestimates agent performance. Project: <a href=\"https://weavebench.github.io\" rel=\"nofollow\">https://weavebench.github.io</a></p>\n","updatedAt":"2026-06-12T02:37:32.608Z","author":{"_id":"6777886b8c91ac2d4b705eaf","avatarUrl":"/avatars/ac27c6728d40edd724ce99cb8369808d.svg","fullname":"Wanli Li","name":"wanlilll","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8396465182304382},"editors":["wanlilll"],"editorAvatarUrls":["/avatars/ac27c6728d40edd724ce99cb8369808d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09426","authors":[{"_id":"6a277fef6dde1c5ef75bcf53","user":{"_id":"6777886b8c91ac2d4b705eaf","avatarUrl":"/avatars/ac27c6728d40edd724ce99cb8369808d.svg","isPro":false,"fullname":"Wanli Li","user":"wanlilll","type":"user","name":"wanlilll"},"name":"Wanli Li","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:47:20.335Z","hidden":false},{"_id":"6a277fef6dde1c5ef75bcf54","user":{"_id":"69271cf15181f68b030a2596","avatarUrl":"/avatars/d64da217ce5a07f38494832e7b1679f8.svg","isPro":false,"fullname":"wen","user":"Wen36666","type":"user","name":"Wen36666"},"name":"Bowen Zhou","status":"claimed_verified","statusLastChangedAt":"2026-06-12T07:49:04.234Z","hidden":false},{"_id":"6a277fef6dde1c5ef75bcf55","name":"Yunyao Yu","hidden":false},{"_id":"6a277fef6dde1c5ef75bcf56","name":"Zhou Xu","hidden":false},{"_id":"6a277fef6dde1c5ef75bcf57","name":"Yifan Yang","hidden":false},{"_id":"6a277fef6dde1c5ef75bcf58","name":"Dongsheng Li","hidden":false},{"_id":"6a277fef6dde1c5ef75bcf59","name":"Caihua Shan","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces","submittedOnDailyBy":{"_id":"6777886b8c91ac2d4b705eaf","avatarUrl":"/avatars/ac27c6728d40edd724ce99cb8369808d.svg","isPro":false,"fullname":"Wanli Li","user":"wanlilll","type":"user","name":"wanlilll"},"summary":"Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.","upvotes":51,"discussionId":"6a277ff06dde1c5ef75bcf5a","projectPage":"https://weavebench.github.io/","githubRepo":"https://github.com/weavebench/WeaveBench","githubRepoAddedBy":"user","ai_summary":"WeaveBench presents a comprehensive benchmark for evaluating computer-use agents across multiple interfaces, revealing significant challenges in long-horizon task orchestration and highlighting the limitations of traditional performance assessment methods.","ai_keywords":["computer-use agents","hybrid-interface benchmark","long-horizon tasks","GUI observations","CLI operations","trajectory-aware judge","agent performance evaluation","real-world work domains","desktop control plugin","outcome-only grading"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":30,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"689ef2d876426fb5a8767735","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6VVNeSfrmiklL0jmNexvw.png","isPro":false,"fullname":"Jayce Zhang","user":"JaYzZ95","type":"user"},{"_id":"6791f7573bbd467c599aa2a3","avatarUrl":"/avatars/c5d4dfdb2618233757a5e8c5d5b04e0d.svg","isPro":false,"fullname":"bai ye","user":"alkaidthes","type":"user"},{"_id":"68418d5eb64ba498927203b0","avatarUrl":"/avatars/6cc11ad4fba75860b2293df092400028.svg","isPro":false,"fullname":"YUYUNYAO","user":"Yyy195","type":"user"},{"_id":"69271cf15181f68b030a2596","avatarUrl":"/avatars/d64da217ce5a07f38494832e7b1679f8.svg","isPro":false,"fullname":"wen","user":"Wen36666","type":"user"},{"_id":"69ba4160d42ab1f838abedcc","avatarUrl":"/avatars/c587fb3383f3acdfc00cb02c3009dee7.svg","isPro":false,"fullname":"Saki","user":"CatXiao1986","type":"user"},{"_id":"6611fd12f6334cd8ac1afc2e","avatarUrl":"/avatars/c7ea4aa31d0bc347119d4b98d9762d7a.svg","isPro":false,"fullname":"frankjay","user":"FrankJay","type":"user"},{"_id":"69c1f9fa87ed6c1ffe38a0ad","avatarUrl":"/avatars/7fbdfcffe3e2d55a13913aa5d208b1a2.svg","isPro":false,"fullname":"Dr. Wang","user":"acewjh","type":"user"},{"_id":"662e6cac36583408f5e55b93","avatarUrl":"/avatars/3ed3291c59fe4051171b13af5ef93185.svg","isPro":false,"fullname":"szh","user":"doinv","type":"user"},{"_id":"6763e4a38324a7ef5f42c707","avatarUrl":"/avatars/2f4e51298ca581463a0e6003f6318053.svg","isPro":false,"fullname":"silver","user":"haozi666","type":"user"},{"_id":"6984376bd3d55cb2569586c0","avatarUrl":"/avatars/fc8c86a6fbbe6aa6205beb69999cae5d.svg","isPro":false,"fullname":"娄培琳","user":"louxiao","type":"user"},{"_id":"66bb20bfaef365586c665622","avatarUrl":"/avatars/62aa823959bd91d64c69e5a15bfa832d.svg","isPro":false,"fullname":"Haoyue ZHANG","user":"Harel1997","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09426.md","query":{}}">
WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces
Abstract
WeaveBench presents a comprehensive benchmark for evaluating computer-use agents across multiple interfaces, revealing significant challenges in long-horizon task orchestration and highlighting the limitations of traditional performance assessment methods.
Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.
Community
We introduce WeaveBench, a long-horizon benchmark with 114 tasks across 8 real-world domains, where agents must interleave GUI and CLI/code operations in a single trajectory on a real Ubuntu desktop. The best frontier model reaches only 41.2% PassRate, and our trajectory-aware judge shows outcome-only grading greatly overestimates agent performance. Project: https://weavebench.github.io
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.09426 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.09426 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.