Hugging Face Daily Papers · May 26, 2026 · 4 min read

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Claw-Anything: See anything, and do anything. Scaling Agent Context.</p>\n<p>We believe the next leap for always-on LLM agents lies in scaling agent context — expanding the slice of the user's digital world an assistant can continuously perceive, reason over, and act on.</p>\n","updatedAt":"2026-05-26T03:01:41.142Z","author":{"_id":"65f43c3cc9940817caaf4434","avatarUrl":"/avatars/ecec2856ba7a7d3421a2071a0a88800b.svg","fullname":"Haiyang Wang","name":"Haiyang-W","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9050289392471313},"editors":["Haiyang-W"],"editorAvatarUrls":["/avatars/ecec2856ba7a7d3421a2071a0a88800b.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26086","authors":[{"_id":"6a150c67b57a1823d5708aaf","name":"Yusong Lin","hidden":false},{"_id":"6a150c67b57a1823d5708ab0","name":"Xinyuan Liang","hidden":false},{"_id":"6a150c67b57a1823d5708ab1","name":"Haiyang Wang","hidden":false},{"_id":"6a150c67b57a1823d5708ab2","name":"Qipeng Gu","hidden":false},{"_id":"6a150c67b57a1823d5708ab3","name":"Siqi Cheng","hidden":false},{"_id":"6a150c67b57a1823d5708ab4","name":"Jiangui Chen","hidden":false},{"_id":"6a150c67b57a1823d5708ab5","name":"Shuzhe Wu","hidden":false},{"_id":"6a150c67b57a1823d5708ab6","name":"Feiyang Pan","hidden":false},{"_id":"6a150c67b57a1823d5708ab7","name":"Lue Fan","hidden":false},{"_id":"6a150c67b57a1823d5708ab8","name":"Sanyuan Zhao","hidden":false},{"_id":"6a150c67b57a1823d5708ab9","name":"Dandan Tu","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World","submittedOnDailyBy":{"_id":"65f43c3cc9940817caaf4434","avatarUrl":"/avatars/ecec2856ba7a7d3421a2071a0a88800b.svg","isPro":false,"fullname":"Haiyang Wang","user":"Haiyang-W","type":"user","name":"Haiyang-W"},"summary":"Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.","upvotes":11,"discussionId":"6a150c67b57a1823d5708aba","githubRepo":"https://github.com/LiberCoders/CLaw-Anything","githubRepoAddedBy":"user","ai_summary":"Claw-Anything benchmark evaluates large language model agents on comprehensive user activity contexts spanning extended timeframes, multiple services, and diverse device interactions to assess true always-on personal assistance capabilities.","ai_keywords":["large language model agents","personal assistants","user activity","multi-round event injection","proactive assistance","pass@1","automated data-generation pipeline"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6693f8896a04ddccda716550","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6693f8896a04ddccda716550/Dg55EJLolBsSVi-CHOovb.jpeg","isPro":false,"fullname":"Qixing Zhou","user":"potatoQi-hf","type":"user"},{"_id":"653f693b885338b011100355","avatarUrl":"/avatars/cae1154f6f46e80a8daa322c4bb8e8a5.svg","isPro":false,"fullname":"Xinyuan Liang","user":"Liangxuan","type":"user"},{"_id":"649ecf9827145c4463240177","avatarUrl":"/avatars/27696cf31790a3d58d8be2e0c983800e.svg","isPro":false,"fullname":"Lue Fan","user":"Abyssaledge","type":"user"},{"_id":"6a141274f65378d22b88ff52","avatarUrl":"/avatars/434fa7fb418c8e8c906343f435d0de1a.svg","isPro":false,"fullname":"siqi Cheng","user":"ssssq05","type":"user"},{"_id":"6868f58a4757672a6da7c417","avatarUrl":"/avatars/73154b7e0f1af68054b97f10a6c2e670.svg","isPro":false,"fullname":"JiaCheng Zhang","user":"jiachengzhg","type":"user"},{"_id":"65f43c3cc9940817caaf4434","avatarUrl":"/avatars/ecec2856ba7a7d3421a2071a0a88800b.svg","isPro":false,"fullname":"Haiyang Wang","user":"Haiyang-W","type":"user"},{"_id":"60d549f1af1ba15dd4f1fba0","avatarUrl":"/avatars/27b73b1ba9a8bbb1422d01b87db3e10f.svg","isPro":false,"fullname":"Chris","user":"Chriskuei","type":"user"},{"_id":"664305f724555ed542fba33d","avatarUrl":"/avatars/fd87f2b5ec655ab32da6660b3c4855e4.svg","isPro":false,"fullname":"GU","user":"Rechardgu","type":"user"},{"_id":"695f8d5d9e778cf056e56017","avatarUrl":"/avatars/a1ec00149303e4c69e43821d2ee43218.svg","isPro":false,"fullname":"Yusong Lin","user":"x1aoche","type":"user"},{"_id":"686f23957730fef2bdd0d71c","avatarUrl":"/avatars/b75bb213c522313c7223376a35dd6859.svg","isPro":false,"fullname":"li yihan","user":"rangwochenggong","type":"user"},{"_id":"639a8f29b2740bf1474e82c1","avatarUrl":"/avatars/306ac149819c80b66386e4a719662130.svg","isPro":false,"fullname":"Hongbo Wang","user":"Larer","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26086.md"}">

Papers

arxiv:2605.26086

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Published on May 25

· Submitted by

Haiyang Wang on May 26

Upvote

Authors:

Abstract

Claw-Anything benchmark evaluates large language model agents on comprehensive user activity contexts spanning extended timeframes, multiple services, and diverse device interactions to assess true always-on personal assistance capabilities.

AI-generated summary

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

View arXiv page View PDF GitHub 4 Add to collection

Community

Haiyang-W

Paper submitter about 5 hours ago

Claw-Anything: See anything, and do anything. Scaling Agent Context.

We believe the next leap for always-on LLM agents lies in scaling agent context — expanding the slice of the user's digital world an assistant can continuously perceive, reason over, and act on.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.26086

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26086 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26086 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers