Hugging Face Daily Papers · May 14, 2026 · 3 min read

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Grounding anything any action on screen !</p>\n","updatedAt":"2026-05-14T00:56:56.975Z","author":{"_id":"62a30bf72dac39c2173c0a8c","avatarUrl":"/avatars/15fb1ea3dcc7ccd8bc8002ce282e27b3.svg","fullname":"Miaosen Zhang","name":"Miaosen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8846306204795837},"editors":["Miaosen"],"editorAvatarUrls":["/avatars/15fb1ea3dcc7ccd8bc8002ce282e27b3.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12501","authors":[{"_id":"6a03f75386b054ce2fa40ed6","name":"Miaosen Zhang","hidden":false},{"_id":"6a03f75386b054ce2fa40ed7","name":"Xiaohan Zhao","hidden":false},{"_id":"6a03f75386b054ce2fa40ed8","name":"Zhihong Tan","hidden":false},{"_id":"6a03f75386b054ce2fa40ed9","name":"Zhou Huoshen","hidden":false},{"_id":"6a03f75386b054ce2fa40eda","name":"Yijia Fan","hidden":false},{"_id":"6a03f75386b054ce2fa40edb","name":"Yifan Yang","hidden":false},{"_id":"6a03f75386b054ce2fa40edc","name":"Kai Qiu","hidden":false},{"_id":"6a03f75386b054ce2fa40edd","name":"Bei Liu","hidden":false},{"_id":"6a03f75386b054ce2fa40ede","name":"Justin Wagle","hidden":false},{"_id":"6a03f75386b054ce2fa40edf","name":"Chenzhong Yin","hidden":false},{"_id":"6a03f75386b054ce2fa40ee0","name":"Mingxi Cheng","hidden":false},{"_id":"6a03f75386b054ce2fa40ee1","name":"Ji Li","hidden":false},{"_id":"6a03f75386b054ce2fa40ee2","name":"Qi Dai","hidden":false},{"_id":"6a03f75386b054ce2fa40ee3","name":"Chong Luo","hidden":false},{"_id":"6a03f75386b054ce2fa40ee4","name":"Xu Yang","hidden":false},{"_id":"6a03f75386b054ce2fa40ee5","name":"Xin Geng","hidden":false},{"_id":"6a03f75386b054ce2fa40ee6","name":"Baining Guo","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/62a30bf72dac39c2173c0a8c/yaUGOCjDYoILU-h1BBYyv.png"],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Covering Human Action Space for Computer Use: Data Synthesis and Benchmark","submittedOnDailyBy":{"_id":"62a30bf72dac39c2173c0a8c","avatarUrl":"/avatars/15fb1ea3dcc7ccd8bc8002ce282e27b3.svg","isPro":false,"fullname":"Miaosen Zhang","user":"Miaosen","type":"user","name":"Miaosen"},"summary":"Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git","upvotes":9,"discussionId":"6a03f75386b054ce2fa40ee7","githubRepo":"https://github.com/microsoft/Phi-Ground","githubRepoAddedBy":"user","ai_summary":"Computer-use agents face reliability challenges with complex GUI interactions due to data scarcity, addressed through a multi-modal benchmark and synthetic data generation pipeline.","ai_keywords":["computer-use agents","GUI operations","long-tail pattern","data scarcity","CUActSpot benchmark","multi-modal interactions","renderer-based data synthesis","LLM","action traces"],"githubStars":33,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62a30bf72dac39c2173c0a8c","avatarUrl":"/avatars/15fb1ea3dcc7ccd8bc8002ce282e27b3.svg","isPro":false,"fullname":"Miaosen Zhang","user":"Miaosen","type":"user"},{"_id":"62d18eb81e36881a57f29bf4","avatarUrl":"/avatars/104851421b4ee9641daaf15942fa7ea1.svg","isPro":false,"fullname":"Yif Yang","user":"Yif29","type":"user"},{"_id":"676a328148d749b7086782d0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Tt7u8l8f_1oVBWmBp7tkm.png","isPro":false,"fullname":"Chong Luo","user":"cluo-ms","type":"user"},{"_id":"697c6f427e2142d01a93aa39","avatarUrl":"/avatars/8baa3d812bcf95f89cc63dbe1a49b9db.svg","isPro":false,"fullname":"huoshenchou","user":"congeechou","type":"user"},{"_id":"6362a1344d4a190d6b14b7b4","avatarUrl":"/avatars/fa39fbcfff6ae16ddb0a7144463bfe83.svg","isPro":false,"fullname":"Justin Wagle","user":"hrz","type":"user"},{"_id":"622a8544305a46df46e9ba7c","avatarUrl":"/avatars/b052b3f50822a4a8f3f86ceea4bffa8e.svg","isPro":false,"fullname":"Yinheng Li","user":"Inhenn","type":"user"},{"_id":"666afe2de9c4f482497eaa5f","avatarUrl":"/avatars/374a3f395fa986c11368a82aa6ae15ba.svg","isPro":false,"fullname":"beiqing","user":"zhangBeiQing","type":"user"},{"_id":"66f6bc97980d52c75c300511","avatarUrl":"/avatars/f7c23c4b09701580b533212ec9b6e306.svg","isPro":false,"fullname":"Yongliang Wu","user":"Liang0223","type":"user"},{"_id":"6a05293ffa92afcd53731fd4","avatarUrl":"/avatars/0ceef7a4d9702c70fec720afe26b2820.svg","isPro":false,"fullname":"hjq","user":"Hjq07","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12501.md"}">

Papers

arxiv:2605.12501

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Published on May 12

· Submitted by

Miaosen Zhang on May 13

Microsoft

Upvote

Authors:

Abstract

Computer-use agents face reliability challenges with complex GUI interactions due to data scarcity, addressed through a multi-modal benchmark and synthetic data generation pipeline.

AI-generated summary

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

View arXiv page View PDF GitHub 33 Add to collection

Community

Miaosen

Paper submitter about 1 hour ago

Grounding anything any action on screen !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.12501

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.12501 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12501 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers