Hugging Face Daily Papers · · 4 min read

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We introduce ToolCUA, an end-to-end agent for optimal GUI-Tool path orchestration, by leveraging an automated interleaved trajectory scaling pipeline and two-staged training paradigm with Reinforcement finetuning and online agentic RL in sandbox. We open source these: </p>\n<ul>\n<li>Website: <a href=\"https://x-plug.github.io/ToolCUA/\" rel=\"nofollow\">https://x-plug.github.io/ToolCUA/</a></li>\n<li>Code: <a href=\"https://github.com/X-PLUG/ToolCUA\" rel=\"nofollow\">https://github.com/X-PLUG/ToolCUA</a></li>\n<li>Model: <a href=\"https://huggingface.co/mPLUG/ToolCUA-8B\">https://huggingface.co/mPLUG/ToolCUA-8B</a></li>\n</ul>\n","updatedAt":"2026-05-13T04:16:54.343Z","author":{"_id":"6372813520a58a5e14c596a3","avatarUrl":"/avatars/9135151259db3e5b9c8969e1d00c949d.svg","fullname":"XuHao Hu","name":"Foreshhh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7252581715583801},"editors":["Foreshhh"],"editorAvatarUrls":["/avatars/9135151259db3e5b9c8969e1d00c949d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12481","authors":[{"_id":"6a03fa5386b054ce2fa40f0c","user":{"_id":"6372813520a58a5e14c596a3","avatarUrl":"/avatars/9135151259db3e5b9c8969e1d00c949d.svg","isPro":false,"fullname":"XuHao Hu","user":"Foreshhh","type":"user","name":"Foreshhh"},"name":"Xuhao Hu","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:43:45.258Z","hidden":false},{"_id":"6a03fa5386b054ce2fa40f0d","name":"Xi Zhang","hidden":false},{"_id":"6a03fa5386b054ce2fa40f0e","name":"Haiyang Xu","hidden":false},{"_id":"6a03fa5386b054ce2fa40f0f","name":"Kyle Qiao","hidden":false},{"_id":"6a03fa5386b054ce2fa40f10","name":"Jingyi Yang","hidden":false},{"_id":"6a03fa5386b054ce2fa40f11","name":"Xuanjing Huang","hidden":false},{"_id":"6a03fa5386b054ce2fa40f12","name":"Jing Shao","hidden":false},{"_id":"6a03fa5386b054ce2fa40f13","name":"Ming Yan","hidden":false},{"_id":"6a03fa5386b054ce2fa40f14","name":"Jieping Ye","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents","submittedOnDailyBy":{"_id":"6372813520a58a5e14c596a3","avatarUrl":"/avatars/9135151259db3e5b9c8969e1d00c949d.svg","isPro":false,"fullname":"XuHao Hu","user":"Foreshhh","type":"user","name":"Foreshhh"},"summary":"Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/","upvotes":23,"discussionId":"6a03fa5486b054ce2fa40f15","projectPage":"https://x-plug.github.io/ToolCUA/","githubRepo":"https://github.com/X-PLUG/ToolCUA","githubRepoAddedBy":"user","ai_summary":"ToolCUA is an end-to-end agent that learns optimal GUI-tool path selection through staged training, achieving superior performance in hybrid action space environments.","ai_keywords":["Computer Use Agents","GUI actions","tool calls","hybrid action space","interleaved GUI-Tool trajectories","staged training paradigm","Interleaved GUI-Tool Trajectory Scaling Pipeline","tool library","Tool-Bootstrapped GUI RFT","single-turn RL","Online Agentic RL","Tool-Efficient Path Reward","OSWorld-MCP"],"githubStars":15,"organization":{"_id":"67d15cca6e2cf0e062dbfb54","name":"AlibabaTongyiLab","fullname":"TongyiLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67d1502bfabfe9974d1f77bb/XdUSVf6HqBzE7zFBfSDQP.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66027b6efad94d04e72597df","avatarUrl":"/avatars/88af5a87c2f9e4e6566610a758976ad1.svg","isPro":false,"fullname":"Hans Zhuang","user":"HansZ8","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"670f8a0fdcc32b5a25ee0b8e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/2X2RVuk6ZwuQxF623eutq.png","isPro":false,"fullname":"Qiyang Chen","user":"c7y","type":"user"},{"_id":"660691330be1fbe3b9e4c33d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660691330be1fbe3b9e4c33d/TxrDFH_cRu3AlpMC3xmhv.jpeg","isPro":false,"fullname":"ZiYang Gong","user":"Cusyoung","type":"user"},{"_id":"645b10e80c73ea27d13f7aca","avatarUrl":"/avatars/95e565306472a15067440b5b43e07a6f.svg","isPro":false,"fullname":"xuhaiyang","user":"xhyandwyy","type":"user"},{"_id":"6438f6415aa69077ffb16942","avatarUrl":"/avatars/c83dbd3e10e88db97c2a86092bad5917.svg","isPro":false,"fullname":"Junyang Wang","user":"junyangwang0410","type":"user"},{"_id":"66a2067ada490fbd6918db0a","avatarUrl":"/avatars/376f694c5e45ebac5932cb09dc4f7105.svg","isPro":false,"fullname":"zhaoqing zhu","user":"IzIy","type":"user"},{"_id":"65a0a3f5be9f3149d59800ee","avatarUrl":"/avatars/6e0af6e1a681ff50cbd5a287ca35f1e3.svg","isPro":false,"fullname":"zihua","user":"zihuaseu","type":"user"},{"_id":"6433b6784b34368fdbfebce8","avatarUrl":"/avatars/57779a5c2fbea37fd320bec2a3eeab2d.svg","isPro":false,"fullname":"Star Bottle","user":"StarBottle","type":"user"},{"_id":"6372813520a58a5e14c596a3","avatarUrl":"/avatars/9135151259db3e5b9c8969e1d00c949d.svg","isPro":false,"fullname":"XuHao Hu","user":"Foreshhh","type":"user"},{"_id":"62ed3b0d3841c14690f8ba26","avatarUrl":"/avatars/9e59e040f9e58ca0cccf2195c2180794.svg","isPro":false,"fullname":"zsj","user":"zsj","type":"user"},{"_id":"6697aa9b07b36ccd016dd28d","avatarUrl":"/avatars/f84be5d4767b82dc3ad373c8b018e6ab.svg","isPro":false,"fullname":"Xiaoya Lu","user":"Ursulalala","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67d15cca6e2cf0e062dbfb54","name":"AlibabaTongyiLab","fullname":"TongyiLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67d1502bfabfe9974d1f77bb/XdUSVf6HqBzE7zFBfSDQP.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12481.md"}">
Papers
arxiv:2605.12481

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Published on May 12
· Submitted by
XuHao Hu
on May 13
Authors:
,
,
,
,
,
,
,

Abstract

ToolCUA is an end-to-end agent that learns optimal GUI-tool path selection through staged training, achieving superior performance in hybrid action space environments.

AI-generated summary

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

Community

Paper author Paper submitter about 17 hours ago

We introduce ToolCUA, an end-to-end agent for optimal GUI-Tool path orchestration, by leveraging an automated interleaved trajectory scaling pipeline and two-staged training paradigm with Reinforcement finetuning and online agentic RL in sandbox. We open source these:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.12481
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.12481 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12481 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers