Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.</p>\n","updatedAt":"2026-05-26T12:31:29.678Z","author":{"_id":"6742eb5b3e568c70a7136d2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6742eb5b3e568c70a7136d2b/u1PGlc0UAAqcXwukHBc2N.jpeg","fullname":"Bowen Wang","name":"BryanWangNLP","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8980097770690918},"editors":["BryanWangNLP"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6742eb5b3e568c70a7136d2b/u1PGlc0UAAqcXwukHBc2N.jpeg"],"reactions":[],"isReport":false}},{"id":"6a15fd9d045f26e230909438","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T20:07:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/cua-gym-scaling-verifiable-training-environments-and-tasks-for-computer-use-agents-100-0a784a26\nCovers the executive summary, detailed methodology, and practical applications.","html":"<p>Interesting breakdown of this paper on arXivLens: <a href=\"https://arxivlens.com/PaperView/Details/cua-gym-scaling-verifiable-training-environments-and-tasks-for-computer-use-agents-100-0a784a26\" rel=\"nofollow\">https://arxivlens.com/PaperView/Details/cua-gym-scaling-verifiable-training-environments-and-tasks-for-computer-use-agents-100-0a784a26</a><br>Covers the executive summary, detailed methodology, and practical applications.</p>\n","updatedAt":"2026-05-26T20:07:57.148Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7708154916763306},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.25624","authors":[{"_id":"6a154e9cb57a1823d5708d6f","user":{"_id":"6742eb5b3e568c70a7136d2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6742eb5b3e568c70a7136d2b/u1PGlc0UAAqcXwukHBc2N.jpeg","isPro":false,"fullname":"Bowen Wang","user":"BryanWangNLP","type":"user","name":"BryanWangNLP"},"name":"Bowen Wang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:47:26.617Z","hidden":false},{"_id":"6a154e9cb57a1823d5708d70","name":"Dunjie Lu","hidden":false},{"_id":"6a154e9cb57a1823d5708d71","name":"Junli Wang","hidden":false},{"_id":"6a154e9cb57a1823d5708d72","name":"Tianyi Bai","hidden":false},{"_id":"6a154e9cb57a1823d5708d73","name":"Shixuan Liu","hidden":false},{"_id":"6a154e9cb57a1823d5708d74","name":"Zhipeng Zhang","hidden":false},{"_id":"6a154e9cb57a1823d5708d75","name":"Haiquan Wang","hidden":false},{"_id":"6a154e9cb57a1823d5708d76","name":"Hao Hu","hidden":false},{"_id":"6a154e9cb57a1823d5708d77","name":"Tianbao Xie","hidden":false},{"_id":"6a154e9cb57a1823d5708d78","name":"Shuai Bai","hidden":false},{"_id":"6a154e9cb57a1823d5708d79","name":"Dayiheng Liu","hidden":false},{"_id":"6a154e9cb57a1823d5708d7a","name":"Que Shen","hidden":false},{"_id":"6a154e9cb57a1823d5708d7b","name":"Junyang Lin","hidden":false},{"_id":"6a154e9cb57a1823d5708d7c","name":"Tao Yu","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents","submittedOnDailyBy":{"_id":"6742eb5b3e568c70a7136d2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6742eb5b3e568c70a7136d2b/u1PGlc0UAAqcXwukHBc2N.jpeg","isPro":false,"fullname":"Bowen Wang","user":"BryanWangNLP","type":"user","name":"BryanWangNLP"},"summary":"Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.","upvotes":15,"discussionId":"6a154e9cb57a1823d5708d7d","projectPage":"https://cua-gym.xlang.ai","githubRepo":"https://github.com/xlang-ai/CUA-Gym","githubRepoAddedBy":"user","ai_summary":"RLVR framework for computer-use agents addresses data scarcity through scalable generation pipeline and synthetic environments, achieving superior performance on verification and transfer benchmarks.","ai_keywords":["reinforcement learning with verifiable rewards","computer-use agents","task instruction","executable environment","verifiable reward","Generator agent","Discriminator agent","orchestrator agent","LLM majority voting","agent rollouts","CUA-Gym-Hub","GSPO","OSWorld-Verified","WebArena"],"githubStars":23,"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6742eb5b3e568c70a7136d2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6742eb5b3e568c70a7136d2b/u1PGlc0UAAqcXwukHBc2N.jpeg","isPro":false,"fullname":"Bowen Wang","user":"BryanWangNLP","type":"user"},{"_id":"669ca7e678115e16bdfc9bfc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/669ca7e678115e16bdfc9bfc/pku8NvQKqfNQACRqm1YrW.jpeg","isPro":true,"fullname":"Lu Dunjie","user":"ludunjie","type":"user"},{"_id":"618767e4238063b4615d042b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1636263880877-noauth.jpeg","isPro":false,"fullname":"Tianbao Xie","user":"tianbaoxiexxx","type":"user"},{"_id":"65f944d5056d465a38f49361","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/PWp43VsXeltiPifGxGwrn.jpeg","isPro":true,"fullname":"Junli Wang","user":"ZeonLap","type":"user"},{"_id":"6522c8cbdfc9ea4f31ca6fc2","avatarUrl":"/avatars/809a3d2e1779a02eea5f50d2a65e9544.svg","isPro":false,"fullname":"Lane Sun","user":"lanesun","type":"user"},{"_id":"66d94b49eae491c642e8fa5d","avatarUrl":"/avatars/d8f57605b2dd21cf0e39fb1b4014a742.svg","isPro":false,"fullname":"charliezhang","user":"Clockz","type":"user"},{"_id":"632d53951538d4798a73c849","avatarUrl":"/avatars/b7d0a895e669bcd1303c4716b5401c36.svg","isPro":false,"fullname":"Hongjin SU","user":"multi-train","type":"user"},{"_id":"653f1d243bd61358055ad51d","avatarUrl":"/avatars/698c03b9a4bb69659d2ed594626e3895.svg","isPro":false,"fullname":"junmingyang","user":"jmyang","type":"user"},{"_id":"6625ef13605f46d05c1d0031","avatarUrl":"/avatars/22f201dca35e43013cb593884516e96c.svg","isPro":false,"fullname":"Zheng Liu","user":"starriver030515","type":"user"},{"_id":"629311a945f405d06678224b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1661878264498-629311a945f405d06678224b.png","isPro":false,"fullname":"Chengguang Gan","user":"ganchengguang","type":"user"},{"_id":"63edd2d1f765928ceeb49057","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676530369930-noauth.png","isPro":false,"fullname":"Yaorui SHI","user":"yrshi","type":"user"},{"_id":"5fbcf0d28f35b82700205fd7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fbcf0d28f35b82700205fd7/KZPwcaIymBAPs0UHXe5fa.jpeg","isPro":false,"fullname":"Ruisheng Cao","user":"rshcao","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}}">
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
Abstract
RLVR framework for computer-use agents addresses data scarcity through scalable generation pipeline and synthetic environments, achieving superior performance on verification and transfer benchmarks.
AI-generated summary
Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.
Community
Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.25624 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.25624 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.