We observe that existing GUI agent benchmarks focus on overall success rates or robustness to external perturbations, while real-world failures are predominantly caused by the agent's own policy — and current training data significantly underrepresents planning and progress-perception errors that dominate real-world failures. Motivated by this gap, we propose GUI-RobustEval, a fine-grained benchmark for measuring error awareness and recovery across 11 error types and 4 error depths, and RoTS, a tree-based online synthesis framework that generates 800K error-recovery trajectories by actively exploring failure modes and synthesizing long-horizon recovery data. RoTS-trained models achieve strong performance on OSWorld with substantially smaller degradation under compounding errors compared to existing baselines, demonstrating improved robustness to long-horizon policy-induced failures.</p>\n","updatedAt":"2026-06-01T06:10:40.330Z","author":{"_id":"651390af641b14c330ef85dd","avatarUrl":"/avatars/1e6e1a5dbb16a1d4e02a08e82456f7fa.svg","fullname":"Tianpeng Bu","name":"smallnono","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8694667220115662},"editors":["smallnono"],"editorAvatarUrls":["/avatars/1e6e1a5dbb16a1d4e02a08e82456f7fa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29447","authors":[{"_id":"6a18ecf756b4bb14ec65cddb","user":{"_id":"651390af641b14c330ef85dd","avatarUrl":"/avatars/1e6e1a5dbb16a1d4e02a08e82456f7fa.svg","isPro":false,"fullname":"Tianpeng Bu","user":"smallnono","type":"user","name":"smallnono"},"name":"Tianpeng Bu","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:51:28.520Z","hidden":false},{"_id":"6a18ecf756b4bb14ec65cddc","user":{"_id":"687768739ce4d7aff04ea968","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/9ys-kzPAWIb5GghP0tkuA.png","isPro":false,"fullname":"LiuXin","user":"444515liuxin","type":"user","name":"444515liuxin"},"name":"Xin Liu","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:51:24.363Z","hidden":false},{"_id":"6a18ecf756b4bb14ec65cddd","name":"Qihua Chen","hidden":false},{"_id":"6a18ecf756b4bb14ec65cdde","name":"Hao Jiang","hidden":false},{"_id":"6a18ecf756b4bb14ec65cddf","name":"Shurui Li","hidden":false},{"_id":"6a18ecf756b4bb14ec65cde0","name":"Hongtao Duan","hidden":false},{"_id":"6a18ecf756b4bb14ec65cde1","name":"Lu Jiang","hidden":false},{"_id":"6a18ecf756b4bb14ec65cde2","name":"Lulu Hu","hidden":false},{"_id":"6a18ecf756b4bb14ec65cde3","name":"Bin Yang","hidden":false},{"_id":"6a18ecf756b4bb14ec65cde4","name":"Minying Zhang","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents","submittedOnDailyBy":{"_id":"651390af641b14c330ef85dd","avatarUrl":"/avatars/1e6e1a5dbb16a1d4e02a08e82456f7fa.svg","isPro":false,"fullname":"Tianpeng Bu","user":"smallnono","type":"user","name":"smallnono"},"summary":"While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains 1,216 executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates 800k high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a 47.4% success rate and a 33.8% All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.","upvotes":14,"discussionId":"6a18ecf756b4bb14ec65cde5","githubRepo":"https://github.com/AlibabaResearch/RoTS","githubRepoAddedBy":"user","ai_summary":"GUI agents lack robust error recovery capabilities, which this work addresses through GUI-RobustEval and Robustness-driven Trajectory Synthesis, demonstrating improved performance on real-world benchmarks.","ai_keywords":["GUI agents","error recovery","GUI-RobustEval","Robustness-driven Trajectory Synthesis","OSWorld","All-Pass@4"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"687768739ce4d7aff04ea968","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/9ys-kzPAWIb5GghP0tkuA.png","isPro":false,"fullname":"LiuXin","user":"444515liuxin","type":"user"},{"_id":"65324f49453e0c5fb6ca6837","avatarUrl":"/avatars/af82e193ccd5699303f23203865cb92e.svg","isPro":false,"fullname":"WuShaoyu","user":"Saoyu","type":"user"},{"_id":"6821b167c90d45ee4130c299","avatarUrl":"/avatars/547ffe1050d00ea253f83f157318c13d.svg","isPro":false,"fullname":"jiashuo zhang","user":"jiajiaShuo","type":"user"},{"_id":"670fcb3ffe84ac0ce43a8507","avatarUrl":"/avatars/6c6e7cf19b0b71f32e7dae8f1521976b.svg","isPro":false,"fullname":"Jifang Wang","user":"PigCatchingExpert","type":"user"},{"_id":"666feeb511d9dadd24cf1166","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/666feeb511d9dadd24cf1166/DgC4LoG_SX4Ckn60QAFr6.jpeg","isPro":false,"fullname":"Hao Jiang","user":"Lutalica","type":"user"},{"_id":"65ae052786f88a686b01820e","avatarUrl":"/avatars/2d49e980057d6aa364bbdd5d0058910c.svg","isPro":false,"fullname":"haozhou","user":"hao36918","type":"user"},{"_id":"69b6bedb14f287b912c1c7ee","avatarUrl":"/avatars/07cf89d9bdeb402876d1a8c8ee078de3.svg","isPro":false,"fullname":"Yang Yong","user":"yangyong1234","type":"user"},{"_id":"66297e6ddb8773b8d2896204","avatarUrl":"/avatars/3a68656f691f439970efe68621f7e2e0.svg","isPro":false,"fullname":"SRL","user":"APterosaur","type":"user"},{"_id":"67767c85829ba1c6cb2eaaa7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/O-UIRXJeJYhNMEZQl7soO.png","isPro":false,"fullname":"lee","user":"coco0119","type":"user"},{"_id":"651390af641b14c330ef85dd","avatarUrl":"/avatars/1e6e1a5dbb16a1d4e02a08e82456f7fa.svg","isPro":false,"fullname":"Tianpeng Bu","user":"smallnono","type":"user"},{"_id":"6391e0a484afa726d66106cc","avatarUrl":"/avatars/ab39ba83bc9ce0da387b8d5026d5742c.svg","isPro":false,"fullname":"plmsmile","user":"plmsmile","type":"user"},{"_id":"6a17d75d90e20f5029f879b2","avatarUrl":"/avatars/e7c8aab0a5eecfa791635e76bea3aac7.svg","isPro":false,"fullname":"pupudi","user":"pupudi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29447.md"}">
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents
Abstract
GUI agents lack robust error recovery capabilities, which this work addresses through GUI-RobustEval and Robustness-driven Trajectory Synthesis, demonstrating improved performance on real-world benchmarks.
AI-generated summary
While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains 1,216 executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates 800k high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a 47.4% success rate and a 33.8% All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.
Community
We observe that existing GUI agent benchmarks focus on overall success rates or robustness to external perturbations, while real-world failures are predominantly caused by the agent's own policy — and current training data significantly underrepresents planning and progress-perception errors that dominate real-world failures. Motivated by this gap, we propose GUI-RobustEval, a fine-grained benchmark for measuring error awareness and recovery across 11 error types and 4 error depths, and RoTS, a tree-based online synthesis framework that generates 800K error-recovery trajectories by actively exploring failure modes and synthesizing long-horizon recovery data. RoTS-trained models achieve strong performance on OSWorld with substantially smaller degradation under compounding errors compared to existing baselines, demonstrating improved robustness to long-horizon policy-induced failures.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.29447 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.29447 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.29447 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.