Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.</p>\n","updatedAt":"2026-05-18T02:11:38.084Z","author":{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","fullname":"Cheng Tan","name":"chengtan9907","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8386496305465698},"editors":["chengtan9907"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15963","authors":[{"_id":"6a0a74e875184a0d71e025ee","name":"Jingxuan Wei","hidden":false},{"_id":"6a0a74e875184a0d71e025ef","name":"Xi Bai","hidden":false},{"_id":"6a0a74e875184a0d71e025f0","name":"Shan Liu","hidden":false},{"_id":"6a0a74e875184a0d71e025f1","name":"Caijun Jia","hidden":false},{"_id":"6a0a74e875184a0d71e025f2","name":"Zheng Sun","hidden":false},{"_id":"6a0a74e875184a0d71e025f3","name":"Xinglong Xu","hidden":false},{"_id":"6a0a74e875184a0d71e025f4","name":"Siyuan Li","hidden":false},{"_id":"6a0a74e875184a0d71e025f5","name":"Linzhuang Sun","hidden":false},{"_id":"6a0a74e875184a0d71e025f6","name":"Bihui Yu","hidden":false},{"_id":"6a0a74e875184a0d71e025f7","name":"Conghui He","hidden":false},{"_id":"6a0a74e875184a0d71e025f8","user":{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","isPro":false,"fullname":"Cheng Tan","user":"chengtan9907","type":"user","name":"chengtan9907"},"name":"Cheng Tan","status":"claimed_verified","statusLastChangedAt":"2026-05-18T09:40:35.797Z","hidden":false}],"publishedAt":"2026-05-15T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control","submittedOnDailyBy":{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","isPro":false,"fullname":"Cheng Tan","user":"chengtan9907","type":"user","name":"chengtan9907"},"summary":"Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.","upvotes":8,"discussionId":"6a0a74e875184a0d71e025f9","projectPage":"https://openraiser.github.io/Pager-webpage/","githubRepo":"https://github.com/OpenRaiser/Pager","githubRepoAddedBy":"user","ai_summary":"Advanced vision-language models for GUI agents face challenges in precision-sensitive tasks requiring point-level accuracy and geometric awareness, addressed by a topology-aware agent that improves task success through structured planning and pixel-level execution.","ai_keywords":["vision-language models","GUI agents","precision-sensitive tasks","geometric primitives","topology-aware agent","dependency-structured planning","pixel-level execution","supervised tuning","reinforcement learning","exposure bias","semantic-execution gap"],"githubStars":1,"organization":{"_id":"66ce9d1f5e180b9b9c8e6f31","name":"opendatalab","fullname":"OpenDataLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","isPro":false,"fullname":"Cheng Tan","user":"chengtan9907","type":"user"},{"_id":"69a1406af0f9c0a84efb1342","avatarUrl":"/avatars/4c0d6e55e5bc4ea2555b4993741f9841.svg","isPro":false,"fullname":"la","user":"7lalala","type":"user"},{"_id":"68d8b91832316f543ea7aa5d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68d8b91832316f543ea7aa5d/qmRMGAxclMg1vYuxPpYcX.jpeg","isPro":false,"fullname":"xvxinglong","user":"xvxinglong","type":"user"},{"_id":"69a1bf8b6a2049014ec32c69","avatarUrl":"/avatars/a3c7c5858cd41911783c77a39c60f1e9.svg","isPro":false,"fullname":"xinyuwang","user":"nanmo1242","type":"user"},{"_id":"69a11a41b25979dda8e14bda","avatarUrl":"/avatars/1c56db96c24a5db5ff8d8df3e0dd626a.svg","isPro":false,"fullname":"makabaka","user":"mkbk0211","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"68ca802ee4b4f3be6800bbfd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/kMcvQTYaCjp22_UzdHgRT.png","isPro":false,"fullname":"LSHAN","user":"Bessie311","type":"user"},{"_id":"69bb7a1975169b73cc3747ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Sp8mJwDUzg3PQ8tpo2SHE.jpeg","isPro":false,"fullname":"Tang Zixuan","user":"jacgonzalez47","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66ce9d1f5e180b9b9c8e6f31","name":"opendatalab","fullname":"OpenDataLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15963.md"}">
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
Advanced vision-language models for GUI agents face challenges in precision-sensitive tasks requiring point-level accuracy and geometric awareness, addressed by a topology-aware agent that improves task success through structured planning and pixel-level execution.
AI-generated summary
Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.
Community
Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.15963 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.15963 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.15963 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.