Hugging Face Daily Papers · May 18, 2026 · 6 min read

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.</p>\n","updatedAt":"2026-05-18T02:11:38.084Z","author":{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","fullname":"Cheng Tan","name":"chengtan9907","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8386496305465698},"editors":["chengtan9907"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15963","authors":[{"_id":"6a0a74e875184a0d71e025ee","name":"Jingxuan Wei","hidden":false},{"_id":"6a0a74e875184a0d71e025ef","name":"Xi Bai","hidden":false},{"_id":"6a0a74e875184a0d71e025f0","name":"Shan Liu","hidden":false},{"_id":"6a0a74e875184a0d71e025f1","name":"Caijun Jia","hidden":false},{"_id":"6a0a74e875184a0d71e025f2","name":"Zheng Sun","hidden":false},{"_id":"6a0a74e875184a0d71e025f3","name":"Xinglong Xu","hidden":false},{"_id":"6a0a74e875184a0d71e025f4","name":"Siyuan Li","hidden":false},{"_id":"6a0a74e875184a0d71e025f5","name":"Linzhuang Sun","hidden":false},{"_id":"6a0a74e875184a0d71e025f6","name":"Bihui Yu","hidden":false},{"_id":"6a0a74e875184a0d71e025f7","name":"Conghui He","hidden":false},{"_id":"6a0a74e875184a0d71e025f8","user":{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","isPro":false,"fullname":"Cheng Tan","user":"chengtan9907","type":"user","name":"chengtan9907"},"name":"Cheng Tan","status":"claimed_verified","statusLastChangedAt":"2026-05-18T09:40:35.797Z","hidden":false}],"publishedAt":"2026-05-15T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control","submittedOnDailyBy":{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","isPro":false,"fullname":"Cheng Tan","user":"chengtan9907","type":"user","name":"chengtan9907"},"summary":"Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.","upvotes":8,"discussionId":"6a0a74e875184a0d71e025f9","projectPage":"https://openraiser.github.io/Pager-webpage/","githubRepo":"https://github.com/OpenRaiser/Pager","githubRepoAddedBy":"user","ai_summary":"Advanced vision-language models for GUI agents face challenges in precision-sensitive tasks requiring point-level accuracy and geometric awareness, addressed by a topology-aware agent that improves task success through structured planning and pixel-level execution.","ai_keywords":["vision-language models","GUI agents","precision-sensitive tasks","geometric primitives","topology-aware agent","dependency-structured planning","pixel-level execution","supervised tuning","reinforcement learning","exposure bias","semantic-execution gap"],"githubStars":1,"organization":{"_id":"66ce9d1f5e180b9b9c8e6f31","name":"opendatalab","fullname":"OpenDataLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64be296a46cc3cdfbb057f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64be296a46cc3cdfbb057f7e/jSHeNY2AcPifCZzJyFhr4.jpeg","isPro":false,"fullname":"Cheng Tan","user":"chengtan9907","type":"user"},{"_id":"69a1406af0f9c0a84efb1342","avatarUrl":"/avatars/4c0d6e55e5bc4ea2555b4993741f9841.svg","isPro":false,"fullname":"la","user":"7lalala","type":"user"},{"_id":"68d8b91832316f543ea7aa5d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68d8b91832316f543ea7aa5d/qmRMGAxclMg1vYuxPpYcX.jpeg","isPro":false,"fullname":"xvxinglong","user":"xvxinglong","type":"user"},{"_id":"69a1bf8b6a2049014ec32c69","avatarUrl":"/avatars/a3c7c5858cd41911783c77a39c60f1e9.svg","isPro":false,"fullname":"xinyuwang","user":"nanmo1242","type":"user"},{"_id":"69a11a41b25979dda8e14bda","avatarUrl":"/avatars/1c56db96c24a5db5ff8d8df3e0dd626a.svg","isPro":false,"fullname":"makabaka","user":"mkbk0211","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"68ca802ee4b4f3be6800bbfd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/kMcvQTYaCjp22_UzdHgRT.png","isPro":false,"fullname":"LSHAN","user":"Bessie311","type":"user"},{"_id":"69bb7a1975169b73cc3747ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Sp8mJwDUzg3PQ8tpo2SHE.jpeg","isPro":false,"fullname":"Tang Zixuan","user":"jacgonzalez47","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66ce9d1f5e180b9b9c8e6f31","name":"opendatalab","fullname":"OpenDataLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15963.md"}">

Papers

arxiv:2605.15963

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Published on May 15

· Submitted by

Cheng Tan on May 18

OpenDataLab

Upvote

Authors:

Cheng Tan

Abstract

Advanced vision-language models for GUI agents face challenges in precision-sensitive tasks requiring point-level accuracy and geometric awareness, addressed by a topology-aware agent that improves task success through structured planning and pixel-level execution.

AI-generated summary

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

chengtan9907

Paper author Paper submitter about 24 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.15963

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15963 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15963 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15963 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers