We use GUI grounding as a directly verifiable testbed to study a broader question in on-policy self-distillation: when the student-generated prefix is already wrong, are the teacher's next-token signals still reliable? Because GUI grounding answers can be checked against ground-truth boxes, we propose a solution tailored to this verifiable structure. Our experiments show that teacher signals after wrong student-prefix can still provide useful supervision, but they require special handling. Our training method achieves significant improvements across multiple GUI grounding benchmarks.</p>\n","updatedAt":"2026-06-18T03:15:54.955Z","author":{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","fullname":"Huang Jingyuan","name":"JingyuanHuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9215477705001831},"editors":["JingyuanHuang"],"editorAvatarUrls":["/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg"],"reactions":[],"isReport":false}},{"id":"6a3393406ba8ddf6bd5c4d55","author":{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","fullname":"Huang Jingyuan","name":"JingyuanHuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-18T06:42:08.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Our main experimental model, GUI-RD, is now publicly released: https://huggingface.co/JingyuanHuang/GUI-RD-9B","html":"<p>Our main experimental model, GUI-RD, is now publicly released: <a href=\"https://huggingface.co/JingyuanHuang/GUI-RD-9B\">https://huggingface.co/JingyuanHuang/GUI-RD-9B</a></p>\n","updatedAt":"2026-06-18T08:55:18.617Z","author":{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","fullname":"Huang Jingyuan","name":"JingyuanHuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7045440673828125},"editors":["JingyuanHuang"],"editorAvatarUrls":["/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18101","authors":[{"_id":"6a323df2bc818ff14e453e82","user":{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","isPro":false,"fullname":"Huang Jingyuan","user":"JingyuanHuang","type":"user","name":"JingyuanHuang"},"name":"Jingyuan Huang","status":"claimed_verified","statusLastChangedAt":"2026-06-17T11:20:46.227Z","hidden":false},{"_id":"6a323df2bc818ff14e453e83","name":"Zuming Huang","hidden":false},{"_id":"6a323df2bc818ff14e453e84","name":"Yucheng Shi","hidden":false},{"_id":"6a323df2bc818ff14e453e85","name":"Tianze Yang","hidden":false},{"_id":"6a323df2bc818ff14e453e86","name":"Xiaoming Zhai","hidden":false},{"_id":"6a323df2bc818ff14e453e87","name":"Wei Chu","hidden":false},{"_id":"6a323df2bc818ff14e453e88","name":"Ninghao Liu","hidden":false}],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding","submittedOnDailyBy":{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","isPro":false,"fullname":"Huang Jingyuan","user":"JingyuanHuang","type":"user","name":"JingyuanHuang"},"summary":"Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.","upvotes":11,"discussionId":"6a323df2bc818ff14e453e89","ai_summary":"Quality-aware self-distillation improves vision-language model performance for GUI grounding by enhancing coordinate-token teacher signals through correctness-aware gating and probability scaling.","ai_keywords":["vision-language models","on-policy self-distillation","coordinate-sensitive task","dense token-level teacher signals","soft correctness-aware gating","teacher-probability scaling","GUI grounding","screen coordinates","vision-language models"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"657e54cc3687559a676eba62","name":"UGA-AI","fullname":"University of Georgia","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/625de37b0bec31f086e32989/D64_Jh2Os7ZU4lrySuzvT.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","isPro":false,"fullname":"Huang Jingyuan","user":"JingyuanHuang","type":"user"},{"_id":"66a310cb73432015c3ccadaf","avatarUrl":"/avatars/1df9214a6259ce31326001f6804cd540.svg","isPro":false,"fullname":"KEYE","user":"KexuanRen","type":"user"},{"_id":"6772524ed6f92f429bd343a3","avatarUrl":"/avatars/211e0c4641b2d048b0136d7cdeef2483.svg","isPro":false,"fullname":"Zuming Huang","user":"zuminghuang","type":"user"},{"_id":"66fc6b6f909f884d3b7b47c2","avatarUrl":"/avatars/32091549474c3a957572bbdfcc87a961.svg","isPro":false,"fullname":"JIANGNANXIA","user":"JIANGNANX129","type":"user"},{"_id":"67baa6c519e9dba50ece56b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/F3ZmRr7nCzfSVbpS4SY1E.png","isPro":false,"fullname":"Ninghao Liu","user":"NeoLiu43","type":"user"},{"_id":"63fac64d6b75d93aa13616e0","avatarUrl":"/avatars/573be0f4fe4a206700aa972629e79abf.svg","isPro":false,"fullname":"Jiaxi Li","user":"plusn","type":"user"},{"_id":"67a41eb3921b18db55282ef8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67a41eb3921b18db55282ef8/KPnh634y3PweUvNjjjgye.png","isPro":false,"fullname":"BaodeWang","user":"GiantPandas","type":"user"},{"_id":"6951c555b519522f565dfd0c","avatarUrl":"/avatars/9028d619483f359639ae7bfe4769da45.svg","isPro":false,"fullname":"ZhongzhiLi","user":"Zhongzhi1228","type":"user"},{"_id":"67d4f7c9d1cd71698c68b9da","avatarUrl":"/avatars/3d721dcf839b11e1c7c65600ed53b258.svg","isPro":false,"fullname":"Yun Wang","user":"Bryce0306","type":"user"},{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","isPro":false,"fullname":"Runpeng Dai","user":"Leo-Dai","type":"user"},{"_id":"67014d33126f9ab39fc52481","avatarUrl":"/avatars/60d1f791e7f3201ce1aef72e9216ff78.svg","isPro":false,"fullname":"Qianhao Yuan","user":"yuanqianhao","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"657e54cc3687559a676eba62","name":"UGA-AI","fullname":"University of Georgia","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/625de37b0bec31f086e32989/D64_Jh2Os7ZU4lrySuzvT.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18101.md","query":{}}">
Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
Abstract
Quality-aware self-distillation improves vision-language model performance for GUI grounding by enhancing coordinate-token teacher signals through correctness-aware gating and probability scaling.
Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.
Community
We use GUI grounding as a directly verifiable testbed to study a broader question in on-policy self-distillation: when the student-generated prefix is already wrong, are the teacher's next-token signals still reliable? Because GUI grounding answers can be checked against ground-truth boxes, we propose a solution tailored to this verifiable structure. Our experiments show that teacher signals after wrong student-prefix can still provide useful supervision, but they require special handling. Our training method achieves significant improvements across multiple GUI grounding benchmarks.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.18101 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.18101 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.