Hugging Face Daily Papers · June 18, 2026 · 5 min read

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We use GUI grounding as a directly verifiable testbed to study a broader question in on-policy self-distillation: when the student-generated prefix is already wrong, are the teacher's next-token signals still reliable? Because GUI grounding answers can be checked against ground-truth boxes, we propose a solution tailored to this verifiable structure. Our experiments show that teacher signals after wrong student-prefix can still provide useful supervision, but they require special handling. Our training method achieves significant improvements across multiple GUI grounding benchmarks.</p>\n","updatedAt":"2026-06-18T03:15:54.955Z","author":{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","fullname":"Huang Jingyuan","name":"JingyuanHuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9215477705001831},"editors":["JingyuanHuang"],"editorAvatarUrls":["/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg"],"reactions":[],"isReport":false}},{"id":"6a3393406ba8ddf6bd5c4d55","author":{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","fullname":"Huang Jingyuan","name":"JingyuanHuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-18T06:42:08.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Our main experimental model, GUI-RD, is now publicly released: https://huggingface.co/JingyuanHuang/GUI-RD-9B","html":"<p>Our main experimental model, GUI-RD, is now publicly released: <a href=\"https://huggingface.co/JingyuanHuang/GUI-RD-9B\">https://huggingface.co/JingyuanHuang/GUI-RD-9B</a></p>\n","updatedAt":"2026-06-18T08:55:18.617Z","author":{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","fullname":"Huang Jingyuan","name":"JingyuanHuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7045440673828125},"editors":["JingyuanHuang"],"editorAvatarUrls":["/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18101","authors":[{"_id":"6a323df2bc818ff14e453e82","user":{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","isPro":false,"fullname":"Huang Jingyuan","user":"JingyuanHuang","type":"user","name":"JingyuanHuang"},"name":"Jingyuan Huang","status":"claimed_verified","statusLastChangedAt":"2026-06-17T11:20:46.227Z","hidden":false},{"_id":"6a323df2bc818ff14e453e83","name":"Zuming Huang","hidden":false},{"_id":"6a323df2bc818ff14e453e84","name":"Yucheng Shi","hidden":false},{"_id":"6a323df2bc818ff14e453e85","name":"Tianze Yang","hidden":false},{"_id":"6a323df2bc818ff14e453e86","name":"Xiaoming Zhai","hidden":false},{"_id":"6a323df2bc818ff14e453e87","name":"Wei Chu","hidden":false},{"_id":"6a323df2bc818ff14e453e88","name":"Ninghao Liu","hidden":false}],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding","submittedOnDailyBy":{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","isPro":false,"fullname":"Huang Jingyuan","user":"JingyuanHuang","type":"user","name":"JingyuanHuang"},"summary":"Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.","upvotes":11,"discussionId":"6a323df2bc818ff14e453e89","ai_summary":"Quality-aware self-distillation improves vision-language model performance for GUI grounding by enhancing coordinate-token teacher signals through correctness-aware gating and probability scaling.","ai_keywords":["vision-language models","on-policy self-distillation","coordinate-sensitive task","dense token-level teacher signals","soft correctness-aware gating","teacher-probability scaling","GUI grounding","screen coordinates","vision-language models"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"657e54cc3687559a676eba62","name":"UGA-AI","fullname":"University of Georgia","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/625de37b0bec31f086e32989/D64_Jh2Os7ZU4lrySuzvT.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68e491dc0b7b68550d08acc9","avatarUrl":"/avatars/6c096137b58c727f1fe89a8ea21b4a51.svg","isPro":false,"fullname":"Huang Jingyuan","user":"JingyuanHuang","type":"user"},{"_id":"66a310cb73432015c3ccadaf","avatarUrl":"/avatars/1df9214a6259ce31326001f6804cd540.svg","isPro":false,"fullname":"KEYE","user":"KexuanRen","type":"user"},{"_id":"6772524ed6f92f429bd343a3","avatarUrl":"/avatars/211e0c4641b2d048b0136d7cdeef2483.svg","isPro":false,"fullname":"Zuming Huang","user":"zuminghuang","type":"user"},{"_id":"66fc6b6f909f884d3b7b47c2","avatarUrl":"/avatars/32091549474c3a957572bbdfcc87a961.svg","isPro":false,"fullname":"JIANGNANXIA","user":"JIANGNANX129","type":"user"},{"_id":"67baa6c519e9dba50ece56b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/F3ZmRr7nCzfSVbpS4SY1E.png","isPro":false,"fullname":"Ninghao Liu","user":"NeoLiu43","type":"user"},{"_id":"63fac64d6b75d93aa13616e0","avatarUrl":"/avatars/573be0f4fe4a206700aa972629e79abf.svg","isPro":false,"fullname":"Jiaxi Li","user":"plusn","type":"user"},{"_id":"67a41eb3921b18db55282ef8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67a41eb3921b18db55282ef8/KPnh634y3PweUvNjjjgye.png","isPro":false,"fullname":"BaodeWang","user":"GiantPandas","type":"user"},{"_id":"6951c555b519522f565dfd0c","avatarUrl":"/avatars/9028d619483f359639ae7bfe4769da45.svg","isPro":false,"fullname":"ZhongzhiLi","user":"Zhongzhi1228","type":"user"},{"_id":"67d4f7c9d1cd71698c68b9da","avatarUrl":"/avatars/3d721dcf839b11e1c7c65600ed53b258.svg","isPro":false,"fullname":"Yun Wang","user":"Bryce0306","type":"user"},{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","isPro":false,"fullname":"Runpeng Dai","user":"Leo-Dai","type":"user"},{"_id":"67014d33126f9ab39fc52481","avatarUrl":"/avatars/60d1f791e7f3201ce1aef72e9216ff78.svg","isPro":false,"fullname":"Qianhao Yuan","user":"yuanqianhao","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"657e54cc3687559a676eba62","name":"UGA-AI","fullname":"University of Georgia","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/625de37b0bec31f086e32989/D64_Jh2Os7ZU4lrySuzvT.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18101.md","query":{}}">

Papers

arxiv:2606.18101

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

Published on Jun 16

· Submitted by

Huang Jingyuan on Jun 18

University of Georgia

Upvote

Authors:

Jingyuan Huang ,

Abstract

Quality-aware self-distillation improves vision-language model performance for GUI grounding by enhancing coordinate-token teacher signals through correctness-aware gating and probability scaling.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.