Hugging Face Daily Papers · · 4 min read

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

This work answers the question: \"which token-level teacher signals in OPD are actually learnable?\" Our fixed-context KL-reduction diagnostic shows that high disagreement token conflates learnable disagreement, where the teacher assigns corrective mass to the student’s top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student’s current support. We formalize this as Token Teachability and propose TA-OPD, which selects only high-teachability positions for OPD. Across Qwen2.5/Qwen3 settings, TA-OPD often matches or surpasses full-token OPD with only 5% retained tokens, without reward models or verifiers.</p>\n<p>In summary, this work establishes a fine-grained view of OPD: not every token-level teacher–student disagreement is worth learning, and Token Teachability identifies which signals are actually learnable.</p>\n","updatedAt":"2026-06-01T02:35:00.081Z","author":{"_id":"67cb0cc7ba670f30e55d46d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67cb0cc7ba670f30e55d46d4/veKpvPWzKrdXTnXFIWR1C.jpeg","fullname":"Yuanyi Wang","name":"wyy-code","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.922774612903595},"editors":["wyy-code"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67cb0cc7ba670f30e55d46d4/veKpvPWzKrdXTnXFIWR1C.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1cf3b32dd0806419924704","author":{"_id":"67fd7b9a63b5a49fe1226e46","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b3pnI4xfvfinsyMwGJsT4.png","fullname":"Wang","name":"pengkaiScience","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2026-06-01T02:51:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"cool~","html":"<p>cool~</p>\n","updatedAt":"2026-06-01T02:51:31.849Z","author":{"_id":"67fd7b9a63b5a49fe1226e46","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b3pnI4xfvfinsyMwGJsT4.png","fullname":"Wang","name":"pengkaiScience","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5872868299484253},"editors":["pengkaiScience"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b3pnI4xfvfinsyMwGJsT4.png"],"reactions":[{"reaction":"😎","users":["wyy-code"],"count":1},{"reaction":"🤝","users":["wyy-code"],"count":1},{"reaction":"❤️","users":["wyy-code"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26844","authors":[{"_id":"6a1ced33808ddbc3c7d43446","user":{"_id":"67cb0cc7ba670f30e55d46d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67cb0cc7ba670f30e55d46d4/veKpvPWzKrdXTnXFIWR1C.jpeg","isPro":false,"fullname":"Yuanyi Wang","user":"wyy-code","type":"user","name":"wyy-code"},"name":"Yuanyi Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:33:23.741Z","hidden":false},{"_id":"6a1ced33808ddbc3c7d43447","name":"Su Lu","hidden":false},{"_id":"6a1ced33808ddbc3c7d43448","name":"Yanggan Gu","hidden":false},{"_id":"6a1ced33808ddbc3c7d43449","name":"Pengkai Wang","hidden":false},{"_id":"6a1ced33808ddbc3c7d4344a","name":"Yifan Yang","hidden":false},{"_id":"6a1ced33808ddbc3c7d4344b","name":"Zhaoyi Yan","hidden":false},{"_id":"6a1ced33808ddbc3c7d4344c","name":"Congkai Xie","hidden":false},{"_id":"6a1ced33808ddbc3c7d4344d","name":"Jianmin Wu","hidden":false},{"_id":"6a1ced33808ddbc3c7d4344e","name":"Hongxia Yang","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation","submittedOnDailyBy":{"_id":"67cb0cc7ba670f30e55d46d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67cb0cc7ba670f30e55d46d4/veKpvPWzKrdXTnXFIWR1C.jpeg","isPro":false,"fullname":"Yuanyi Wang","user":"wyy-code","type":"user","name":"wyy-code"},"summary":"On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.","upvotes":17,"discussionId":"6a1ced33808ddbc3c7d4344f","githubRepo":"https://github.com/wyy-code/TA-OPD","githubRepoAddedBy":"user","ai_summary":"Token-level teacher signals in on-policy distillation are better predicted by teachability—measuring local compatibility between teacher and student distributions—than by raw KL disagreement alone.","ai_keywords":["on-policy distillation","token-level teacher supervision","selective OPD","high-entropy tokens","high-disagreement tokens","KL divergence","teacher-student KL reduction","token teachability","TA-OPD","fixed-context diagnostic"],"githubStars":3,"organization":{"_id":"646ecc368d316fde87b3b6e3","name":"PolyUHK","fullname":"The Hong Kong Polytechnic University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/646ecbc0cbb7bb996513e298/Akb4zKqIP9kb9PQoUPUmj.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67cb0cc7ba670f30e55d46d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67cb0cc7ba670f30e55d46d4/veKpvPWzKrdXTnXFIWR1C.jpeg","isPro":false,"fullname":"Yuanyi Wang","user":"wyy-code","type":"user"},{"_id":"643e4ad11d0e956d94ba65d4","avatarUrl":"/avatars/69236b1cdb65031fe425b92dfb12b324.svg","isPro":false,"fullname":"slu","user":"sslu","type":"user"},{"_id":"67fd7b9a63b5a49fe1226e46","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b3pnI4xfvfinsyMwGJsT4.png","isPro":false,"fullname":"Wang","user":"pengkaiScience","type":"user"},{"_id":"662bba49bed98acbe616d37d","avatarUrl":"/avatars/f70ded35f371ec0d10249d4248d3cea1.svg","isPro":false,"fullname":"yanggangu","user":"yanggangu","type":"user"},{"_id":"64d22791b1a26e6bbcfdcdbf","avatarUrl":"/avatars/4038c43a46280071bc295d9de3a60b1c.svg","isPro":false,"fullname":"Yifan Yang","user":"yyf919","type":"user"},{"_id":"670f8a7d4e3a710738fd13cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/2ITxPXZCAzdY-LhbNA6NI.png","isPro":false,"fullname":"Guanghao","user":"GuanghaoZhu","type":"user"},{"_id":"66e2511d88522d2c8228cb49","avatarUrl":"/avatars/3ab2cad94ea0b953c9f92353f7a9831e.svg","isPro":false,"fullname":"qizhou","user":"shakira12138","type":"user"},{"_id":"6877c624704898517bbec102","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/t_6k7issq36NK5dmQ8LnX.png","isPro":false,"fullname":"xuan","user":"Max1798","type":"user"},{"_id":"65cd72420b26447b6dde7a9d","avatarUrl":"/avatars/4b25099ed2e4f7d4e451947b712ac748.svg","isPro":false,"fullname":"Wenjun Wang","user":"juezhi","type":"user"},{"_id":"68f6f955a503c13c696dffcc","avatarUrl":"/avatars/e572ba8218a43b183811cfd9cd810112.svg","isPro":false,"fullname":"W.Byme","user":"ByME1","type":"user"},{"_id":"66e43f335d97b5bb46a6b63c","avatarUrl":"/avatars/de23617cdf7fe39d015a59aee6456842.svg","isPro":false,"fullname":"Zhang","user":"kongzym","type":"user"},{"_id":"69454be4d5b791cc5c77f33a","avatarUrl":"/avatars/64a69a17d9047f8b63205ef00d65479c.svg","isPro":false,"fullname":"Liu Xinhang","user":"wenrizhou","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"646ecc368d316fde87b3b6e3","name":"PolyUHK","fullname":"The Hong Kong Polytechnic University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/646ecbc0cbb7bb996513e298/Akb4zKqIP9kb9PQoUPUmj.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26844.md"}">
Papers
arxiv:2605.26844

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

Published on May 26
· Submitted by
Yuanyi Wang
on Jun 1
Authors:
,
,
,
,
,
,
,

Abstract

Token-level teacher signals in on-policy distillation are better predicted by teachability—measuring local compatibility between teacher and student distributions—than by raw KL disagreement alone.

AI-generated summary

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.

Community

Paper author Paper submitter about 8 hours ago

This work answers the question: "which token-level teacher signals in OPD are actually learnable?" Our fixed-context KL-reduction diagnostic shows that high disagreement token conflates learnable disagreement, where the teacher assigns corrective mass to the student’s top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student’s current support. We formalize this as Token Teachability and propose TA-OPD, which selects only high-teachability positions for OPD. Across Qwen2.5/Qwen3 settings, TA-OPD often matches or surpasses full-token OPD with only 5% retained tokens, without reward models or verifiers.

In summary, this work establishes a fine-grained view of OPD: not every token-level teacher–student disagreement is worth learning, and Token Teachability identifies which signals are actually learnable.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26844
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26844 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26844 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26844 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers