Hugging Face Daily Papers · June 1, 2026 · 4 min read

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

This work answers the question: \"which token-level teacher signals in OPD are actually learnable?\" Our fixed-context KL-reduction diagnostic shows that high disagreement token conflates learnable disagreement, where the teacher assigns corrective mass to the student’s top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student’s current support. We formalize this as Token Teachability and propose TA-OPD, which selects only high-teachability positions for OPD. Across Qwen2.5/Qwen3 settings, TA-OPD often matches or surpasses full-token OPD with only 5% retained tokens, without reward models or verifiers.\nIn summary, this work establishes a fine-grained view of OPD: not every token-level teacher–student disagreement is worth learning, and Token Teachability identifies which signals are actually learnable.\n","updatedAt":"2026-06-01T02:35:00.081Z","author":{"_id":"67cb0cc7ba670f30e55d46d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67cb0cc7ba670f30e55d46d4/veKpvPWzKrdXTnXFIWR1C.jpeg","fullname":"Yuanyi Wang","name":"wyy-code","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.922774612903595},"editors":["wyy-code"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67cb0cc7ba670f30e55d46d4/veKpvPWzKrdXTnXFIWR1C.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1cf3b32dd0806419924704","author":{"_id":"67fd7b9a63b5a49fe1226e46","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b3pnI4xfvfinsyMwGJsT4.png","fullname":"Wang","name":"pengkaiScience","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2026-06-01T02:51:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"cool~","html":"cool~\n","updatedAt":"2026-06-01T02:51:31.849Z","author":{"_id":"67fd7b9a63b5a49fe1226e46","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b3pnI4xfvfinsyMwGJsT4.png","fullname":"Wang","name":"pengkaiScience","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5872868299484253},"editors":["pengkaiScience"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b3pnI4xfvfinsyMwGJsT4.png"],"reactions":[{"reaction":"😎","users":["wyy-code"],"count":1},{"reaction":"🤝","users":["wyy-code"],"count":1},{"reaction":"❤️","users":["wyy-code"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26844","authors":[{"_id":"6a1ced33808ddbc3c7d43446","user":{"_id":"67cb0cc7ba670f30e55d46d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67cb0cc7ba670f30e55d46d4/veKpvPWzKrdXTnXFIWR1C.jpeg","isPro":false,"fullname":"Yuanyi Wang","user":"wyy-code","type":"user","name":"wyy-code"},"name":"Yuanyi Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:33:23.741Z","hidden":false},{"_id":"6a1ced33808ddbc3c7d43447","name":"Su Lu","hidden":false},{"_id":"6a1ced33808ddbc3c7d43448","name":"Yanggan Gu","hidden":false},{"_id":"6a1ced33808ddbc3c7d43449","name":"Pengkai Wang","hidden":false},{"_id":"6a1ced33808ddbc3c7d4344a","name":"Yifan Yang","hidden":false},{"_id":"6a1ced33808ddbc3c7d4344b","name":"Zhaoyi Yan","hidden":false},{"_id":"6a1ced33808ddbc3c7d4344c","name":"Congkai Xie","hidden":false},{"_id":"6a1ced33808ddbc3c7d4344d","name":"Jianmin Wu","hidden":false},{"_id":"6a1ced33808ddbc3c7d4344e","name":"Hongxia Yang","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation","submittedOnDailyBy":{"_id":"67cb0cc7ba670f30e55d46d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67cb0cc7ba670f30e55d46d4/veKpvPWzKrdXTnXFIWR1C.jpeg","isPro":false,"fullname":"Yuanyi Wang","user":"wyy-code","type":"user","name":"wyy-code"},"summary":"On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.","upvotes":17,"discussionId":"6a1ced33808ddbc3c7d4344f","githubRepo":"https://github.com/wyy-code/TA-OPD","githubRepoAddedBy":"user","ai_summary":"Token-level teacher signals in on-policy distillation are better predicted by teachability—measuring local compatibility between teacher and student distributions—than by raw KL disagreement alone.","ai_keywords":["on-policy distillation","token-level teacher supervision","selective OPD","high-entropy tokens","high-disagreement tokens","KL divergence","teacher-student KL reduction","token teachability","TA-OPD","fixed-context diagnostic"],"githubStars":3,"organization":{"_id":"646ecc368d316fde87b3b6e3","name":"PolyUHK","fullname":"The Hong Kong Polytechnic University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/646ecbc0cbb7bb996513e298/Akb4zKqIP9kb9PQoUPUmj.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67cb0cc7ba670f30e55d46d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67cb0cc7ba670f30e55d46d4/veKpvPWzKrdXTnXFIWR1C.jpeg","isPro":false,"fullname":"Yuanyi Wang","user":"wyy-code","type":"user"},{"_id":"643e4ad11d0e956d94ba65d4","avatarUrl":"/avatars/69236b1cdb65031fe425b92dfb12b324.svg","isPro":false,"fullname":"slu","user":"sslu","type":"user"},{"_id":"67fd7b9a63b5a49fe1226e46","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b3pnI4xfvfinsyMwGJsT4.png","isPro":false,"fullname":"Wang","user":"pengkaiScience","type":"user"},{"_id":"662bba49bed98acbe616d37d","avatarUrl":"/avatars/f70ded35f371ec0d10249d4248d3cea1.svg","isPro":false,"fullname":"yanggangu","user":"yanggangu","type":"user"},{"_id":"64d22791b1a26e6bbcfdcdbf","avatarUrl":"/avatars/4038c43a46280071bc295d9de3a60b1c.svg","isPro":false,"fullname":"Yifan Yang","user":"yyf919","type":"user"},{"_id":"670f8a7d4e3a710738fd13cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/2ITxPXZCAzdY-LhbNA6NI.png","isPro":false,"fullname":"Guanghao","user":"GuanghaoZhu","type":"user"},{"_id":"66e2511d88522d2c8228cb49","avatarUrl":"/avatars/3ab2cad94ea0b953c9f92353f7a9831e.svg","isPro":false,"fullname":"qizhou","user":"shakira12138","type":"user"},{"_id":"6877c624704898517bbec102","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/t_6k7issq36NK5dmQ8LnX.png","isPro":false,"fullname":"xuan","user":"Max1798","type":"user"},{"_id":"65cd72420b26447b6dde7a9d","avatarUrl":"/avatars/4b25099ed2e4f7d4e451947b712ac748.svg","isPro":false,"fullname":"Wenjun Wang","user":"juezhi","type":"user"},{"_id":"68f6f955a503c13c696dffcc","avatarUrl":"/avatars/e572ba8218a43b183811cfd9cd810112.svg","isPro":false,"fullname":"W.Byme","user":"ByME1","type":"user"},{"_id":"66e43f335d97b5bb46a6b63c","avatarUrl":"/avatars/de23617cdf7fe39d015a59aee6456842.svg","isPro":false,"fullname":"Zhang","user":"kongzym","type":"user"},{"_id":"69454be4d5b791cc5c77f33a","avatarUrl":"/avatars/64a69a17d9047f8b63205ef00d65479c.svg","isPro":false,"fullname":"Liu Xinhang","user":"wenrizhou","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"646ecc368d316fde87b3b6e3","name":"PolyUHK","fullname":"The Hong Kong Polytechnic University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/646ecbc0cbb7bb996513e298/Akb4zKqIP9kb9PQoUPUmj.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26844.md"}">

Papers

arxiv:2605.26844

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

Published on May 26

· Submitted by

Yuanyi Wang on Jun 1

The Hong Kong Polytechnic University

Upvote

Authors:

Yuanyi Wang ,

Abstract

Token-level teacher signals in on-policy distillation are better predicted by teachability—measuring local compatibility between teacher and student distributions—than by raw KL disagreement alone.

AI-generated summary

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.

View arXiv page View PDF GitHub 3 Add to collection

Community

wyy-code

Paper author Paper submitter about 8 hours ago

This work answers the question: "which token-level teacher signals in OPD are actually learnable?" Our fixed-context KL-reduction diagnostic shows that high disagreement token conflates learnable disagreement, where the teacher assigns corrective mass to the student’s top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student’s current support. We formalize this as Token Teachability and propose TA-OPD, which selects only high-teachability positions for OPD. Across Qwen2.5/Qwen3 settings, TA-OPD often matches or surpasses full-token OPD with only 5% retained tokens, without reward models or verifiers.

In summary, this work establishes a fine-grained view of OPD: not every token-level teacher–student disagreement is worth learning, and Token Teachability identifies which signals are actually learnable.

pengkaiScience

about 8 hours ago

cool~