Sharing OmniOPD, our latest work that does on-policy distillation without requiring the teacher's logits — while outperforming methods that need them.</p>\n","updatedAt":"2026-06-03T18:35:16.147Z","author":{"_id":"64887eb15cf73a16e767b56a","avatarUrl":"/avatars/ada2b6a07346b1d61322ddd04d219318.svg","fullname":"Yuhang Zhou","name":"zyhang1998","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8998143076896667},"editors":["zyhang1998"],"editorAvatarUrls":["/avatars/ada2b6a07346b1d61322ddd04d219318.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01476","authors":[{"_id":"6a202b7415100c5272a842ac","name":"Yuhang Zhou","hidden":false},{"_id":"6a202b7415100c5272a842ad","name":"Lizhu Zhang","hidden":false},{"_id":"6a202b7415100c5272a842ae","name":"Yifan Wu","hidden":false},{"_id":"6a202b7415100c5272a842af","name":"Mingyi Wang","hidden":false},{"_id":"6a202b7415100c5272a842b0","name":"Peng Bo","hidden":false},{"_id":"6a202b7415100c5272a842b1","name":"Jiayi Liu","hidden":false},{"_id":"6a202b7415100c5272a842b2","name":"Xiangjun Fan","hidden":false},{"_id":"6a202b7415100c5272a842b3","name":"Zhuokai Zhao","hidden":false}],"publishedAt":"2026-05-31T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification","submittedOnDailyBy":{"_id":"64887eb15cf73a16e767b56a","avatarUrl":"/avatars/ada2b6a07346b1d61322ddd04d219318.svg","isPro":false,"fullname":"Yuhang Zhou","user":"zyhang1998","type":"user","name":"zyhang1998"},"summary":"On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.","upvotes":7,"discussionId":"6a202b7415100c5272a842b4","ai_summary":"OmniOPD addresses limitations of standard On-Policy Distillation by using chunk-level semantic similarity instead of token-level logits, improving learning reliability and performance with black-box teachers.","ai_keywords":["On-Policy Distillation","supervised fine-tuning","reinforcement learning","token-level feedback","logit matching","Monte Carlo rollouts","semantic similarity","peak-entropy scheduler","Dirichlet-Multinomial Bayesian prior","KL divergence","policy collapse","black-box teachers"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"66b54027408752ae16404b05","name":"metaresearch","fullname":"Meta Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b25f3f58babfaeb76112dc/2GmiaF075AZ7BcE538oPk.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64887eb15cf73a16e767b56a","avatarUrl":"/avatars/ada2b6a07346b1d61322ddd04d219318.svg","isPro":false,"fullname":"Yuhang Zhou","user":"zyhang1998","type":"user"},{"_id":"62505101a0f6b0ed18114323","avatarUrl":"/avatars/fdd0cd6abba33740b037b71876e8af41.svg","isPro":false,"fullname":"Paiheng Xu","user":"paiheng","type":"user"},{"_id":"64641a2938083255f6769953","avatarUrl":"/avatars/a4117357703607bd7b290dc2975acbef.svg","isPro":false,"fullname":"Yifan Wu","user":"yfwu","type":"user"},{"_id":"655fed9fdef5905d38b84af3","avatarUrl":"/avatars/2cda4182dfd11a1e94743639e62328ea.svg","isPro":false,"fullname":"Xiyao Wang","user":"russwang","type":"user"},{"_id":"63af25605fe9db73f67a0fb7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63af25605fe9db73f67a0fb7/mcQHbTdJVxi2GdHzRB9ad.jpeg","isPro":false,"fullname":"Zhuokai Zhao","user":"zhuokai","type":"user"},{"_id":"665640a3a2d7a882a8c7f7d5","avatarUrl":"/avatars/56dbd937eec1681ab2837fc4e978b9d4.svg","isPro":false,"fullname":"Yuhang Zhou","user":"tonyzhou1998","type":"user"},{"_id":"6a146a65fc425179b674ee9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/bCBIvZ3jFq-gm7UzVZrTO.png","isPro":false,"fullname":"Даниил Борисов","user":"ella-brown","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66b54027408752ae16404b05","name":"metaresearch","fullname":"Meta Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b25f3f58babfaeb76112dc/2GmiaF075AZ7BcE538oPk.png"}}">
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification
Abstract
OmniOPD addresses limitations of standard On-Policy Distillation by using chunk-level semantic similarity instead of token-level logits, improving learning reliability and performance with black-box teachers.
On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.
Community
Sharing OmniOPD, our latest work that does on-policy distillation without requiring the teacher's logits — while outperforming methods that need them.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.01476 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.01476 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.01476 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.