Hugging Face Daily Papers · · 4 min read

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Sharing OmniOPD, our latest work that does on-policy distillation without requiring the teacher's logits — while outperforming methods that need them.</p>\n","updatedAt":"2026-06-03T18:35:16.147Z","author":{"_id":"64887eb15cf73a16e767b56a","avatarUrl":"/avatars/ada2b6a07346b1d61322ddd04d219318.svg","fullname":"Yuhang Zhou","name":"zyhang1998","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8998143076896667},"editors":["zyhang1998"],"editorAvatarUrls":["/avatars/ada2b6a07346b1d61322ddd04d219318.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01476","authors":[{"_id":"6a202b7415100c5272a842ac","name":"Yuhang Zhou","hidden":false},{"_id":"6a202b7415100c5272a842ad","name":"Lizhu Zhang","hidden":false},{"_id":"6a202b7415100c5272a842ae","name":"Yifan Wu","hidden":false},{"_id":"6a202b7415100c5272a842af","name":"Mingyi Wang","hidden":false},{"_id":"6a202b7415100c5272a842b0","name":"Peng Bo","hidden":false},{"_id":"6a202b7415100c5272a842b1","name":"Jiayi Liu","hidden":false},{"_id":"6a202b7415100c5272a842b2","name":"Xiangjun Fan","hidden":false},{"_id":"6a202b7415100c5272a842b3","name":"Zhuokai Zhao","hidden":false}],"publishedAt":"2026-05-31T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification","submittedOnDailyBy":{"_id":"64887eb15cf73a16e767b56a","avatarUrl":"/avatars/ada2b6a07346b1d61322ddd04d219318.svg","isPro":false,"fullname":"Yuhang Zhou","user":"zyhang1998","type":"user","name":"zyhang1998"},"summary":"On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.","upvotes":7,"discussionId":"6a202b7415100c5272a842b4","ai_summary":"OmniOPD addresses limitations of standard On-Policy Distillation by using chunk-level semantic similarity instead of token-level logits, improving learning reliability and performance with black-box teachers.","ai_keywords":["On-Policy Distillation","supervised fine-tuning","reinforcement learning","token-level feedback","logit matching","Monte Carlo rollouts","semantic similarity","peak-entropy scheduler","Dirichlet-Multinomial Bayesian prior","KL divergence","policy collapse","black-box teachers"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"66b54027408752ae16404b05","name":"metaresearch","fullname":"Meta Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b25f3f58babfaeb76112dc/2GmiaF075AZ7BcE538oPk.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64887eb15cf73a16e767b56a","avatarUrl":"/avatars/ada2b6a07346b1d61322ddd04d219318.svg","isPro":false,"fullname":"Yuhang Zhou","user":"zyhang1998","type":"user"},{"_id":"62505101a0f6b0ed18114323","avatarUrl":"/avatars/fdd0cd6abba33740b037b71876e8af41.svg","isPro":false,"fullname":"Paiheng Xu","user":"paiheng","type":"user"},{"_id":"64641a2938083255f6769953","avatarUrl":"/avatars/a4117357703607bd7b290dc2975acbef.svg","isPro":false,"fullname":"Yifan Wu","user":"yfwu","type":"user"},{"_id":"655fed9fdef5905d38b84af3","avatarUrl":"/avatars/2cda4182dfd11a1e94743639e62328ea.svg","isPro":false,"fullname":"Xiyao Wang","user":"russwang","type":"user"},{"_id":"63af25605fe9db73f67a0fb7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63af25605fe9db73f67a0fb7/mcQHbTdJVxi2GdHzRB9ad.jpeg","isPro":false,"fullname":"Zhuokai Zhao","user":"zhuokai","type":"user"},{"_id":"665640a3a2d7a882a8c7f7d5","avatarUrl":"/avatars/56dbd937eec1681ab2837fc4e978b9d4.svg","isPro":false,"fullname":"Yuhang Zhou","user":"tonyzhou1998","type":"user"},{"_id":"6a146a65fc425179b674ee9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/bCBIvZ3jFq-gm7UzVZrTO.png","isPro":false,"fullname":"Даниил Борисов","user":"ella-brown","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66b54027408752ae16404b05","name":"metaresearch","fullname":"Meta Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b25f3f58babfaeb76112dc/2GmiaF075AZ7BcE538oPk.png"}}">
Papers
arxiv:2606.01476

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Published on May 31
· Submitted by
Yuhang Zhou
on Jun 3
Authors:
,
,
,
,
,
,
,

Abstract

OmniOPD addresses limitations of standard On-Policy Distillation by using chunk-level semantic similarity instead of token-level logits, improving learning reliability and performance with black-box teachers.

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.

Community

Paper submitter about 2 hours ago

Sharing OmniOPD, our latest work that does on-policy distillation without requiring the teacher's logits — while outperforming methods that need them.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.01476 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.01476 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.01476 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers