Hugging Face Daily Papers · June 12, 2026 · 3 min read

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

SG-OPD</p>\n","updatedAt":"2026-06-12T03:48:55.900Z","author":{"_id":"66f18c7982d5de5715393736","avatarUrl":"/avatars/dd278f91dab5cf1be97a751027a637b1.svg","fullname":"haoran xu","name":"pianzhikuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"de","probability":0.3300532400608063},"editors":["pianzhikuang"],"editorAvatarUrls":["/avatars/dd278f91dab5cf1be97a751027a637b1.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09304","authors":[{"_id":"6a27976e6dde1c5ef75bd0fb","name":"Haoran Xu","hidden":false},{"_id":"6a27976e6dde1c5ef75bd0fc","name":"Hongyu Wang","hidden":false},{"_id":"6a27976e6dde1c5ef75bd0fd","name":"Yifei Gao","hidden":false},{"_id":"6a27976e6dde1c5ef75bd0fe","user":{"_id":"68fa12db3fea62443368a82e","avatarUrl":"/avatars/e6059edebe9ee6e3b456d5cae1dab94c.svg","isPro":false,"fullname":"jiaze li","user":"williamljz","type":"user","name":"williamljz"},"name":"Jiaze Li","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:47:09.759Z","hidden":false},{"_id":"6a27976e6dde1c5ef75bd0ff","name":"Xiaofeng Zhang","hidden":false},{"_id":"6a27976e6dde1c5ef75bd100","name":"Xiaosong Yuan","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling","submittedOnDailyBy":{"_id":"66f18c7982d5de5715393736","avatarUrl":"/avatars/dd278f91dab5cf1be97a751027a637b1.svg","isPro":false,"fullname":"haoran xu","user":"pianzhikuang","type":"user","name":"pianzhikuang"},"summary":"On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.","upvotes":5,"discussionId":"6a27976f6dde1c5ef75bd101","ai_summary":"Sign-Gated On-Policy Distillation improves upon standard on-policy distillation by incorporating a binary verifier to filter teacher signals, resulting in better performance on mathematical reasoning tasks.","ai_keywords":["on-policy distillation","student","teacher","dense per-token supervision","trajectory-level alignment","token-level reliability","binary verifier","phased teacher sampling","sign-consistency gate","distillation update"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f18c7982d5de5715393736","avatarUrl":"/avatars/dd278f91dab5cf1be97a751027a637b1.svg","isPro":false,"fullname":"haoran xu","user":"pianzhikuang","type":"user"},{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},{"_id":"6443f2859174daa2f68f125f","avatarUrl":"/avatars/0a7a2ebdc174df95ed85def44608f306.svg","isPro":false,"fullname":"Xiaosong Yuan","user":"yuanxs21","type":"user"},{"_id":"68fa12db3fea62443368a82e","avatarUrl":"/avatars/e6059edebe9ee6e3b456d5cae1dab94c.svg","isPro":false,"fullname":"jiaze li","user":"williamljz","type":"user"},{"_id":"67223563fa69c82e19d2232c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/1z_axjIty3uB4UDYa9JK4.png","isPro":false,"fullname":"Xiaoxing Hu","user":"wsdwJohn1231","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09304.md","query":{}}">

Papers

arxiv:2606.09304

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

Published on Jun 8

· Submitted by

haoran xu on Jun 12

Upvote

Authors:

Jiaze Li ,

Abstract

Sign-Gated On-Policy Distillation improves upon standard on-policy distillation by incorporating a binary verifier to filter teacher signals, resulting in better performance on mathematical reasoning tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.

View arXiv page View PDF Add to collection

Community

pianzhikuang

Paper submitter about 6 hours ago

SG-OPD

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.09304

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09304 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09304 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09304 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers