Hugging Face Daily Papers · June 5, 2026 · 3 min read

OPRD: On-Policy Representation Distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Due to the company's open-source process, the code will be released within the next week~</p>\n","updatedAt":"2026-06-05T01:46:37.277Z","author":{"_id":"640c439b3623f6a56dd86fd3","avatarUrl":"/avatars/38a52ff55d84be55e805f6f0f7cdb754.svg","fullname":"Shenzhi Yang","name":"Shenzhi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9487210512161255},"editors":["Shenzhi"],"editorAvatarUrls":["/avatars/38a52ff55d84be55e805f6f0f7cdb754.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.06021","authors":[{"_id":"6a2229593490a593e87b13ac","name":"Shenzhi Yang","hidden":false},{"_id":"6a2229593490a593e87b13ad","name":"Guangcheng Zhu","hidden":false},{"_id":"6a2229593490a593e87b13ae","name":"Bowen Song","hidden":false},{"_id":"6a2229593490a593e87b13af","name":"Haobo Wang","hidden":false},{"_id":"6a2229593490a593e87b13b0","name":"Mingxuan Xia","hidden":false},{"_id":"6a2229593490a593e87b13b1","name":"Xing Zheng","hidden":false},{"_id":"6a2229593490a593e87b13b2","name":"Yingfan Ma","hidden":false},{"_id":"6a2229593490a593e87b13b3","name":"Zhongqi Chen","hidden":false},{"_id":"6a2229593490a593e87b13b4","name":"Weiqiang Wang","hidden":false},{"_id":"6a2229593490a593e87b13b5","name":"Gang Chen","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"OPRD: On-Policy Representation Distillation","submittedOnDailyBy":{"_id":"640c439b3623f6a56dd86fd3","avatarUrl":"/avatars/38a52ff55d84be55e805f6f0f7cdb754.svg","isPro":false,"fullname":"Shenzhi Yang","user":"Shenzhi","type":"user","name":"Shenzhi"},"summary":"On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.","upvotes":3,"discussionId":"6a2229593490a593e87b13b6","githubRepo":"https://github.com/ShenzhiYang2000/OPRD","githubRepoAddedBy":"user","ai_summary":"On-Policy Representation Distillation (OPRD) improves upon traditional on-policy distillation by aligning student and teacher representations in hidden-state space rather than just output space, resulting in reduced variance and improved training efficiency.","ai_keywords":["on-policy distillation","student-teacher distillation","hidden-state space","next-token probabilities","Monte Carlo KL estimates","large vocabularies","AIME 2024","AIMO","representation distillation","rollouts","layer alignment"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"640c439b3623f6a56dd86fd3","avatarUrl":"/avatars/38a52ff55d84be55e805f6f0f7cdb754.svg","isPro":false,"fullname":"Shenzhi Yang","user":"Shenzhi","type":"user"},{"_id":"6570936ac8018fe6406c813e","avatarUrl":"/avatars/3ba76d4b1a5701b36dddbd124eecfb8d.svg","isPro":false,"fullname":"Haobo Wang","user":"HobertZJU","type":"user"},{"_id":"65dc040952eca001fd0bb142","avatarUrl":"/avatars/cd8e54ceef7c9e4a3bb4b0900c47a8b6.svg","isPro":false,"fullname":"Mingxuan Xia","user":"MingxuanXia","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.06021.md"}">

Papers

arxiv:2606.06021

OPRD: On-Policy Representation Distillation

Published on Jun 4

· Submitted by

Shenzhi Yang on Jun 5

Upvote

Authors:

Abstract

On-Policy Representation Distillation (OPRD) improves upon traditional on-policy distillation by aligning student and teacher representations in hidden-state space rather than just output space, resulting in reduced variance and improved training efficiency.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

View arXiv page View PDF GitHub 0 Add to collection

Community

Shenzhi

Paper submitter about 9 hours ago

Due to the company's open-source process, the code will be released within the next week~

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.06021

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.06021 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.06021 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.06021 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

OPRD: On-Policy Representation Distillation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers