Hugging Face Daily Papers · June 17, 2026 · 4 min read

Learning from the Self-future: On-policy Self-distillation for dLLMs

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Simple self-distillation method for DLMs.\n","updatedAt":"2026-06-17T06:48:24.064Z","author":{"_id":"65b04d2291e63920a7898c9e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b04d2291e63920a7898c9e/iUHs235G4bqK-KnH_94ti.jpeg","fullname":"Liu","name":"Shiweiliuiiiiiii","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8368170857429504},"editors":["Shiweiliuiiiiiii"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65b04d2291e63920a7898c9e/iUHs235G4bqK-KnH_94ti.jpeg"],"reactions":[],"isReport":false}},{"id":"6a32e6dae868a8aeca2bd8a3","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false},"createdAt":"2026-06-17T18:26:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Neat paper. It is interesting to see someone finally tackling the autoregressive bias in self-distillation. Most OPSD work feels so tied to left-to-right generation, so reframing the teacher construction around suffix conditioning for dLLMs makes a lot of sense.\n\nHow much does the performance start to drop off if the model's self-generated answers are low quality? I wonder if the iterative denoising process is robust to that early noise.\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/bc8b4b71-4946-4027-aca9-54baf889e33c","html":"Neat paper. It is interesting to see someone finally tackling the autoregressive bias in self-distillation. Most OPSD work feels so tied to left-to-right generation, so reframing the teacher construction around suffix conditioning for dLLMs makes a lot of sense.\nHow much does the performance start to drop off if the model's self-generated answers are low quality? I wonder if the iterative denoising process is robust to that early noise.\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go: <a href=\"https://researchpod.app/episode/bc8b4b71-4946-4027-aca9-54baf889e33c\" rel=\"nofollow\">https://researchpod.app/episode/bc8b4b71-4946-4027-aca9-54baf889e33c</a>\n","updatedAt":"2026-06-17T18:26:34.054Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9141027927398682},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18195","authors":[{"_id":"6a3242c7bc818ff14e453eb4","name":"Yifu Luo","hidden":false},{"_id":"6a3242c7bc818ff14e453eb5","name":"Zeyu Chen","hidden":false},{"_id":"6a3242c7bc818ff14e453eb6","name":"Haoyu Wang","hidden":false},{"_id":"6a3242c7bc818ff14e453eb7","name":"Xinhao Hu","hidden":false},{"_id":"6a3242c7bc818ff14e453eb8","name":"Yuxuan Zhang","hidden":false},{"_id":"6a3242c7bc818ff14e453eb9","name":"Zhizhou Sha","hidden":false},{"_id":"6a3242c7bc818ff14e453eba","name":"Shiwei Liu","hidden":false}],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-17T00:00:00.000Z","title":"Learning from the Self-future: On-policy Self-distillation for dLLMs","submittedOnDailyBy":{"_id":"65b04d2291e63920a7898c9e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b04d2291e63920a7898c9e/iUHs235G4bqK-KnH_94ti.jpeg","isPro":false,"fullname":"Liu","user":"Shiweiliuiiiiiii","type":"user","name":"Shiweiliuiiiiiii"},"summary":"On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from \"self future-experience\" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.","upvotes":25,"discussionId":"6a3242c7bc818ff14e453ebb","githubRepo":"https://github.com/xingzhejun/d-opsd-code","githubRepoAddedBy":"user","ai_summary":"d-OPSD introduces a novel on-policy self-distillation framework for diffusion language models by adapting self-teacher construction and supervision mechanisms to match the non-autoregressive nature of diffusion models.","ai_keywords":["on-policy self-distillation","diffusion LLMs","self-teacher construction","suffix conditioning","step-level supervision","iterative denoising process","reasoning benchmarks","sample efficiency","RLVR","SFT"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"638df552a11654155baca408","name":"Intelligent-Systems","fullname":"Max Planck Institute for Intelligent Systems","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670247618868-6183d0b249ef1d984699e4a3.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65b04d2291e63920a7898c9e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b04d2291e63920a7898c9e/iUHs235G4bqK-KnH_94ti.jpeg","isPro":false,"fullname":"Liu","user":"Shiweiliuiiiiiii","type":"user"},{"_id":"681d7bc1e36b83d25c8c4581","avatarUrl":"/avatars/9af72c1b7446ec9ecd18ea0c164e28d8.svg","isPro":false,"fullname":"Happy AI","user":"happyiai","type":"user"},{"_id":"645503e102912fad3f2202fb","avatarUrl":"/avatars/87d8c80f4ff7920d990be3c3b94dfc64.svg","isPro":false,"fullname":"Haoyu Wang","user":"Harryis","type":"user"},{"_id":"65d239b957f53f6ebe151816","avatarUrl":"/avatars/c18c11bc9fb9d388373152f18cf9b81c.svg","isPro":false,"fullname":"Jiahao s","user":"lanlanlan23","type":"user"},{"_id":"66f25defd295fc5ccc9bafc5","avatarUrl":"/avatars/860e1062b26a8f605c469ffbc6aca86c.svg","isPro":false,"fullname":"ziheng zhang","user":"Ruabert","type":"user"},{"_id":"69dc5ee09232e513cbb1240c","avatarUrl":"/avatars/2bb4195d6082733cbd89d5e5b248bac8.svg","isPro":false,"fullname":"Z","user":"Freeeeeak","type":"user"},{"_id":"6732cdfa07cf693a11536b88","avatarUrl":"/avatars/ba0f623f77baee34cbac1422570931da.svg","isPro":false,"fullname":"Zhou Xingchi","user":"hooccee","type":"user"},{"_id":"656010c6e23401f820041e47","avatarUrl":"/avatars/9678121773b22e98d53e14373e874c81.svg","isPro":false,"fullname":"Yilun Kong","user":"Yilun-Kong","type":"user"},{"_id":"6590e03454f8826173ed5ee6","avatarUrl":"/avatars/f5e59d3e58c28a99f2ff39267ca51cdb.svg","isPro":false,"fullname":"HuanjinYao","user":"HuanjinYao","type":"user"},{"_id":"65e2d43f9fb58a5115253049","avatarUrl":"/avatars/46bd4ae27eaa23802cef3d91626897b5.svg","isPro":false,"fullname":"Haoyuan Sun","user":"xiaonengmiao","type":"user"},{"_id":"66f16d166f7038039a1e2770","avatarUrl":"/avatars/0a30d3e9af3b109ce4b82396b0e8d685.svg","isPro":false,"fullname":"Yibo Wang","user":"yiboowang","type":"user"},{"_id":"670e63cd41b894977b30c244","avatarUrl":"/avatars/8cef866d528cf0e3a0d3b45f319b94aa.svg","isPro":false,"fullname":"Huaisong Zhang","user":"P1n3","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"638df552a11654155baca408","name":"Intelligent-Systems","fullname":"Max Planck Institute for Intelligent Systems","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670247618868-6183d0b249ef1d984699e4a3.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18195.md","query":{}}">

Papers

arxiv:2606.18195

Learning from the Self-future: On-policy Self-distillation for dLLMs

Published on Jun 16

· Submitted by

Liu on Jun 17

Max Planck Institute for Intelligent Systems

Upvote

Authors:

Abstract

d-OPSD introduces a novel on-policy self-distillation framework for diffusion language models by adapting self-teacher construction and supervision mechanisms to match the non-autoregressive nature of diffusion models.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

View arXiv page View PDF GitHub 3 Add to collection

Community

Shiweiliuiiiiiii

Paper submitter about 18 hours ago

Simple self-distillation method for DLMs.

noahml

about 7 hours ago

Neat paper. It is interesting to see someone finally tackling the autoregressive bias in self-distillation. Most OPSD work feels so tied to left-to-right generation, so reframing the teacher construction around suffix conditioning for dLLMs makes a lot of sense.

How much does the performance start to drop off if the model's self-generated answers are low quality? I wonder if the iterative denoising process is robust to that early noise.

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/bc8b4b71-4946-4027-aca9-54baf889e33c

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.18195

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18195 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18195 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18195 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Learning from the Self-future: On-policy Self-distillation for dLLMs

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers