Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at <a href=\"https://github.com/GaryBall/GDSD\" rel=\"nofollow\">https://github.com/GaryBall/GDSD</a>.</p>\n","updatedAt":"2026-06-01T11:06:10.229Z","author":{"_id":"60b3864c1d57c8cd72044592","avatarUrl":"/avatars/9268981676143013913e99bac6c58dd3.svg","fullname":"Keyue Jiang","name":"jky594176","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9208126664161682},"editors":["jky594176"],"editorAvatarUrls":["/avatars/9268981676143013913e99bac6c58dd3.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29398","authors":[{"_id":"6a1d3be5808ddbc3c7d436fd","name":"Xiaohang Tang","hidden":false},{"_id":"6a1d3be5808ddbc3c7d436fe","user":{"_id":"60b3864c1d57c8cd72044592","avatarUrl":"/avatars/9268981676143013913e99bac6c58dd3.svg","isPro":false,"fullname":"Keyue Jiang","user":"jky594176","type":"user","name":"jky594176"},"name":"Keyue Jiang","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:31:19.954Z","hidden":false},{"_id":"6a1d3be5808ddbc3c7d436ff","name":"Che Liu","hidden":false},{"_id":"6a1d3be5808ddbc3c7d43700","name":"Qifang Zhao","hidden":false},{"_id":"6a1d3be5808ddbc3c7d43701","name":"Xiaoxiao Xu","hidden":false},{"_id":"6a1d3be5808ddbc3c7d43702","name":"Sangwoong Yoon","hidden":false},{"_id":"6a1d3be5808ddbc3c7d43703","name":"Ilija Bogunovic","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models","submittedOnDailyBy":{"_id":"60b3864c1d57c8cd72044592","avatarUrl":"/avatars/9268981676143013913e99bac6c58dd3.svg","isPro":false,"fullname":"Keyue Jiang","user":"jky594176","type":"user","name":"jky594176"},"summary":"Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.","upvotes":3,"discussionId":"6a1d3be5808ddbc3c7d43704","ai_summary":"Guided Denoiser Self-Distillation (GDSD) improves diffusion large language models by directly distilling denoisers from advantage-guided self-teachers, avoiding biases introduced by ELBO likelihood surrogates and achieving superior performance on benchmark tasks.","ai_keywords":["reinforcement learning","diffusion large language models","denoiser","evidence lower bound","ELBO","reverse-KL regularized RL","self-distillation","likelihood-free self-distillation","training--inference mismatch","closed-form optimum","denoiser logits","normalization-free objective","pathologies","benchmark tasks","LLaDA-8B","Dream-7B"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"649e8fa70217743334166cae","avatarUrl":"/avatars/413fbb7bfd0f829d30ff6073b35f41c8.svg","isPro":false,"fullname":"Xiaohang Tang","user":"timxiaohangt","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":false,"fullname":"Urro","user":"urroxyz","type":"user"},{"_id":"60b3864c1d57c8cd72044592","avatarUrl":"/avatars/9268981676143013913e99bac6c58dd3.svg","isPro":false,"fullname":"Keyue Jiang","user":"jky594176","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29398.md"}">
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
Abstract
Guided Denoiser Self-Distillation (GDSD) improves diffusion large language models by directly distilling denoisers from advantage-guided self-teachers, avoiding biases introduced by ELBO likelihood surrogates and achieving superior performance on benchmark tasks.
AI-generated summary
Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.
Community
Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.29398 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.29398 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.