Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.</p>\n","updatedAt":"2026-06-11T06:57:42.380Z","author":{"_id":"680a2f6202f8f5eddd0f0873","avatarUrl":"/avatars/d71093b223d532599387287a20d15c52.svg","fullname":"N","name":"Gaetan10","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9127647280693054},"editors":["Gaetan10"],"editorAvatarUrls":["/avatars/d71093b223d532599387287a20d15c52.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30861","authors":[{"_id":"6a26d2a2da05d61ad5d10d48","user":{"_id":"680a2f6202f8f5eddd0f0873","avatarUrl":"/avatars/d71093b223d532599387287a20d15c52.svg","isPro":false,"fullname":"N","user":"Gaetan10","type":"user","name":"Gaetan10"},"name":"Gaetan Narozniak","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:44:48.007Z","hidden":false},{"_id":"6a26d2a2da05d61ad5d10d49","name":"Gérard Biau","hidden":false},{"_id":"6a26d2a2da05d61ad5d10d4a","name":"Rémi Munos","hidden":false},{"_id":"6a26d2a2da05d61ad5d10d4b","name":"Ahmad Rammal","hidden":false},{"_id":"6a26d2a2da05d61ad5d10d4c","name":"Pierre Marion","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Distilling LLM Feedback for Lean Theorem Proving","submittedOnDailyBy":{"_id":"680a2f6202f8f5eddd0f0873","avatarUrl":"/avatars/d71093b223d532599387287a20d15c52.svg","isPro":false,"fullname":"N","user":"Gaetan10","type":"user","name":"Gaetan10"},"summary":"Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.","upvotes":1,"discussionId":"6a26d2a2da05d61ad5d10d4d","ai_summary":"Feedback Distillation improves post-training of reasoning models by using self-distillation with token-level supervision and privileged feedback from language models, offering better diversity and complementary benefits when combined with GRPO.","ai_keywords":["supervised fine-tuning","reinforcement learning","verifiable rewards","GRPO","self-distillation","token-level supervision","privileged feedback","language model","Lean4 theorem-proving","policy entropy","pass@k scaling"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"66b54027408752ae16404b05","name":"metaresearch","fullname":"Meta Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b25f3f58babfaeb76112dc/2GmiaF075AZ7BcE538oPk.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a2ae6c2e36bc84d91b6e7cc","avatarUrl":"/avatars/abf4b4c0020f9332b6827952cc53163e.svg","isPro":false,"fullname":"mmgood","user":"mmgood","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66b54027408752ae16404b05","name":"metaresearch","fullname":"Meta Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b25f3f58babfaeb76112dc/2GmiaF075AZ7BcE538oPk.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30861.md"}">
Distilling LLM Feedback for Lean Theorem Proving
Published on May 29
· Submitted by N on Jun 11 Abstract
Feedback Distillation improves post-training of reasoning models by using self-distillation with token-level supervision and privileged feedback from language models, offering better diversity and complementary benefits when combined with GRPO.
Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.
Community
Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30861 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.30861 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.30861 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.