Hugging Face Daily Papers · June 11, 2026 · 4 min read

Distilling LLM Feedback for Lean Theorem Proving

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.</p>\n","updatedAt":"2026-06-11T06:57:42.380Z","author":{"_id":"680a2f6202f8f5eddd0f0873","avatarUrl":"/avatars/d71093b223d532599387287a20d15c52.svg","fullname":"N","name":"Gaetan10","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9127647280693054},"editors":["Gaetan10"],"editorAvatarUrls":["/avatars/d71093b223d532599387287a20d15c52.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30861","authors":[{"_id":"6a26d2a2da05d61ad5d10d48","user":{"_id":"680a2f6202f8f5eddd0f0873","avatarUrl":"/avatars/d71093b223d532599387287a20d15c52.svg","isPro":false,"fullname":"N","user":"Gaetan10","type":"user","name":"Gaetan10"},"name":"Gaetan Narozniak","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:44:48.007Z","hidden":false},{"_id":"6a26d2a2da05d61ad5d10d49","name":"Gérard Biau","hidden":false},{"_id":"6a26d2a2da05d61ad5d10d4a","name":"Rémi Munos","hidden":false},{"_id":"6a26d2a2da05d61ad5d10d4b","name":"Ahmad Rammal","hidden":false},{"_id":"6a26d2a2da05d61ad5d10d4c","name":"Pierre Marion","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Distilling LLM Feedback for Lean Theorem Proving","submittedOnDailyBy":{"_id":"680a2f6202f8f5eddd0f0873","avatarUrl":"/avatars/d71093b223d532599387287a20d15c52.svg","isPro":false,"fullname":"N","user":"Gaetan10","type":"user","name":"Gaetan10"},"summary":"Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.","upvotes":1,"discussionId":"6a26d2a2da05d61ad5d10d4d","ai_summary":"Feedback Distillation improves post-training of reasoning models by using self-distillation with token-level supervision and privileged feedback from language models, offering better diversity and complementary benefits when combined with GRPO.","ai_keywords":["supervised fine-tuning","reinforcement learning","verifiable rewards","GRPO","self-distillation","token-level supervision","privileged feedback","language model","Lean4 theorem-proving","policy entropy","pass@k scaling"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"66b54027408752ae16404b05","name":"metaresearch","fullname":"Meta Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b25f3f58babfaeb76112dc/2GmiaF075AZ7BcE538oPk.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a2ae6c2e36bc84d91b6e7cc","avatarUrl":"/avatars/abf4b4c0020f9332b6827952cc53163e.svg","isPro":false,"fullname":"mmgood","user":"mmgood","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66b54027408752ae16404b05","name":"metaresearch","fullname":"Meta Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b25f3f58babfaeb76112dc/2GmiaF075AZ7BcE538oPk.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30861.md"}">

Papers

arxiv:2605.30861

Distilling LLM Feedback for Lean Theorem Proving

Published on May 29

· Submitted by

N on Jun 11

Meta Research

Upvote

Authors:

Gaetan Narozniak ,

Abstract

Feedback Distillation improves post-training of reasoning models by using self-distillation with token-level supervision and privileged feedback from language models, offering better diversity and complementary benefits when combined with GRPO.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

View arXiv page View PDF Add to collection

Community

Gaetan10

Paper author Paper submitter about 13 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.30861

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30861 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30861 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30861 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Distilling LLM Feedback for Lean Theorem Proving

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers