Hugging Face Daily Papers · · 4 min read

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

AntiSD reaches GRPO's accuracy in 2–10× fewer training steps and improves final accuracy by up to +11.5 points on AIME 2024/2025, HMMT 2025, and BeyondAIME — consistent across 4B–30B dense and MoE models.</p>\n<p>Standard self-distillation in reasoning RL pulls the student toward a teacher conditioned on a verified solution. The privileged context makes the teacher sharp on template tokens but unsure on the deliberation tokens — \"Wait\", \"Let\", \"Maybe\" — that drive multi-step search; descending its divergence reinforces templates at the cost of reasoning.</p>\n<p>AntiSD flips the sign: instead of descending the divergence, we ascend a bounded Jensen–Shannon between student and teacher, with an entropy-triggered gate. No token-level reward shaping, no length normalization, no schedule heuristics.</p>\n<p>Code: <a href=\"https://github.com/FloyedShen/AntiSD\" rel=\"nofollow\">https://github.com/FloyedShen/AntiSD</a><br>Paper: <a href=\"https://www.alphaxiv.org/abs/2605.11609\" rel=\"nofollow\">https://www.alphaxiv.org/abs/2605.11609</a></p>\n","updatedAt":"2026-05-20T02:13:35.894Z","author":{"_id":"6475ff9b4c9fb8a4bf1cde76","avatarUrl":"/avatars/61cf82cd0e15c4618f5bd8b1f7d52f37.svg","fullname":"floyed shen","name":"floyed","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8753555417060852},"editors":["floyed"],"editorAvatarUrls":["/avatars/61cf82cd0e15c4618f5bd8b1f7d52f37.svg"],"reactions":[],"isReport":false}},{"id":"6a0da21df06cebb761520c28","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false},"createdAt":"2026-05-20T11:59:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Innovative and useful.","html":"<p>Innovative and useful.</p>\n","updatedAt":"2026-05-20T11:59:25.319Z","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9556747674942017},"editors":["urroxyz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png"],"reactions":[{"reaction":"🤗","users":["floyed"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.11609","authors":[{"_id":"6a03e88986b054ce2fa40e3f","name":"Guobin Shen","hidden":false},{"_id":"6a03e88986b054ce2fa40e40","name":"Xiang Cheng","hidden":false},{"_id":"6a03e88986b054ce2fa40e41","name":"Chenxiao Zhao","hidden":false},{"_id":"6a03e88986b054ce2fa40e42","name":"Lei Huang","hidden":false},{"_id":"6a03e88986b054ce2fa40e43","name":"Jindong Li","hidden":false},{"_id":"6a03e88986b054ce2fa40e44","name":"Dongcheng Zhao","hidden":false},{"_id":"6a03e88986b054ce2fa40e45","name":"Xing Yu","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information","submittedOnDailyBy":{"_id":"6475ff9b4c9fb8a4bf1cde76","avatarUrl":"/avatars/61cf82cd0e15c4618f5bd8b1f7d52f37.svg","isPro":false,"fullname":"floyed shen","user":"floyed","type":"user","name":"floyed"},"summary":"On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens (\"Wait\", \"Let\", \"Maybe\") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.","upvotes":45,"discussionId":"6a03e88a86b054ce2fa40e46","githubRepo":"https://github.com/FloyedShen/AntiSD","githubRepoAddedBy":"user","ai_summary":"Anti-Self-Distillation reverses the direction of knowledge transfer in self-distillation to improve math reasoning efficiency and accuracy.","ai_keywords":["self-distillation","privileged context","pointwise mutual information","entropy-triggered gate","GRPO baseline","language model"],"githubStars":6,"organization":{"_id":"68246a0a98117c02df67a547","name":"rednote-hilab","fullname":"rednote-hilab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6807a1d6504547b3554b9c73/WgnnQDsz7FqnyTtv8mmRO.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6475ff9b4c9fb8a4bf1cde76","avatarUrl":"/avatars/61cf82cd0e15c4618f5bd8b1f7d52f37.svg","isPro":false,"fullname":"floyed shen","user":"floyed","type":"user"},{"_id":"69ccb73d4ec277b44ab32395","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/NKjRTQFGjqJPVNcvUfZlT.png","isPro":false,"fullname":"Anthony HALL","user":"ella-rodriguez2","type":"user"},{"_id":"699be351de7861b55c1745ec","avatarUrl":"/avatars/da55d1564446dd52f453a3cde7b4b75e.svg","isPro":false,"fullname":"Gabriel Harris","user":"amenelson7","type":"user"},{"_id":"61f4c2e981c4d30f58140279","avatarUrl":"/avatars/c4a69f6563c952354e33682e86045b14.svg","isPro":false,"fullname":"HuangMeow","user":"Luckyyy","type":"user"},{"_id":"69ccf5cb99d0adfa3a45d3ac","avatarUrl":"/avatars/61eb0160b34474d8f2b3a26d6f4dfc37.svg","isPro":false,"fullname":"Борисов София","user":"lucwright541","type":"user"},{"_id":"69ccc16a92e44910c0ff1ac2","avatarUrl":"/avatars/a2abff5e8eb3c2fe8e1dbbd6ac6f99f9.svg","isPro":false,"fullname":"罗 浩然","user":"arialee2024","type":"user"},{"_id":"69ccac5f5334e3f776f200e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yIf0yD6x0T3tJGGe_Qji_.png","isPro":false,"fullname":"Song Yutong","user":"charlottescott","type":"user"},{"_id":"699ed1cdc4052ad8e3efa9e2","avatarUrl":"/avatars/68d03329190bdd9344a2ae1bc2f1f5d3.svg","isPro":false,"fullname":"S King","user":"sking9","type":"user"},{"_id":"69ccf3d7712fc1e427357409","avatarUrl":"/avatars/f3a22ea048da80782e5492312b7e4990.svg","isPro":false,"fullname":"Luo Wenhao","user":"sking62","type":"user"},{"_id":"698f899e0574510b757afb51","avatarUrl":"/avatars/1fc257e8924b25e3889f37ff3d5b42ad.svg","isPro":false,"fullname":"Hsg8l24mya3","user":"hsg8l24mya3","type":"user"},{"_id":"69bb62fe1c5dbc4df50343bb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/2xe-gLIG_HHPZSAfOidQT.jpeg","isPro":false,"fullname":"Charlotte Martinez","user":"james-harris86","type":"user"},{"_id":"69830a0126ed112350d6fc0e","avatarUrl":"/avatars/e258e770d16d26a268c0bd0d50bc29fa.svg","isPro":false,"fullname":"Giovanni Rinaldi","user":"oof-111","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"68246a0a98117c02df67a547","name":"rednote-hilab","fullname":"rednote-hilab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6807a1d6504547b3554b9c73/WgnnQDsz7FqnyTtv8mmRO.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.11609.md"}">
Papers
arxiv:2605.11609

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Published on May 12
· Submitted by
floyed shen
on May 20
#2 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

Anti-Self-Distillation reverses the direction of knowledge transfer in self-distillation to improve math reasoning efficiency and accuracy.

AI-generated summary

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.

Community

Paper submitter about 11 hours ago

AntiSD reaches GRPO's accuracy in 2–10× fewer training steps and improves final accuracy by up to +11.5 points on AIME 2024/2025, HMMT 2025, and BeyondAIME — consistent across 4B–30B dense and MoE models.

Standard self-distillation in reasoning RL pulls the student toward a teacher conditioned on a verified solution. The privileged context makes the teacher sharp on template tokens but unsure on the deliberation tokens — "Wait", "Let", "Maybe" — that drive multi-step search; descending its divergence reinforces templates at the cost of reasoning.

AntiSD flips the sign: instead of descending the divergence, we ascend a bounded Jensen–Shannon between student and teacher, with an entropy-triggered gate. No token-level reward shaping, no length normalization, no schedule heuristics.

Code: https://github.com/FloyedShen/AntiSD
Paper: https://www.alphaxiv.org/abs/2605.11609

Innovative and useful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.11609
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.11609 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.11609 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.11609 in a Space README.md to link it from this page.

Collections including this paper 3

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers