AntiSD reaches GRPO's accuracy in 2–10× fewer training steps and improves final accuracy by up to +11.5 points on AIME 2024/2025, HMMT 2025, and BeyondAIME — consistent across 4B–30B dense and MoE models.</p>\n<p>Standard self-distillation in reasoning RL pulls the student toward a teacher conditioned on a verified solution. The privileged context makes the teacher sharp on template tokens but unsure on the deliberation tokens — \"Wait\", \"Let\", \"Maybe\" — that drive multi-step search; descending its divergence reinforces templates at the cost of reasoning.</p>\n<p>AntiSD flips the sign: instead of descending the divergence, we ascend a bounded Jensen–Shannon between student and teacher, with an entropy-triggered gate. No token-level reward shaping, no length normalization, no schedule heuristics.</p>\n<p>Code: <a href=\"https://github.com/FloyedShen/AntiSD\" rel=\"nofollow\">https://github.com/FloyedShen/AntiSD</a><br>Paper: <a href=\"https://www.alphaxiv.org/abs/2605.11609\" rel=\"nofollow\">https://www.alphaxiv.org/abs/2605.11609</a></p>\n","updatedAt":"2026-05-20T02:13:35.894Z","author":{"_id":"6475ff9b4c9fb8a4bf1cde76","avatarUrl":"/avatars/61cf82cd0e15c4618f5bd8b1f7d52f37.svg","fullname":"floyed shen","name":"floyed","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8753555417060852},"editors":["floyed"],"editorAvatarUrls":["/avatars/61cf82cd0e15c4618f5bd8b1f7d52f37.svg"],"reactions":[],"isReport":false}},{"id":"6a0da21df06cebb761520c28","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false},"createdAt":"2026-05-20T11:59:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Innovative and useful.","html":"<p>Innovative and useful.</p>\n","updatedAt":"2026-05-20T11:59:25.319Z","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9556747674942017},"editors":["urroxyz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png"],"reactions":[{"reaction":"🤗","users":["floyed"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.11609","authors":[{"_id":"6a03e88986b054ce2fa40e3f","name":"Guobin Shen","hidden":false},{"_id":"6a03e88986b054ce2fa40e40","name":"Xiang Cheng","hidden":false},{"_id":"6a03e88986b054ce2fa40e41","name":"Chenxiao Zhao","hidden":false},{"_id":"6a03e88986b054ce2fa40e42","name":"Lei Huang","hidden":false},{"_id":"6a03e88986b054ce2fa40e43","name":"Jindong Li","hidden":false},{"_id":"6a03e88986b054ce2fa40e44","name":"Dongcheng Zhao","hidden":false},{"_id":"6a03e88986b054ce2fa40e45","name":"Xing Yu","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information","submittedOnDailyBy":{"_id":"6475ff9b4c9fb8a4bf1cde76","avatarUrl":"/avatars/61cf82cd0e15c4618f5bd8b1f7d52f37.svg","isPro":false,"fullname":"floyed shen","user":"floyed","type":"user","name":"floyed"},"summary":"On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens (\"Wait\", \"Let\", \"Maybe\") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.","upvotes":45,"discussionId":"6a03e88a86b054ce2fa40e46","githubRepo":"https://github.com/FloyedShen/AntiSD","githubRepoAddedBy":"user","ai_summary":"Anti-Self-Distillation reverses the direction of knowledge transfer in self-distillation to improve math reasoning efficiency and accuracy.","ai_keywords":["self-distillation","privileged context","pointwise mutual information","entropy-triggered gate","GRPO baseline","language model"],"githubStars":6,"organization":{"_id":"68246a0a98117c02df67a547","name":"rednote-hilab","fullname":"rednote-hilab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6807a1d6504547b3554b9c73/WgnnQDsz7FqnyTtv8mmRO.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6475ff9b4c9fb8a4bf1cde76","avatarUrl":"/avatars/61cf82cd0e15c4618f5bd8b1f7d52f37.svg","isPro":false,"fullname":"floyed shen","user":"floyed","type":"user"},{"_id":"69ccb73d4ec277b44ab32395","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/NKjRTQFGjqJPVNcvUfZlT.png","isPro":false,"fullname":"Anthony HALL","user":"ella-rodriguez2","type":"user"},{"_id":"699be351de7861b55c1745ec","avatarUrl":"/avatars/da55d1564446dd52f453a3cde7b4b75e.svg","isPro":false,"fullname":"Gabriel Harris","user":"amenelson7","type":"user"},{"_id":"61f4c2e981c4d30f58140279","avatarUrl":"/avatars/c4a69f6563c952354e33682e86045b14.svg","isPro":false,"fullname":"HuangMeow","user":"Luckyyy","type":"user"},{"_id":"69ccf5cb99d0adfa3a45d3ac","avatarUrl":"/avatars/61eb0160b34474d8f2b3a26d6f4dfc37.svg","isPro":false,"fullname":"Борисов София","user":"lucwright541","type":"user"},{"_id":"69ccc16a92e44910c0ff1ac2","avatarUrl":"/avatars/a2abff5e8eb3c2fe8e1dbbd6ac6f99f9.svg","isPro":false,"fullname":"罗 浩然","user":"arialee2024","type":"user"},{"_id":"69ccac5f5334e3f776f200e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yIf0yD6x0T3tJGGe_Qji_.png","isPro":false,"fullname":"Song Yutong","user":"charlottescott","type":"user"},{"_id":"699ed1cdc4052ad8e3efa9e2","avatarUrl":"/avatars/68d03329190bdd9344a2ae1bc2f1f5d3.svg","isPro":false,"fullname":"S King","user":"sking9","type":"user"},{"_id":"69ccf3d7712fc1e427357409","avatarUrl":"/avatars/f3a22ea048da80782e5492312b7e4990.svg","isPro":false,"fullname":"Luo Wenhao","user":"sking62","type":"user"},{"_id":"698f899e0574510b757afb51","avatarUrl":"/avatars/1fc257e8924b25e3889f37ff3d5b42ad.svg","isPro":false,"fullname":"Hsg8l24mya3","user":"hsg8l24mya3","type":"user"},{"_id":"69bb62fe1c5dbc4df50343bb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/2xe-gLIG_HHPZSAfOidQT.jpeg","isPro":false,"fullname":"Charlotte Martinez","user":"james-harris86","type":"user"},{"_id":"69830a0126ed112350d6fc0e","avatarUrl":"/avatars/e258e770d16d26a268c0bd0d50bc29fa.svg","isPro":false,"fullname":"Giovanni Rinaldi","user":"oof-111","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"68246a0a98117c02df67a547","name":"rednote-hilab","fullname":"rednote-hilab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6807a1d6504547b3554b9c73/WgnnQDsz7FqnyTtv8mmRO.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.11609.md"}">
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Abstract
Anti-Self-Distillation reverses the direction of knowledge transfer in self-distillation to improve math reasoning efficiency and accuracy.
AI-generated summary
On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.
Community
AntiSD reaches GRPO's accuracy in 2–10× fewer training steps and improves final accuracy by up to +11.5 points on AIME 2024/2025, HMMT 2025, and BeyondAIME — consistent across 4B–30B dense and MoE models.
Standard self-distillation in reasoning RL pulls the student toward a teacher conditioned on a verified solution. The privileged context makes the teacher sharp on template tokens but unsure on the deliberation tokens — "Wait", "Let", "Maybe" — that drive multi-step search; descending its divergence reinforces templates at the cost of reasoning.
AntiSD flips the sign: instead of descending the divergence, we ascend a bounded Jensen–Shannon between student and teacher, with an entropy-triggered gate. No token-level reward shaping, no length normalization, no schedule heuristics.
Code: https://github.com/FloyedShen/AntiSD
Paper: https://www.alphaxiv.org/abs/2605.11609
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.11609 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.11609 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.11609 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.