Hugging Face Daily Papers · · 4 min read

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We find that negative trajectories are more important for LLM on-policy distillation than positive ones, and propose ReNIO, a simple reweighting method to exploit them. Since ReNIO does not rely on trajectory correctness signals, it naturally supports efficient prefix-level distillation, preserving the efficiency advantage of OPD over RL.</p>\n","updatedAt":"2026-06-25T10:21:54.497Z","author":{"_id":"65e017549c7ca80a04084023","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65e017549c7ca80a04084023/W0oXWdfaVPkuXEqS6kPLd.jpeg","fullname":"Chen Lin","name":"Alephia","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8915563821792603},"editors":["Alephia"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65e017549c7ca80a04084023/W0oXWdfaVPkuXEqS6kPLd.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.23104","authors":[{"_id":"6a3b8c500a86ac3098d5d813","user":{"_id":"65e017549c7ca80a04084023","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65e017549c7ca80a04084023/W0oXWdfaVPkuXEqS6kPLd.jpeg","isPro":false,"fullname":"Chen Lin","user":"Alephia","type":"user","name":"Alephia"},"name":"Chen Lin","status":"claimed_verified","statusLastChangedAt":"2026-06-25T09:29:27.545Z","hidden":false},{"_id":"6a3b8c500a86ac3098d5d814","name":"Kedi Chen","hidden":false},{"_id":"6a3b8c500a86ac3098d5d815","name":"Wei Zhang","hidden":false}],"publishedAt":"2026-06-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation","submittedOnDailyBy":{"_id":"65e017549c7ca80a04084023","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65e017549c7ca80a04084023/W0oXWdfaVPkuXEqS6kPLd.jpeg","isPro":false,"fullname":"Chen Lin","user":"Alephia","type":"user","name":"Alephia"},"summary":"On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness. We observe a consistent asymmetry in controlled filtering experiments: in both OPD and on-policy self distillation (OPSD), training only on incorrect SGOs outperforms training only on correct ones. Our further analysis suggests that models trained on correct-only SGOs tend to generate shorter reasoning traces and show weaker reflection behavior, while incorrect SGOs better preserve exploratory reasoning near the model's capability boundary. To exploit this signal without requiring full answer-containing rollouts, we introduce ReNIO, which Reweights Negative trajectory Importance for LLM On-policy distillation. By using the student-to-teacher probability ratio, ReNIO identifies pivotal tokens leading to wrong reasoning traces and aggregates their information into a normalized sample weight, inherently assigning larger weights to likely negative trajectories without observing the correctness of final-answer. Since Re-NIO only uses prefix-conditioned token probabilities, it preserves OPD's prefix training advantage over full-rollout reinforcement learning. Across both mathematical reasoning and code generation tasks, ReNIO improves both OPD and OPSD, with representative relative gains of up to 8.90% for Qwen3-1.7B and 10.00% for R1-Distill-Qwen-7B on mathematical reasoning benchmarks. Code repo: https://github.com/BDML-lab/ReNIO.","upvotes":5,"discussionId":"6a3b8c500a86ac3098d5d816","githubRepo":"https://github.com/BDML-lab/ReNIO","githubRepoAddedBy":"user","ai_summary":"ReNIO enhances on-policy distillation for language models by reweighting negative trajectories based on token-level probability ratios, improving reasoning performance in mathematical and code generation tasks.","ai_keywords":["on-policy distillation","student model","teacher model","generated outputs","reasoning traces","exploration behavior","token-level probability ratios","normalized sample weights","prefix-conditioned probabilities","mathematical reasoning","code generation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"62c1482aac1b639c2d873235","name":"ECNU","fullname":"East China Normal University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656834075081-62c14609ac1b639c2d87192c.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65e017549c7ca80a04084023","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65e017549c7ca80a04084023/W0oXWdfaVPkuXEqS6kPLd.jpeg","isPro":false,"fullname":"Chen Lin","user":"Alephia","type":"user"},{"_id":"6a3cb1e9c60949da5a123a06","avatarUrl":"/avatars/cda5417e17b498d5d6e9caea691ba037.svg","isPro":false,"fullname":"czc666","user":"czc666","type":"user"},{"_id":"685b911748f9dda027ba4fd5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/jsOtVsdfYP7cZ00DEh8It.png","isPro":false,"fullname":"Derrick Fang","user":"Afang03","type":"user"},{"_id":"6820a87b77669579f4297072","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/POA5q8EbJfnGL2hADSTqV.png","isPro":false,"fullname":"Cola Chen (SII)","user":"141forever","type":"user"},{"_id":"6a3d02d915d206a78ec9f893","avatarUrl":"/avatars/f8ad5b7cca36faba469bca4074fcea8f.svg","isPro":false,"fullname":"Yihan Zhao","user":"zyhsaccount","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"62c1482aac1b639c2d873235","name":"ECNU","fullname":"East China Normal University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656834075081-62c14609ac1b639c2d87192c.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.23104.md","query":{}}">
Papers
arxiv:2606.23104

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

Published on Jun 22
· Submitted by
Chen Lin
on Jun 25
Authors:
,

Abstract

ReNIO enhances on-policy distillation for language models by reweighting negative trajectories based on token-level probability ratios, improving reasoning performance in mathematical and code generation tasks.

On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness. We observe a consistent asymmetry in controlled filtering experiments: in both OPD and on-policy self distillation (OPSD), training only on incorrect SGOs outperforms training only on correct ones. Our further analysis suggests that models trained on correct-only SGOs tend to generate shorter reasoning traces and show weaker reflection behavior, while incorrect SGOs better preserve exploratory reasoning near the model's capability boundary. To exploit this signal without requiring full answer-containing rollouts, we introduce ReNIO, which Reweights Negative trajectory Importance for LLM On-policy distillation. By using the student-to-teacher probability ratio, ReNIO identifies pivotal tokens leading to wrong reasoning traces and aggregates their information into a normalized sample weight, inherently assigning larger weights to likely negative trajectories without observing the correctness of final-answer. Since Re-NIO only uses prefix-conditioned token probabilities, it preserves OPD's prefix training advantage over full-rollout reinforcement learning. Across both mathematical reasoning and code generation tasks, ReNIO improves both OPD and OPSD, with representative relative gains of up to 8.90% for Qwen3-1.7B and 10.00% for R1-Distill-Qwen-7B on mathematical reasoning benchmarks. Code repo: https://github.com/BDML-lab/ReNIO.

Community

Paper author Paper submitter about 16 hours ago

We find that negative trajectories are more important for LLM on-policy distillation than positive ones, and propose ReNIO, a simple reweighting method to exploit them. Since ReNIO does not rely on trajectory correctness signals, it naturally supports efficient prefix-level distillation, preserving the efficiency advantage of OPD over RL.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.23104
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.23104 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.23104 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.23104 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers