Hugging Face Daily Papers · · 5 min read

Trajectory-Refined Distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage.</p>\n","updatedAt":"2026-06-09T04:11:15.724Z","author":{"_id":"648416886fbf526fad1853f8","avatarUrl":"/avatars/0eff6c70011c06609b8695a9967b59b9.svg","fullname":"Jiang","name":"Louieworth","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8872982263565063},"editors":["Louieworth"],"editorAvatarUrls":["/avatars/0eff6c70011c06609b8695a9967b59b9.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.08432","authors":[{"_id":"6a2791b26dde1c5ef75bd0b2","name":"Li Jiang","hidden":false},{"_id":"6a2791b26dde1c5ef75bd0b3","name":"Haoran Xu","hidden":false},{"_id":"6a2791b26dde1c5ef75bd0b4","name":"Yichuan Ding","hidden":false},{"_id":"6a2791b26dde1c5ef75bd0b5","name":"Amy Zhang","hidden":false}],"publishedAt":"2026-06-07T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Trajectory-Refined Distillation","submittedOnDailyBy":{"_id":"648416886fbf526fad1853f8","avatarUrl":"/avatars/0eff6c70011c06609b8695a9967b59b9.svg","isPro":false,"fullname":"Jiang","user":"Louieworth","type":"user","name":"Louieworth"},"summary":"On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd","upvotes":3,"discussionId":"6a2791b36dde1c5ef75bd0b6","githubRepo":"https://github.com/louieworth/trd","githubRepoAddedBy":"user","ai_summary":"On-policy distillation suffers from prefix failure where dense token-level supervision creates fragmented gradients; trajectory-refined distillation addresses this by correcting student rollouts at the trajectory level before distillation.","ai_keywords":["on-policy distillation","large language models","token-level loss","trajectory-level correction","prefix failure","bimodal teacher mixture","gradient fragmentation","teacher guidance","on-policy self-distillation","parameter-sharing"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"63571619a482286f0dc9cce9","name":"MilaQuebec","fullname":"Mila – Quebec Artificial Intelligence Institute","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666651643213-605238cb46a8a9249390a28e.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6632b5bbff2e8be107274240","avatarUrl":"/avatars/69bafd4839e4b2e4f0faae4400d946a2.svg","isPro":false,"fullname":"Haoran Xu","user":"ryanxhr","type":"user"},{"_id":"648416886fbf526fad1853f8","avatarUrl":"/avatars/0eff6c70011c06609b8695a9967b59b9.svg","isPro":false,"fullname":"Jiang","user":"Louieworth","type":"user"},{"_id":"642969134e073875f6a6579f","avatarUrl":"/avatars/ad9a4234aa1caa0e10c60e3aa094102d.svg","isPro":false,"fullname":"Wang","user":"kiyoxi2022","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63571619a482286f0dc9cce9","name":"MilaQuebec","fullname":"Mila – Quebec Artificial Intelligence Institute","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666651643213-605238cb46a8a9249390a28e.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.08432.md"}">
Papers
arxiv:2606.08432

Trajectory-Refined Distillation

Published on Jun 7
· Submitted by
Jiang
on Jun 9
Authors:
,
,
,

Abstract

On-policy distillation suffers from prefix failure where dense token-level supervision creates fragmented gradients; trajectory-refined distillation addresses this by correcting student rollouts at the trajectory level before distillation.

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd

Community

Paper submitter about 4 hours ago

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.08432
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.08432 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.08432 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.08432 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers