Hugging Face Daily Papers · · 3 min read

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We propose HINT-SD, a targeted hindsight self-distillation framework for long-horizon agents that improves learning by identifying and correcting only the actions responsible for task failure. Instead of distilling entire trajectories, HINT-SD performs hindsight analysis to isolate failure-critical decisions and conducts self-distillation on each targeted turn, with the teacher conditioned on generated hindsight feedback.</p>\n","updatedAt":"2026-05-25T02:48:33.932Z","author":{"_id":"66d30f5fad293ffc4b7672bc","avatarUrl":"/avatars/6f164d813b947940a088820f8fd4dbe8.svg","fullname":"Woongyeong Yeo","name":"wgcyeo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9020233750343323},"editors":["wgcyeo"],"editorAvatarUrls":["/avatars/6f164d813b947940a088820f8fd4dbe8.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.17873","authors":[{"_id":"6a0bc7b68ca2d0b256380322","name":"Woongyeng Yeo","hidden":false},{"_id":"6a0bc7b68ca2d0b256380323","name":"Yumin Choi","hidden":false},{"_id":"6a0bc7b68ca2d0b256380324","name":"Taekyung Ki","hidden":false},{"_id":"6a0bc7b68ca2d0b256380325","name":"Sung Ju Hwang","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents","submittedOnDailyBy":{"_id":"66d30f5fad293ffc4b7672bc","avatarUrl":"/avatars/6f164d813b947940a088820f8fd4dbe8.svg","isPro":false,"fullname":"Woongyeong Yeo","user":"wgcyeo","type":"user","name":"wgcyeo"},"summary":"Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.","upvotes":2,"discussionId":"6a0bc7b68ca2d0b256380326","ai_summary":"HINT-SD is a targeted self-distillation framework that selects failure-relevant actions from full trajectories to improve long-horizon LLM agent training efficiency and effectiveness.","ai_keywords":["reinforcement learning","self-distillation","hindsight","targeted distillation","long-horizon agents","action selection","feedback-conditioned distillation","trajectory analysis"],"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6615494716917dfdc645c44e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6615494716917dfdc645c44e/GGzgDi_WTW1Ci4CaDJd8I.jpeg","isPro":true,"fullname":"Daniel Fox","user":"FlameF0X","type":"user"},{"_id":"67864e969ade3b15efd4044b","avatarUrl":"/avatars/3d3fdcc111515be5652f97f16e7d521d.svg","isPro":false,"fullname":"Chanuk Lee","user":"tally0818","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.17873.md"}">
Papers
arxiv:2605.17873

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Published on May 18
· Submitted by
Woongyeong Yeo
on May 25
Authors:
,
,
,

Abstract

HINT-SD is a targeted self-distillation framework that selects failure-relevant actions from full trajectories to improve long-horizon LLM agent training efficiency and effectiveness.

AI-generated summary

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.

Community

Paper submitter about 8 hours ago

We propose HINT-SD, a targeted hindsight self-distillation framework for long-horizon agents that improves learning by identifying and correcting only the actions responsible for task failure. Instead of distilling entire trajectories, HINT-SD performs hindsight analysis to isolate failure-critical decisions and conducts self-distillation on each targeted turn, with the teacher conditioned on generated hindsight feedback.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.17873
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.17873 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.17873 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.17873 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers