Hugging Face Daily Papers · May 25, 2026 · 3 min read

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We propose HINT-SD, a targeted hindsight self-distillation framework for long-horizon agents that improves learning by identifying and correcting only the actions responsible for task failure. Instead of distilling entire trajectories, HINT-SD performs hindsight analysis to isolate failure-critical decisions and conducts self-distillation on each targeted turn, with the teacher conditioned on generated hindsight feedback.</p>\n","updatedAt":"2026-05-25T02:48:33.932Z","author":{"_id":"66d30f5fad293ffc4b7672bc","avatarUrl":"/avatars/6f164d813b947940a088820f8fd4dbe8.svg","fullname":"Woongyeong Yeo","name":"wgcyeo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9020233750343323},"editors":["wgcyeo"],"editorAvatarUrls":["/avatars/6f164d813b947940a088820f8fd4dbe8.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.17873","authors":[{"_id":"6a0bc7b68ca2d0b256380322","name":"Woongyeng Yeo","hidden":false},{"_id":"6a0bc7b68ca2d0b256380323","name":"Yumin Choi","hidden":false},{"_id":"6a0bc7b68ca2d0b256380324","name":"Taekyung Ki","hidden":false},{"_id":"6a0bc7b68ca2d0b256380325","name":"Sung Ju Hwang","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents","submittedOnDailyBy":{"_id":"66d30f5fad293ffc4b7672bc","avatarUrl":"/avatars/6f164d813b947940a088820f8fd4dbe8.svg","isPro":false,"fullname":"Woongyeong Yeo","user":"wgcyeo","type":"user","name":"wgcyeo"},"summary":"Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.","upvotes":2,"discussionId":"6a0bc7b68ca2d0b256380326","ai_summary":"HINT-SD is a targeted self-distillation framework that selects failure-relevant actions from full trajectories to improve long-horizon LLM agent training efficiency and effectiveness.","ai_keywords":["reinforcement learning","self-distillation","hindsight","targeted distillation","long-horizon agents","action selection","feedback-conditioned distillation","trajectory analysis"],"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6615494716917dfdc645c44e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6615494716917dfdc645c44e/GGzgDi_WTW1Ci4CaDJd8I.jpeg","isPro":true,"fullname":"Daniel Fox","user":"FlameF0X","type":"user"},{"_id":"67864e969ade3b15efd4044b","avatarUrl":"/avatars/3d3fdcc111515be5652f97f16e7d521d.svg","isPro":false,"fullname":"Chanuk Lee","user":"tally0818","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.17873.md"}">

Papers

arxiv:2605.17873

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Published on May 18

· Submitted by

Woongyeong Yeo on May 25

KAIST AI

Upvote

Authors:

Abstract

HINT-SD is a targeted self-distillation framework that selects failure-relevant actions from full trajectories to improve long-horizon LLM agent training efficiency and effectiveness.

AI-generated summary

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.

View arXiv page View PDF Add to collection

Community

wgcyeo

Paper submitter about 8 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.17873

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.17873 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.17873 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.17873 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers