Hugging Face Daily Papers · · 5 min read

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose \\textbf{PBSD} (\\textbf{P}rivileged \\textbf{B}ayesian \\textbf{S}elf-\\textbf{D}istillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.</p>\n","updatedAt":"2026-06-09T06:04:53.220Z","author":{"_id":"669a184bc7b418d6f357229a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/669a184bc7b418d6f357229a/gZovo4yl29iJwRMG2RzC1.jpeg","fullname":"Yang Tian","name":"yangtian6781","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8832391500473022},"editors":["yangtian6781"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/669a184bc7b418d6f357229a/gZovo4yl29iJwRMG2RzC1.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09348","authors":[{"_id":"6a27a5066dde1c5ef75bd147","name":"Yang Tian","hidden":false},{"_id":"6a27a5066dde1c5ef75bd148","name":"Rui Wang","hidden":false},{"_id":"6a27a5066dde1c5ef75bd149","name":"Xumeng Wen","hidden":false},{"_id":"6a27a5066dde1c5ef75bd14a","name":"Junjie Li","hidden":false},{"_id":"6a27a5066dde1c5ef75bd14b","name":"Shizhao Sun","hidden":false},{"_id":"6a27a5066dde1c5ef75bd14c","name":"Lei Song","hidden":false},{"_id":"6a27a5066dde1c5ef75bd14d","name":"Jiang Bian","hidden":false},{"_id":"6a27a5066dde1c5ef75bd14e","name":"Bo Zhao","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment","submittedOnDailyBy":{"_id":"669a184bc7b418d6f357229a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/669a184bc7b418d6f357229a/gZovo4yl29iJwRMG2RzC1.jpeg","isPro":false,"fullname":"Yang Tian","user":"yangtian6781","type":"user","name":"yangtian6781"},"summary":"Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.","upvotes":1,"discussionId":"6a27a5076dde1c5ef75bd14f","ai_summary":"Privileged Bayesian Self-Distillation enables fine-grained credit assignment in long-horizon tasks by converting sparse outcome rewards into calibrated turn-level signals through Bayesian evidence scoring and autoregressive decomposition.","ai_keywords":["reinforcement learning","credit assignment","self-distillation","Bayesian calibration","policy optimization","autoregressive decomposition","trajectory-level rewards","turn-level signals","privileged learning","evidence scoring"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"669a184bc7b418d6f357229a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/669a184bc7b418d6f357229a/gZovo4yl29iJwRMG2RzC1.jpeg","isPro":false,"fullname":"Yang Tian","user":"yangtian6781","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09348.md"}">
Papers
arxiv:2606.09348

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Published on Jun 8
· Submitted by
Yang Tian
on Jun 9
Authors:
,
,
,
,
,
,
,

Abstract

Privileged Bayesian Self-Distillation enables fine-grained credit assignment in long-horizon tasks by converting sparse outcome rewards into calibrated turn-level signals through Bayesian evidence scoring and autoregressive decomposition.

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

Community

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose \textbf{PBSD} (\textbf{P}rivileged \textbf{B}ayesian \textbf{S}elf-\textbf{D}istillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.09348
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09348 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09348 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09348 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers