Hugging Face Daily Papers · June 9, 2026 · 5 min read

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose \\textbf{PBSD} (\\textbf{P}rivileged \\textbf{B}ayesian \\textbf{S}elf-\\textbf{D}istillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.</p>\n","updatedAt":"2026-06-09T06:04:53.220Z","author":{"_id":"669a184bc7b418d6f357229a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/669a184bc7b418d6f357229a/gZovo4yl29iJwRMG2RzC1.jpeg","fullname":"Yang Tian","name":"yangtian6781","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8832391500473022},"editors":["yangtian6781"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/669a184bc7b418d6f357229a/gZovo4yl29iJwRMG2RzC1.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09348","authors":[{"_id":"6a27a5066dde1c5ef75bd147","name":"Yang Tian","hidden":false},{"_id":"6a27a5066dde1c5ef75bd148","name":"Rui Wang","hidden":false},{"_id":"6a27a5066dde1c5ef75bd149","name":"Xumeng Wen","hidden":false},{"_id":"6a27a5066dde1c5ef75bd14a","name":"Junjie Li","hidden":false},{"_id":"6a27a5066dde1c5ef75bd14b","name":"Shizhao Sun","hidden":false},{"_id":"6a27a5066dde1c5ef75bd14c","name":"Lei Song","hidden":false},{"_id":"6a27a5066dde1c5ef75bd14d","name":"Jiang Bian","hidden":false},{"_id":"6a27a5066dde1c5ef75bd14e","name":"Bo Zhao","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment","submittedOnDailyBy":{"_id":"669a184bc7b418d6f357229a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/669a184bc7b418d6f357229a/gZovo4yl29iJwRMG2RzC1.jpeg","isPro":false,"fullname":"Yang Tian","user":"yangtian6781","type":"user","name":"yangtian6781"},"summary":"Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.","upvotes":1,"discussionId":"6a27a5076dde1c5ef75bd14f","ai_summary":"Privileged Bayesian Self-Distillation enables fine-grained credit assignment in long-horizon tasks by converting sparse outcome rewards into calibrated turn-level signals through Bayesian evidence scoring and autoregressive decomposition.","ai_keywords":["reinforcement learning","credit assignment","self-distillation","Bayesian calibration","policy optimization","autoregressive decomposition","trajectory-level rewards","turn-level signals","privileged learning","evidence scoring"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"669a184bc7b418d6f357229a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/669a184bc7b418d6f357229a/gZovo4yl29iJwRMG2RzC1.jpeg","isPro":false,"fullname":"Yang Tian","user":"yangtian6781","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09348.md"}">

Papers

arxiv:2606.09348

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Published on Jun 8

· Submitted by

Yang Tian on Jun 9

Shanghai Jiao Tong University

Upvote

Authors:

Abstract

Privileged Bayesian Self-Distillation enables fine-grained credit assignment in long-horizon tasks by converting sparse outcome rewards into calibrated turn-level signals through Bayesian evidence scoring and autoregressive decomposition.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

View arXiv page View PDF Add to collection

Community

yangtian6781

Paper submitter about 2 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.09348

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09348 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09348 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09348 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers