Hugging Face Daily Papers · · 5 min read

Process Rewards with Learned Reliability

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.</p>\n","updatedAt":"2026-05-20T02:07:08.068Z","author":{"_id":"64fc20d899123d7698a30e61","avatarUrl":"/avatars/9231982cf70a0689f50accedf1004702.svg","fullname":"Jinyuan Li","name":"jinyuan222","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8665516972541809},"editors":["jinyuan222"],"editorAvatarUrls":["/avatars/9231982cf70a0689f50accedf1004702.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15529","authors":[{"_id":"6a0d16ee65eb30f20d962bae","name":"Jinyuan Li","hidden":false},{"_id":"6a0d16ee65eb30f20d962baf","name":"Langlin Huang","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb0","name":"Chengsong Huang","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb1","name":"Shaoyang Xu","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb2","name":"Donghong Cai","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb3","name":"Yuyi Yang","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb4","name":"Wenxuan Zhang","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb5","name":"Jiaxin Huang","hidden":false}],"publishedAt":"2026-05-15T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Process Rewards with Learned Reliability","submittedOnDailyBy":{"_id":"64fc20d899123d7698a30e61","avatarUrl":"/avatars/9231982cf70a0689f50accedf1004702.svg","isPro":false,"fullname":"Jinyuan Li","user":"jinyuan222","type":"user","name":"jinyuan222"},"summary":"Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.","upvotes":44,"discussionId":"6a0d16ef65eb30f20d962bb6","githubRepo":"https://github.com/JinYuanLi0012/Beta-Binomial-PRM","githubRepoAddedBy":"user","ai_summary":"BetaPRM introduces a distributional approach to process reward models that predicts both success probabilities and prediction reliability, enabling adaptive computation allocation that reduces token usage while maintaining accuracy.","ai_keywords":["Process Reward Models","BetaPRM","distributional PRM","Beta belief","Beta-Binomial likelihood","Monte Carlo continuations","step-level success probability","reliability signal","Adaptive Computation Allocation","Best-of-N reasoning"],"githubStars":7,"organization":{"_id":"670035f24055c4569f7dd024","name":"HINT-lab","fullname":"Huang's INTelligence lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64efbf39b3610349e84db417/tbNZtAX3vJeGo2Rag_7ZN.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64fc20d899123d7698a30e61","avatarUrl":"/avatars/9231982cf70a0689f50accedf1004702.svg","isPro":false,"fullname":"Jinyuan Li","user":"jinyuan222","type":"user"},{"_id":"6452faa03f80ad88c77c0efc","avatarUrl":"/avatars/2ce498d6a88f643dd91b6d56e14cb66e.svg","isPro":false,"fullname":"YUYI YANG","user":"yyuyi","type":"user"},{"_id":"68d9f7596a520d90b25cae27","avatarUrl":"/avatars/d0c952ece301df8afba72ec3b293413e.svg","isPro":false,"fullname":"Hank Mi","user":"HongzeMi","type":"user"},{"_id":"639ae8dfb49b726255975f86","avatarUrl":"/avatars/3361477fb2de29eaea5484696b2721c6.svg","isPro":false,"fullname":"xushaoyang","user":"beiweixiaoxu","type":"user"},{"_id":"64efbf39b3610349e84db417","avatarUrl":"/avatars/9e09a20e88f8cf5ce119efc0dadc3b7b.svg","isPro":false,"fullname":"Jiaxin Huang","user":"teapot123","type":"user"},{"_id":"65e02d89574e5aa0e9ce3efa","avatarUrl":"/avatars/2ab152a10b21d81fb1defc726b8e951a.svg","isPro":false,"fullname":"Langlin Huang","user":"shrango","type":"user"},{"_id":"69844029d3d55cb256962dbf","avatarUrl":"/avatars/405b9ad99f58ed28cf125963749c4992.svg","isPro":false,"fullname":"TJUYFY","user":"TJUYFY","type":"user"},{"_id":"69841e19a289b72fb0517a79","avatarUrl":"/avatars/81385becb79b9c8cfa9d4b8bd13c3440.svg","isPro":false,"fullname":"DEXTER","user":"david563","type":"user"},{"_id":"69841959c43b9939c221d28a","avatarUrl":"/avatars/c135a0a54f52faaa5b30ac346dedb46d.svg","isPro":false,"fullname":"Li","user":"Dexter0012","type":"user"},{"_id":"69844bf2b9498b31764e82f6","avatarUrl":"/avatars/87d52ea1a037181f418a42e957b4c52f.svg","isPro":false,"fullname":"YvonneAlerander","user":"YvonneAlerander","type":"user"},{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"},{"_id":"69846575777fa97cbdbf4f41","avatarUrl":"/avatars/7095bb7013e0d7121763eab736ac2376.svg","isPro":false,"fullname":"MKUSD","user":"MKUSD","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"670035f24055c4569f7dd024","name":"HINT-lab","fullname":"Huang's INTelligence lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64efbf39b3610349e84db417/tbNZtAX3vJeGo2Rag_7ZN.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15529.md"}">
Papers
arxiv:2605.15529

Process Rewards with Learned Reliability

Published on May 15
· Submitted by
Jinyuan Li
on May 20
Authors:
,
,
,
,
,
,
,

Abstract

BetaPRM introduces a distributional approach to process reward models that predicts both success probabilities and prediction reliability, enabling adaptive computation allocation that reduces token usage while maintaining accuracy.

AI-generated summary

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

Community

Paper submitter about 11 hours ago

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.15529
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15529 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15529 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15529 in a Space README.md to link it from this page.

Collections including this paper 3

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers