Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.</p>\n","updatedAt":"2026-05-20T02:07:08.068Z","author":{"_id":"64fc20d899123d7698a30e61","avatarUrl":"/avatars/9231982cf70a0689f50accedf1004702.svg","fullname":"Jinyuan Li","name":"jinyuan222","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8665516972541809},"editors":["jinyuan222"],"editorAvatarUrls":["/avatars/9231982cf70a0689f50accedf1004702.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15529","authors":[{"_id":"6a0d16ee65eb30f20d962bae","name":"Jinyuan Li","hidden":false},{"_id":"6a0d16ee65eb30f20d962baf","name":"Langlin Huang","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb0","name":"Chengsong Huang","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb1","name":"Shaoyang Xu","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb2","name":"Donghong Cai","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb3","name":"Yuyi Yang","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb4","name":"Wenxuan Zhang","hidden":false},{"_id":"6a0d16ee65eb30f20d962bb5","name":"Jiaxin Huang","hidden":false}],"publishedAt":"2026-05-15T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Process Rewards with Learned Reliability","submittedOnDailyBy":{"_id":"64fc20d899123d7698a30e61","avatarUrl":"/avatars/9231982cf70a0689f50accedf1004702.svg","isPro":false,"fullname":"Jinyuan Li","user":"jinyuan222","type":"user","name":"jinyuan222"},"summary":"Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.","upvotes":44,"discussionId":"6a0d16ef65eb30f20d962bb6","githubRepo":"https://github.com/JinYuanLi0012/Beta-Binomial-PRM","githubRepoAddedBy":"user","ai_summary":"BetaPRM introduces a distributional approach to process reward models that predicts both success probabilities and prediction reliability, enabling adaptive computation allocation that reduces token usage while maintaining accuracy.","ai_keywords":["Process Reward Models","BetaPRM","distributional PRM","Beta belief","Beta-Binomial likelihood","Monte Carlo continuations","step-level success probability","reliability signal","Adaptive Computation Allocation","Best-of-N reasoning"],"githubStars":7,"organization":{"_id":"670035f24055c4569f7dd024","name":"HINT-lab","fullname":"Huang's INTelligence lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64efbf39b3610349e84db417/tbNZtAX3vJeGo2Rag_7ZN.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64fc20d899123d7698a30e61","avatarUrl":"/avatars/9231982cf70a0689f50accedf1004702.svg","isPro":false,"fullname":"Jinyuan Li","user":"jinyuan222","type":"user"},{"_id":"6452faa03f80ad88c77c0efc","avatarUrl":"/avatars/2ce498d6a88f643dd91b6d56e14cb66e.svg","isPro":false,"fullname":"YUYI YANG","user":"yyuyi","type":"user"},{"_id":"68d9f7596a520d90b25cae27","avatarUrl":"/avatars/d0c952ece301df8afba72ec3b293413e.svg","isPro":false,"fullname":"Hank Mi","user":"HongzeMi","type":"user"},{"_id":"639ae8dfb49b726255975f86","avatarUrl":"/avatars/3361477fb2de29eaea5484696b2721c6.svg","isPro":false,"fullname":"xushaoyang","user":"beiweixiaoxu","type":"user"},{"_id":"64efbf39b3610349e84db417","avatarUrl":"/avatars/9e09a20e88f8cf5ce119efc0dadc3b7b.svg","isPro":false,"fullname":"Jiaxin Huang","user":"teapot123","type":"user"},{"_id":"65e02d89574e5aa0e9ce3efa","avatarUrl":"/avatars/2ab152a10b21d81fb1defc726b8e951a.svg","isPro":false,"fullname":"Langlin Huang","user":"shrango","type":"user"},{"_id":"69844029d3d55cb256962dbf","avatarUrl":"/avatars/405b9ad99f58ed28cf125963749c4992.svg","isPro":false,"fullname":"TJUYFY","user":"TJUYFY","type":"user"},{"_id":"69841e19a289b72fb0517a79","avatarUrl":"/avatars/81385becb79b9c8cfa9d4b8bd13c3440.svg","isPro":false,"fullname":"DEXTER","user":"david563","type":"user"},{"_id":"69841959c43b9939c221d28a","avatarUrl":"/avatars/c135a0a54f52faaa5b30ac346dedb46d.svg","isPro":false,"fullname":"Li","user":"Dexter0012","type":"user"},{"_id":"69844bf2b9498b31764e82f6","avatarUrl":"/avatars/87d52ea1a037181f418a42e957b4c52f.svg","isPro":false,"fullname":"YvonneAlerander","user":"YvonneAlerander","type":"user"},{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"},{"_id":"69846575777fa97cbdbf4f41","avatarUrl":"/avatars/7095bb7013e0d7121763eab736ac2376.svg","isPro":false,"fullname":"MKUSD","user":"MKUSD","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"670035f24055c4569f7dd024","name":"HINT-lab","fullname":"Huang's INTelligence lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64efbf39b3610349e84db417/tbNZtAX3vJeGo2Rag_7ZN.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15529.md"}">
Process Rewards with Learned Reliability
Abstract
BetaPRM introduces a distributional approach to process reward models that predicts both success probabilities and prediction reliability, enabling adaptive computation allocation that reduces token usage while maintaining accuracy.
AI-generated summary
Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.
Community
Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.15529 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.15529 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.15529 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.