Hugging Face Daily Papers · · 5 min read

Unsupervised Process Reward Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.</p>\n","updatedAt":"2026-05-22T09:21:49.803Z","author":{"_id":"67574a29dd1f47442293becd","avatarUrl":"/avatars/5e6a28899e5ea631d8f0677a9dd1a51a.svg","fullname":"Siba Smarak Panigrahi","name":"sibasmarakp","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9111698269844055},"editors":["sibasmarakp"],"editorAvatarUrls":["/avatars/5e6a28899e5ea631d8f0677a9dd1a51a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.10158","authors":[{"_id":"6a101fe0a53a61ce2e422f25","name":"Artyom Gadetsky","hidden":false},{"_id":"6a101fe0a53a61ce2e422f26","name":"Maxim Kodryan","hidden":false},{"_id":"6a101fe0a53a61ce2e422f27","name":"Siba Smarak Panigrahi","hidden":false},{"_id":"6a101fe0a53a61ce2e422f28","name":"Hang Guo","hidden":false},{"_id":"6a101fe0a53a61ce2e422f29","name":"Maria Brbic","hidden":false}],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"Unsupervised Process Reward Models","submittedOnDailyBy":{"_id":"67574a29dd1f47442293becd","avatarUrl":"/avatars/5e6a28899e5ea631d8f0677a9dd1a51a.svg","isPro":false,"fullname":"Siba Smarak Panigrahi","user":"sibasmarakp","type":"user","name":"sibasmarakp"},"summary":"Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.","upvotes":5,"discussionId":"6a101fe1a53a61ce2e422f2a","ai_summary":"Unsupervised reward models eliminate the need for human annotations in training by leveraging language model next-token probabilities to identify erroneous reasoning steps and improve policy optimization in reinforcement learning.","ai_keywords":["Process Reward Models","unsupervised training","language model","next-token probabilities","reasoning trajectories","first erroneous steps","ProcessBench","test-time scaling","reinforcement learning","policy optimization"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67574a29dd1f47442293becd","avatarUrl":"/avatars/5e6a28899e5ea631d8f0677a9dd1a51a.svg","isPro":false,"fullname":"Siba Smarak Panigrahi","user":"sibasmarakp","type":"user"},{"_id":"67cec23a59707cde2a1b0b5e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Gep4ri5a-ndwIqqDvd4up.png","isPro":false,"fullname":"Animesh Awasthi","user":"aawasthi2000","type":"user"},{"_id":"64624d352538819c729db11c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/5TaYAocoTG7sT5Gi-Q1PW.png","isPro":false,"fullname":"Debajyoti Dasgupta","user":"debajyotidasgupta","type":"user"},{"_id":"6722601ed1bf6eb8f53cc40a","avatarUrl":"/avatars/7c38ddb469f729c75c5d3429cbddb455.svg","isPro":false,"fullname":"Marta Knezevic","user":"Marta102","type":"user"},{"_id":"67a69cfb9b72585dd17e8996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/BFZN_bF3aWmsG53hyP5eb.png","isPro":false,"fullname":"Mehdi El Bouari","user":"medimed","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.10158.md"}">
Papers
arxiv:2605.10158

Unsupervised Process Reward Models

Published on May 11
· Submitted by
Siba Smarak Panigrahi
on May 22
Authors:
,
,
,
,

Abstract

Unsupervised reward models eliminate the need for human annotations in training by leveraging language model next-token probabilities to identify erroneous reasoning steps and improve policy optimization in reinforcement learning.

AI-generated summary

Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.

Community

Paper submitter about 3 hours ago

Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.10158
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.10158 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.10158 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.10158 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers