Hugging Face Daily Papers · May 13, 2026 · 5 min read

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

POISE asks how the actor's internal representations can be folded back into RL training: it turns hidden states from the model's own generation process into a baseline function for RL updates, without a separate critic or many extra samples.\n","updatedAt":"2026-05-13T11:29:12.078Z","author":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","fullname":"Jongwon Lim","name":"Jongwondd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8643953800201416},"editors":["Jongwondd"],"editorAvatarUrls":["/avatars/272750344d9c5afa38312f9814e390bb.svg"],"reactions":[],"isReport":false}},{"id":"6a047f90aafc33889c5711be","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false},"createdAt":"2026-05-13T13:41:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Really cool stuff!\n\nI think results could be improved with some additional refinement, i.e., taking advantage of more signals. But this concept is important for the future of LLM training.","html":"Really cool stuff!\nI think results could be improved with some additional refinement, i.e., taking advantage of more signals. But this concept is important for the future of LLM training.\n","updatedAt":"2026-05-13T13:41:36.088Z","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9731864929199219},"editors":["urroxyz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png"],"reactions":[{"reaction":"🔥","users":["Jongwondd"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.07579","authors":[{"_id":"6a019524675c142cf74adc6a","name":"Yunho Choi","hidden":false},{"_id":"6a019524675c142cf74adc6b","user":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","isPro":false,"fullname":"Jongwon Lim","user":"Jongwondd","type":"user","name":"Jongwondd"},"name":"Jongwon Lim","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:55:50.962Z","hidden":false},{"_id":"6a019524675c142cf74adc6c","name":"Woojin Ahn","hidden":false},{"_id":"6a019524675c142cf74adc6d","name":"Minjae Oh","hidden":false},{"_id":"6a019524675c142cf74adc6e","name":"Jeonghoon Shim","hidden":false},{"_id":"6a019524675c142cf74adc6f","name":"Yohan Jo","hidden":false}],"publishedAt":"2026-05-08T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States","submittedOnDailyBy":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","isPro":false,"fullname":"Jongwon Lim","user":"Jongwondd","type":"user","name":"Jongwondd"},"summary":"Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.","upvotes":13,"discussionId":"6a019525675c142cf74adc70","projectPage":"https://holi-lab.github.io/POISE/","githubRepo":"https://github.com/holi-lab/POISE","githubRepoAddedBy":"user","ai_summary":"POISE enables stable and efficient policy optimization for large reasoning models by estimating baselines using internal model signals, reducing computational overhead while maintaining performance comparable to existing methods.","ai_keywords":["reinforcement learning with verifiable rewards","policy optimization","variance reduction","PPO","GRPO","policy-model scale critic","empirical group mean","lightweight probe","internal state value estimation","trajectory-conditioned features","cross-rollout construction","gradient unbiasedness","verifiable rewards","DAPO","LLM-scale value model"],"githubStars":0,"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64a3b603fbd994e0767b52e9","avatarUrl":"/avatars/0eecc4db5b4da27703204b9301440a4b.svg","isPro":false,"fullname":"Minjae Oh","user":"Riasok","type":"user"},{"_id":"67371adc7ef9698051041c58","avatarUrl":"/avatars/58449d2f0fbfa6daab983d462cb86104.svg","isPro":false,"fullname":"Choi","user":"yunhowhour","type":"user"},{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","isPro":false,"fullname":"Jongwon Lim","user":"Jongwondd","type":"user"},{"_id":"66e80dc9007ee12d2a8bd5ae","avatarUrl":"/avatars/dfb867009e352a43a003081503e7072c.svg","isPro":false,"fullname":"Hyeongjin Kim","user":"madokalif","type":"user"},{"_id":"650fcd442a45730c3ffcbdb6","avatarUrl":"/avatars/17d7de506dcf870d35fdcd0ddd5cc2ee.svg","isPro":false,"fullname":"Heejae Suh","user":"boribori","type":"user"},{"_id":"669f80b63d38b52c79bdf8fc","avatarUrl":"/avatars/1a88d11c1408c1373ba148e186e3a0f1.svg","isPro":false,"fullname":"sungjiblim","user":"sungzip","type":"user"},{"_id":"62b45cec2aeef077f1e42697","avatarUrl":"/avatars/91f8570448d8a2a407eb87ef5c072b21.svg","isPro":false,"fullname":"Hacastle12","user":"hacastle12","type":"user"},{"_id":"67e62e2e85286d639823ee15","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/hMXbFXaG4bHNLo0QuEvC1.png","isPro":false,"fullname":"SeungWon Kook","user":"Aiant56","type":"user"},{"_id":"6552f9e2ab7c20ac6fe7e556","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6552f9e2ab7c20ac6fe7e556/WYErK8nUyPXn4QmfhNlze.jpeg","isPro":false,"fullname":"John","user":"johnhan00","type":"user"},{"_id":"69e991e47d22f27adde7f518","avatarUrl":"/avatars/560c40b29721cd31558f49c5c7e1f797.svg","isPro":false,"fullname":"pikachu","user":"optimized-pikachu","type":"user"},{"_id":"66ac7b0997a8c9192bc551df","avatarUrl":"/avatars/41e9d93cde502e8235f9c8bd20be89cc.svg","isPro":false,"fullname":"Sangjun Song","user":"ssangjun706","type":"user"},{"_id":"67dd45f1a412018fab2705ae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/FfOX4wkw4Zirw2O9Bdd4T.png","isPro":false,"fullname":"holilab","user":"holi-lab","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.07579.md"}">

Papers

arxiv:2605.07579

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Published on May 8

· Submitted by

Jongwon Lim on May 13

Seoul National University

Upvote

Authors:

Jongwon Lim ,

Abstract

POISE enables stable and efficient policy optimization for large reasoning models by estimating baselines using internal model signals, reducing computational overhead while maintaining performance comparable to existing methods.

AI-generated summary

Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

Jongwondd

Paper author Paper submitter about 10 hours ago

urroxyz

about 7 hours ago

Really cool stuff!

I think results could be improved with some additional refinement, i.e., taking advantage of more signals. But this concept is important for the future of LLM training.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.07579

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.07579 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.07579 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.07579 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers