Hugging Face Daily Papers · · 5 min read

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

“Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR”</p>\n<p>As RL post-training expands beyond fully verifiable domains, rubrics, or checklists, are becoming a common reward interface for open-ended and multimodal tasks.</p>\n<p>The question we study is: Should the same rubric weights that define final answer quality also determine what the current policy learns from during RL?</p>\n<p>Our finding is no - A criterion can be important for the final response, but if all sampled rollouts pass it or all sampled rollouts fail it, it provides no group-relative learning signal. Across our multimodal setting and HealthBench, roughly half of rubric criteria are non-contrastive for a fresh policy, and static aggregation routes 45–51% of within-category training pressure to such criteria.</p>\n<p>In this work, we:<br>• diagnose how static rubric aggregation misallocates learning signal,<br>• show that human importance and policy-dependent usefulness can decouple, and<br>• introduce POW3R, a policy-aware rubric reward framework that preserves the evaluation target while adapting criterion-level reward weights during training.</p>\n<p>Across three base policies and multimodal/text-only settings, POW3R wins 24/30 base-policy/metric comparisons and reaches the same plateau in 2.5–4× fewer training steps.</p>\n","updatedAt":"2026-05-20T21:59:46.789Z","author":{"_id":"62d596a905461e9459cd9284","avatarUrl":"/avatars/34c5ed1f7b8f0ca3be7224abbce5f4a2.svg","fullname":"Utkarsh Tyagi","name":"utkarsh4430","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8861117362976074},"editors":["utkarsh4430"],"editorAvatarUrls":["/avatars/34c5ed1f7b8f0ca3be7224abbce5f4a2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20164","authors":[{"_id":"6a0d66630cc88a0d483d36eb","user":{"_id":"62d596a905461e9459cd9284","avatarUrl":"/avatars/34c5ed1f7b8f0ca3be7224abbce5f4a2.svg","isPro":false,"fullname":"Utkarsh Tyagi","user":"utkarsh4430","type":"user","name":"utkarsh4430"},"name":"Utkarsh Tyagi","status":"claimed_verified","statusLastChangedAt":"2026-05-20T17:10:07.991Z","hidden":false},{"_id":"6a0d66630cc88a0d483d36ec","name":"Xingang Guo","hidden":false},{"_id":"6a0d66630cc88a0d483d36ed","name":"MohammadHossein Rezaei","hidden":false},{"_id":"6a0d66630cc88a0d483d36ee","name":"Daniel George","hidden":false},{"_id":"6a0d66630cc88a0d483d36ef","name":"Anas Mahmoud","hidden":false},{"_id":"6a0d66630cc88a0d483d36f0","name":"Jackson Lee","hidden":false},{"_id":"6a0d66630cc88a0d483d36f1","name":"Bing Liu","hidden":false},{"_id":"6a0d66630cc88a0d483d36f2","name":"Yunzhong He","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/62d596a905461e9459cd9284/VN_I7Ma0H9wJbzS5iU5ha.png"],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR","submittedOnDailyBy":{"_id":"62d596a905461e9459cd9284","avatarUrl":"/avatars/34c5ed1f7b8f0ca3be7224abbce5f4a2.svg","isPro":false,"fullname":"Utkarsh Tyagi","user":"utkarsh4430","type":"user","name":"utkarsh4430"},"summary":"Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins 24 of 30 base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in 2.5--4times fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.","upvotes":1,"discussionId":"6a0d66630cc88a0d483d36f3","ai_summary":"POW3R is a policy-aware framework for reinforcement learning with rubric-based rewards that adapts criterion weights during training to improve policy optimization while preserving human-defined criteria importance.","ai_keywords":["reinforcement learning","verifiable rewards","rubric-based rewards","policy-aware","GRPO","rollout-level contrast","policy optimization","training efficiency"],"organization":{"_id":"6677220f8a4064c02bc81217","name":"ScaleAI","fullname":"Scale AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d6a5f94c28026a003581b4/uqHyTuNQ8fX7LheVhzPeO.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6677220f8a4064c02bc81217","name":"ScaleAI","fullname":"Scale AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d6a5f94c28026a003581b4/uqHyTuNQ8fX7LheVhzPeO.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20164.md"}">
Papers
arxiv:2605.20164

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Published on May 19
· Submitted by
Utkarsh Tyagi
on May 20
Authors:
,
,
,
,
,
,

Abstract

POW3R is a policy-aware framework for reinforcement learning with rubric-based rewards that adapts criterion weights during training to improve policy optimization while preserving human-defined criteria importance.

AI-generated summary

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins 24 of 30 base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in 2.5--4times fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.

Community

Paper author Paper submitter about 4 hours ago

“Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR”

As RL post-training expands beyond fully verifiable domains, rubrics, or checklists, are becoming a common reward interface for open-ended and multimodal tasks.

The question we study is: Should the same rubric weights that define final answer quality also determine what the current policy learns from during RL?

Our finding is no - A criterion can be important for the final response, but if all sampled rollouts pass it or all sampled rollouts fail it, it provides no group-relative learning signal. Across our multimodal setting and HealthBench, roughly half of rubric criteria are non-contrastive for a fresh policy, and static aggregation routes 45–51% of within-category training pressure to such criteria.

In this work, we:
• diagnose how static rubric aggregation misallocates learning signal,
• show that human importance and policy-dependent usefulness can decouple, and
• introduce POW3R, a policy-aware rubric reward framework that preserves the evaluation target while adapting criterion-level reward weights during training.

Across three base policies and multimodal/text-only settings, POW3R wins 24/30 base-policy/metric comparisons and reaches the same plateau in 2.5–4× fewer training steps.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.20164
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.20164 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.20164 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.20164 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers