Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.</p>\n","updatedAt":"2026-05-29T06:55:05.433Z","author":{"_id":"64bf811d76a6e2efcceabc00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bf811d76a6e2efcceabc00/0p3zSIVqzoME25Zmfh7SD.png","fullname":"Tianci Liu","name":"lliutianc","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.90444415807724},"editors":["lliutianc"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64bf811d76a6e2efcceabc00/0p3zSIVqzoME25Zmfh7SD.png"],"reactions":[],"isReport":false}},{"id":"6a1a40ff0499e06634bc27a2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:44:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences](https://huggingface.co/papers/2604.13618) (2026)\n* [Prompt-Level Reward Specifications for Open-Ended Post-Training](https://huggingface.co/papers/2605.29275) (2026)\n* [Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards](https://huggingface.co/papers/2605.26579) (2026)\n* [EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics](https://huggingface.co/papers/2605.03871) (2026)\n* [Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning](https://huggingface.co/papers/2605.08061) (2026)\n* [Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria](https://huggingface.co/papers/2605.08354) (2026)\n* [Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation](https://huggingface.co/papers/2605.26958) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.13618\">C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.29275\">Prompt-Level Reward Specifications for Open-Ended Post-Training</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26579\">Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.03871\">EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.08061\">Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.08354\">Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26958\">Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:44:31.046Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7502484321594238},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29156","authors":[{"_id":"6a1937ee56b4bb14ec65d148","name":"Haoxiang Jiang","hidden":false},{"_id":"6a1937ee56b4bb14ec65d149","name":"Zihan Dong","hidden":false},{"_id":"6a1937ee56b4bb14ec65d14a","user":{"_id":"64bf811d76a6e2efcceabc00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bf811d76a6e2efcceabc00/0p3zSIVqzoME25Zmfh7SD.png","isPro":false,"fullname":"Tianci Liu","user":"lliutianc","type":"user","name":"lliutianc"},"name":"Tianci Liu","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:49:09.934Z","hidden":false},{"_id":"6a1937ee56b4bb14ec65d14b","name":"Wanying Wang","hidden":false},{"_id":"6a1937ee56b4bb14ec65d14c","name":"Ran Xu","hidden":false},{"_id":"6a1937ee56b4bb14ec65d14d","name":"Tony Yu","hidden":false},{"_id":"6a1937ee56b4bb14ec65d14e","name":"Linjun Zhang","hidden":false},{"_id":"6a1937ee56b4bb14ec65d14f","name":"Haoyu Wang","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains","submittedOnDailyBy":{"_id":"64bf811d76a6e2efcceabc00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bf811d76a6e2efcceabc00/0p3zSIVqzoME25Zmfh7SD.png","isPro":false,"fullname":"Tianci Liu","user":"lliutianc","type":"user","name":"lliutianc"},"summary":"Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.","upvotes":6,"discussionId":"6a1937ee56b4bb14ec65d150","ai_summary":"RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data for training.","ai_keywords":["reward modeling","LLM post-training","rubric-based methods","pairwise preference data","RL stage","probability-based scoring rule","phase-specific preference-based rewards","alternating GRPO scheme","pointwise evaluator"],"organization":{"_id":"68e706da311f55603f9b6f2f","name":"OpenRubrics","fullname":"OpenRubrics","avatar":"https://www.gravatar.com/avatar/a1cf4d47627d8b743a835e34d24d6b7e?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64bf811d76a6e2efcceabc00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bf811d76a6e2efcceabc00/0p3zSIVqzoME25Zmfh7SD.png","isPro":false,"fullname":"Tianci Liu","user":"lliutianc","type":"user"},{"_id":"6358c9d90e4fef21982b6b87","avatarUrl":"/avatars/12def86ed68b74aaea0b6593c867a274.svg","isPro":false,"fullname":"Yue Yu","user":"yyu","type":"user"},{"_id":"68ba7b143b5dd16b2315a5c0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/w3qDz3A_vITfHG4BuWfwt.png","isPro":false,"fullname":"Yi-Chung Chen","user":"andrew0111","type":"user"},{"_id":"68bf90fe449c9a0248625005","avatarUrl":"/avatars/bb6112bc1cfd8a30beebd58a9e57280f.svg","isPro":false,"fullname":"Shiyang Wang","user":"testbed2","type":"user"},{"_id":"665881b031d241b7a609cc8c","avatarUrl":"/avatars/62fd259fd5c9bbadd523c5c195ab764f.svg","isPro":false,"fullname":"Tianchun Li","user":"tchunli","type":"user"},{"_id":"641a92bc4182690729c9324b","avatarUrl":"/avatars/f5d3de7f04fe77d0cfced51b5431c114.svg","isPro":false,"fullname":"haoyu wang","user":"haoyuw","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68e706da311f55603f9b6f2f","name":"OpenRubrics","fullname":"OpenRubrics","avatar":"https://www.gravatar.com/avatar/a1cf4d47627d8b743a835e34d24d6b7e?d=retro&size=100"}}">
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
Abstract
RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data for training.
AI-generated summary
Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.
Community
Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.29156 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.