TLDR: RubricEM introduces a rubric-guided reinforcement learning framework for training long-form deep research agents, enabling finer-grained stagewise credit assignment and reflection meta-policy training beyond verifiable rewards.</p>\n","updatedAt":"2026-05-13T01:47:16.522Z","author":{"_id":"654d784d71a30c4bca09a319","avatarUrl":"/avatars/ab9f93122903ccd662267232bab30ad8.svg","fullname":"Gaotang Li","name":"gaotang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8168656826019287},"editors":["gaotang"],"editorAvatarUrls":["/avatars/ab9f93122903ccd662267232bab30ad8.svg"],"reactions":[{"reaction":"🚀","users":["bhavanagoogle","zifengw","q-rz","jiaruz2"],"count":4}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.10899","authors":[{"_id":"6a03d7c386b054ce2fa40d03","name":"Gaotang Li","hidden":false},{"_id":"6a03d7c386b054ce2fa40d04","name":"Bhavana Dalvi Mishra","hidden":false},{"_id":"6a03d7c386b054ce2fa40d05","name":"Zifeng Wang","hidden":false},{"_id":"6a03d7c386b054ce2fa40d06","name":"Jun Yan","hidden":false},{"_id":"6a03d7c386b054ce2fa40d07","name":"Yanfei Chen","hidden":false},{"_id":"6a03d7c386b054ce2fa40d08","name":"Chun-Liang Li","hidden":false},{"_id":"6a03d7c386b054ce2fa40d09","name":"Long T. Le","hidden":false},{"_id":"6a03d7c386b054ce2fa40d0a","name":"Rujun Han","hidden":false},{"_id":"6a03d7c386b054ce2fa40d0b","name":"George Lee","hidden":false},{"_id":"6a03d7c386b054ce2fa40d0c","name":"Hanghang Tong","hidden":false},{"_id":"6a03d7c386b054ce2fa40d0d","name":"Chen-Yu Lee","hidden":false},{"_id":"6a03d7c386b054ce2fa40d0e","name":"Tomas Pfister","hidden":false}],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards","submittedOnDailyBy":{"_id":"654d784d71a30c4bca09a319","avatarUrl":"/avatars/ab9f93122903ccd662267232bab30ad8.svg","isPro":false,"fullname":"Gaotang Li","user":"gaotang","type":"user","name":"gaotang"},"summary":"Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.","upvotes":63,"discussionId":"6a03d7c486b054ce2fa40d0f","ai_summary":"Deep research agents trained using RubricEM framework demonstrate superior performance on long-form research tasks through rubric-guided reinforcement learning with stage-aware planning and reflection-based meta-policy evolution.","ai_keywords":["reinforcement learning","policy execution","judge feedback","agent memory","rubric-guided reinforcement learning","stagewise policy decomposition","reflection-based meta-policy evolution","stage-structured GRPO","long-horizon optimization","research trajectories","evidence gathering","synthesis"],"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"654d784d71a30c4bca09a319","avatarUrl":"/avatars/ab9f93122903ccd662267232bab30ad8.svg","isPro":false,"fullname":"Gaotang Li","user":"gaotang","type":"user"},{"_id":"68199cfafb166170d3f15973","avatarUrl":"/avatars/261297479f2dfb0495b5bdae2d0adf4e.svg","isPro":false,"fullname":"Keming Ouyang","user":"oykevin","type":"user"},{"_id":"6847d401544a12e6c6200f0b","avatarUrl":"/avatars/30b606bee298a73511a0dcebd3735877.svg","isPro":false,"fullname":"Kamran Vojislav","user":"angrydog111","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"68480695eb700507d8885f14","avatarUrl":"/avatars/c889c2690b87307c56ad1608987ca769.svg","isPro":false,"fullname":"Qi Yu","user":"QiLeoYu","type":"user"},{"_id":"69b38bb19aee8a9a1e703a8d","avatarUrl":"/avatars/f30965d11e2c7d63279cd5301d1c711b.svg","isPro":false,"fullname":"Jian Kang","user":"jiank2","type":"user"},{"_id":"69839d1199ed1f6d4ffe31f6","avatarUrl":"/avatars/cddb08097826f28b811b7fd58919f382.svg","isPro":false,"fullname":"BBD","user":"bhavanagoogle","type":"user"},{"_id":"67ae7ac3208d2992385a718a","avatarUrl":"/avatars/b080c101394615b70fc18134d117623e.svg","isPro":false,"fullname":"Hyunsik Yoo","user":"skifree32","type":"user"},{"_id":"670049af8cd2f0acdcac67e5","avatarUrl":"/avatars/bbc032c95c340f48b46cf1b65ccbdfc7.svg","isPro":false,"fullname":"Yanjun Zhao","user":"yj-zhao","type":"user"},{"_id":"65cbea410390fce629089559","avatarUrl":"/avatars/531e1a3cca174022f907e8c37ab3fa0e.svg","isPro":false,"fullname":"YunzheQ","user":"yunzhe0306","type":"user"},{"_id":"69839e5db469b595b9a69dc2","avatarUrl":"/avatars/7c43986f0047fbafe55509eee9d2a78d.svg","isPro":false,"fullname":"Aditya","user":"adityakmishra","type":"user"},{"_id":"6a03e5d8f69e35c6c2e44bce","avatarUrl":"/avatars/9ee02edda7cae59af16bc7a7befcba4d.svg","isPro":false,"fullname":"Goda Orxan","user":"godaaaOrxan","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.10899.md"}">
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
Authors: ,
,
,
,
,
,
,
,
,
,
,
Abstract
Deep research agents trained using RubricEM framework demonstrate superior performance on long-form research tasks through rubric-guided reinforcement learning with stage-aware planning and reflection-based meta-policy evolution.
AI-generated summary
Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.
Community
TLDR: RubricEM introduces a rubric-guided reinforcement learning framework for training long-form deep research agents, enabling finer-grained stagewise credit assignment and reflection meta-policy training beyond verifiable rewards.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.10899 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.10899 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.10899 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.