<strong>GD²PO</strong> tackles an important failure mode in multi-reward RL training: when a single rollout gets a positive advantage on one reward dimension but a negative one on another, aggregating them just cancels the signals out, leaving the model with a near-zero, uninformative update.</p>\n<p>The fix is elegant: before aggregating, filter out rollouts where reward-wise advantages point in conflicting directions (either by sign, or via an SNR-style soft score). Then, reweight each query based on how many of its rollouts survived filtering -- queries with mostly conflicted rollouts get a smaller update, since their supervision is noisier.</p>\n<p>The method builds cleanly on top of GRPO + GDPO and consistently outperforms both on tool calling and helpfulness-safety alignment, across Qwen2.5 and Llama3 backbones. The SNR-based variant is especially useful when there are 3+ reward dimensions, where hard sign-based filtering becomes too aggressive.</p>\n","updatedAt":"2026-06-16T07:45:14.133Z","author":{"_id":"66f79b323fe089b75e9e0c04","avatarUrl":"/avatars/92918bf8913012a3f005f09e03b381c2.svg","fullname":"Siyuan Huang","name":"chamber111","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.916957437992096},"editors":["chamber111"],"editorAvatarUrls":["/avatars/92918bf8913012a3f005f09e03b381c2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.16771","authors":[{"_id":"6a30fd5fa0d4daae4286032e","name":"Haotian Liu","hidden":false},{"_id":"6a30fd5fa0d4daae4286032f","name":"Yihao Liu","hidden":false},{"_id":"6a30fd5fa0d4daae42860330","name":"Jingwei Ni","hidden":false},{"_id":"6a30fd5fa0d4daae42860331","name":"Siyuan Huang","hidden":false},{"_id":"6a30fd5fa0d4daae42860332","name":"Xinpeng Liu","hidden":false},{"_id":"6a30fd5fa0d4daae42860333","name":"Pengyu Cheng","hidden":false},{"_id":"6a30fd5fa0d4daae42860334","name":"Jiajun Song","hidden":false},{"_id":"6a30fd5fa0d4daae42860335","name":"Ruijin Ding","hidden":false},{"_id":"6a30fd5fa0d4daae42860336","name":"Junfeng Li","hidden":false},{"_id":"6a30fd5fa0d4daae42860337","name":"Zhechao Yu","hidden":false},{"_id":"6a30fd5fa0d4daae42860338","name":"Mengyu Zhou","hidden":false},{"_id":"6a30fd5fa0d4daae42860339","name":"Hongteng Xu","hidden":false},{"_id":"6a30fd5fa0d4daae4286033a","name":"Xiaoxi Jiang","hidden":false},{"_id":"6a30fd5fa0d4daae4286033b","name":"Guanjun Jiang","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization","submittedOnDailyBy":{"_id":"66f79b323fe089b75e9e0c04","avatarUrl":"/avatars/92918bf8913012a3f005f09e03b381c2.svg","isPro":false,"fullname":"Siyuan Huang","user":"chamber111","type":"user","name":"chamber111"},"summary":"As LLMs advance, post-training reinforcement learning (RL) increasingly relies on multi-dimensional rewards to cultivate comprehensive capabilities. This shift demands new algorithms capable of optimizing diverse and potentially competing objectives simultaneously. To address this, existing methods such as Group reward-Decoupled Policy Optimization (GDPO) decompose the overall score into independent reward groups, then compute the RL loss separately within each group. However, this strategy still encounters multi-reward conflicts: a single rollout can yield positive advantages on certain reward dimensions but negative ones on others, causing opposing signals to cancel each other out during aggregation, further hindering RL training efficiency. Inspired by Dynamic sAmpling Policy Optimization (DAPO), which improves RL training efficiency by filtering out ineffective rollouts with near-zero advantages, we propose Group-Dynamic reward-Decoupled Policy Optimization (GD^2PO). Specifically, GD^2PO employs a conflict-aware filtering mechanism to mask out rollouts suffering from severe reward-wise disagreement. By preventing conflicting signals from canceling each other out, this masking strategy preserves and enhances the magnitude of effective RL advantages, thereby significantly accelerating learning efficiency. Furthermore, we introduce query-level reweighting to dynamically adjust the update intensity of each query based on its overall reward consensus. Experiments on various multi-reward scenarios, including tool calling and human preference alignment, demonstrate that GD^2PO consistently and significantly outperforms existing baselines. The code is available at https://github.com/Qwen-Applications/GD2PO.","upvotes":8,"discussionId":"6a30fd5fa0d4daae4286033c","githubRepo":"https://github.com/Qwen-Applications/GD2PO","githubRepoAddedBy":"user","ai_summary":"Multi-dimensional reward optimization in large language models is enhanced through a conflict-aware filtering mechanism that prevents signal cancellation and accelerates reinforcement learning efficiency.","ai_keywords":["reinforcement learning","multi-dimensional rewards","policy optimization","rewardDecoupled Policy Optimization","dynamic sampling","query-level reweighting","conflict-aware filtering","reinforcement learning training efficiency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":5},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f79b323fe089b75e9e0c04","avatarUrl":"/avatars/92918bf8913012a3f005f09e03b381c2.svg","isPro":false,"fullname":"Siyuan Huang","user":"chamber111","type":"user"},{"_id":"68021b24470436ef00347c84","avatarUrl":"/avatars/6970c4404156a89a2985de5a170c0aa7.svg","isPro":false,"fullname":"Liu","user":"happyfighting","type":"user"},{"_id":"658bd850c88fda265ce659cb","avatarUrl":"/avatars/c5a68a789745828e7f3aabee05bb1eff.svg","isPro":false,"fullname":"Sijia Cui","user":"cuisijia","type":"user"},{"_id":"674695a17e39c9bcdd93003d","avatarUrl":"/avatars/0e037d26c1a217912b2bf14b907f0e00.svg","isPro":false,"fullname":"Jiajun Song","user":"JiajunSong-Duke","type":"user"},{"_id":"64340bde95b8ab0493864963","avatarUrl":"/avatars/bdf2d876e4fa0b7e7a1756fc20a1d0d2.svg","isPro":false,"fullname":"Pengyu Cheng","user":"Linear95","type":"user"},{"_id":"67698b2aa8c1f23364133dcd","avatarUrl":"/avatars/731e61e51957216d93b3b0d8b41029ef.svg","isPro":false,"fullname":"Durakaka","user":"Durakaka","type":"user"},{"_id":"697c8b15a7f796854ef333c4","avatarUrl":"/avatars/94de3a736fac914944f1b57609e3819a.svg","isPro":false,"fullname":"Joel Wang","user":"joelhenwang","type":"user"},{"_id":"64704b689ad4008a29058b6e","avatarUrl":"/avatars/f2f82ecb3f0019aafadb2c0e4fe82840.svg","isPro":false,"fullname":"Gangwei","user":"Fif2099","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.16771.md","query":{}}">
GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Multi-dimensional reward optimization in large language models is enhanced through a conflict-aware filtering mechanism that prevents signal cancellation and accelerates reinforcement learning efficiency.
As LLMs advance, post-training reinforcement learning (RL) increasingly relies on multi-dimensional rewards to cultivate comprehensive capabilities. This shift demands new algorithms capable of optimizing diverse and potentially competing objectives simultaneously. To address this, existing methods such as Group reward-Decoupled Policy Optimization (GDPO) decompose the overall score into independent reward groups, then compute the RL loss separately within each group. However, this strategy still encounters multi-reward conflicts: a single rollout can yield positive advantages on certain reward dimensions but negative ones on others, causing opposing signals to cancel each other out during aggregation, further hindering RL training efficiency. Inspired by Dynamic sAmpling Policy Optimization (DAPO), which improves RL training efficiency by filtering out ineffective rollouts with near-zero advantages, we propose Group-Dynamic reward-Decoupled Policy Optimization (GD^2PO). Specifically, GD^2PO employs a conflict-aware filtering mechanism to mask out rollouts suffering from severe reward-wise disagreement. By preventing conflicting signals from canceling each other out, this masking strategy preserves and enhances the magnitude of effective RL advantages, thereby significantly accelerating learning efficiency. Furthermore, we introduce query-level reweighting to dynamically adjust the update intensity of each query based on its overall reward consensus. Experiments on various multi-reward scenarios, including tool calling and human preference alignment, demonstrate that GD^2PO consistently and significantly outperforms existing baselines. The code is available at https://github.com/Qwen-Applications/GD2PO.
Community
GD²PO tackles an important failure mode in multi-reward RL training: when a single rollout gets a positive advantage on one reward dimension but a negative one on another, aggregating them just cancels the signals out, leaving the model with a near-zero, uninformative update.
The fix is elegant: before aggregating, filter out rollouts where reward-wise advantages point in conflicting directions (either by sign, or via an SNR-style soft score). Then, reweight each query based on how many of its rollouts survived filtering -- queries with mostly conflicted rollouts get a smaller update, since their supervision is noisier.
The method builds cleanly on top of GRPO + GDPO and consistently outperforms both on tool calling and helpfulness-safety alignment, across Qwen2.5 and Llama3 backbones. The SNR-based variant is especially useful when there are 3+ reward dimensions, where hard sign-based filtering becomes too aggressive.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.16771 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.16771 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.16771 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.