We propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones.</p>\n","updatedAt":"2026-05-26T02:09:02.069Z","author":{"_id":"644a1dbb9c340e5e1e713153","avatarUrl":"/avatars/21cb93ad067a798a39829ef7e67c70b8.svg","fullname":"JGC","name":"Nothing2Say","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8604837656021118},"editors":["Nothing2Say"],"editorAvatarUrls":["/avatars/21cb93ad067a798a39829ef7e67c70b8.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.25604","authors":[{"_id":"6a150071b57a1823d5708a1c","user":{"_id":"644a1dbb9c340e5e1e713153","avatarUrl":"/avatars/21cb93ad067a798a39829ef7e67c70b8.svg","isPro":false,"fullname":"JGC","user":"Nothing2Say","type":"user","name":"Nothing2Say"},"name":"Guochao Jiang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:48:12.892Z","hidden":false},{"_id":"6a150071b57a1823d5708a1d","name":"Jingyi Song","hidden":false},{"_id":"6a150071b57a1823d5708a1e","name":"Guofeng Quan","hidden":false},{"_id":"6a150071b57a1823d5708a1f","user":{"_id":"64ae631b58bd9e9cc2f5a749","avatarUrl":"/avatars/ce6426ec3bdb618a9e449297e7f147e0.svg","isPro":false,"fullname":"Chuzhan HAO","user":"Chuzhan","type":"user","name":"Chuzhan"},"name":"Chuzhan Hao","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:48:05.761Z","hidden":false},{"_id":"6a150071b57a1823d5708a20","name":"Guohua Liu","hidden":false},{"_id":"6a150071b57a1823d5708a21","name":"Yuewei Zhang","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning","submittedOnDailyBy":{"_id":"644a1dbb9c340e5e1e713153","avatarUrl":"/avatars/21cb93ad067a798a39829ef7e67c70b8.svg","isPro":false,"fullname":"JGC","user":"Nothing2Say","type":"user","name":"Nothing2Say"},"summary":"Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.","upvotes":59,"discussionId":"6a150072b57a1823d5708a22","ai_summary":"Dynamic Variance-adaptive Advantage Optimization (DVAO) addresses training instability in multi-reward reinforcement learning by adaptively weighting objectives based on empirical reward variance, maintaining bounded advantage magnitudes and improving multi-objective performance.","ai_keywords":["Reinforcement Learning","Large Language Models","Group Relative Policy Optimization","Proximal Policy Optimization","Reward Combination","Advantage Combination","Dynamic Variance-adaptive Advantage Optimization","multi-objective Pareto frontier","training stability","empirical reward variance"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"644a1dbb9c340e5e1e713153","avatarUrl":"/avatars/21cb93ad067a798a39829ef7e67c70b8.svg","isPro":false,"fullname":"JGC","user":"Nothing2Say","type":"user"},{"_id":"64ae631b58bd9e9cc2f5a749","avatarUrl":"/avatars/ce6426ec3bdb618a9e449297e7f147e0.svg","isPro":false,"fullname":"Chuzhan HAO","user":"Chuzhan","type":"user"},{"_id":"68d666fdf73c8e632e733b30","avatarUrl":"/avatars/85939e3373aef74492e309691774fc6c.svg","isPro":false,"fullname":"Chuyu Qiang","user":"SHNhy","type":"user"},{"_id":"68d669e9ccf464a96ac5137a","avatarUrl":"/avatars/78e51f14ee4497b9b99a3fbf1b73b4ad.svg","isPro":false,"fullname":"Boyu Kang","user":"AIProEth","type":"user"},{"_id":"68d66a2bbfd2620af98bec48","avatarUrl":"/avatars/41e394fe47764b343422d4418270912b.svg","isPro":false,"fullname":"Zuowu Shi","user":"RL4LLM4AI","type":"user"},{"_id":"68d66a9952400179123f1122","avatarUrl":"/avatars/95af61777b7e6ced367779e63f882d98.svg","isPro":false,"fullname":"yhang","user":"artetaout","type":"user"},{"_id":"68d66ad3e690f3f546768c50","avatarUrl":"/avatars/cdd80f9a76c0e433d6685725b80aafb3.svg","isPro":false,"fullname":"Ke Bao","user":"ispobock1","type":"user"},{"_id":"68d66b1178d69b134522ae80","avatarUrl":"/avatars/63066f32be1142fdd09bcfea1ea6e823.svg","isPro":false,"fullname":"Baizhou Zhang","user":"Fridge003","type":"user"},{"_id":"64c9fce7cb2f1bf0e7f5124c","avatarUrl":"/avatars/7ea9c1be8d8f739de256bbe10708b37b.svg","isPro":false,"fullname":"guofengquan","user":"siegfried0714","type":"user"},{"_id":"68d66b6643e8e474fa83ba07","avatarUrl":"/avatars/52e8d507c4d96459de0483484d3997c2.svg","isPro":false,"fullname":"Tuyu Fei","user":"tiphaineeee1","type":"user"},{"_id":"68d66bab0abfe8b812151ffe","avatarUrl":"/avatars/211cea89c534bf4b7b2224219dc3f8a4.svg","isPro":false,"fullname":"Zuofeng Qi","user":"Swipe4057","type":"user"},{"_id":"68d66bdf28e169473e94ef80","avatarUrl":"/avatars/d1dda5cb5f4126e547faf7b4a77551cd.svg","isPro":false,"fullname":"Luchang Li","user":"llc-kc","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":1,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.25604.md"}">
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
Abstract
Dynamic Variance-adaptive Advantage Optimization (DVAO) addresses training instability in multi-reward reinforcement learning by adaptively weighting objectives based on empirical reward variance, maintaining bounded advantage magnitudes and improving multi-objective performance.
AI-generated summary
Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.
Community
We propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.25604 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.25604 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.25604 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.