Why do fixed rewards always break your LLM training at the worst time?</p>\n","updatedAt":"2026-05-18T09:43:11.589Z","author":{"_id":"66996ea912210698d6fb453b","avatarUrl":"/avatars/d898f7967d4d0785e0c7a1e94b7a237c.svg","fullname":"Yihang Chen","name":"scyyc9","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9286090731620789},"editors":["scyyc9"],"editorAvatarUrls":["/avatars/d898f7967d4d0785e0c7a1e94b7a237c.svg"],"reactions":[],"isReport":false}},{"id":"6a0bc107edf4b2bcf7b12919","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false},"createdAt":"2026-05-19T01:46:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning](https://huggingface.co/papers/2605.04066) (2026)\n* [Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood](https://huggingface.co/papers/2604.12736) (2026)\n* [SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks](https://huggingface.co/papers/2604.08865) (2026)\n* [expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling](https://huggingface.co/papers/2605.09923) (2026)\n* [On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation](https://huggingface.co/papers/2603.22117) (2026)\n* [DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment](https://huggingface.co/papers/2605.03327) (2026)\n* [fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum](https://huggingface.co/papers/2605.11403) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.04066\">Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.12736\">Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.08865\">SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09923\">expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2603.22117\">On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.03327\">DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.11403\">fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-19T01:46:47.605Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7402479648590088},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12058","authors":[{"_id":"6a0adec23049bece374a85de","name":"Yuxiang Chen","hidden":false},{"_id":"6a0adec23049bece374a85df","name":"Dingli Liang","hidden":false},{"_id":"6a0adec23049bece374a85e0","name":"Yihang Chen","hidden":false},{"_id":"6a0adec23049bece374a85e1","name":"Ziqin Gong","hidden":false},{"_id":"6a0adec23049bece374a85e2","name":"Chenyang Le","hidden":false},{"_id":"6a0adec23049bece374a85e3","name":"Zhaokai Wang","hidden":false},{"_id":"6a0adec23049bece374a85e4","name":"Jiachen Zhu","hidden":false},{"_id":"6a0adec23049bece374a85e5","name":"Lingyu Yang","hidden":false},{"_id":"6a0adec23049bece374a85e6","name":"Jianghao Lin","hidden":false},{"_id":"6a0adec23049bece374a85e7","name":"Weinan Zhang","hidden":false},{"_id":"6a0adec23049bece374a85e8","name":"Jun Wang","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"Hölder Policy Optimisation","submittedOnDailyBy":{"_id":"66996ea912210698d6fb453b","avatarUrl":"/avatars/d898f7967d4d0785e0c7a1e94b7a237c.svg","isPro":false,"fullname":"Yihang Chen","user":"scyyc9","type":"user","name":"scyyc9"},"summary":"Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.","upvotes":16,"discussionId":"6a0adec23049bece374a85e9"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68320962388706bf96e68803","avatarUrl":"/avatars/f174196ad605f49307377d69e2baa550.svg","isPro":false,"fullname":"Leo Chen","user":"darkmoonlight1","type":"user"},{"_id":"6363a769287b5ce02ed156a4","avatarUrl":"/avatars/72f1ed35ce0c20f566399c7f06261452.svg","isPro":false,"fullname":"wrara","user":"wrawar","type":"user"},{"_id":"68abf3aaf626dce772ab8e4b","avatarUrl":"/avatars/1281623f0242f4c2be328849f7195e3c.svg","isPro":false,"fullname":"Memento","user":"AgentFly","type":"user"},{"_id":"645d7f107c7258d904e82749","avatarUrl":"/avatars/a4e9d47b281f18616c522c1a8b8ee7e5.svg","isPro":false,"fullname":"HuichiZhou","user":"Zhouhc","type":"user"},{"_id":"66996ea912210698d6fb453b","avatarUrl":"/avatars/d898f7967d4d0785e0c7a1e94b7a237c.svg","isPro":false,"fullname":"Yihang Chen","user":"scyyc9","type":"user"},{"_id":"66840c818f6c78ebd41f86ca","avatarUrl":"/avatars/626ffa3d262c02f964f15fa420af414c.svg","isPro":false,"fullname":"Kun Shao","user":"ShaoKun-HW","type":"user"},{"_id":"69f881fe04afb047d6d54dfa","avatarUrl":"/avatars/7d2f96301d04d3bc52e79039666ac8f4.svg","isPro":false,"fullname":"Leo C","user":"xg12138","type":"user"},{"_id":"67cf8810a1c1880159fb0a1a","avatarUrl":"/avatars/b3735428f1967b87cc5a713689e02fef.svg","isPro":false,"fullname":"Yuxiang Chen","user":"Yuxiang0853","type":"user"},{"_id":"65041307324053b21adcf01a","avatarUrl":"/avatars/0377287cdf55c698c660b8a3c79f43c7.svg","isPro":true,"fullname":"Ka Yiu Lee","user":"ycps051031","type":"user"},{"_id":"69f880fdb1e057b04faad6fb","avatarUrl":"/avatars/49a210b6d97e373bde02aa9da2d8a678.svg","isPro":false,"fullname":"nobodynose","user":"bigguy323","type":"user"},{"_id":"69ef34f997d8d08048aa5be4","avatarUrl":"/avatars/1e98079eddb818243a4054c474ed0553.svg","isPro":false,"fullname":"yc","user":"leovoo-o","type":"user"},{"_id":"67fee3f8ecdaad48e7885091","avatarUrl":"/avatars/7315b6e3216bfb3941ee474d50ae8ddb.svg","isPro":false,"fullname":"Lee","user":"Summer-77","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12058.md"}">
Hölder Policy Optimisation
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.
Community
Why do fixed rewards always break your LLM training at the worst time?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.12058 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.12058 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.12058 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.