Hugging Face Daily Papers · May 18, 2026 · 6 min read

Hölder Policy Optimisation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Why do fixed rewards always break your LLM training at the worst time?\n","updatedAt":"2026-05-18T09:43:11.589Z","author":{"_id":"66996ea912210698d6fb453b","avatarUrl":"/avatars/d898f7967d4d0785e0c7a1e94b7a237c.svg","fullname":"Yihang Chen","name":"scyyc9","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9286090731620789},"editors":["scyyc9"],"editorAvatarUrls":["/avatars/d898f7967d4d0785e0c7a1e94b7a237c.svg"],"reactions":[],"isReport":false}},{"id":"6a0bc107edf4b2bcf7b12919","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false},"createdAt":"2026-05-19T01:46:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning](https://huggingface.co/papers/2605.04066) (2026)\n* [Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood](https://huggingface.co/papers/2604.12736) (2026)\n* [SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks](https://huggingface.co/papers/2604.08865) (2026)\n* [expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling](https://huggingface.co/papers/2605.09923) (2026)\n* [On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation](https://huggingface.co/papers/2603.22117) (2026)\n* [DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment](https://huggingface.co/papers/2605.03327) (2026)\n* [fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum](https://huggingface.co/papers/2605.11403) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.04066\">Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.12736\">Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.08865\">SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09923\">expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2603.22117\">On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.03327\">DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.11403\">fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-19T01:46:47.605Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7402479648590088},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12058","authors":[{"_id":"6a0adec23049bece374a85de","name":"Yuxiang Chen","hidden":false},{"_id":"6a0adec23049bece374a85df","name":"Dingli Liang","hidden":false},{"_id":"6a0adec23049bece374a85e0","name":"Yihang Chen","hidden":false},{"_id":"6a0adec23049bece374a85e1","name":"Ziqin Gong","hidden":false},{"_id":"6a0adec23049bece374a85e2","name":"Chenyang Le","hidden":false},{"_id":"6a0adec23049bece374a85e3","name":"Zhaokai Wang","hidden":false},{"_id":"6a0adec23049bece374a85e4","name":"Jiachen Zhu","hidden":false},{"_id":"6a0adec23049bece374a85e5","name":"Lingyu Yang","hidden":false},{"_id":"6a0adec23049bece374a85e6","name":"Jianghao Lin","hidden":false},{"_id":"6a0adec23049bece374a85e7","name":"Weinan Zhang","hidden":false},{"_id":"6a0adec23049bece374a85e8","name":"Jun Wang","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"Hölder Policy Optimisation","submittedOnDailyBy":{"_id":"66996ea912210698d6fb453b","avatarUrl":"/avatars/d898f7967d4d0785e0c7a1e94b7a237c.svg","isPro":false,"fullname":"Yihang Chen","user":"scyyc9","type":"user","name":"scyyc9"},"summary":"Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.","upvotes":16,"discussionId":"6a0adec23049bece374a85e9"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68320962388706bf96e68803","avatarUrl":"/avatars/f174196ad605f49307377d69e2baa550.svg","isPro":false,"fullname":"Leo Chen","user":"darkmoonlight1","type":"user"},{"_id":"6363a769287b5ce02ed156a4","avatarUrl":"/avatars/72f1ed35ce0c20f566399c7f06261452.svg","isPro":false,"fullname":"wrara","user":"wrawar","type":"user"},{"_id":"68abf3aaf626dce772ab8e4b","avatarUrl":"/avatars/1281623f0242f4c2be328849f7195e3c.svg","isPro":false,"fullname":"Memento","user":"AgentFly","type":"user"},{"_id":"645d7f107c7258d904e82749","avatarUrl":"/avatars/a4e9d47b281f18616c522c1a8b8ee7e5.svg","isPro":false,"fullname":"HuichiZhou","user":"Zhouhc","type":"user"},{"_id":"66996ea912210698d6fb453b","avatarUrl":"/avatars/d898f7967d4d0785e0c7a1e94b7a237c.svg","isPro":false,"fullname":"Yihang Chen","user":"scyyc9","type":"user"},{"_id":"66840c818f6c78ebd41f86ca","avatarUrl":"/avatars/626ffa3d262c02f964f15fa420af414c.svg","isPro":false,"fullname":"Kun Shao","user":"ShaoKun-HW","type":"user"},{"_id":"69f881fe04afb047d6d54dfa","avatarUrl":"/avatars/7d2f96301d04d3bc52e79039666ac8f4.svg","isPro":false,"fullname":"Leo C","user":"xg12138","type":"user"},{"_id":"67cf8810a1c1880159fb0a1a","avatarUrl":"/avatars/b3735428f1967b87cc5a713689e02fef.svg","isPro":false,"fullname":"Yuxiang Chen","user":"Yuxiang0853","type":"user"},{"_id":"65041307324053b21adcf01a","avatarUrl":"/avatars/0377287cdf55c698c660b8a3c79f43c7.svg","isPro":true,"fullname":"Ka Yiu Lee","user":"ycps051031","type":"user"},{"_id":"69f880fdb1e057b04faad6fb","avatarUrl":"/avatars/49a210b6d97e373bde02aa9da2d8a678.svg","isPro":false,"fullname":"nobodynose","user":"bigguy323","type":"user"},{"_id":"69ef34f997d8d08048aa5be4","avatarUrl":"/avatars/1e98079eddb818243a4054c474ed0553.svg","isPro":false,"fullname":"yc","user":"leovoo-o","type":"user"},{"_id":"67fee3f8ecdaad48e7885091","avatarUrl":"/avatars/7315b6e3216bfb3941ee474d50ae8ddb.svg","isPro":false,"fullname":"Lee","user":"Summer-77","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12058.md"}">

Papers

arxiv:2605.12058

Hölder Policy Optimisation

Published on May 12

· Submitted by

Yihang Chen on May 18

Upvote

Authors:

Abstract

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter p, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger p concentrates the gradient to amplify sparse learning signals, whereas a smaller p strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules p across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical benchmarks, yielding a substantial 7.2% relative gain over standard GRPO and secures an exceptional 93.8% success rate on ALFWorld.

View arXiv page View PDF Add to collection