Hugging Face Daily Papers · · 4 min read

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We present <strong>GoLongRL</strong>, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR) 🚀</p>\n<p><strong>GoLongRL-30B-A3B</strong> achieves long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507—while using a significantly smaller activated parameter budget ⚡</p>\n<p>Under the same vanilla GRPO setup, our dataset alone outperforms the QwenLong-L1.5 dataset by <strong>+6.1pp at the 4B scale</strong> and <strong>+2.6pp at 30B</strong> 📈</p>\n<p>🔥 <strong>Actually open-source — not “weights-only”, not partial releases</strong>:<br>we release <strong>training data, full code, and models</strong>. No hidden pipelines, no inaccessible datasets, no black boxes.<br>If you want to reproduce, audit, or build on top of it—you <em>can</em>. End to end.</p>\n<p><strong>Resources:</strong></p>\n<ul>\n<li><p>📄 Paper: <a href=\"https://arxiv.org/pdf/2605.19577\" rel=\"nofollow\">https://arxiv.org/pdf/2605.19577</a></p>\n</li>\n<li><p>🧠 Github Project: <a href=\"https://github.com/xiaoxuanNLP/GoLongRL\" rel=\"nofollow\">https://github.com/xiaoxuanNLP/GoLongRL</a></p>\n</li>\n<li><p>📊 Data: <a href=\"https://huggingface.co/datasets/Kwai-Klear/GoLongRL\">https://huggingface.co/datasets/Kwai-Klear/GoLongRL</a></p>\n</li>\n<li><p>🤖 Models:</p>\n<ul>\n<li>GoLongRL-30B-A3B: <a href=\"https://huggingface.co/Kwai-Klear/GoLongRL-30B-A3B\">https://huggingface.co/Kwai-Klear/GoLongRL-30B-A3B</a></li>\n<li>GoLongRL-4B: <a href=\"https://huggingface.co/Kwai-Klear/GoLongRL-4B\">https://huggingface.co/Kwai-Klear/GoLongRL-4B</a></li>\n</ul>\n</li>\n</ul>\n","updatedAt":"2026-05-20T04:07:40.793Z","author":{"_id":"61c2cf8d1172fa7969904d99","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61c2cf8d1172fa7969904d99/R10G5h3d9Q_YQ__Hc-H4k.jpeg","fullname":"suu","name":"Suu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":5,"identifiedLanguage":{"language":"en","probability":0.7899413108825684},"editors":["Suu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61c2cf8d1172fa7969904d99/R10G5h3d9Q_YQ__Hc-H4k.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.19577","authors":[{"_id":"6a0d2af265eb30f20d962ca6","name":"Minxuan Lv","hidden":false},{"_id":"6a0d2af265eb30f20d962ca7","name":"Tiehua Mei","hidden":false},{"_id":"6a0d2af265eb30f20d962ca8","name":"Tanlong Du","hidden":false},{"_id":"6a0d2af265eb30f20d962ca9","name":"Junmin Chen","hidden":false},{"_id":"6a0d2af265eb30f20d962caa","name":"Zhenpeng Su","hidden":false},{"_id":"6a0d2af265eb30f20d962cab","name":"Ziyang Chen","hidden":false},{"_id":"6a0d2af265eb30f20d962cac","name":"Ziqi Wang","hidden":false},{"_id":"6a0d2af265eb30f20d962cad","name":"Zhennan Wu","hidden":false},{"_id":"6a0d2af265eb30f20d962cae","name":"Ruotong Pan","hidden":false},{"_id":"6a0d2af265eb30f20d962caf","name":"jian Liang","hidden":false},{"_id":"6a0d2af265eb30f20d962cb0","name":"Ruiming Tang","hidden":false},{"_id":"6a0d2af265eb30f20d962cb1","name":"Han Li","hidden":false}],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment","submittedOnDailyBy":{"_id":"61c2cf8d1172fa7969904d99","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61c2cf8d1172fa7969904d99/R10G5h3d9Q_YQ__Hc-H4k.jpeg","isPro":false,"fullname":"suu","user":"Suu","type":"user","name":"Suu"},"summary":"We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.","upvotes":44,"discussionId":"6a0d2af365eb30f20d962cb2","projectPage":"https://huggingface.co/collections/Kwai-Klear/golongrl","githubRepo":"https://github.com/xiaoxuanNLP/GoLongRL","githubRepoAddedBy":"user","ai_summary":"GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology.","ai_keywords":["reinforcement learning","long-context","verifiable rewards","post-training recipe","RLVR","GRPO","task-level mean normalization","difficulty-adaptive weighting","advantage estimation"],"githubStars":7},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"61c2cf8d1172fa7969904d99","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61c2cf8d1172fa7969904d99/R10G5h3d9Q_YQ__Hc-H4k.jpeg","isPro":false,"fullname":"suu","user":"Suu","type":"user"},{"_id":"65436f1c68493a7808d5bdb3","avatarUrl":"/avatars/bf528e0efd25b5e9bbb6271666a870ed.svg","isPro":false,"fullname":"dtl","user":"dtl666","type":"user"},{"_id":"664ff678c08859923b1818e2","avatarUrl":"/avatars/e28a5fd5adf171fe389d45792dd4d944.svg","isPro":false,"fullname":"oxygen","user":"oxygen0078","type":"user"},{"_id":"66cd87d0b1fe4c78fe1e3f00","avatarUrl":"/avatars/899c73530f8a4b8d8efc30f14c65ca81.svg","isPro":false,"fullname":"wzn","user":"wznwznwzn","type":"user"},{"_id":"66d9521673761d04a5856c67","avatarUrl":"/avatars/2cf4052be3a348507dcf5bac730c780b.svg","isPro":false,"fullname":"Weihan Li","user":"J1Feng","type":"user"},{"_id":"65bb205be54bc620ebc32dd6","avatarUrl":"/avatars/30288e959217d770246813aa4bb11072.svg","isPro":false,"fullname":"Xiao xinge","user":"shuiyoungwoming","type":"user"},{"_id":"62f7a3613b991129e94005da","avatarUrl":"/avatars/7ea485cb0a24bef70d24c76bc2bffa68.svg","isPro":false,"fullname":"Ruotong Pan","user":"Ruotong","type":"user"},{"_id":"64f4337d0c658d2476e6bcef","avatarUrl":"/avatars/2dd89329ae3c582ea1e1d1f0a918f195.svg","isPro":false,"fullname":"Xuan Xiao","user":"xiaoxuanzi","type":"user"},{"_id":"69364773b956a4949c79a5c5","avatarUrl":"/avatars/29dd0f107a644ea075cb94b975cfe416.svg","isPro":false,"fullname":"Liouville","user":"M-best","type":"user"},{"_id":"68249662280377699edf05ad","avatarUrl":"/avatars/bdcf0f8b7442154fc5049a1eb6f00d7c.svg","isPro":false,"fullname":"Tiehua Mei","user":"Mithas-01","type":"user"},{"_id":"6720aab6315a5b4077a64db3","avatarUrl":"/avatars/df8797737520d00f49818ff7aa2c4758.svg","isPro":false,"fullname":"chenhengrui","user":"chenhengrui","type":"user"},{"_id":"680898d9a6e089f09aa4e73b","avatarUrl":"/avatars/7ff8d7e97492512b88325c3c3ceae08f.svg","isPro":false,"fullname":"niuniukeyi","user":"niuniukeyi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3}">
Papers
arxiv:2605.19577

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Published on May 19
· Submitted by
suu
on May 20
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology.

AI-generated summary

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

Community

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR) 🚀

GoLongRL-30B-A3B achieves long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507—while using a significantly smaller activated parameter budget ⚡

Under the same vanilla GRPO setup, our dataset alone outperforms the QwenLong-L1.5 dataset by +6.1pp at the 4B scale and +2.6pp at 30B 📈

🔥 Actually open-source — not “weights-only”, not partial releases:
we release training data, full code, and models. No hidden pipelines, no inaccessible datasets, no black boxes.
If you want to reproduce, audit, or build on top of it—you can. End to end.

Resources:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.19577 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers