Hugging Face Daily Papers · June 24, 2026 · 5 min read

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families – retrieval, multi-evidence synthesis, and reasoning – for which we construct and curate eight datasets totaling∼14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven longcontext benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.</p>\n","updatedAt":"2026-06-24T00:59:45.909Z","author":{"_id":"608f6d72283d0a8d7be9d1f9","avatarUrl":"/avatars/7f499a37019359a3c488ba6cc11751fc.svg","fullname":"Chaojun XIAO","name":"xcjthu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9295870661735535},"editors":["xcjthu"],"editorAvatarUrls":["/avatars/7f499a37019359a3c488ba6cc11751fc.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18831","authors":[{"_id":"6a3b2b910a86ac3098d5d610","name":"Xiaoyue Xu","hidden":false},{"_id":"6a3b2b910a86ac3098d5d611","name":"Sikui Zhang","hidden":false},{"_id":"6a3b2b910a86ac3098d5d612","name":"Xiaorong Wang","hidden":false},{"_id":"6a3b2b910a86ac3098d5d613","name":"Xu Han","hidden":false},{"_id":"6a3b2b910a86ac3098d5d614","name":"Chaojun Xiao","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning","submittedOnDailyBy":{"_id":"608f6d72283d0a8d7be9d1f9","avatarUrl":"/avatars/7f499a37019359a3c488ba6cc11751fc.svg","isPro":false,"fullname":"Chaojun XIAO","user":"xcjthu","type":"user","name":"xcjthu"},"summary":"Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.","upvotes":1,"discussionId":"6a3b2b920a86ac3098d5d615","ai_summary":"Data-centric approach using curated datasets and minimal GRPO setup significantly improves long-context reasoning in large language models, outperforming prior reinforcement learning methods.","ai_keywords":["long-context reasoning","reinforcement learning","GRPO","large language models","agent-tuned models","GAIA","BrowseComp"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"633fe81429b5a95f6e16e34a","name":"openbmb","fullname":"OpenBMB","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670387859384-633fe7784b362488336bbfad.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"608f6d72283d0a8d7be9d1f9","avatarUrl":"/avatars/7f499a37019359a3c488ba6cc11751fc.svg","isPro":false,"fullname":"Chaojun XIAO","user":"xcjthu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"633fe81429b5a95f6e16e34a","name":"openbmb","fullname":"OpenBMB","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670387859384-633fe7784b362488336bbfad.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18831.md","query":{}}">

Papers

arxiv:2606.18831

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Published on Jun 17

· Submitted by

Chaojun XIAO on Jun 23

OpenBMB

Upvote

Authors:

Abstract

Data-centric approach using curated datasets and minimal GRPO setup significantly improves long-context reasoning in large language models, outperforming prior reinforcement learning methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

View arXiv page View PDF Add to collection

Community

xcjthu

Paper submitter 1 minute ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.18831

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18831 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18831 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18831 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers