Hugging Face Daily Papers · · 3 min read

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Paper: <a href=\"https://arxiv.org/abs/2605.31584\" rel=\"nofollow\">https://arxiv.org/abs/2605.31584</a><br>Code: <a href=\"https://github.com/THU-KEG/LongTraceRL\" rel=\"nofollow\">https://github.com/THU-KEG/LongTraceRL</a></p>\n","updatedAt":"2026-06-01T02:15:51.216Z","author":{"_id":"66cdd285c51a915bd5f2d017","avatarUrl":"/avatars/14e5794307e4672b1b51d26b31227e0f.svg","fullname":"Jiajie Zhang","name":"NeoZ123","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":16,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.679222822189331},"editors":["NeoZ123"],"editorAvatarUrls":["/avatars/14e5794307e4672b1b51d26b31227e0f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31584","authors":[{"_id":"6a1ceb0f808ddbc3c7d4340f","name":"Nianyi Lin","hidden":false},{"_id":"6a1ceb0f808ddbc3c7d43410","name":"Jiajie Zhang","hidden":false},{"_id":"6a1ceb0f808ddbc3c7d43411","name":"Lei Hou","hidden":false},{"_id":"6a1ceb0f808ddbc3c7d43412","name":"Juanzi Li","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards","submittedOnDailyBy":{"_id":"66cdd285c51a915bd5f2d017","avatarUrl":"/avatars/14e5794307e4672b1b51d26b31227e0f.svg","isPro":false,"fullname":"Jiajie Zhang","user":"NeoZ123","type":"user","name":"NeoZ123"},"summary":"Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce LongTraceRL. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build tiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a rubric reward that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL{https://github.com/THU-KEG/LongTraceRL}.","upvotes":30,"discussionId":"6a1ceb10808ddbc3c7d43413","githubRepo":"https://github.com/THU-KEG/LongTraceRL","githubRepoAddedBy":"user","ai_summary":"LongTraceRL addresses long-context reasoning challenges in large language models through tiered distractor construction and rubric reward design for improved reasoning quality.","ai_keywords":["reinforcement learning with verifiable rewards","RLVR","knowledge graph random walks","search agent trajectories","tiered distractors","rubric reward","long-context reasoning","large language models","reasoning chains","reward hacking"],"githubStars":4,"organization":{"_id":"64db4fc57266618e854318f4","name":"THU-KEG","fullname":"Knowledge Engineer Group @ Tsinghua University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c4b46e549be47af1aafcd/5atqdE9AUWvYAHm9FNkG_.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66cdd285c51a915bd5f2d017","avatarUrl":"/avatars/14e5794307e4672b1b51d26b31227e0f.svg","isPro":false,"fullname":"Jiajie Zhang","user":"NeoZ123","type":"user"},{"_id":"682c13182f8a52030ebc3016","avatarUrl":"/avatars/58e913b02e8e2e4e0a9e05a543185be9.svg","isPro":false,"fullname":"yi wei","user":"yxxi","type":"user"},{"_id":"625a5446f1063e7085d5178a","avatarUrl":"/avatars/5e78186f13f74b14e01583e06ff6c4dc.svg","isPro":false,"fullname":"Hao Peng","user":"Wesleythu","type":"user"},{"_id":"652542861e9db26e407aa1fc","avatarUrl":"/avatars/4c47ef4564f498a7f34b4a17a1e209a8.svg","isPro":false,"fullname":"Lee Zhicheng","user":"ZhiCheng0326","type":"user"},{"_id":"648c48d8c0ddeee6df5b6d22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c48d8c0ddeee6df5b6d22/BlrYDv3eQxZ-Y5vtVGegX.jpeg","isPro":false,"fullname":"Shangqing Tu","user":"tsq2000","type":"user"},{"_id":"672092b587dac62160c217a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/GDzX1C4GpvFu2e9ZD4pVn.jpeg","isPro":false,"fullname":"Bowen Lv","user":"extreme1228","type":"user"},{"_id":"67c944f793cc506c4dec9dec","avatarUrl":"/avatars/a3c41a146fdf715a0e5a92164e9c0ef5.svg","isPro":false,"fullname":"hi loong","user":"llmtnbl","type":"user"},{"_id":"61b9b5f9fd429dff1ed3bd58","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b9b5f9fd429dff1ed3bd58/GrfeNU4K1zts-gmQdGbVE.jpeg","isPro":false,"fullname":"Ziqiang Liu","user":"icoderzqliu","type":"user"},{"_id":"6321ae29a97afe3c4c647ffb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1677222836363-6321ae29a97afe3c4c647ffb.png","isPro":false,"fullname":"Zilin Zhu","user":"zhuzilin","type":"user"},{"_id":"63f111cebc705ef8c239983e","avatarUrl":"/avatars/b757fe6ea0ee5f346b43a499f4d18342.svg","isPro":false,"fullname":"hazz","user":"manakanemu","type":"user"},{"_id":"62eb8941185591f0372d8595","avatarUrl":"/avatars/5f59bb537f6c28e58670e313d0fc6817.svg","isPro":false,"fullname":"Yinpei Su","user":"suyinpei","type":"user"},{"_id":"636674af95204b4649c8c3bc","avatarUrl":"/avatars/2f11b16a253af7ee42a496640b3f9827.svg","isPro":false,"fullname":"liuxinghan","user":"kiriharulxh","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"64db4fc57266618e854318f4","name":"THU-KEG","fullname":"Knowledge Engineer Group @ Tsinghua University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c4b46e549be47af1aafcd/5atqdE9AUWvYAHm9FNkG_.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31584.md"}">
Papers
arxiv:2605.31584

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Authors:
,
,
,

Abstract

LongTraceRL addresses long-context reasoning challenges in large language models through tiered distractor construction and rubric reward design for improved reasoning quality.

AI-generated summary

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce LongTraceRL. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build tiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a rubric reward that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL{https://github.com/THU-KEG/LongTraceRL}.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.31584
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31584 in a Space README.md to link it from this page.

Collections including this paper 3

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers