TL;DR: We make agentic RLVR more sample-efficient by allocating a fixed rollout budget not just across prompts, but across turns within a rollout.</p>\n<p>RLVR training is bottlenecked by reward contrast: overly easy/hard prompts give low-variance feedback, and a single outcome-only reward leaves almost no local signal for credit assignment across a long multi-turn rollout. Prior work only allocates budget at the prompt level — once a prompt is picked, each rollout is still an atomic trajectory.</p>\n<p>We observe that ReAct-style agentic interaction naturally packages each thought–action–observation step as a node, turning flat rollouts into tree-structured rollouts. This lets us unify prompt filtering, rollout-count allocation, and turn-level branching under one principle — mixed-reward contrast construction: spend budget on anchors (roots and intermediate prefixes) whose descendants are most likely to contain both successes and failures. A single shared predictor estimates conditional success probability from prefix histories to guide a two-stage scheme (global root allocation → local prefix expansion).</p>\n<p>Across Mathematical Reasoning, Multi-Hop QA, and Function Calling, TRACE improves accuracy at equal sampling cost — e.g., +2.8 points on Qwen3-14B Multi-Hop QA over competitive baselines.</p>\n","updatedAt":"2026-06-11T09:31:59.398Z","author":{"_id":"672c668860bdd07053bc0544","avatarUrl":"/avatars/67131337c6dc7a33dc9361c200736a69.svg","fullname":"Heming Zou","name":"gfyddha","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8522709012031555},"editors":["gfyddha"],"editorAvatarUrls":["/avatars/67131337c6dc7a33dc9361c200736a69.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11119","authors":[{"_id":"6a28c5e2e7d78ea7587e530a","user":{"_id":"672c668860bdd07053bc0544","avatarUrl":"/avatars/67131337c6dc7a33dc9361c200736a69.svg","isPro":false,"fullname":"Heming Zou","user":"gfyddha","type":"user","name":"gfyddha"},"name":"Heming Zou","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:43:41.427Z","hidden":false},{"_id":"6a28c5e2e7d78ea7587e530b","name":"Qi Wang","hidden":false},{"_id":"6a28c5e2e7d78ea7587e530c","name":"Yun Qu","hidden":false},{"_id":"6a28c5e2e7d78ea7587e530d","name":"Yuhang Jiang","hidden":false},{"_id":"6a28c5e2e7d78ea7587e530e","name":"Lizhou Cai","hidden":false},{"_id":"6a28c5e2e7d78ea7587e530f","name":"Yixiu Mao","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5310","name":"Ru Peng","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5311","name":"Xin Xu","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5312","name":"Weijie Liu","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5313","name":"Kai Yang","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5314","name":"Saiyong Yang","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5315","name":"Xiangyang Ji","hidden":false}],"publishedAt":"2026-06-09T17:16:03.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning","submittedOnDailyBy":{"_id":"672c668860bdd07053bc0544","avatarUrl":"/avatars/67131337c6dc7a33dc9361c200736a69.svg","isPro":false,"fullname":"Heming Zou","user":"gfyddha","type":"user","name":"gfyddha"},"summary":"Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.","upvotes":15,"discussionId":"6a28c5e3e7d78ea7587e5316","ai_summary":"TRACE is a rollout allocation framework that improves reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness.","ai_keywords":["reinforcement learning","verifiable rewards","policy optimization","rollout allocation","ReAct-style","tree-structured rollouts","conditional success probability","policy-update signal","multi-turn agentic behavior","reward contrast"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"672c668860bdd07053bc0544","avatarUrl":"/avatars/67131337c6dc7a33dc9361c200736a69.svg","isPro":false,"fullname":"Heming Zou","user":"gfyddha","type":"user"},{"_id":"6a19f597d43e3860426a6f51","avatarUrl":"/avatars/264f2293ffc57f00357c70aa1c06c310.svg","isPro":false,"fullname":"Larry Saliencro","user":"saliencro","type":"user"},{"_id":"6a19fd42209a3911eebbc462","avatarUrl":"/avatars/5c3c3744d9f2d34735344eee7688fe13.svg","isPro":false,"fullname":"Ethan Williams","user":"ethanwilliams001","type":"user"},{"_id":"6a1a004b209a3911eebbed31","avatarUrl":"/avatars/869e6b4335a914cf5b5a0925be506b13.svg","isPro":false,"fullname":"Liam Thompson","user":"liamthompson1994","type":"user"},{"_id":"6a2a8284321c1b8f38a91151","avatarUrl":"/avatars/d6212c50f326a20ab05c2a71d6202a82.svg","isPro":false,"fullname":"Pixel Nomad","user":"PixelNomand","type":"user"},{"_id":"6a2a83a21429251a9b9b74fb","avatarUrl":"/avatars/fc7a808c564b21049dc4963bf17cc3e6.svg","isPro":false,"fullname":"Schedert Guest","user":"MoonLightNI","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"6a2a8f4b446b82529255131a","avatarUrl":"/avatars/b201a2520e066690c48f1d6896d69b9b.svg","isPro":false,"fullname":"Yan Li","user":"YanshaLi","type":"user"},{"_id":"6a2a9069ea4ac1edb57dbc66","avatarUrl":"/avatars/91b94d0358a04b2fae3ca16c7dd38330.svg","isPro":false,"fullname":"Stone Crane","user":"StoneCrane","type":"user"},{"_id":"6a2a91623aba1d6a64659ab1","avatarUrl":"/avatars/a3f2dc4cefdb7c398cd7fa4b3d1e7322.svg","isPro":false,"fullname":"Zhang Zhiling","user":"zzl25896","type":"user"},{"_id":"6a2a9265952947faa38cfc7e","avatarUrl":"/avatars/d48e64c39a053bee4b99c241b424b626.svg","isPro":false,"fullname":"Zeropy Surio","user":"ZeropySurio","type":"user"},{"_id":"6a2a949250243e7997d4fc4d","avatarUrl":"/avatars/5b5700ff3bd1c7761b6302cfeea9d31b.svg","isPro":false,"fullname":"Owen Niles","user":"Oweniles","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11119.md"}">
TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
TRACE is a rollout allocation framework that improves reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness.
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.
Community
TL;DR: We make agentic RLVR more sample-efficient by allocating a fixed rollout budget not just across prompts, but across turns within a rollout.
RLVR training is bottlenecked by reward contrast: overly easy/hard prompts give low-variance feedback, and a single outcome-only reward leaves almost no local signal for credit assignment across a long multi-turn rollout. Prior work only allocates budget at the prompt level — once a prompt is picked, each rollout is still an atomic trajectory.
We observe that ReAct-style agentic interaction naturally packages each thought–action–observation step as a node, turning flat rollouts into tree-structured rollouts. This lets us unify prompt filtering, rollout-count allocation, and turn-level branching under one principle — mixed-reward contrast construction: spend budget on anchors (roots and intermediate prefixes) whose descendants are most likely to contain both successes and failures. A single shared predictor estimates conditional success probability from prefix histories to guide a two-stage scheme (global root allocation → local prefix expansion).
Across Mathematical Reasoning, Multi-Hop QA, and Function Calling, TRACE improves accuracy at equal sampling cost — e.g., +2.8 points on Qwen3-14B Multi-Hop QA over competitive baselines.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.11119 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.11119 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.11119 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.