Hugging Face Daily Papers · June 11, 2026 · 6 min read

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

TL;DR: We make agentic RLVR more sample-efficient by allocating a fixed rollout budget not just across prompts, but across turns within a rollout.\nRLVR training is bottlenecked by reward contrast: overly easy/hard prompts give low-variance feedback, and a single outcome-only reward leaves almost no local signal for credit assignment across a long multi-turn rollout. Prior work only allocates budget at the prompt level — once a prompt is picked, each rollout is still an atomic trajectory.\nWe observe that ReAct-style agentic interaction naturally packages each thought–action–observation step as a node, turning flat rollouts into tree-structured rollouts. This lets us unify prompt filtering, rollout-count allocation, and turn-level branching under one principle — mixed-reward contrast construction: spend budget on anchors (roots and intermediate prefixes) whose descendants are most likely to contain both successes and failures. A single shared predictor estimates conditional success probability from prefix histories to guide a two-stage scheme (global root allocation → local prefix expansion).\nAcross Mathematical Reasoning, Multi-Hop QA, and Function Calling, TRACE improves accuracy at equal sampling cost — e.g., +2.8 points on Qwen3-14B Multi-Hop QA over competitive baselines.\n","updatedAt":"2026-06-11T09:31:59.398Z","author":{"_id":"672c668860bdd07053bc0544","avatarUrl":"/avatars/67131337c6dc7a33dc9361c200736a69.svg","fullname":"Heming Zou","name":"gfyddha","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8522709012031555},"editors":["gfyddha"],"editorAvatarUrls":["/avatars/67131337c6dc7a33dc9361c200736a69.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11119","authors":[{"_id":"6a28c5e2e7d78ea7587e530a","user":{"_id":"672c668860bdd07053bc0544","avatarUrl":"/avatars/67131337c6dc7a33dc9361c200736a69.svg","isPro":false,"fullname":"Heming Zou","user":"gfyddha","type":"user","name":"gfyddha"},"name":"Heming Zou","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:43:41.427Z","hidden":false},{"_id":"6a28c5e2e7d78ea7587e530b","name":"Qi Wang","hidden":false},{"_id":"6a28c5e2e7d78ea7587e530c","name":"Yun Qu","hidden":false},{"_id":"6a28c5e2e7d78ea7587e530d","name":"Yuhang Jiang","hidden":false},{"_id":"6a28c5e2e7d78ea7587e530e","name":"Lizhou Cai","hidden":false},{"_id":"6a28c5e2e7d78ea7587e530f","name":"Yixiu Mao","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5310","name":"Ru Peng","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5311","name":"Xin Xu","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5312","name":"Weijie Liu","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5313","name":"Kai Yang","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5314","name":"Saiyong Yang","hidden":false},{"_id":"6a28c5e2e7d78ea7587e5315","name":"Xiangyang Ji","hidden":false}],"publishedAt":"2026-06-09T17:16:03.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning","submittedOnDailyBy":{"_id":"672c668860bdd07053bc0544","avatarUrl":"/avatars/67131337c6dc7a33dc9361c200736a69.svg","isPro":false,"fullname":"Heming Zou","user":"gfyddha","type":"user","name":"gfyddha"},"summary":"Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.","upvotes":15,"discussionId":"6a28c5e3e7d78ea7587e5316","ai_summary":"TRACE is a rollout allocation framework that improves reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness.","ai_keywords":["reinforcement learning","verifiable rewards","policy optimization","rollout allocation","ReAct-style","tree-structured rollouts","conditional success probability","policy-update signal","multi-turn agentic behavior","reward contrast"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"672c668860bdd07053bc0544","avatarUrl":"/avatars/67131337c6dc7a33dc9361c200736a69.svg","isPro":false,"fullname":"Heming Zou","user":"gfyddha","type":"user"},{"_id":"6a19f597d43e3860426a6f51","avatarUrl":"/avatars/264f2293ffc57f00357c70aa1c06c310.svg","isPro":false,"fullname":"Larry Saliencro","user":"saliencro","type":"user"},{"_id":"6a19fd42209a3911eebbc462","avatarUrl":"/avatars/5c3c3744d9f2d34735344eee7688fe13.svg","isPro":false,"fullname":"Ethan Williams","user":"ethanwilliams001","type":"user"},{"_id":"6a1a004b209a3911eebbed31","avatarUrl":"/avatars/869e6b4335a914cf5b5a0925be506b13.svg","isPro":false,"fullname":"Liam Thompson","user":"liamthompson1994","type":"user"},{"_id":"6a2a8284321c1b8f38a91151","avatarUrl":"/avatars/d6212c50f326a20ab05c2a71d6202a82.svg","isPro":false,"fullname":"Pixel Nomad","user":"PixelNomand","type":"user"},{"_id":"6a2a83a21429251a9b9b74fb","avatarUrl":"/avatars/fc7a808c564b21049dc4963bf17cc3e6.svg","isPro":false,"fullname":"Schedert Guest","user":"MoonLightNI","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"6a2a8f4b446b82529255131a","avatarUrl":"/avatars/b201a2520e066690c48f1d6896d69b9b.svg","isPro":false,"fullname":"Yan Li","user":"YanshaLi","type":"user"},{"_id":"6a2a9069ea4ac1edb57dbc66","avatarUrl":"/avatars/91b94d0358a04b2fae3ca16c7dd38330.svg","isPro":false,"fullname":"Stone Crane","user":"StoneCrane","type":"user"},{"_id":"6a2a91623aba1d6a64659ab1","avatarUrl":"/avatars/a3f2dc4cefdb7c398cd7fa4b3d1e7322.svg","isPro":false,"fullname":"Zhang Zhiling","user":"zzl25896","type":"user"},{"_id":"6a2a9265952947faa38cfc7e","avatarUrl":"/avatars/d48e64c39a053bee4b99c241b424b626.svg","isPro":false,"fullname":"Zeropy Surio","user":"ZeropySurio","type":"user"},{"_id":"6a2a949250243e7997d4fc4d","avatarUrl":"/avatars/5b5700ff3bd1c7761b6302cfeea9d31b.svg","isPro":false,"fullname":"Owen Niles","user":"Oweniles","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11119.md"}">

Papers

arxiv:2606.11119

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Published on Jun 9

· Submitted by

Heming Zou on Jun 11

Tencent

Upvote

Authors:

Heming Zou ,

Abstract

TRACE is a rollout allocation framework that improves reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.

View arXiv page View PDF Add to collection

Community

gfyddha

Paper author Paper submitter about 10 hours ago

TL;DR: We make agentic RLVR more sample-efficient by allocating a fixed rollout budget not just across prompts, but across turns within a rollout.

RLVR training is bottlenecked by reward contrast: overly easy/hard prompts give low-variance feedback, and a single outcome-only reward leaves almost no local signal for credit assignment across a long multi-turn rollout. Prior work only allocates budget at the prompt level — once a prompt is picked, each rollout is still an atomic trajectory.

We observe that ReAct-style agentic interaction naturally packages each thought–action–observation step as a node, turning flat rollouts into tree-structured rollouts. This lets us unify prompt filtering, rollout-count allocation, and turn-level branching under one principle — mixed-reward contrast construction: spend budget on anchors (roots and intermediate prefixes) whose descendants are most likely to contain both successes and failures. A single shared predictor estimates conditional success probability from prefix histories to guide a two-stage scheme (global root allocation → local prefix expansion).

Across Mathematical Reasoning, Multi-Hop QA, and Function Calling, TRACE improves accuracy at equal sampling cost — e.g., +2.8 points on Qwen3-14B Multi-Hop QA over competitive baselines.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.11119

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.11119 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11119 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11119 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers