Hugging Face Daily Papers · · 6 min read

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

🤔 <strong>When does training a multi-agent LLM workflow end-to-end with RL actually help, and when does it quietly break?</strong></p>\n<p>We run a controlled study across three workflows (Eval-Opt, Voting, Orch-Workers), two tasks (math + code), and three scales (Qwen3 0.6B → 4B), comparing two training regimes:</p>\n<ul>\n<li>🔗 <strong>Shared-Policy (SP):</strong> all roles update one policy</li>\n<li>🧩 <strong>Isolated-Policy (IP):</strong> each role keeps its own</li>\n</ul>\n<p>✅ <strong>The good news:</strong> Multi-agent RL usually beats the base model. But how much you gain depends jointly on the workflow, the task, and the scale, not on whether you share the policy.</p>\n<p>⚖️ <strong>The tradeoff:</strong> Isolated-Policy reaches higher peaks but trains less stably. When a role has parallel copies seeing similar input patterns, its updates pile up in the same gradient direction, driving over-optimization and late-training collapse.</p>\n<p>⚠️ <strong>Shared-Policy isn't the safe default either.</strong> It just relocates the failure. When one role contributes asymmetric gradient mass, the shared policy gets captured by the dominant role, and that role's distribution leaks into the other roles' outputs.</p>\n<p>🎯 <strong>Takeaway:</strong> There's no universally safe choice, but the failure modes are predictable. Isolated-Policy tends to break on roles that are duplicated in the workflow (the 3 voters, the 3 workers), since the copies stack up same-direction updates and over-train that role into a late-stage collapse. Shared-Policy instead breaks when one role dominates the gradients and drags the shared policy toward its behavior. Match the policy to your workflow's structure rather than defaulting to either.</p>\n","updatedAt":"2026-06-02T06:33:34.808Z","author":{"_id":"6245285af59b8d262df3321b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6245285af59b8d262df3321b/dvy__dTf-miJ60IbveDg4.jpeg","fullname":"Yifan Zeng","name":"yokey","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9142667055130005},"editors":["yokey"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6245285af59b8d262df3321b/dvy__dTf-miJ60IbveDg4.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.24202","authors":[{"_id":"6a1e0027808ddbc3c7d43b55","name":"Yifan Zeng","hidden":false},{"_id":"6a1e0027808ddbc3c7d43b56","name":"Yiran Wu","hidden":false},{"_id":"6a1e0027808ddbc3c7d43b57","name":"Yaolun Zhang","hidden":false},{"_id":"6a1e0027808ddbc3c7d43b58","name":"Wentian Zhao","hidden":false},{"_id":"6a1e0027808ddbc3c7d43b59","name":"Kun Wan","hidden":false},{"_id":"6a1e0027808ddbc3c7d43b5a","name":"Qingyun Wu","hidden":false},{"_id":"6a1e0027808ddbc3c7d43b5b","name":"Huazheng Wang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6245285af59b8d262df3321b/BFXynL_RCjRl-BwLZRPLu.png"],"publishedAt":"2026-05-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs","submittedOnDailyBy":{"_id":"6245285af59b8d262df3321b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6245285af59b8d262df3321b/dvy__dTf-miJ60IbveDg4.jpeg","isPro":false,"fullname":"Yifan Zeng","user":"yokey","type":"user","name":"yokey"},"summary":"Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.","upvotes":11,"discussionId":"6a1e0028808ddbc3c7d43b5c","githubRepo":"https://github.com/XHMY/marl-llm-workflows","githubRepoAddedBy":"user","ai_summary":"Multi-agent large language model workflows trained with reinforcement learning show improved accuracy over base models, but performance varies significantly based on workflow type, task, and model scale, with isolated and shared policy training exhibiting distinct failure patterns due to gradient dynamics and role interactions.","ai_keywords":["multi-agent LLM workflows","reinforcement learning","Shared-Policy training","Isolated-Policy training","end-to-end RL training","workflow topology","policy routing","gradient dynamics","role-level gradients","terminal degradation","asymmetric gradient mass","dominant role"],"githubStars":2,"organization":{"_id":"6897df91ad3033f4085e432c","name":"OregonStateUniversity","fullname":"Oregon State University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6897df118dbb78d2e8837335/ssdqjm2xjvu285uuDBZbd.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6245285af59b8d262df3321b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6245285af59b8d262df3321b/dvy__dTf-miJ60IbveDg4.jpeg","isPro":false,"fullname":"Yifan Zeng","user":"yokey","type":"user"},{"_id":"648d2e2e514bf0ce32ba729f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648d2e2e514bf0ce32ba729f/VPL1rehLxkvixz5oRD6u_.jpeg","isPro":false,"fullname":"Yaolun Zhang","user":"Mercury7353","type":"user"},{"_id":"6374c494958cd71fa7ea0a9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6374c494958cd71fa7ea0a9d/b2SjfvbjYqPCW38LzkzWl.jpeg","isPro":false,"fullname":"yuyijiong","user":"yuyijiong","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"66443629b23fe8d3f7f2d0c7","avatarUrl":"/avatars/98ff088036aa382f33a05c232604c565.svg","isPro":false,"fullname":"Wentian Zhao","user":"zwt123home123","type":"user"},{"_id":"66274e02348a5304435dc9cc","avatarUrl":"/avatars/bda87559cd497c310597c2fc8430b31f.svg","isPro":false,"fullname":"Kun Wan","user":"timecuriosity","type":"user"},{"_id":"65eeb3d1ada92a05d3d97b59","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65eeb3d1ada92a05d3d97b59/81SQDQMF0WSHnnqklAxPl.png","isPro":false,"fullname":"Iris","user":"No011","type":"user"},{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user"},{"_id":"687363d49a81c7dcbcfa2d84","avatarUrl":"/avatars/5d943a5c811ed931c3fdcfee19253049.svg","isPro":false,"fullname":"jj","user":"realman123","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"69a40ee032cb01cf2a246652","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/qmyp293CJZMRE_Y_zvCyY.jpeg","isPro":false,"fullname":"Shiyu Zhu","user":"chrrobinson","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6897df91ad3033f4085e432c","name":"OregonStateUniversity","fullname":"Oregon State University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6897df118dbb78d2e8837335/ssdqjm2xjvu285uuDBZbd.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.24202.md"}">
Papers
arxiv:2605.24202

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Published on May 22
· Submitted by
Yifan Zeng
on Jun 2
Authors:
,
,
,
,
,
,

Abstract

Multi-agent large language model workflows trained with reinforcement learning show improved accuracy over base models, but performance varies significantly based on workflow type, task, and model scale, with isolated and shared policy training exhibiting distinct failure patterns due to gradient dynamics and role interactions.

AI-generated summary

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

Community

🤔 When does training a multi-agent LLM workflow end-to-end with RL actually help, and when does it quietly break?

We run a controlled study across three workflows (Eval-Opt, Voting, Orch-Workers), two tasks (math + code), and three scales (Qwen3 0.6B → 4B), comparing two training regimes:

  • 🔗 Shared-Policy (SP): all roles update one policy
  • 🧩 Isolated-Policy (IP): each role keeps its own

The good news: Multi-agent RL usually beats the base model. But how much you gain depends jointly on the workflow, the task, and the scale, not on whether you share the policy.

⚖️ The tradeoff: Isolated-Policy reaches higher peaks but trains less stably. When a role has parallel copies seeing similar input patterns, its updates pile up in the same gradient direction, driving over-optimization and late-training collapse.

⚠️ Shared-Policy isn't the safe default either. It just relocates the failure. When one role contributes asymmetric gradient mass, the shared policy gets captured by the dominant role, and that role's distribution leaks into the other roles' outputs.

🎯 Takeaway: There's no universally safe choice, but the failure modes are predictable. Isolated-Policy tends to break on roles that are duplicated in the workflow (the 3 voters, the 3 workers), since the copies stack up same-direction updates and over-train that role into a late-stage collapse. Shared-Policy instead breaks when one role dominates the gradients and drags the shared policy toward its behavior. Match the policy to your workflow's structure rather than defaulting to either.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.24202
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.24202 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.24202 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.24202 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers