Hugging Face Daily Papers · · 5 min read

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ^2-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.</p>\n","updatedAt":"2026-05-20T04:17:59.713Z","author":{"_id":"67e95c8d2b124840d0cb8d7f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UCt1iaTveaIXA-NBEqX3A.png","fullname":"shawnxzhu","name":"shawnxzhu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8937362432479858},"editors":["shawnxzhu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UCt1iaTveaIXA-NBEqX3A.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18703","authors":[{"_id":"6a0c266072c785ae8a652bc2","name":"Minrui Xu","hidden":false},{"_id":"6a0c266072c785ae8a652bc3","name":"Zilin Wang","hidden":false},{"_id":"6a0c266072c785ae8a652bc4","name":"Mengyi DENG","hidden":false},{"_id":"6a0c266072c785ae8a652bc5","name":"Zhiwei Li","hidden":false},{"_id":"6a0c266072c785ae8a652bc6","name":"Zhicheng Yang","hidden":false},{"_id":"6a0c266072c785ae8a652bc7","name":"Xiao Zhu","hidden":false},{"_id":"6a0c266072c785ae8a652bc8","name":"Yinhong Liu","hidden":false},{"_id":"6a0c266072c785ae8a652bc9","name":"Boyu Zhu","hidden":false},{"_id":"6a0c266072c785ae8a652bca","name":"Baiyu Huang","hidden":false},{"_id":"6a0c266072c785ae8a652bcb","name":"Chao Chen","hidden":false},{"_id":"6a0c266072c785ae8a652bcc","name":"Heyuan Deng","hidden":false},{"_id":"6a0c266072c785ae8a652bcd","name":"Fei Mi","hidden":false},{"_id":"6a0c266072c785ae8a652bce","name":"Lifeng Shang","hidden":false},{"_id":"6a0c266072c785ae8a652bcf","name":"Xingshan Zeng","hidden":false},{"_id":"6a0c266072c785ae8a652bd0","name":"Zhijiang Guo","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL","submittedOnDailyBy":{"_id":"67e95c8d2b124840d0cb8d7f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UCt1iaTveaIXA-NBEqX3A.png","isPro":false,"fullname":"shawnxzhu","user":"shawnxzhu","type":"user","name":"shawnxzhu"},"summary":"Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ^2-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.","upvotes":36,"discussionId":"6a0c266172c785ae8a652bd1","githubRepo":"https://github.com/LARK-AI-Lab/EnvFactory","githubRepoAddedBy":"user","ai_summary":"EnvFactory automates the creation of executable tool environments and natural multi-turn trajectories for training LLMs with agentic reinforcement learning, achieving superior performance with fewer resources.","ai_keywords":["Agentic Reinforcement Learning","tool-use capabilities","execution environments","synthetic trajectories","topology-aware sampling","calibrated refinement","grounded queries","SFT trajectories","RL trajectories","Qwen3-series models","BFCLv3","MCP-Atlas","τ²-Bench","VitaBench"],"githubStars":16,"organization":{"_id":"6980a3aede8ee5f0a7de0007","name":"LARK-Lab","fullname":"LARK Lab@HKUST (GZ)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b6af3accebeadccc868efd/H6b3XExLG87O3ZFPV7Pr5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67f8ccce9301e8cd1592b71f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/pmQUnMY4GqTYmm0K-_7BA.png","isPro":false,"fullname":"WangZilin","user":"terr1ble","type":"user"},{"_id":"6980a631fea1db72ec8272db","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6980a631fea1db72ec8272db/DAkagzsz7DMRW3PhtqrmH.jpeg","isPro":false,"fullname":"Minrui Xu","user":"RolandXMR","type":"user"},{"_id":"64fed23f0871bc5930598ab5","avatarUrl":"/avatars/080a4ef3e4634cd978528dfa899a4eb0.svg","isPro":false,"fullname":"ZhiWei LI","user":"Aragonaa","type":"user"},{"_id":"63b6af3accebeadccc868efd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b6af3accebeadccc868efd/cFTHKggMpsoaPe_46gcy9.webp","isPro":false,"fullname":"Zhijiang","user":"Zeee","type":"user"},{"_id":"649ab25550a99f8c104b560f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649ab25550a99f8c104b560f/3BY1oUBrs3HOjKFfIeScv.jpeg","isPro":false,"fullname":"Xingshan Zeng","user":"zxshamson","type":"user"},{"_id":"63d2cfb1b734eaa4d4f5e92a","avatarUrl":"/avatars/d2ff40929fddbd35aef658b5eb2c6bbf.svg","isPro":false,"fullname":"Yinhong Liu","user":"yinhongliu","type":"user"},{"_id":"6469e4ac4c1cd18b497537bb","avatarUrl":"/avatars/5149203a9015956578deaf3710c30cef.svg","isPro":false,"fullname":"Zhou","user":"xinyu04","type":"user"},{"_id":"61669c456916c52acd5a1aa3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61669c456916c52acd5a1aa3/HnZTwRaXgTeTG3ljO3ITb.jpeg","isPro":false,"fullname":"jianbo dai","user":"jbd","type":"user"},{"_id":"643a587fe2b979ae6141b193","avatarUrl":"/avatars/1726b6a1629d800795f9bdf6d03ad190.svg","isPro":false,"fullname":"yilong xu","user":"sapphirex","type":"user"},{"_id":"67e95c8d2b124840d0cb8d7f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UCt1iaTveaIXA-NBEqX3A.png","isPro":false,"fullname":"shawnxzhu","user":"shawnxzhu","type":"user"},{"_id":"66273cd097b597050a8e7122","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Act6TH_qVX68zbo17BcLh.jpeg","isPro":false,"fullname":"Zhicheng YANG","user":"yangzhch6","type":"user"},{"_id":"668a5cab75b0d7666e6130bd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/4G1wI2S4chzbFXInPeAQR.png","isPro":false,"fullname":"Baiyu Huang","user":"Adelante","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6980a3aede8ee5f0a7de0007","name":"LARK-Lab","fullname":"LARK Lab@HKUST (GZ)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b6af3accebeadccc868efd/H6b3XExLG87O3ZFPV7Pr5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18703.md"}">
Papers
arxiv:2605.18703

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Published on May 18
· Submitted by
shawnxzhu
on May 20
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

EnvFactory automates the creation of executable tool environments and natural multi-turn trajectories for training LLMs with agentic reinforcement learning, achieving superior performance with fewer resources.

AI-generated summary

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ^2-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

Community

Paper submitter about 9 hours ago

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ^2-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.18703
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.18703 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers