Hugging Face Daily Papers · · 5 min read

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9x end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|tau_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|tau_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).</p>\n","updatedAt":"2026-06-09T18:13:37.055Z","author":{"_id":"62927c2e56fedc76e396b3ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62927c2e56fedc76e396b3ca/v-jgUS5GEC7jXygP1Edb8.jpeg","fullname":"HAO BAI","name":"JackBAI","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8600284457206726},"editors":["JackBAI"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62927c2e56fedc76e396b3ca/v-jgUS5GEC7jXygP1Edb8.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05597","authors":[{"_id":"6a2856ede7d78ea7587e5151","name":"Hao Bai","hidden":false},{"_id":"6a2856ede7d78ea7587e5152","name":"Rui Yang","hidden":false},{"_id":"6a2856ede7d78ea7587e5153","name":"Chenlu Ye","hidden":false},{"_id":"6a2856ede7d78ea7587e5154","name":"Spencer Whitehead","hidden":false},{"_id":"6a2856ede7d78ea7587e5155","name":"Aviral Kumar","hidden":false},{"_id":"6a2856ede7d78ea7587e5156","name":"Tong Zhang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/62927c2e56fedc76e396b3ca/vEpLJEOB6MuyD921duZ2c.mp4"],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents","submittedOnDailyBy":{"_id":"62927c2e56fedc76e396b3ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62927c2e56fedc76e396b3ca/v-jgUS5GEC7jXygP1Edb8.jpeg","isPro":false,"fullname":"HAO BAI","user":"JackBAI","type":"user","name":"JackBAI"},"summary":"Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9times end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|τ_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|τ_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).","upvotes":2,"discussionId":"6a2856ede7d78ea7587e5157","projectPage":"https://asyncwebrl-website.github.io/","githubRepo":"https://github.com/microsoft/webgym","githubRepoAddedBy":"user","ai_summary":"AsyncWebRL improves vision-language web agent training through asynchronous reinforcement learning and trajectory normalization modifications, achieving faster throughput and better performance on challenging tasks.","ai_keywords":["multi-step RL","synchronous RL","asynchronous design","rollout","gradient update","policy refresh","everlasting rollout pool","lightweight screenshot handling","multi-step GRPO","trajectory-level inefficiency","token-level inefficiency","per-trajectory normalizer","aggregate success"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":39,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62927c2e56fedc76e396b3ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62927c2e56fedc76e396b3ca/v-jgUS5GEC7jXygP1Edb8.jpeg","isPro":false,"fullname":"HAO BAI","user":"JackBAI","type":"user"},{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.05597.md"}">
Papers
arxiv:2606.05597

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Published on Jun 4
· Submitted by
HAO BAI
on Jun 9
Authors:
,
,
,
,
,

Abstract

AsyncWebRL improves vision-language web agent training through asynchronous reinforcement learning and trajectory normalization modifications, achieving faster throughput and better performance on challenging tasks.

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9times end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|τ_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|τ_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).

Community

Paper submitter about 2 hours ago

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a 2.9x end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer 1/|tau_i| in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing 1/|tau_i| with a constant 1/k breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.05597
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05597 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.05597 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05597 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers