Hugging Face Daily Papers · June 16, 2026 · 4 min read

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<video src=\"https://cdn-uploads.huggingface.co/production/uploads/634e4120038b5879133552f5/f3gt8eFuCIXl9z5Z-nuW9.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>","updatedAt":"2026-06-16T08:22:14.152Z","author":{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","fullname":"Siyuan","name":"SiyuanH","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5265265703201294},"editors":["SiyuanH"],"editorAvatarUrls":["/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17043","authors":[{"_id":"6a30f538a0d4daae42860308","user":{"_id":"68a9c8bb224a47aa7d74a284","avatarUrl":"/avatars/40b84f3e97ac5a082707baedc45fce34.svg","isPro":false,"fullname":"Tongyan Fang","user":"ffgvjjg","type":"user","name":"ffgvjjg"},"name":"Tongyan Fang","status":"claimed_verified","statusLastChangedAt":"2026-06-16T09:47:10.199Z","hidden":false},{"_id":"6a30f538a0d4daae42860309","name":"Siyuan Huang","hidden":false},{"_id":"6a30f538a0d4daae4286030a","name":"Naiyu Fang","hidden":false},{"_id":"6a30f538a0d4daae4286030b","name":"Ganlong Zhao","hidden":false},{"_id":"6a30f538a0d4daae4286030c","name":"Zhongjin Luo","hidden":false},{"_id":"6a30f538a0d4daae4286030d","name":"Jianbo Liu","hidden":false},{"_id":"6a30f538a0d4daae4286030e","name":"Xiaogang Wang","hidden":false},{"_id":"6a30f538a0d4daae4286030f","name":"Ying Dong","hidden":false},{"_id":"6a30f538a0d4daae42860310","name":"Hongsheng Li","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes","submittedOnDailyBy":{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","isPro":false,"fullname":"Siyuan","user":"SiyuanH","type":"user","name":"SiyuanH"},"summary":"When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.","upvotes":6,"discussionId":"6a30f538a0d4daae42860311","ai_summary":"Hierarchical Advantage-Weighted Behavior Cloning (HABC) addresses sparse reward challenges in robot learning by separately optimizing viability and efficiency objectives through adaptive critic heads and intervention-aware credit assignment, significantly improving success rates in contact-rich manipulation tasks.","ai_keywords":["online RL","actor update","per-transition supervision","sparse outcome","scalar reward","advantage signal","viability","efficiency","critic heads","state-adaptive balance","intervention-aware credit assignment","supervised fine-tuning","contact-rich bimanual tasks"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","isPro":false,"fullname":"Siyuan","user":"SiyuanH","type":"user"},{"_id":"665d4b515fdfe8f923e347a7","avatarUrl":"/avatars/d114b24c02dadfca0a8aee104755a8ec.svg","isPro":false,"fullname":"Zhaokai Wang","user":"wzk1015","type":"user"},{"_id":"645dd4a058f9ee3151493022","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645dd4a058f9ee3151493022/2r0tgS90ww1vcQDLKbWCl.jpeg","isPro":false,"fullname":"Yufei Liu","user":"ggxxii","type":"user"},{"_id":"683c77f75bdbb3803e148c01","avatarUrl":"/avatars/01e5f0f837e6851d74619fc7b4710952.svg","isPro":false,"fullname":"Xuanyao Tian","user":"XuanyaoTian","type":"user"},{"_id":"64c265145176b28ce997c437","avatarUrl":"/avatars/923cde8010e43a4d0c8c648734c916f1.svg","isPro":false,"fullname":"Ganlong Zhao","user":"sdfae","type":"user"},{"_id":"68a9c8bb224a47aa7d74a284","avatarUrl":"/avatars/40b84f3e97ac5a082707baedc45fce34.svg","isPro":false,"fullname":"Tongyan Fang","user":"ffgvjjg","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.17043.md","query":{}}">

Papers

arxiv:2606.17043

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Published on Jun 15

· Submitted by

Siyuan on Jun 16

Upvote

Authors:

Tongyan Fang ,

Abstract

Hierarchical Advantage-Weighted Behavior Cloning (HABC) addresses sparse reward challenges in robot learning by separately optimizing viability and efficiency objectives through adaptive critic heads and intervention-aware credit assignment, significantly improving success rates in contact-rich manipulation tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

View arXiv page View PDF Add to collection

Community

SiyuanH

Paper submitter about 5 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.17043

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.17043 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17043 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17043 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers