<video src=\"https://cdn-uploads.huggingface.co/production/uploads/634e4120038b5879133552f5/f3gt8eFuCIXl9z5Z-nuW9.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>","updatedAt":"2026-06-16T08:22:14.152Z","author":{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","fullname":"Siyuan","name":"SiyuanH","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5265265703201294},"editors":["SiyuanH"],"editorAvatarUrls":["/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17043","authors":[{"_id":"6a30f538a0d4daae42860308","user":{"_id":"68a9c8bb224a47aa7d74a284","avatarUrl":"/avatars/40b84f3e97ac5a082707baedc45fce34.svg","isPro":false,"fullname":"Tongyan Fang","user":"ffgvjjg","type":"user","name":"ffgvjjg"},"name":"Tongyan Fang","status":"claimed_verified","statusLastChangedAt":"2026-06-16T09:47:10.199Z","hidden":false},{"_id":"6a30f538a0d4daae42860309","name":"Siyuan Huang","hidden":false},{"_id":"6a30f538a0d4daae4286030a","name":"Naiyu Fang","hidden":false},{"_id":"6a30f538a0d4daae4286030b","name":"Ganlong Zhao","hidden":false},{"_id":"6a30f538a0d4daae4286030c","name":"Zhongjin Luo","hidden":false},{"_id":"6a30f538a0d4daae4286030d","name":"Jianbo Liu","hidden":false},{"_id":"6a30f538a0d4daae4286030e","name":"Xiaogang Wang","hidden":false},{"_id":"6a30f538a0d4daae4286030f","name":"Ying Dong","hidden":false},{"_id":"6a30f538a0d4daae42860310","name":"Hongsheng Li","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes","submittedOnDailyBy":{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","isPro":false,"fullname":"Siyuan","user":"SiyuanH","type":"user","name":"SiyuanH"},"summary":"When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.","upvotes":6,"discussionId":"6a30f538a0d4daae42860311","ai_summary":"Hierarchical Advantage-Weighted Behavior Cloning (HABC) addresses sparse reward challenges in robot learning by separately optimizing viability and efficiency objectives through adaptive critic heads and intervention-aware credit assignment, significantly improving success rates in contact-rich manipulation tasks.","ai_keywords":["online RL","actor update","per-transition supervision","sparse outcome","scalar reward","advantage signal","viability","efficiency","critic heads","state-adaptive balance","intervention-aware credit assignment","supervised fine-tuning","contact-rich bimanual tasks"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","isPro":false,"fullname":"Siyuan","user":"SiyuanH","type":"user"},{"_id":"665d4b515fdfe8f923e347a7","avatarUrl":"/avatars/d114b24c02dadfca0a8aee104755a8ec.svg","isPro":false,"fullname":"Zhaokai Wang","user":"wzk1015","type":"user"},{"_id":"645dd4a058f9ee3151493022","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645dd4a058f9ee3151493022/2r0tgS90ww1vcQDLKbWCl.jpeg","isPro":false,"fullname":"Yufei Liu","user":"ggxxii","type":"user"},{"_id":"683c77f75bdbb3803e148c01","avatarUrl":"/avatars/01e5f0f837e6851d74619fc7b4710952.svg","isPro":false,"fullname":"Xuanyao Tian","user":"XuanyaoTian","type":"user"},{"_id":"64c265145176b28ce997c437","avatarUrl":"/avatars/923cde8010e43a4d0c8c648734c916f1.svg","isPro":false,"fullname":"Ganlong Zhao","user":"sdfae","type":"user"},{"_id":"68a9c8bb224a47aa7d74a284","avatarUrl":"/avatars/40b84f3e97ac5a082707baedc45fce34.svg","isPro":false,"fullname":"Tongyan Fang","user":"ffgvjjg","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.17043.md","query":{}}">
Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes
Published on Jun 15
· Submitted by Siyuan on Jun 16 Abstract
Hierarchical Advantage-Weighted Behavior Cloning (HABC) addresses sparse reward challenges in robot learning by separately optimizing viability and efficiency objectives through adaptive critic heads and intervention-aware credit assignment, significantly improving success rates in contact-rich manipulation tasks.
When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.17043 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.17043 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.17043 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.