<video src=\"https://cdn-uploads.huggingface.co/production/uploads/634e4120038b5879133552f5/WHpefueER5NtjE00pF-fh.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>","updatedAt":"2026-06-17T02:33:10.210Z","author":{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","fullname":"Siyuan","name":"SiyuanH","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5844476222991943},"editors":["SiyuanH"],"editorAvatarUrls":["/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg"],"reactions":[],"isReport":false}},{"id":"6a32af57275e8e6181c29b10","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-06-17T14:29:43.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"the unified camera-space action space is slick, and the reliability-aware loss is a clean guardrail for noisy signals. i’m curious how this plays out when egocentric reconstructions have systematic noise like occlusions or timing jitter—does the weighting auto-correct or still need tuning? it’d be great to see an ablation varying the noise profile or dropping some channels to stress-test the auto-weighting during pretraining. btw the arxivlens breakdown helped me parse the method details: https://arxivlens.com/PaperView/Details/ace-ego-0-unifying-egocentric-human-and-robotic-data-for-vla-pretraining-8825-7e203c20","html":"<p>the unified camera-space action space is slick, and the reliability-aware loss is a clean guardrail for noisy signals. i’m curious how this plays out when egocentric reconstructions have systematic noise like occlusions or timing jitter—does the weighting auto-correct or still need tuning? it’d be great to see an ablation varying the noise profile or dropping some channels to stress-test the auto-weighting during pretraining. btw the arxivlens breakdown helped me parse the method details: <a href=\"https://arxivlens.com/PaperView/Details/ace-ego-0-unifying-egocentric-human-and-robotic-data-for-vla-pretraining-8825-7e203c20\" rel=\"nofollow\">https://arxivlens.com/PaperView/Details/ace-ego-0-unifying-egocentric-human-and-robotic-data-for-vla-pretraining-8825-7e203c20</a></p>\n","updatedAt":"2026-06-17T14:29:43.877Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8385937213897705},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"6a32e6a5d9ae52c37f9d5cad","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false},"createdAt":"2026-06-17T18:25:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Neat paper. The bridge between human video and robot actions has always been a pain point, so I'm interested to see how that reliability-aware training objective actually performs in practice. It makes a lot of sense to filter out the noise from those pseudo-labels rather than just throwing everything into the mix.\n\nHow does the model handle the inherent differences in temporal dynamics between human movement and robot trajectories when aligning these action chunks?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/4526ec0a-584b-4dea-9efd-d759ba040fd8","html":"<p>Neat paper. The bridge between human video and robot actions has always been a pain point, so I'm interested to see how that reliability-aware training objective actually performs in practice. It makes a lot of sense to filter out the noise from those pseudo-labels rather than just throwing everything into the mix.</p>\n<p>How does the model handle the inherent differences in temporal dynamics between human movement and robot trajectories when aligning these action chunks?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/4526ec0a-584b-4dea-9efd-d759ba040fd8\" rel=\"nofollow\">https://researchpod.app/episode/4526ec0a-584b-4dea-9efd-d759ba040fd8</a></p>\n","updatedAt":"2026-06-17T18:25:41.013Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9319188594818115},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17200","authors":[{"_id":"6a3206ccbc818ff14e453da8","name":"Hao Li","hidden":false},{"_id":"6a3206ccbc818ff14e453da9","name":"Ganlong Zhao","hidden":false},{"_id":"6a3206ccbc818ff14e453daa","name":"Yufei Liu","hidden":false},{"_id":"6a3206ccbc818ff14e453dab","name":"Haotian Hou","hidden":false},{"_id":"6a3206ccbc818ff14e453dac","name":"Guoquan Ye","hidden":false},{"_id":"6a3206ccbc818ff14e453dad","name":"Tongyan Fang","hidden":false},{"_id":"6a3206ccbc818ff14e453dae","name":"Chunxiao Liu","hidden":false},{"_id":"6a3206ccbc818ff14e453daf","name":"Siyuan Huang","hidden":false},{"_id":"6a3206ccbc818ff14e453db0","name":"Jianbo Liu","hidden":false},{"_id":"6a3206ccbc818ff14e453db1","name":"Xiaogang Wang","hidden":false},{"_id":"6a3206ccbc818ff14e453db2","name":"Hongsheng Li","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-17T00:00:00.000Z","title":"ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining","submittedOnDailyBy":{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","isPro":false,"fullname":"Siyuan","user":"SiyuanH","type":"user","name":"SiyuanH"},"summary":"Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.","upvotes":40,"discussionId":"6a3206cdbc818ff14e453db3","projectPage":"https://acerobotics-vla.github.io/ACE-Ego/","githubRepo":"https://github.com/ACERobotics-VLA/ACE-Ego-0","githubRepoAddedBy":"user","ai_summary":"A unified Vision-Language-Action pretraining framework leverages heterogeneous data sources including human egocentric videos and robot trajectories through a reliability-aware training approach that improves performance on embodied AI tasks.","ai_keywords":["Vision-Language-Action models","egocentric human videos","robot trajectory collection","unified action representation","camera-space actions","time-aligned action chunking","reliability-aware training objective","human auxiliary loss","pseudo-action trajectories","embodied AI tasks"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":11,"organization":{"_id":"62a9ed9212b1efd0454bc4ce","name":"CUHK","fullname":"CUHK","avatar":"https://www.gravatar.com/avatar/5f48fde8d794b18069b400091a07da77?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","isPro":false,"fullname":"Siyuan","user":"SiyuanH","type":"user"},{"_id":"65c04e9c27a5fdca81abcbd9","avatarUrl":"/avatars/12a155683c824fa23da4a9e2bed4f64e.svg","isPro":false,"fullname":"Hongsheng LI","user":"hsli-cuhk","type":"user"},{"_id":"649ecf9827145c4463240177","avatarUrl":"/avatars/27696cf31790a3d58d8be2e0c983800e.svg","isPro":false,"fullname":"Lue Fan","user":"Abyssaledge","type":"user"},{"_id":"6599074f8c5c668886623078","avatarUrl":"/avatars/dda0b876da033a07bb6a3a77f3404188.svg","isPro":false,"fullname":"hao","user":"1223hao","type":"user"},{"_id":"66026c9068d519ed32519e9c","avatarUrl":"/avatars/8fa051312c713772e5b8ba65989ff7f5.svg","isPro":false,"fullname":"Weifeng Lin","user":"Afeng-x","type":"user"},{"_id":"690c17cbc1bef4972922a937","avatarUrl":"/avatars/94e0d258998f8965042913c5e46054a4.svg","isPro":false,"fullname":"Naiyu Fang","user":"NerdFNY","type":"user"},{"_id":"6555ab405891609e4552360b","avatarUrl":"/avatars/0815b93b08d809b7013108ab1b688f81.svg","isPro":false,"fullname":"Junchao Gong","user":"jason816","type":"user"},{"_id":"65eeb8f3ceb1a8d208fcb865","avatarUrl":"/avatars/345eb8ba27503b77e0e9e42a3642de5a.svg","isPro":false,"fullname":"Qianhan Feng","user":"fqhank","type":"user"},{"_id":"675558f0ad1bd71f63fd3547","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/O4TJAr-FQKZozFjgXU7EF.png","isPro":false,"fullname":"junbodong","user":"junbo0","type":"user"},{"_id":"683c77f75bdbb3803e148c01","avatarUrl":"/avatars/01e5f0f837e6851d74619fc7b4710952.svg","isPro":false,"fullname":"Xuanyao Tian","user":"XuanyaoTian","type":"user"},{"_id":"6886ff0c1ac52d1f17973076","avatarUrl":"/avatars/3b16b6c56ce69bd18ccd3189498f2067.svg","isPro":false,"fullname":"HT Hou","user":"Onkri","type":"user"},{"_id":"66bb136002fd8eb58bc84ffb","avatarUrl":"/avatars/122cb8f59c502392768099b3c2afe043.svg","isPro":false,"fullname":"qinqi","user":"Dakerqi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"62a9ed9212b1efd0454bc4ce","name":"CUHK","fullname":"CUHK","avatar":"https://www.gravatar.com/avatar/5f48fde8d794b18069b400091a07da77?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.17200.md","query":{}}">
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
A unified Vision-Language-Action pretraining framework leverages heterogeneous data sources including human egocentric videos and robot trajectories through a reliability-aware training approach that improves performance on embodied AI tasks.
Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.
Community
Neat paper. The bridge between human video and robot actions has always been a pain point, so I'm interested to see how that reliability-aware training objective actually performs in practice. It makes a lot of sense to filter out the noise from those pseudo-labels rather than just throwing everything into the mix.
How does the model handle the inherent differences in temporal dynamics between human movement and robot trajectories when aligning these action chunks?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/4526ec0a-584b-4dea-9efd-d759ba040fd8
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.17200 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.17200 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.17200 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.