Hugging Face Daily Papers · · 6 min read

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<video src=\"https://cdn-uploads.huggingface.co/production/uploads/634e4120038b5879133552f5/WHpefueER5NtjE00pF-fh.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>","updatedAt":"2026-06-17T02:33:10.210Z","author":{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","fullname":"Siyuan","name":"SiyuanH","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5844476222991943},"editors":["SiyuanH"],"editorAvatarUrls":["/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg"],"reactions":[],"isReport":false}},{"id":"6a32af57275e8e6181c29b10","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-06-17T14:29:43.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"the unified camera-space action space is slick, and the reliability-aware loss is a clean guardrail for noisy signals. i’m curious how this plays out when egocentric reconstructions have systematic noise like occlusions or timing jitter—does the weighting auto-correct or still need tuning? it’d be great to see an ablation varying the noise profile or dropping some channels to stress-test the auto-weighting during pretraining. btw the arxivlens breakdown helped me parse the method details: https://arxivlens.com/PaperView/Details/ace-ego-0-unifying-egocentric-human-and-robotic-data-for-vla-pretraining-8825-7e203c20","html":"<p>the unified camera-space action space is slick, and the reliability-aware loss is a clean guardrail for noisy signals. i’m curious how this plays out when egocentric reconstructions have systematic noise like occlusions or timing jitter—does the weighting auto-correct or still need tuning? it’d be great to see an ablation varying the noise profile or dropping some channels to stress-test the auto-weighting during pretraining. btw the arxivlens breakdown helped me parse the method details: <a href=\"https://arxivlens.com/PaperView/Details/ace-ego-0-unifying-egocentric-human-and-robotic-data-for-vla-pretraining-8825-7e203c20\" rel=\"nofollow\">https://arxivlens.com/PaperView/Details/ace-ego-0-unifying-egocentric-human-and-robotic-data-for-vla-pretraining-8825-7e203c20</a></p>\n","updatedAt":"2026-06-17T14:29:43.877Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8385937213897705},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"6a32e6a5d9ae52c37f9d5cad","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false},"createdAt":"2026-06-17T18:25:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Neat paper. The bridge between human video and robot actions has always been a pain point, so I'm interested to see how that reliability-aware training objective actually performs in practice. It makes a lot of sense to filter out the noise from those pseudo-labels rather than just throwing everything into the mix.\n\nHow does the model handle the inherent differences in temporal dynamics between human movement and robot trajectories when aligning these action chunks?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/4526ec0a-584b-4dea-9efd-d759ba040fd8","html":"<p>Neat paper. The bridge between human video and robot actions has always been a pain point, so I'm interested to see how that reliability-aware training objective actually performs in practice. It makes a lot of sense to filter out the noise from those pseudo-labels rather than just throwing everything into the mix.</p>\n<p>How does the model handle the inherent differences in temporal dynamics between human movement and robot trajectories when aligning these action chunks?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/4526ec0a-584b-4dea-9efd-d759ba040fd8\" rel=\"nofollow\">https://researchpod.app/episode/4526ec0a-584b-4dea-9efd-d759ba040fd8</a></p>\n","updatedAt":"2026-06-17T18:25:41.013Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9319188594818115},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17200","authors":[{"_id":"6a3206ccbc818ff14e453da8","name":"Hao Li","hidden":false},{"_id":"6a3206ccbc818ff14e453da9","name":"Ganlong Zhao","hidden":false},{"_id":"6a3206ccbc818ff14e453daa","name":"Yufei Liu","hidden":false},{"_id":"6a3206ccbc818ff14e453dab","name":"Haotian Hou","hidden":false},{"_id":"6a3206ccbc818ff14e453dac","name":"Guoquan Ye","hidden":false},{"_id":"6a3206ccbc818ff14e453dad","name":"Tongyan Fang","hidden":false},{"_id":"6a3206ccbc818ff14e453dae","name":"Chunxiao Liu","hidden":false},{"_id":"6a3206ccbc818ff14e453daf","name":"Siyuan Huang","hidden":false},{"_id":"6a3206ccbc818ff14e453db0","name":"Jianbo Liu","hidden":false},{"_id":"6a3206ccbc818ff14e453db1","name":"Xiaogang Wang","hidden":false},{"_id":"6a3206ccbc818ff14e453db2","name":"Hongsheng Li","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-17T00:00:00.000Z","title":"ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining","submittedOnDailyBy":{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","isPro":false,"fullname":"Siyuan","user":"SiyuanH","type":"user","name":"SiyuanH"},"summary":"Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.","upvotes":40,"discussionId":"6a3206cdbc818ff14e453db3","projectPage":"https://acerobotics-vla.github.io/ACE-Ego/","githubRepo":"https://github.com/ACERobotics-VLA/ACE-Ego-0","githubRepoAddedBy":"user","ai_summary":"A unified Vision-Language-Action pretraining framework leverages heterogeneous data sources including human egocentric videos and robot trajectories through a reliability-aware training approach that improves performance on embodied AI tasks.","ai_keywords":["Vision-Language-Action models","egocentric human videos","robot trajectory collection","unified action representation","camera-space actions","time-aligned action chunking","reliability-aware training objective","human auxiliary loss","pseudo-action trajectories","embodied AI tasks"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":11,"organization":{"_id":"62a9ed9212b1efd0454bc4ce","name":"CUHK","fullname":"CUHK","avatar":"https://www.gravatar.com/avatar/5f48fde8d794b18069b400091a07da77?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"634e4120038b5879133552f5","avatarUrl":"/avatars/34ec861b4bbf1aecf927a7d6e726c7a4.svg","isPro":false,"fullname":"Siyuan","user":"SiyuanH","type":"user"},{"_id":"65c04e9c27a5fdca81abcbd9","avatarUrl":"/avatars/12a155683c824fa23da4a9e2bed4f64e.svg","isPro":false,"fullname":"Hongsheng LI","user":"hsli-cuhk","type":"user"},{"_id":"649ecf9827145c4463240177","avatarUrl":"/avatars/27696cf31790a3d58d8be2e0c983800e.svg","isPro":false,"fullname":"Lue Fan","user":"Abyssaledge","type":"user"},{"_id":"6599074f8c5c668886623078","avatarUrl":"/avatars/dda0b876da033a07bb6a3a77f3404188.svg","isPro":false,"fullname":"hao","user":"1223hao","type":"user"},{"_id":"66026c9068d519ed32519e9c","avatarUrl":"/avatars/8fa051312c713772e5b8ba65989ff7f5.svg","isPro":false,"fullname":"Weifeng Lin","user":"Afeng-x","type":"user"},{"_id":"690c17cbc1bef4972922a937","avatarUrl":"/avatars/94e0d258998f8965042913c5e46054a4.svg","isPro":false,"fullname":"Naiyu Fang","user":"NerdFNY","type":"user"},{"_id":"6555ab405891609e4552360b","avatarUrl":"/avatars/0815b93b08d809b7013108ab1b688f81.svg","isPro":false,"fullname":"Junchao Gong","user":"jason816","type":"user"},{"_id":"65eeb8f3ceb1a8d208fcb865","avatarUrl":"/avatars/345eb8ba27503b77e0e9e42a3642de5a.svg","isPro":false,"fullname":"Qianhan Feng","user":"fqhank","type":"user"},{"_id":"675558f0ad1bd71f63fd3547","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/O4TJAr-FQKZozFjgXU7EF.png","isPro":false,"fullname":"junbodong","user":"junbo0","type":"user"},{"_id":"683c77f75bdbb3803e148c01","avatarUrl":"/avatars/01e5f0f837e6851d74619fc7b4710952.svg","isPro":false,"fullname":"Xuanyao Tian","user":"XuanyaoTian","type":"user"},{"_id":"6886ff0c1ac52d1f17973076","avatarUrl":"/avatars/3b16b6c56ce69bd18ccd3189498f2067.svg","isPro":false,"fullname":"HT Hou","user":"Onkri","type":"user"},{"_id":"66bb136002fd8eb58bc84ffb","avatarUrl":"/avatars/122cb8f59c502392768099b3c2afe043.svg","isPro":false,"fullname":"qinqi","user":"Dakerqi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"62a9ed9212b1efd0454bc4ce","name":"CUHK","fullname":"CUHK","avatar":"https://www.gravatar.com/avatar/5f48fde8d794b18069b400091a07da77?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.17200.md","query":{}}">
Papers
arxiv:2606.17200

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Published on Jun 15
· Submitted by
Siyuan
on Jun 17
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

A unified Vision-Language-Action pretraining framework leverages heterogeneous data sources including human egocentric videos and robot trajectories through a reliability-aware training approach that improves performance on embodied AI tasks.

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.

Community

Paper submitter about 22 hours ago

the unified camera-space action space is slick, and the reliability-aware loss is a clean guardrail for noisy signals. i’m curious how this plays out when egocentric reconstructions have systematic noise like occlusions or timing jitter—does the weighting auto-correct or still need tuning? it’d be great to see an ablation varying the noise profile or dropping some channels to stress-test the auto-weighting during pretraining. btw the arxivlens breakdown helped me parse the method details: https://arxivlens.com/PaperView/Details/ace-ego-0-unifying-egocentric-human-and-robotic-data-for-vla-pretraining-8825-7e203c20

Neat paper. The bridge between human video and robot actions has always been a pain point, so I'm interested to see how that reliability-aware training objective actually performs in practice. It makes a lot of sense to filter out the noise from those pseudo-labels rather than just throwing everything into the mix.

How does the model handle the inherent differences in temporal dynamics between human movement and robot trajectories when aligning these action chunks?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/4526ec0a-584b-4dea-9efd-d759ba040fd8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.17200
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.17200 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17200 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17200 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers