Hugging Face Daily Papers · · 4 min read

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

commit</p>\n","updatedAt":"2026-06-19T07:17:03.646Z","author":{"_id":"68fce03ed1d0efce7ca87075","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68fce03ed1d0efce7ca87075/GRKTeVIaLZD_M-KoJE8YF.png","fullname":"yfdeng","name":"yfdeng10","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665c91e15b11dca02f0c5891/ek9KGIc02tiFfaYeDfLaU.png","fullname":"DAGroup-PKU","name":"DAGroup-PKU","type":"org","isHf":false}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.868535041809082},"editors":["yfdeng10"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/68fce03ed1d0efce7ca87075/GRKTeVIaLZD_M-KoJE8YF.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.20521","authors":[{"_id":"6a34ecdd4c5c5e0d69bf1d8c","name":"Juncheng Ma","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d8d","name":"Jianxin Bi","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d8e","name":"Yufan Deng","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d8f","name":"Xuanran Zhai","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d90","name":"Kewei Zhang","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d91","name":"Ye Huang","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d92","name":"Bo Liang","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d93","name":"Shukai Gong","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d94","name":"Jiankai Tu","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d95","name":"Xiaotian Tang","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d96","name":"Jiaxin Li","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d97","name":"Kaiqi Chen","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d98","name":"Duomin Wang","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d99","name":"Yuqi Wang","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d9a","name":"Bingyi Kang","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d9b","name":"Eric Huang","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d9c","name":"Zhiyang Dou","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d9d","name":"Zhen Dong","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d9e","name":"Enze Xie","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1d9f","name":"Wojciech Matusik","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1da0","name":"Tat-Seng Chua","hidden":false},{"_id":"6a34ecdd4c5c5e0d69bf1da1","name":"Daquan Zhou","hidden":false}],"publishedAt":"2026-06-18T00:00:00.000Z","submittedOnDailyAt":"2026-06-19T00:00:00.000Z","title":"HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining","submittedOnDailyBy":{"_id":"68fce03ed1d0efce7ca87075","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68fce03ed1d0efce7ca87075/GRKTeVIaLZD_M-KoJE8YF.png","isPro":false,"fullname":"yfdeng","user":"yfdeng10","type":"user","name":"yfdeng10"},"summary":"Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.","upvotes":3,"discussionId":"6a34ecdd4c5c5e0d69bf1da2","ai_summary":"Egocentric human video can effectively replace teleoperated robot trajectories for embodied model pretraining, achieving better performance with reduced data collection costs.","ai_keywords":["embodied foundation models","teleoperated real-robot trajectories","egocentric human video","pretraining","action prediction","task execution","data scaling","behavioral diversity","environmental diversity","filtering pipeline","labeling pipeline","action-space alignment"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68fce03ed1d0efce7ca87075","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68fce03ed1d0efce7ca87075/GRKTeVIaLZD_M-KoJE8YF.png","isPro":false,"fullname":"yfdeng","user":"yfdeng10","type":"user"},{"_id":"63fe0b160c1bbe8e29d2dd32","avatarUrl":"/avatars/bc574036287170a77057893efaa48e2d.svg","isPro":false,"fullname":"Zhou","user":"DaQuan21","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"query":{}}">
Papers
arxiv:2606.20521

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Published on Jun 18
· Submitted by
yfdeng
on Jun 19
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Egocentric human video can effectively replace teleoperated robot trajectories for embodied model pretraining, achieving better performance with reduced data collection costs.

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

Community

Paper submitter about 2 hours ago

commit

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.20521 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.20521 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.20521 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers