Hugging Face Daily Papers · May 18, 2026 · 7 min read

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic crossembodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and realworld benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model. Code is released at <a href=\"https://github.com/showlab/OmniHumanoid\" rel=\"nofollow\">https://github.com/showlab/OmniHumanoid</a>\n","updatedAt":"2026-05-18T03:28:12.931Z","author":{"_id":"65519eb532f278f503b3b2c3","avatarUrl":"/avatars/2e180f7b20189cd2d8a75e05c2913c5d.svg","fullname":"QuanjianSong","name":"QuanjianSong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8740820288658142},"editors":["QuanjianSong"],"editorAvatarUrls":["/avatars/2e180f7b20189cd2d8a75e05c2913c5d.svg"],"reactions":[],"isReport":false}},{"id":"6a0bc123e737e2d3933f60dc","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false},"createdAt":"2026-05-19T01:47:15.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing](https://huggingface.co/papers/2605.03637) (2026)\n* [Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation](https://huggingface.co/papers/2604.24681) (2026)\n* [HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation](https://huggingface.co/papers/2604.07993) (2026)\n* [LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment](https://huggingface.co/papers/2604.10677) (2026)\n* [UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling](https://huggingface.co/papers/2604.19734) (2026)\n* [ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis](https://huggingface.co/papers/2604.19720) (2026)\n* [EgoSim: Egocentric World Simulator for Embodied Interaction Generation](https://huggingface.co/papers/2604.01001) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.03637\">Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.24681\">Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.07993\">HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.10677\">LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.19734\">UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.19720\">ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.01001\">EgoSim: Egocentric World Simulator for Embodied Interaction Generation</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-19T01:47:15.230Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7147902846336365},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12038","authors":[{"_id":"6a0a871c75184a0d71e0268d","name":"Yiren Song","hidden":false},{"_id":"6a0a871c75184a0d71e0268e","name":"Xiyao Deng","hidden":false},{"_id":"6a0a871c75184a0d71e0268f","name":"Pei Yang","hidden":false},{"_id":"6a0a871c75184a0d71e02690","name":"Yihan Wang","hidden":false},{"_id":"6a0a871c75184a0d71e02691","name":"Mike Zheng Shou","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation","submittedOnDailyBy":{"_id":"65519eb532f278f503b3b2c3","avatarUrl":"/avatars/2e180f7b20189cd2d8a75e05c2913c5d.svg","isPro":false,"fullname":"QuanjianSong","user":"QuanjianSong","type":"user","name":"QuanjianSong"},"summary":"Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.","upvotes":2,"discussionId":"6a0a871c75184a0d71e02692","githubRepo":"https://github.com/showlab/OmniHumanoid","githubRepoAddedBy":"user","ai_summary":"OmniHumanoid enables cross-embodiment video generation by factorizing motion transfer and embodiment-specific adaptation, allowing scalable adaptation to new humanoid embodiments using unpaired data.","ai_keywords":["cross-embodiment video generation","motion transfer","embodiment-specific adaptation","motion-aligned paired videos","branch-isolated attention","synthetic cross-embodiment dataset","humanoid embodiments","motion fidelity","embodiment consistency"],"githubStars":2},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"633407947eb49986ce070a6c","avatarUrl":"/avatars/84245495d36f605a900950a3a76d4386.svg","isPro":false,"fullname":"song yiren","user":"songyiren","type":"user"},{"_id":"69cca49fa910b9017dab9825","avatarUrl":"/avatars/9a6620e3b32d4bd36bc1fd06bd360a5b.svg","isPro":false,"fullname":"Liang Xinyu","user":"fengyic8","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0}">

Papers

arxiv:2605.12038

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Published on May 12

· Submitted by

QuanjianSong on May 18

Upvote

Authors:

Abstract

OmniHumanoid enables cross-embodiment video generation by factorizing motion transfer and embodiment-specific adaptation, allowing scalable adaptation to new humanoid embodiments using unpaired data.

AI-generated summary

Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.

View arXiv page View PDF GitHub 2 Add to collection

Community

QuanjianSong

Paper submitter about 23 hours ago

Cross-embodiment video generation aims to transfer motions across different
humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting
is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often
entangle these factors, and many require paired data for every target embodiment,
which limits scalability to new robots. We present OmniHumanoid, a framework
that factorizes transferable motion learning and embodiment-specific adaptation.
Our method learns a shared motion transfer model from motion-aligned paired
videos spanning multiple embodiments, while adapting to a new embodiment using
only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further
introduce a branch-isolated attention design that separates motion conditioning
from embodiment-specific modulation. In addition, we construct a synthetic crossembodiment dataset with motion-aligned paired videos rendered across diverse
humanoid assets, scenes, and viewpoints. Experiments on both synthetic and realworld benchmarks show that OmniHumanoid achieves strong motion fidelity and
embodiment consistency, while enabling scalable adaptation to unseen humanoid
embodiments without retraining the shared motion model. Code is released at
https://github.com/showlab/OmniHumanoid

librarian-bot

14 minutes ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.12038 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.12038 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12038 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers