<a href=\"https://cdn-uploads.huggingface.co/production/uploads/66afcc9b3cbe4ea9a4b57df6/80mvSZqWErZ76SUGpz6Bb.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/66afcc9b3cbe4ea9a4b57df6/80mvSZqWErZ76SUGpz6Bb.png\" alt=\"teaser\"></a></p>\n","updatedAt":"2026-06-01T04:13:47.130Z","author":{"_id":"66afcc9b3cbe4ea9a4b57df6","avatarUrl":"/avatars/add1c3cdfbb9c1013820761057a6a643.svg","fullname":"Li YiHeng","name":"L-yiheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5105447173118591},"editors":["L-yiheng"],"editorAvatarUrls":["/avatars/add1c3cdfbb9c1013820761057a6a643.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29488","authors":[{"_id":"6a1c19dd808ddbc3c7d4320f","name":"Yiheng Li","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43210","name":"Zhuo Li","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43211","name":"Ruibing Hou","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43212","name":"Yingjie Chen","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43213","name":"Hong Chang","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43214","name":"Hao Liu","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43215","name":"Shiguang Shan","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling","submittedOnDailyBy":{"_id":"66afcc9b3cbe4ea9a4b57df6","avatarUrl":"/avatars/add1c3cdfbb9c1013820761057a6a643.svg","isPro":false,"fullname":"Li YiHeng","user":"L-yiheng","type":"user","name":"L-yiheng"},"summary":"Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.","upvotes":0,"discussionId":"6a1c19de808ddbc3c7d43216","projectPage":"https://huggingface.co/datasets/L-yiheng/OmniHuMo","ai_summary":"A unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer to enable high-quality synthesis across arbitrary modality combinations.","ai_keywords":["Residual FSQ","motion tokenizer","masked modeling transformer","multimodal-conditioned synthesis","cross-modal interactions","human motion generation","scalable architecture","high-fidelity synthesis"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29488.md"}">
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
Abstract
A unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer to enable high-quality synthesis across arbitrary modality combinations.
AI-generated summary
Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.29488 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.29488 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.