Hugging Face Daily Papers · June 1, 2026 · 3 min read

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/66afcc9b3cbe4ea9a4b57df6/80mvSZqWErZ76SUGpz6Bb.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/66afcc9b3cbe4ea9a4b57df6/80mvSZqWErZ76SUGpz6Bb.png\" alt=\"teaser\"></a></p>\n","updatedAt":"2026-06-01T04:13:47.130Z","author":{"_id":"66afcc9b3cbe4ea9a4b57df6","avatarUrl":"/avatars/add1c3cdfbb9c1013820761057a6a643.svg","fullname":"Li YiHeng","name":"L-yiheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5105447173118591},"editors":["L-yiheng"],"editorAvatarUrls":["/avatars/add1c3cdfbb9c1013820761057a6a643.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29488","authors":[{"_id":"6a1c19dd808ddbc3c7d4320f","name":"Yiheng Li","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43210","name":"Zhuo Li","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43211","name":"Ruibing Hou","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43212","name":"Yingjie Chen","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43213","name":"Hong Chang","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43214","name":"Hao Liu","hidden":false},{"_id":"6a1c19dd808ddbc3c7d43215","name":"Shiguang Shan","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling","submittedOnDailyBy":{"_id":"66afcc9b3cbe4ea9a4b57df6","avatarUrl":"/avatars/add1c3cdfbb9c1013820761057a6a643.svg","isPro":false,"fullname":"Li YiHeng","user":"L-yiheng","type":"user","name":"L-yiheng"},"summary":"Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.","upvotes":0,"discussionId":"6a1c19de808ddbc3c7d43216","projectPage":"https://huggingface.co/datasets/L-yiheng/OmniHuMo","ai_summary":"A unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer to enable high-quality synthesis across arbitrary modality combinations.","ai_keywords":["Residual FSQ","motion tokenizer","masked modeling transformer","multimodal-conditioned synthesis","cross-modal interactions","human motion generation","scalable architecture","high-fidelity synthesis"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29488.md"}">

Papers

arxiv:2605.29488

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Published on May 28

· Submitted by

Li YiHeng on Jun 1

Upvote

Authors:

Abstract

A unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer to enable high-quality synthesis across arbitrary modality combinations.

AI-generated summary

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.

View arXiv page View PDF Project page Add to collection

Community

L-yiheng

Paper submitter about 7 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.29488

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.29488 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.29488 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers