Open-sourced.</p>\n","updatedAt":"2026-06-17T03:11:43.295Z","author":{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/ZQVw6cdpC2WsJ46aW4iyh.png","fullname":"HuggingFace Zhang","name":"SteveZeyuZhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9639630913734436},"editors":["SteveZeyuZhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/ZQVw6cdpC2WsJ46aW4iyh.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15142","authors":[{"_id":"6a320f10bc818ff14e453de4","name":"Nonghai Zhang","hidden":false},{"_id":"6a320f10bc818ff14e453de5","name":"Siyu Zhai","hidden":false},{"_id":"6a320f10bc818ff14e453de6","name":"Yanjun Li","hidden":false},{"_id":"6a320f10bc818ff14e453de7","name":"Zeyu Zhang","hidden":false},{"_id":"6a320f10bc818ff14e453de8","name":"Zhihan Yin","hidden":false},{"_id":"6a320f10bc818ff14e453de9","name":"Yandong Guo","hidden":false},{"_id":"6a320f10bc818ff14e453dea","name":"Boxin Shi","hidden":false},{"_id":"6a320f10bc818ff14e453deb","name":"Hao Tang","hidden":false}],"publishedAt":"2026-06-13T00:00:00.000Z","submittedOnDailyAt":"2026-06-17T00:00:00.000Z","title":"MotionVLA: Vision-Language-Action Model for Humanoid Motion","submittedOnDailyBy":{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/ZQVw6cdpC2WsJ46aW4iyh.png","isPro":false,"fullname":"HuggingFace Zhang","user":"SteveZeyuZhang","type":"user","name":"SteveZeyuZhang"},"summary":"Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.","upvotes":2,"discussionId":"6a320f10bc818ff14e453dec","projectPage":"https://aigeeksgroup.github.io/MotionVLA/","githubRepo":"https://github.com/AIGeeksGroup/MotionVLA","githubRepoAddedBy":"user","ai_summary":"A dual-stream frequency tokenizer and autoregressive model are proposed to improve humanoid motion generation by separately encoding pose and physical dynamics, achieving better diversity and consistency compared to single-codebook approaches.","ai_keywords":["frequency-domain analysis","DCT coefficients","quantization","autoregressive model","dual-stream frequency tokenizer","motion generation","HumanML3D","MBench","Qwen3.5-based model","BPE","DCT truncation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":8,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15142.md","query":{}}">
MotionVLA: Vision-Language-Action Model for Humanoid Motion
Abstract
A dual-stream frequency tokenizer and autoregressive model are proposed to improve humanoid motion generation by separately encoding pose and physical dynamics, achieving better diversity and consistency compared to single-codebook approaches.
Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.15142 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.15142 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.15142 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.