Hugging Face Daily Papers · · 4 min read

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

A 2.6B open-source world model that turns one image and a camera trajectory into 720p, minute-long, controllable video on a single GPU. Project Page: <a href=\"https://nvlabs.github.io/Sana/WM/\" rel=\"nofollow\">https://nvlabs.github.io/Sana/WM/</a> Code: <a href=\"https://github.com/NVlabs/Sana/\" rel=\"nofollow\">https://github.com/NVlabs/Sana/</a></p>\n","updatedAt":"2026-05-15T03:10:01.842Z","author":{"_id":"6283546209aa80237c6c482c","avatarUrl":"/avatars/0d6fc5846c0456d5282d82d5bf4d7056.svg","fullname":"Haoyi Zhu","name":"HaoyiZhu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8491390943527222},"editors":["HaoyiZhu"],"editorAvatarUrls":["/avatars/0d6fc5846c0456d5282d82d5bf4d7056.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15178","authors":[{"_id":"6a068989b1a8cbabc9f09915","name":"Haoyi Zhu","hidden":false},{"_id":"6a068989b1a8cbabc9f09916","name":"Haozhe Liu","hidden":false},{"_id":"6a068989b1a8cbabc9f09917","name":"Yuyang Zhao","hidden":false},{"_id":"6a068989b1a8cbabc9f09918","name":"Tian Ye","hidden":false},{"_id":"6a068989b1a8cbabc9f09919","name":"Junsong Chen","hidden":false},{"_id":"6a068989b1a8cbabc9f0991a","name":"Jincheng Yu","hidden":false},{"_id":"6a068989b1a8cbabc9f0991b","name":"Tong He","hidden":false},{"_id":"6a068989b1a8cbabc9f0991c","name":"Song Han","hidden":false},{"_id":"6a068989b1a8cbabc9f0991d","name":"Enze Xie","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer","submittedOnDailyBy":{"_id":"6283546209aa80237c6c482c","avatarUrl":"/avatars/0d6fc5846c0456d5282d82d5bf4d7056.svg","isPro":false,"fullname":"Haoyi Zhu","user":"HaoyiZhu","type":"user","name":"HaoyiZhu"},"summary":"We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only sim213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at 36times higher throughput for scalable world modeling.","upvotes":51,"discussionId":"6a06898ab1a8cbabc9f0991e","projectPage":"https://nvlabs.github.io/Sana/WM/","ai_summary":"SANA-WM is an efficient 2.6B-parameter world model that generates high-fidelity 720p videos with precise camera control, achieving industrial-level quality while significantly reducing computational requirements through hybrid attention, dual-camera branches, two-stage generation, and robust annotation pipelines.","ai_keywords":["world model","Gated DeltaNet","softmax attention","6-DoF trajectory","two-stage generation pipeline","metric-scale pose supervision","video synthesis","camera control","hybrid linear attention","distilled variant","NVFP4 quantization"],"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65dcc2510d1261fa54988f9b","avatarUrl":"/avatars/e1ba38d45d65b1f5213462a978114f1d.svg","isPro":false,"fullname":"yyfz233","user":"yyfz233","type":"user"},{"_id":"68c3d9ce8fc494c46b06e9f5","avatarUrl":"/avatars/790bded92bae4ab2f7a0f795a53736a5.svg","isPro":false,"fullname":"YangZhou","user":"YangZhou24","type":"user"},{"_id":"667e81565934c9fae29207ef","avatarUrl":"/avatars/431e777c71fccf7cf48ce013e5f6f1cb.svg","isPro":false,"fullname":"Zhou","user":"ZhouTimeMachine","type":"user"},{"_id":"6283546209aa80237c6c482c","avatarUrl":"/avatars/0d6fc5846c0456d5282d82d5bf4d7056.svg","isPro":false,"fullname":"Haoyi Zhu","user":"HaoyiZhu","type":"user"},{"_id":"66d6819bc8c857729c54ec0d","avatarUrl":"/avatars/c46545865abdf7d73abbd10f1cf516c1.svg","isPro":false,"fullname":"Runzhe Teng","user":"Runge","type":"user"},{"_id":"6747ede3a9c72aebe1322382","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/inILqQ05sESbYLdsEldJ_.png","isPro":false,"fullname":"Tong He","user":"tonghe90","type":"user"},{"_id":"65e7eb86c7a0617cc71d3df4","avatarUrl":"/avatars/01020b6b5ccb08bf8aa10fd5f8b2701d.svg","isPro":false,"fullname":"lizizun","user":"lizizun","type":"user"},{"_id":"6760ffd9a07d1ec856b38aa0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/uOAlWj0ZIrb8wvMfbqiLi.png","isPro":false,"fullname":"Hanxue Zhang","user":"jjxjiaxue","type":"user"},{"_id":"6575e68ab238c76bba082aad","avatarUrl":"/avatars/22019e0b1b6c8bad0325b9c19f2938ad.svg","isPro":false,"fullname":"Wang yating","user":"YatingWang","type":"user"},{"_id":"65b0e74f38a0c8f7705b4ec1","avatarUrl":"/avatars/26c7e6a3efb0a0919e5b758e3fe949f9.svg","isPro":false,"fullname":"KaijingMa","user":"fallenleaves","type":"user"},{"_id":"645b5b09bc7518912e1f9733","avatarUrl":"/avatars/4d35f728b41f93881a9b67c337f4d1df.svg","isPro":false,"fullname":"Chen","user":"Lawrence-cj","type":"user"},{"_id":"616961bdd3f656f79ad18ec1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/616961bdd3f656f79ad18ec1/NYpxas378lDevRuUDepJJ.jpeg","isPro":false,"fullname":"xieenze","user":"xieenze","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15178.md"}">
Papers
arxiv:2605.15178

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Published on May 14
· Submitted by
Haoyi Zhu
on May 15
Authors:
,
,
,
,
,
,
,
,

Abstract

SANA-WM is an efficient 2.6B-parameter world model that generates high-fidelity 720p videos with precise camera control, achieving industrial-level quality while significantly reducing computational requirements through hybrid attention, dual-camera branches, two-stage generation, and robust annotation pipelines.

AI-generated summary

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only sim213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at 36times higher throughput for scalable world modeling.

Community

Paper submitter about 22 hours ago

A 2.6B open-source world model that turns one image and a camera trajectory into 720p, minute-long, controllable video on a single GPU. Project Page: https://nvlabs.github.io/Sana/WM/ Code: https://github.com/NVlabs/Sana/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.15178
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15178 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15178 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15178 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers