Hugging Face Daily Papers · May 15, 2026 · 4 min read

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

A 2.6B open-source world model that turns one image and a camera trajectory into 720p, minute-long, controllable video on a single GPU. Project Page: <a href=\"https://nvlabs.github.io/Sana/WM/\" rel=\"nofollow\">https://nvlabs.github.io/Sana/WM/</a> Code: <a href=\"https://github.com/NVlabs/Sana/\" rel=\"nofollow\">https://github.com/NVlabs/Sana/</a></p>\n","updatedAt":"2026-05-15T03:10:01.842Z","author":{"_id":"6283546209aa80237c6c482c","avatarUrl":"/avatars/0d6fc5846c0456d5282d82d5bf4d7056.svg","fullname":"Haoyi Zhu","name":"HaoyiZhu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8491390943527222},"editors":["HaoyiZhu"],"editorAvatarUrls":["/avatars/0d6fc5846c0456d5282d82d5bf4d7056.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15178","authors":[{"_id":"6a068989b1a8cbabc9f09915","name":"Haoyi Zhu","hidden":false},{"_id":"6a068989b1a8cbabc9f09916","name":"Haozhe Liu","hidden":false},{"_id":"6a068989b1a8cbabc9f09917","name":"Yuyang Zhao","hidden":false},{"_id":"6a068989b1a8cbabc9f09918","name":"Tian Ye","hidden":false},{"_id":"6a068989b1a8cbabc9f09919","name":"Junsong Chen","hidden":false},{"_id":"6a068989b1a8cbabc9f0991a","name":"Jincheng Yu","hidden":false},{"_id":"6a068989b1a8cbabc9f0991b","name":"Tong He","hidden":false},{"_id":"6a068989b1a8cbabc9f0991c","name":"Song Han","hidden":false},{"_id":"6a068989b1a8cbabc9f0991d","name":"Enze Xie","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer","submittedOnDailyBy":{"_id":"6283546209aa80237c6c482c","avatarUrl":"/avatars/0d6fc5846c0456d5282d82d5bf4d7056.svg","isPro":false,"fullname":"Haoyi Zhu","user":"HaoyiZhu","type":"user","name":"HaoyiZhu"},"summary":"We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only sim213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at 36times higher throughput for scalable world modeling.","upvotes":51,"discussionId":"6a06898ab1a8cbabc9f0991e","projectPage":"https://nvlabs.github.io/Sana/WM/","ai_summary":"SANA-WM is an efficient 2.6B-parameter world model that generates high-fidelity 720p videos with precise camera control, achieving industrial-level quality while significantly reducing computational requirements through hybrid attention, dual-camera branches, two-stage generation, and robust annotation pipelines.","ai_keywords":["world model","Gated DeltaNet","softmax attention","6-DoF trajectory","two-stage generation pipeline","metric-scale pose supervision","video synthesis","camera control","hybrid linear attention","distilled variant","NVFP4 quantization"],"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65dcc2510d1261fa54988f9b","avatarUrl":"/avatars/e1ba38d45d65b1f5213462a978114f1d.svg","isPro":false,"fullname":"yyfz233","user":"yyfz233","type":"user"},{"_id":"68c3d9ce8fc494c46b06e9f5","avatarUrl":"/avatars/790bded92bae4ab2f7a0f795a53736a5.svg","isPro":false,"fullname":"YangZhou","user":"YangZhou24","type":"user"},{"_id":"667e81565934c9fae29207ef","avatarUrl":"/avatars/431e777c71fccf7cf48ce013e5f6f1cb.svg","isPro":false,"fullname":"Zhou","user":"ZhouTimeMachine","type":"user"},{"_id":"6283546209aa80237c6c482c","avatarUrl":"/avatars/0d6fc5846c0456d5282d82d5bf4d7056.svg","isPro":false,"fullname":"Haoyi Zhu","user":"HaoyiZhu","type":"user"},{"_id":"66d6819bc8c857729c54ec0d","avatarUrl":"/avatars/c46545865abdf7d73abbd10f1cf516c1.svg","isPro":false,"fullname":"Runzhe Teng","user":"Runge","type":"user"},{"_id":"6747ede3a9c72aebe1322382","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/inILqQ05sESbYLdsEldJ_.png","isPro":false,"fullname":"Tong He","user":"tonghe90","type":"user"},{"_id":"65e7eb86c7a0617cc71d3df4","avatarUrl":"/avatars/01020b6b5ccb08bf8aa10fd5f8b2701d.svg","isPro":false,"fullname":"lizizun","user":"lizizun","type":"user"},{"_id":"6760ffd9a07d1ec856b38aa0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/uOAlWj0ZIrb8wvMfbqiLi.png","isPro":false,"fullname":"Hanxue Zhang","user":"jjxjiaxue","type":"user"},{"_id":"6575e68ab238c76bba082aad","avatarUrl":"/avatars/22019e0b1b6c8bad0325b9c19f2938ad.svg","isPro":false,"fullname":"Wang yating","user":"YatingWang","type":"user"},{"_id":"65b0e74f38a0c8f7705b4ec1","avatarUrl":"/avatars/26c7e6a3efb0a0919e5b758e3fe949f9.svg","isPro":false,"fullname":"KaijingMa","user":"fallenleaves","type":"user"},{"_id":"645b5b09bc7518912e1f9733","avatarUrl":"/avatars/4d35f728b41f93881a9b67c337f4d1df.svg","isPro":false,"fullname":"Chen","user":"Lawrence-cj","type":"user"},{"_id":"616961bdd3f656f79ad18ec1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/616961bdd3f656f79ad18ec1/NYpxas378lDevRuUDepJJ.jpeg","isPro":false,"fullname":"xieenze","user":"xieenze","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15178.md"}">

Papers

arxiv:2605.15178

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Published on May 14

· Submitted by

Haoyi Zhu on May 15

NVIDIA

Upvote

Authors:

Abstract

SANA-WM is an efficient 2.6B-parameter world model that generates high-fidelity 720p videos with precise camera control, achieving industrial-level quality while significantly reducing computational requirements through hybrid attention, dual-camera branches, two-stage generation, and robust annotation pipelines.

AI-generated summary

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only sim213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at 36times higher throughput for scalable world modeling.

View arXiv page View PDF Project page Add to collection

Community

HaoyiZhu

Paper submitter about 22 hours ago

A 2.6B open-source world model that turns one image and a camera trajectory into 720p, minute-long, controllable video on a single GPU. Project Page: https://nvlabs.github.io/Sana/WM/ Code: https://github.com/NVlabs/Sana/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.15178

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15178 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15178 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15178 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers