Hugging Face Daily Papers · · 4 min read

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Try our model here: <a href=\"https://huggingface.co/zhuhz22/Causal-Forcing\">https://huggingface.co/zhuhz22/Causal-Forcing</a><br>And the full-stack open-source code: <a href=\"https://github.com/thu-ml/Causal-Forcing\" rel=\"nofollow\">https://github.com/thu-ml/Causal-Forcing</a><br>We release <strong>2-step frame-wise</strong> AR model with <strong>50% latency and even better quality</strong> compared to 4-step chunk-wise models!</p>\n","updatedAt":"2026-05-15T10:11:35.234Z","author":{"_id":"64c269a52d73768f07ac266c","avatarUrl":"/avatars/d497a960f8aef6a974907b68ed750c1c.svg","fullname":"Zhu Hongzhou","name":"zhuhz22","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.6453540325164795},"editors":["zhuhz22"],"editorAvatarUrls":["/avatars/d497a960f8aef6a974907b68ed750c1c.svg"],"reactions":[{"reaction":"🔥","users":["zhuhz22"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15141","authors":[{"_id":"6a0680f4b1a8cbabc9f09877","name":"Min Zhao","hidden":false},{"_id":"6a0680f4b1a8cbabc9f09878","name":"Hongzhou Zhu","hidden":false},{"_id":"6a0680f4b1a8cbabc9f09879","name":"Kaiwen Zheng","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987a","name":"Zihan Zhou","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987b","name":"Bokai Yan","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987c","name":"Xinyuan Li","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987d","name":"Xiao Yang","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987e","name":"Chongxuan Li","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987f","name":"Jun Zhu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64c269a52d73768f07ac266c/f0lRysgW2mfGfbrSqXUQt.png"],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation","submittedOnDailyBy":{"_id":"64c269a52d73768f07ac266c","avatarUrl":"/avatars/d497a960f8aef6a974907b68ed750c1c.svg","isPro":false,"fullname":"Zhu Hongzhou","user":"zhuhz22","type":"user","name":"zhuhz22"},"summary":"Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose Causal Forcing++, a principled and scalable pipeline that uses causal consistency distillation (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \\ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \\textbf{frame-wise 2-step setting} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\\% and Stage 2 training cost by sim4times. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .","upvotes":75,"discussionId":"6a0680f5b1a8cbabc9f09880","projectPage":"https://github.com/thu-ml/Causal-Forcing","ai_summary":"A novel causal consistency distillation method enables efficient frame-wise video generation with reduced latency and improved quality compared to existing chunk-wise approaches.","ai_keywords":["autoregressive diffusion","causal consistency distillation","causal CD","frame-wise autoregression","few-step AR initialization","diffusion distillation","VBench","VisionReward","Genie3","world model generation"],"organization":{"_id":"640d3084536d9fe0f005cac3","name":"thu-ml","fullname":"Tsinghua Machine Learning Group","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678587085174-633131798ef21f47308ce49b.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64c269a52d73768f07ac266c","avatarUrl":"/avatars/d497a960f8aef6a974907b68ed750c1c.svg","isPro":false,"fullname":"Zhu Hongzhou","user":"zhuhz22","type":"user"},{"_id":"644e3e5f030210812f413073","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/uW8TKV2sds97lFBwnt6JK.jpeg","isPro":true,"fullname":"Zilong Chen","user":"heheyas","type":"user"},{"_id":"63709c12f0fe906bdc50225e","avatarUrl":"/avatars/1c287b4ff6dc06b2812986b55d758906.svg","isPro":false,"fullname":"zzhzzh","user":"zzh0000","type":"user"},{"_id":"6350884a0f376d3c482bda54","avatarUrl":"/avatars/824c49aa2fb2e85801c001e2843c0576.svg","isPro":false,"fullname":"Wei Yu","user":"yuweiao","type":"user"},{"_id":"66d347eebb76fb26eedb256e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66d347eebb76fb26eedb256e/iCPF7GkmZu--XCsWzoucl.jpeg","isPro":false,"fullname":"tianqi liu","user":"tqliu","type":"user"},{"_id":"664da5a094234f3c17df8d3b","avatarUrl":"/avatars/db8f6988ae7ec289a210a747c54470ad.svg","isPro":false,"fullname":"Songsong Yu","user":"Two-hot","type":"user"},{"_id":"66f753556e6db689a2918903","avatarUrl":"/avatars/ff735ae95b58f04fb690eef84dc63854.svg","isPro":false,"fullname":"Abner Albert","user":"AbnerAGI","type":"user"},{"_id":"638f308fc4444c6ca870b60a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f308fc4444c6ca870b60a/Q11NK-8-JbiilJ-vk2LAR.png","isPro":true,"fullname":"Linoy Tsaban","user":"linoyts","type":"user"},{"_id":"63468720dd6d90d82ccf3450","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63468720dd6d90d82ccf3450/tVBFlmZNz8FRMkOrDaDID.jpeg","isPro":false,"fullname":"YSH","user":"BestWishYsh","type":"user"},{"_id":"66015e8aa4d296af07de538e","avatarUrl":"/avatars/a1295c631cc2646282c545859975ce4c.svg","isPro":false,"fullname":"Owen","user":"Owen777","type":"user"},{"_id":"6411c801e872ae3fb1e2c96e","avatarUrl":"/avatars/f8898dc13d700e545eedbbfab1c18353.svg","isPro":true,"fullname":"Franklin","user":"Franklinzhang","type":"user"},{"_id":"697b04bf95c7d6cade79b835","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/697b04bf95c7d6cade79b835/5urnq1O8o2Q5vgFIL1HTV.jpeg","isPro":false,"fullname":"Runjia Qian","user":"rickchains","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"640d3084536d9fe0f005cac3","name":"thu-ml","fullname":"Tsinghua Machine Learning Group","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678587085174-633131798ef21f47308ce49b.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15141.md"}">
Papers
arxiv:2605.15141

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Published on May 14
· Submitted by
Zhu Hongzhou
on May 15
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,

Abstract

A novel causal consistency distillation method enables efficient frame-wise video generation with reduced latency and improved quality compared to existing chunk-wise approaches.

AI-generated summary

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose Causal Forcing++, a principled and scalable pipeline that uses causal consistency distillation (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textbf{frame-wise 2-step setting} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by sim4times. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

Community

Paper submitter about 21 hours ago
This comment has been hidden (marked as Resolved)

Try our model here: https://huggingface.co/zhuhz22/Causal-Forcing
And the full-stack open-source code: https://github.com/thu-ml/Causal-Forcing
We release 2-step frame-wise AR model with 50% latency and even better quality compared to 4-step chunk-wise models!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.15141
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15141 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15141 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15141 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers