Hugging Face Daily Papers · May 15, 2026 · 4 min read

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Try our model here: <a href=\"https://huggingface.co/zhuhz22/Causal-Forcing\">https://huggingface.co/zhuhz22/Causal-Forcing</a> And the full-stack open-source code: <a href=\"https://github.com/thu-ml/Causal-Forcing\" rel=\"nofollow\">https://github.com/thu-ml/Causal-Forcing</a> We release 2-step frame-wise AR model with 50% latency and even better quality compared to 4-step chunk-wise models!\n","updatedAt":"2026-05-15T10:11:35.234Z","author":{"_id":"64c269a52d73768f07ac266c","avatarUrl":"/avatars/d497a960f8aef6a974907b68ed750c1c.svg","fullname":"Zhu Hongzhou","name":"zhuhz22","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.6453540325164795},"editors":["zhuhz22"],"editorAvatarUrls":["/avatars/d497a960f8aef6a974907b68ed750c1c.svg"],"reactions":[{"reaction":"🔥","users":["zhuhz22"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15141","authors":[{"_id":"6a0680f4b1a8cbabc9f09877","name":"Min Zhao","hidden":false},{"_id":"6a0680f4b1a8cbabc9f09878","name":"Hongzhou Zhu","hidden":false},{"_id":"6a0680f4b1a8cbabc9f09879","name":"Kaiwen Zheng","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987a","name":"Zihan Zhou","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987b","name":"Bokai Yan","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987c","name":"Xinyuan Li","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987d","name":"Xiao Yang","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987e","name":"Chongxuan Li","hidden":false},{"_id":"6a0680f4b1a8cbabc9f0987f","name":"Jun Zhu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64c269a52d73768f07ac266c/f0lRysgW2mfGfbrSqXUQt.png"],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation","submittedOnDailyBy":{"_id":"64c269a52d73768f07ac266c","avatarUrl":"/avatars/d497a960f8aef6a974907b68ed750c1c.svg","isPro":false,"fullname":"Zhu Hongzhou","user":"zhuhz22","type":"user","name":"zhuhz22"},"summary":"Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose Causal Forcing++, a principled and scalable pipeline that uses causal consistency distillation (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \\ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \\textbf{frame-wise 2-step setting} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\\% and Stage 2 training cost by sim4times. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .","upvotes":75,"discussionId":"6a0680f5b1a8cbabc9f09880","projectPage":"https://github.com/thu-ml/Causal-Forcing","ai_summary":"A novel causal consistency distillation method enables efficient frame-wise video generation with reduced latency and improved quality compared to existing chunk-wise approaches.","ai_keywords":["autoregressive diffusion","causal consistency distillation","causal CD","frame-wise autoregression","few-step AR initialization","diffusion distillation","VBench","VisionReward","Genie3","world model generation"],"organization":{"_id":"640d3084536d9fe0f005cac3","name":"thu-ml","fullname":"Tsinghua Machine Learning Group","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678587085174-633131798ef21f47308ce49b.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64c269a52d73768f07ac266c","avatarUrl":"/avatars/d497a960f8aef6a974907b68ed750c1c.svg","isPro":false,"fullname":"Zhu Hongzhou","user":"zhuhz22","type":"user"},{"_id":"644e3e5f030210812f413073","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/uW8TKV2sds97lFBwnt6JK.jpeg","isPro":true,"fullname":"Zilong Chen","user":"heheyas","type":"user"},{"_id":"63709c12f0fe906bdc50225e","avatarUrl":"/avatars/1c287b4ff6dc06b2812986b55d758906.svg","isPro":false,"fullname":"zzhzzh","user":"zzh0000","type":"user"},{"_id":"6350884a0f376d3c482bda54","avatarUrl":"/avatars/824c49aa2fb2e85801c001e2843c0576.svg","isPro":false,"fullname":"Wei Yu","user":"yuweiao","type":"user"},{"_id":"66d347eebb76fb26eedb256e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66d347eebb76fb26eedb256e/iCPF7GkmZu--XCsWzoucl.jpeg","isPro":false,"fullname":"tianqi liu","user":"tqliu","type":"user"},{"_id":"664da5a094234f3c17df8d3b","avatarUrl":"/avatars/db8f6988ae7ec289a210a747c54470ad.svg","isPro":false,"fullname":"Songsong Yu","user":"Two-hot","type":"user"},{"_id":"66f753556e6db689a2918903","avatarUrl":"/avatars/ff735ae95b58f04fb690eef84dc63854.svg","isPro":false,"fullname":"Abner Albert","user":"AbnerAGI","type":"user"},{"_id":"638f308fc4444c6ca870b60a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f308fc4444c6ca870b60a/Q11NK-8-JbiilJ-vk2LAR.png","isPro":true,"fullname":"Linoy Tsaban","user":"linoyts","type":"user"},{"_id":"63468720dd6d90d82ccf3450","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63468720dd6d90d82ccf3450/tVBFlmZNz8FRMkOrDaDID.jpeg","isPro":false,"fullname":"YSH","user":"BestWishYsh","type":"user"},{"_id":"66015e8aa4d296af07de538e","avatarUrl":"/avatars/a1295c631cc2646282c545859975ce4c.svg","isPro":false,"fullname":"Owen","user":"Owen777","type":"user"},{"_id":"6411c801e872ae3fb1e2c96e","avatarUrl":"/avatars/f8898dc13d700e545eedbbfab1c18353.svg","isPro":true,"fullname":"Franklin","user":"Franklinzhang","type":"user"},{"_id":"697b04bf95c7d6cade79b835","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/697b04bf95c7d6cade79b835/5urnq1O8o2Q5vgFIL1HTV.jpeg","isPro":false,"fullname":"Runjia Qian","user":"rickchains","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"640d3084536d9fe0f005cac3","name":"thu-ml","fullname":"Tsinghua Machine Learning Group","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678587085174-633131798ef21f47308ce49b.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15141.md"}">

Papers

arxiv:2605.15141

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Published on May 14

· Submitted by

Zhu Hongzhou on May 15

#2 Paper of the day

Tsinghua Machine Learning Group

Upvote

Authors:

Abstract

A novel causal consistency distillation method enables efficient frame-wise video generation with reduced latency and improved quality compared to existing chunk-wise approaches.

AI-generated summary

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose Causal Forcing++, a principled and scalable pipeline that uses causal consistency distillation (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textbf{frame-wise 2-step setting} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by sim4times. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

View arXiv page View PDF Project page Add to collection

Community

zhuhz22

Paper submitter about 21 hours ago

This comment has been hidden (marked as Resolved)

zhuhz22

Paper submitter about 21 hours ago

•

edited about 15 hours ago

Try our model here: https://huggingface.co/zhuhz22/Causal-Forcing
And the full-stack open-source code: https://github.com/thu-ml/Causal-Forcing
We release 2-step frame-wise AR model with 50% latency and even better quality compared to 4-step chunk-wise models!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.15141

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15141 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15141 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15141 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers