Hugging Face Daily Papers · June 10, 2026 · 4 min read

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We propose Next Forcing, a multi-chunk prediction framework that overcomes the myopic<br>supervision problem of autoregressive video world models. Next Forcing establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random) with significantly faster training convergence.</p>\n","updatedAt":"2026-06-10T13:07:26.404Z","author":{"_id":"6777a782cb3b36883e4c99d7","avatarUrl":"/avatars/b203a55dd12e95621ffabef78d54d33a.svg","fullname":"Gangwei Xu","name":"gangweix","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8674889802932739},"editors":["gangweix"],"editorAvatarUrls":["/avatars/b203a55dd12e95621ffabef78d54d33a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11187","authors":[{"_id":"6a296081887fb79cbf65d685","name":"Gangwei Xu","hidden":false},{"_id":"6a296081887fb79cbf65d686","name":"Qihang Zhang","hidden":false},{"_id":"6a296081887fb79cbf65d687","name":"Jiaming Zhou","hidden":false},{"_id":"6a296081887fb79cbf65d688","name":"Xing Zhu","hidden":false},{"_id":"6a296081887fb79cbf65d689","name":"Yujun Shen","hidden":false},{"_id":"6a296081887fb79cbf65d68a","name":"Xin Yang","hidden":false},{"_id":"6a296081887fb79cbf65d68b","name":"Yinghao Xu","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Next Forcing: Causal World Modeling with Multi-Chunk Prediction","submittedOnDailyBy":{"_id":"6777a782cb3b36883e4c99d7","avatarUrl":"/avatars/b203a55dd12e95621ffabef78d54d33a.svg","isPro":true,"fullname":"Gangwei Xu","user":"gangweix","type":"user","name":"gangweix"},"summary":"Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next^1, next^2, next^3 chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.","upvotes":3,"discussionId":"6a296081887fb79cbf65d68c","projectPage":"https://gangweix.github.io/next-forcing/","githubRepo":"https://github.com/gangweix/next-forcing","githubRepoAddedBy":"user","ai_summary":"Next Forcing introduces a multi-chunk prediction framework that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence.","ai_keywords":["autoregressive video generation","World Action Models","multi-chunk prediction","causal world modeling","video denoising","temporal horizons","causal chain","multi-scale temporal supervision","inference acceleration","PhyWorld","FVD"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":29},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6777a782cb3b36883e4c99d7","avatarUrl":"/avatars/b203a55dd12e95621ffabef78d54d33a.svg","isPro":true,"fullname":"Gangwei Xu","user":"gangweix","type":"user"},{"_id":"6721acf7bbf9703bc285e840","avatarUrl":"/avatars/487f5cae476758d318465aa9ed103568.svg","isPro":false,"fullname":"Yujie Zhao","user":"HomieZ","type":"user"},{"_id":"6717c5c36bc2876059ed23ab","avatarUrl":"/avatars/52c68fb315760df5ef9323cd8ada5a3c.svg","isPro":false,"fullname":"Xin Zhou","user":"LMD0311","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11187.md"}">

Papers

arxiv:2606.11187

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Published on Jun 9

· Submitted by

Gangwei Xu on Jun 10

Upvote

Authors:

Abstract

Next Forcing introduces a multi-chunk prediction framework that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next^1, next^2, next^3 chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.

View arXiv page View PDF Project page GitHub 29 Add to collection

Community

gangweix

Paper submitter about 4 hours ago

We propose Next Forcing, a multi-chunk prediction framework that overcomes the myopic
supervision problem of autoregressive video world models. Next Forcing establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random) with significantly faster training convergence.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.11187

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.11187 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11187 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11187 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers