MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation
Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.
MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation
Abstract
Video generative models achieve improved long-range consistency through coarse-to-fine token generation using a multi-scale autoencoder and diffusion model architecture.
Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.
Get this paper in your agent:
hf papers read 2606.09056 curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
More from Hugging Face Daily Papers
-
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
Jun 12
-
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
Jun 12
-
See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents
Jun 12
-
Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents
Jun 12
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.