Hugging Face Daily Papers · June 5, 2026 · 3 min read

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Future-L1 nails video event prediction by letting MLLMs interleave text reasoning with latent visual \"imagination\" of future frames.</p>\n","updatedAt":"2026-06-05T02:19:34.702Z","author":{"_id":"6744754ff9940208b97a6a9a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6744754ff9940208b97a6a9a/PRG6_0jAfsj0uoUJvKyWf.png","fullname":"Eurayka","name":"Eurayka","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.748452365398407},"editors":["Eurayka"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6744754ff9940208b97a6a9a/PRG6_0jAfsj0uoUJvKyWf.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05769","authors":[{"_id":"6a22307f3490a593e87b1426","name":"Tianxiang Jiang","hidden":false},{"_id":"6a22307f3490a593e87b1427","name":"Linquan Wu","hidden":false},{"_id":"6a22307f3490a593e87b1428","name":"Sheng Xia","hidden":false},{"_id":"6a22307f3490a593e87b1429","name":"Songze Li","hidden":false},{"_id":"6a22307f3490a593e87b142a","name":"Ziang Yan","hidden":false},{"_id":"6a22307f3490a593e87b142b","name":"Haoyu Yang","hidden":false},{"_id":"6a22307f3490a593e87b142c","name":"Yu Qiao","hidden":false},{"_id":"6a22307f3490a593e87b142d","name":"Yi Wang","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction","submittedOnDailyBy":{"_id":"6744754ff9940208b97a6a9a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6744754ff9940208b97a6a9a/PRG6_0jAfsj0uoUJvKyWf.png","isPro":false,"fullname":"Eurayka","user":"Eurayka","type":"user","name":"Eurayka"},"summary":"Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.","upvotes":2,"discussionId":"6a2230803490a593e87b142e","githubRepo":"https://github.com/OpenGVLab/Future-L1","githubRepoAddedBy":"user","ai_summary":"Future-L1, an interleaved latent visual reasoning framework, improves video event prediction by maintaining visual semantics in latent space during autoregressive decoding, achieving state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.","ai_keywords":["video event prediction","video MLLMs","autoregressive decoding","latent visual reasoning","language tokens","continuous latent spans","FutureBench","TwiFF-Bench","LA-DAPO","latent-aware RL objective","outcome-contrastive rewards","temporal-diversity rewards"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"64006c57a3b8fe3ac0e9af7c","name":"OpenGVLab","fullname":"OpenGVLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64006c09330a45b03605bba3/FvdxiTkTqH8rKDOzGZGUE.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6744754ff9940208b97a6a9a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6744754ff9940208b97a6a9a/PRG6_0jAfsj0uoUJvKyWf.png","isPro":false,"fullname":"Eurayka","user":"Eurayka","type":"user"},{"_id":"6a226c24aa7e0caf1a1ff83c","avatarUrl":"/avatars/dbb90e82887aadd3c2a456c4b46339ce.svg","isPro":false,"fullname":"CUICUISHA","user":"CCS05","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64006c57a3b8fe3ac0e9af7c","name":"OpenGVLab","fullname":"OpenGVLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64006c09330a45b03605bba3/FvdxiTkTqH8rKDOzGZGUE.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.05769.md"}">

Papers

arxiv:2606.05769

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Published on Jun 4

· Submitted by

Eurayka on Jun 5

OpenGVLab

Upvote

Authors:

Abstract

Future-L1, an interleaved latent visual reasoning framework, improves video event prediction by maintaining visual semantics in latent space during autoregressive decoding, achieving state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.

View arXiv page View PDF GitHub 2 Add to collection

Community

Eurayka

Paper submitter about 9 hours ago

Future-L1 nails video event prediction by letting MLLMs interleave text reasoning with latent visual "imagination" of future frames.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.05769

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05769 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.05769 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05769 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers