Hugging Face Daily Papers · · 4 min read

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

The next generation of fully-open multimodal training — pushing the boundary of recipe transparency, native-resolution understanding, and end-to-end reproducibility.</p>\n","updatedAt":"2026-05-27T04:06:06.164Z","author":{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","fullname":"xiangan","name":"xiangan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.897498607635498},"editors":["xiangan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.25979","authors":[{"_id":"6a166d6fe9aa3c8e322db505","name":"Xiang An","hidden":false},{"_id":"6a166d6fe9aa3c8e322db506","name":"Yin Xie","hidden":false},{"_id":"6a166d6fe9aa3c8e322db507","name":"Feilong Tang","hidden":false},{"_id":"6a166d6fe9aa3c8e322db508","name":"Yunyao Yan","hidden":false},{"_id":"6a166d6fe9aa3c8e322db509","name":"Huajie Tan","hidden":false},{"_id":"6a166d6fe9aa3c8e322db50a","name":"Didi Zhu","hidden":false},{"_id":"6a166d6fe9aa3c8e322db50b","name":"Changrui Chen","hidden":false},{"_id":"6a166d6fe9aa3c8e322db50c","name":"Xiuwei Zhao","hidden":false},{"_id":"6a166d6fe9aa3c8e322db50d","name":"Bin Qin","hidden":false},{"_id":"6a166d6fe9aa3c8e322db50e","name":"Kaicheng Yang","hidden":false},{"_id":"6a166d6fe9aa3c8e322db50f","name":"Yifei Shen","hidden":false},{"_id":"6a166d6fe9aa3c8e322db510","name":"Yuanhan Zhang","hidden":false},{"_id":"6a166d6fe9aa3c8e322db511","name":"Kaichen Zhang","hidden":false},{"_id":"6a166d6fe9aa3c8e322db512","name":"Wenkang Zhang","hidden":false},{"_id":"6a166d6fe9aa3c8e322db513","name":"Zheng Cheng","hidden":false},{"_id":"6a166d6fe9aa3c8e322db514","name":"Nansen Zhang","hidden":false},{"_id":"6a166d6fe9aa3c8e322db515","name":"Chunsheng Wu","hidden":false},{"_id":"6a166d6fe9aa3c8e322db516","name":"Chunjiang Ge","hidden":false},{"_id":"6a166d6fe9aa3c8e322db517","name":"Zimin Ran","hidden":false},{"_id":"6a166d6fe9aa3c8e322db518","name":"Dehua Song","hidden":false},{"_id":"6a166d6fe9aa3c8e322db519","name":"Chunyuan Li","hidden":false},{"_id":"6a166d6fe9aa3c8e322db51a","name":"Shikun Feng","hidden":false},{"_id":"6a166d6fe9aa3c8e322db51b","name":"Ming Hu","hidden":false},{"_id":"6a166d6fe9aa3c8e322db51c","name":"Zhangquan Chen","hidden":false},{"_id":"6a166d6fe9aa3c8e322db51d","name":"Junbo Niu","hidden":false},{"_id":"6a166d6fe9aa3c8e322db51e","name":"Bo Li","hidden":false},{"_id":"6a166d6fe9aa3c8e322db51f","name":"Ziyong Feng","hidden":false},{"_id":"6a166d6fe9aa3c8e322db520","name":"Ziwei Liu","hidden":false},{"_id":"6a166d6fe9aa3c8e322db521","name":"Zongyuan Ge","hidden":false},{"_id":"6a166d6fe9aa3c8e322db522","name":"Jiankang Deng","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence","submittedOnDailyBy":{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","isPro":false,"fullname":"xiangan","user":"xiangan","type":"user","name":"xiangan"},"summary":"We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.","upvotes":11,"discussionId":"6a166d6fe9aa3c8e322db523","projectPage":"https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/","githubRepo":"https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2","githubRepoAddedBy":"user","ai_summary":"LLaVA-OneVision-2 achieves superior multimodal performance through codec-stream tokenization, windowed attention, and large-scale open supervision across video understanding, temporal grounding, and tracking tasks.","ai_keywords":["vision-language model","OneVision-Encoder","Windowed Attention","codec-stream tokenization","bit-cost dynamics","motion-residual cues","visual canvases","3D RoPE","large-scale open supervision","JumpScore","temporal-localization benchmark","fine-grained grounding","spatial grounding","manipulation-trace reasoning"],"githubStars":943},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6478679d7b370854241b2ad8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6478679d7b370854241b2ad8/dBczWYYdfEt9tQcnVGhQk.jpeg","isPro":false,"fullname":"xiangan","user":"xiangan","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6458a99c3b81018d6b93aecb","avatarUrl":"/avatars/d90f16e8d3e49e095fea4cbd899837df.svg","isPro":false,"fullname":"jankin","user":"jankin123","type":"user"},{"_id":"690c9ab5c65d1580be2fef16","avatarUrl":"/avatars/7b83568ba1e4331917d75a82eb25c321.svg","isPro":false,"fullname":"liuweisong","user":"maxLWSv2","type":"user"},{"_id":"642e97dbc1b0f8e4e76c2b30","avatarUrl":"/avatars/60adf4470baf12d5687d53a6c3299bcd.svg","isPro":false,"fullname":"james curry","user":"ainbo","type":"user"},{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},{"_id":"65c4eb7cd1dcbd30d86febec","avatarUrl":"/avatars/001c8f02e8ce794b2c21883628b2da72.svg","isPro":false,"fullname":"free-bit","user":"free-bit","type":"user"},{"_id":"66ff81731687036580bea355","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ff81731687036580bea355/Wgxqf-HeE4D9mhZBu7vDr.jpeg","isPro":false,"fullname":"Wang","user":"ShihaoW","type":"user"},{"_id":"64337d376c2a26ae66d765b5","avatarUrl":"/avatars/04d6532b3b949197a81c220548688311.svg","isPro":false,"fullname":"Mingzesun","user":"smz22","type":"user"},{"_id":"68f9fdf99244a69d062ae91d","avatarUrl":"/avatars/e1e3d5d60205f884228ad95149d1285c.svg","isPro":false,"fullname":"NIU YX","user":"YXNiu","type":"user"},{"_id":"64b4a717aa03b6520839e9b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b4a717aa03b6520839e9b8/Rt3ERG-6BVEA4hAwOz0_I.jpeg","isPro":false,"fullname":"Haiwen Diao","user":"Paranioar","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.25979.md"}">
Papers
arxiv:2605.25979

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Published on May 25
· Submitted by
xiangan
on May 27
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

LLaVA-OneVision-2 achieves superior multimodal performance through codec-stream tokenization, windowed attention, and large-scale open supervision across video understanding, temporal grounding, and tracking tasks.

AI-generated summary

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.

Community

Paper submitter about 7 hours ago

The next generation of fully-open multimodal training — pushing the boundary of recipe transparency, native-resolution understanding, and end-to-end reproducibility.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.25979
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.25979 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.25979 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.25979 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers