Hugging Face Daily Papers · June 12, 2026 · 5 min read

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

🔱 Meet HYDRA-X — a 7B native unified multimodal model where one ViT-based tokenizer drives 5 tasks: image/video understanding, image/video generation, and image editing.\nThree core contributions:\n🎯 Less attention is more. Local causal tubelet attention + hierarchical temporal patchify preserve the image-pretrained prior far better than full spatiotemporal mixing or single-step compression.\n🌉 Compressed latents, full-rate semantics. A lightweight training-time Decompressor lets image+video teachers supervise temporally compressed latents — no extra cost at inference.\n✨ Editing as length-2 video. Source–target alignment happens inside the tokenizer via the same causal pathway used for video — no extra modules, no extra parameters.\nAcross all 5 tasks, HYDRA-X delivers strong results at the 7B dense scale — laying a solid foundation and offering practical insights for future unified-tokenizer UMM research. 🚀\n","updatedAt":"2026-06-12T08:47:40.994Z","author":{"_id":"6684152a443492c24cdac044","avatarUrl":"/avatars/1d5abbde12a808aa743769603e494ddb.svg","fullname":"Guozhen Zhang","name":"zgzaacm","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8003309369087219},"editors":["zgzaacm"],"editorAvatarUrls":["/avatars/1d5abbde12a808aa743769603e494ddb.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.13289","authors":[{"_id":"6a2bc498ca6c5360cc7cfa82","name":"Guozhen Zhang","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa83","name":"Xuerui Qiu","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa84","name":"Yutao Cui","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa85","name":"Tianhui Song","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa86","name":"Changlin Li","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa87","name":"Junzhe Li","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa88","name":"Tao Huang","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa89","name":"Xiao Zhang","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa8a","name":"Yang Li","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa8b","name":"Jianbing Wu","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa8c","name":"Miles Yang","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa8d","name":"Zhao Zhong","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa8e","name":"Liefeng Bo","hidden":false},{"_id":"6a2bc498ca6c5360cc7cfa8f","name":"Limin Wang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6684152a443492c24cdac044/uyQ5ja30uT_DAp9cvnpDO.png"],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers","submittedOnDailyBy":{"_id":"6684152a443492c24cdac044","avatarUrl":"/avatars/1d5abbde12a808aa743769603e494ddb.svg","isPro":false,"fullname":"Guozhen Zhang","user":"zgzaacm","type":"user","name":"zgzaacm"},"summary":"Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.","upvotes":19,"discussionId":"6a2bc499ca6c5360cc7cfa90","ai_summary":"HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, addressing spatiotemporal reconstruction and semantic awareness through causal temporal attention and hierarchical compression.","ai_keywords":["Vision Transformer","spatiotemporal reconstruction","causal temporal attention","hierarchical temporal compression","latent space","decompressor","teacher supervision","unified multimodal models","visual tokenizers"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"6314524a5f47a1896274d586","name":"NJU","fullname":"Nanjing University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662276136108-6314518e5f47a1896274d080.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6684152a443492c24cdac044","avatarUrl":"/avatars/1d5abbde12a808aa743769603e494ddb.svg","isPro":false,"fullname":"Guozhen Zhang","user":"zgzaacm","type":"user"},{"_id":"656dce6e6f39f15658668967","avatarUrl":"/avatars/9525572a2cb0713154b2f1b3df80a3d2.svg","isPro":false,"fullname":"Tibron","user":"Tibron","type":"user"},{"_id":"66d98c2bc9758ee602cb2e22","avatarUrl":"/avatars/cea03f263d5ed292c4e558871112c61c.svg","isPro":false,"fullname":"Qiu","user":"Xuerui123","type":"user"},{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","isPro":false,"fullname":"wangshuai","user":"wangsssssss","type":"user"},{"_id":"6462eb712a83863b97c0a51a","avatarUrl":"/avatars/dff98414e0eb83066b77f8f1aa008e09.svg","isPro":false,"fullname":"Ziqiao Peng","user":"ZiqiaoPeng","type":"user"},{"_id":"649e7693a83143427691769c","avatarUrl":"/avatars/d04f7b3d417423abaa053375212da21f.svg","isPro":false,"fullname":"Tianhui Song","user":"sthui","type":"user"},{"_id":"670244d68b889e4bce612bb6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/DwoEKxNhzdLyj2uMAK40l.png","isPro":false,"fullname":"Yufei Huang","user":"Healthcliff-SAO","type":"user"},{"_id":"637230fb0f3db8b7e6217857","avatarUrl":"/avatars/c21132fe39f543e6126f1e31aa6506b2.svg","isPro":false,"fullname":"Li","user":"Changlinn","type":"user"},{"_id":"651e91c59dcfa4c1bfa6f415","avatarUrl":"/avatars/bde9b7e833d91853ceca0bef0b91cccb.svg","isPro":false,"fullname":"Yifeng Ma","user":"lostork","type":"user"},{"_id":"6927cce9e4744bdfd92144c4","avatarUrl":"/avatars/2265628a5e5fa756ab9855f33ca1edf8.svg","isPro":false,"fullname":"qi","user":"qianNN7","type":"user"},{"_id":"67d39e72df34ad328c20f0c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/qQBezE-XbMyow_gbtBvXA.png","isPro":false,"fullname":"Ruizhe Zhong","user":"Yukino271828","type":"user"},{"_id":"649a9f20e6a462260253d71d","avatarUrl":"/avatars/93ab5281371f130821b4663c7fd0ef65.svg","isPro":false,"fullname":"Miles","user":"zjmonk","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6314524a5f47a1896274d586","name":"NJU","fullname":"Nanjing University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662276136108-6314518e5f47a1896274d080.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.13289.md","query":{}}">

Papers

arxiv:2606.13289

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Published on Jun 11

· Submitted by

Guozhen Zhang on Jun 12

Nanjing University

Upvote

Authors:

Abstract

HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, addressing spatiotemporal reconstruction and semantic awareness through causal temporal attention and hierarchical compression.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

View arXiv page View PDF Add to collection

Community

zgzaacm

Paper submitter about 1 hour ago

🔱 Meet HYDRA-X — a 7B native unified multimodal model where one ViT-based tokenizer drives 5 tasks: image/video understanding, image/video generation, and image editing.

Three core contributions:

🎯 Less attention is more. Local causal tubelet attention + hierarchical temporal patchify preserve the image-pretrained prior far better than full spatiotemporal mixing or single-step compression.

🌉 Compressed latents, full-rate semantics. A lightweight training-time Decompressor lets image+video teachers supervise temporally compressed latents — no extra cost at inference.

✨ Editing as length-2 video. Source–target alignment happens inside the tokenizer via the same causal pathway used for video — no extra modules, no extra parameters.

Across all 5 tasks, HYDRA-X delivers strong results at the 7B dense scale — laying a solid foundation and offering practical insights for future unified-tokenizer UMM research. 🚀

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.13289

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.13289 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.13289 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.13289 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers