Hugging Face Daily Papers · June 2, 2026 · 4 min read

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

EVA01 is a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing.</p>\n","updatedAt":"2026-06-02T07:25:13.159Z","author":{"_id":"675a8f46049d7f55bcf8af8e","avatarUrl":"/avatars/4441ecc6c5e4f8b0e323708b683e00f9.svg","fullname":"SeeleAI","name":"Asukakoko","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8922209739685059},"editors":["Asukakoko"],"editorAvatarUrls":["/avatars/4441ecc6c5e4f8b0e323708b683e00f9.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.16745","authors":[{"_id":"6a1e82eb808ddbc3c7d43f12","name":"Zongyuan Yang","hidden":false},{"_id":"6a1e82eb808ddbc3c7d43f13","name":"Mingjing Yi","hidden":false},{"_id":"6a1e82eb808ddbc3c7d43f14","name":"Wanli Ma","hidden":false},{"_id":"6a1e82eb808ddbc3c7d43f15","name":"Chenzhuo Fan","hidden":false},{"_id":"6a1e82eb808ddbc3c7d43f16","name":"Bocheng Li","hidden":false},{"_id":"6a1e82eb808ddbc3c7d43f17","name":"Baolin Liu","hidden":false},{"_id":"6a1e82eb808ddbc3c7d43f18","name":"Yuke Lou","hidden":false},{"_id":"6a1e82eb808ddbc3c7d43f19","name":"Yingde Song","hidden":false},{"_id":"6a1e82eb808ddbc3c7d43f1a","name":"Yongping Xiong","hidden":false},{"_id":"6a1e82eb808ddbc3c7d43f1b","name":"Zhengdong Guo","hidden":false},{"_id":"6a1e82eb808ddbc3c7d43f1c","name":"Shimu Wang","hidden":false}],"publishedAt":"2026-05-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers","submittedOnDailyBy":{"_id":"675a8f46049d7f55bcf8af8e","avatarUrl":"/avatars/4441ecc6c5e4f8b0e323708b683e00f9.svg","isPro":true,"fullname":"SeeleAI","user":"Asukakoko","type":"user","name":"Asukakoko"},"summary":"This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert (E_{und}) and a structurally mirrored Generation Expert (E_{gen}), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01","upvotes":2,"discussionId":"6a1e82ec808ddbc3c7d43f1d","projectPage":"https://www.seeles.ai/research/pages/EVA01","ai_summary":"EVA01 enables native 3D mesh integration in multimodal language models through a Mixture-of-Transformers architecture that aligns semantic and geometric manifolds for improved generation and editing capabilities.","ai_keywords":["Multimodal Large Language Models","diffusion-based large reconstruction models","3D meshes","Mixture-of-Transformers","Understanding Expert","Generation Expert","global self-attention","hard modality routing","semantic latent space","geometric manifold","text-to-3D generation","long-context multi-turn geometric editing"],"organization":{"_id":"6a1e7590a5a56b661d55722a","name":"SEELE-AI","fullname":"SEELE AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/675a8f46049d7f55bcf8af8e/fZy76k6tGSh8Z2GffyxBH.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63ede101a8841c0446c04d11","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ede101a8841c0446c04d11/Ey5rN4flywL0ohoNKt_Zd.png","isPro":false,"fullname":"Vice","user":"Royalvice","type":"user"},{"_id":"69bcf7ee15b495870e499425","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/2Nn77YOF_bx_v-byzVoBw.png","isPro":false,"fullname":"山田樹","user":"ahernandez50","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a1e7590a5a56b661d55722a","name":"SEELE-AI","fullname":"SEELE AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/675a8f46049d7f55bcf8af8e/fZy76k6tGSh8Z2GffyxBH.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.16745.md"}">

Papers

arxiv:2605.16745

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

Published on May 16

· Submitted by

SeeleAI on Jun 2

SEELE AI

Upvote

Authors:

Abstract

EVA01 enables native 3D mesh integration in multimodal language models through a Mixture-of-Transformers architecture that aligns semantic and geometric manifolds for improved generation and editing capabilities.

AI-generated summary

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert (E_{und}) and a structurally mirrored Generation Expert (E_{gen}), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01

View arXiv page View PDF Project page Add to collection

Community

Asukakoko

Paper submitter about 3 hours ago

EVA01 is a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.16745

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.16745 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.16745 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.16745 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers