Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.
Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
Abstract
UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and parallel prediction.
Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 2
More from Hugging Face Daily Papers
-
COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami
Jun 27
-
Fast LeWorldModel
Jun 27
-
ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation
Jun 27
-
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Jun 26
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.