Hugging Face Daily Papers · June 10, 2026 · 3 min read

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Technical report: a discrete autoregressive model that unifies image generation, editing, and understanding.</p>\n","updatedAt":"2026-06-10T02:21:14.017Z","author":{"_id":"642e1e7147833318f0eb3755","avatarUrl":"/avatars/7ef4f5a099d3eb0fab99b589f33903fa.svg","fullname":"Junke Wang","name":"Daniel0724","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8264472484588623},"editors":["Daniel0724"],"editorAvatarUrls":["/avatars/7ef4f5a099d3eb0fab99b589f33903fa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11188","authors":[{"_id":"6a28c9a7e7d78ea7587e53c9","name":"Junke Wang","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53ca","name":"Xiao Wang","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53cb","name":"Jiacheng Pan","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53cc","name":"Xuefeng Hu","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53cd","name":"Feng Li","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53ce","name":"Jingxiang Sun","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53cf","name":"Chaorui Deng","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53d0","name":"Zilong Chen","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53d1","name":"Yunpeng Chen","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53d2","name":"Kaibin Tian","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53d3","name":"Matthew Gwilliam","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53d4","name":"Hao Chen","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53d5","name":"Danhui Guan","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53d6","name":"Kun Xu","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53d7","name":"Weilin Huang","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53d8","name":"Zuxuan Wu","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53d9","name":"Haoqi Fan","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53da","name":"Yu-Gang Jiang","hidden":false},{"_id":"6a28c9a7e7d78ea7587e53db","name":"Zhenheng Yang","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations","submittedOnDailyBy":{"_id":"642e1e7147833318f0eb3755","avatarUrl":"/avatars/7ef4f5a099d3eb0fab99b589f33903fa.svg","isPro":false,"fullname":"Junke Wang","user":"Daniel0724","type":"user","name":"Daniel0724"},"summary":"This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.","upvotes":15,"discussionId":"6a28c9a8e7d78ea7587e53dc","githubRepo":"https://github.com/wdrink/ARM","githubRepoAddedBy":"user","ai_summary":"ARM demonstrates a unified autoregressive framework for image understanding, generation, and editing through discrete semantic tokenization and reinforcement learning optimization.","ai_keywords":["AutoRegressive Model","discrete semantic visual tokenizer","next-token prediction","vision-language perception","reinforcement learning","text-to-image generation","instruction-guided editing","multimodal intelligence"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":14},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"642e1e7147833318f0eb3755","avatarUrl":"/avatars/7ef4f5a099d3eb0fab99b589f33903fa.svg","isPro":false,"fullname":"Junke Wang","user":"Daniel0724","type":"user"},{"_id":"649bce4f200e2dff194d9883","avatarUrl":"/avatars/b55a8bdc6f7e2bf9de5f26dc1d87bee3.svg","isPro":false,"fullname":"Wujian Peng","user":"wjpoom","type":"user"},{"_id":"62e1403f926f4892a4c545f8","avatarUrl":"/avatars/1f9d09bba8dd2d8657619536078f9ec2.svg","isPro":false,"fullname":"Basit mustafa","user":"BasitMustafa","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6830e8e81bdea85fad4c65f5","avatarUrl":"/avatars/f5aa39c61052c40240db8d42a35e6b52.svg","isPro":false,"fullname":"Xuefeng Hu","user":"leonhuxff","type":"user"},{"_id":"675b3e1bdbd891194dc57535","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/PtkY_t4u1Et10_z24MOET.png","isPro":false,"fullname":"Xiao Wang","user":"wang3702","type":"user"},{"_id":"674577a8857a7f7d7e48b4ed","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/M8iUwOpasIwmM9F1nuU_K.png","isPro":false,"fullname":"Rui Tian","user":"rtian12","type":"user"},{"_id":"6a2988bf1c34fa4696d5b195","avatarUrl":"/avatars/e9e646968e139c4fc440b25685c64a34.svg","isPro":false,"fullname":"kai ling","user":"keviniling","type":"user"},{"_id":"616e1e37cf7dd0b3acec6696","avatarUrl":"/avatars/0298db8ea1b7c3aba27f956954a69787.svg","isPro":false,"fullname":"Zhen-Qi Liu","user":"liuzhenqi77","type":"user"},{"_id":"6852de6d52824a2531daa2f0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/prUMGiFuvW863NvLi0Sp1.png","isPro":false,"fullname":"Zicong Zhang","user":"Ericzhang25","type":"user"},{"_id":"636c2a68dd7265f81a9ba267","avatarUrl":"/avatars/ce61c66a1d616cd64c8587f2c0d65162.svg","isPro":false,"fullname":"Han Zhu","user":"zhtronics","type":"user"},{"_id":"65171225a1a5e5d6177354e6","avatarUrl":"/avatars/4c4c8d0c511d4350463341b124aedb98.svg","isPro":false,"fullname":"hao chen","user":"wanhu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11188.md"}">

Papers

arxiv:2606.11188

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Published on Jun 9

· Submitted by

Junke Wang on Jun 10

Upvote

Authors:

Abstract

ARM demonstrates a unified autoregressive framework for image understanding, generation, and editing through discrete semantic tokenization and reinforcement learning optimization.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.

View arXiv page View PDF GitHub 14 Add to collection

Community

Daniel0724

Paper submitter about 15 hours ago

Technical report: a discrete autoregressive model that unifies image generation, editing, and understanding.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.11188

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.11188 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11188 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11188 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers