Hugging Face Daily Papers · · 3 min read

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Uni-Edit is a novel training task for Unified Multimodal Models that integrates image understanding and generation into a single pipeline to achieve simultaneous performance improvements.</p>\n","updatedAt":"2026-05-21T02:24:57.539Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":302,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9195026159286499},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.21487","authors":[{"_id":"6a0e65a6164dbbc68a26c41d","name":"Dian Zheng","hidden":false},{"_id":"6a0e65a6164dbbc68a26c41e","name":"Manyuan Zhang","hidden":false},{"_id":"6a0e65a6164dbbc68a26c41f","name":"Hongyu Li","hidden":false},{"_id":"6a0e65a6164dbbc68a26c420","name":"Hongbo Liu","hidden":false},{"_id":"6a0e65a6164dbbc68a26c421","name":"Kai Zou","hidden":false},{"_id":"6a0e65a6164dbbc68a26c422","name":"Kaituo Feng","hidden":false},{"_id":"6a0e65a6164dbbc68a26c423","name":"Hongsheng Li","hidden":false}],"publishedAt":"2026-05-20T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.","upvotes":17,"discussionId":"6a0e65a6164dbbc68a26c424","projectPage":"https://zhengdian1.github.io/Uni-Edit-proj/","githubRepo":"https://github.com/zhengdian1/Uni-Edit","githubRepoAddedBy":"user","ai_summary":"Uni-Edit introduces an intelligent image editing task that simultaneously enhances unified multimodal models' understanding, generation, and editing capabilities through a single training stage and dataset, utilizing an automated data synthesis pipeline for complex editing instructions.","ai_keywords":["Unified Multimodal Models","image editing","multi-task training","data synthesis pipeline","VQA data","reasoning-intensive instructions","BAGEL","Janus-Pro"],"githubStars":2},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67e60ae6ac37824273d74389","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/YvPKZ_0gyJnvNwM1zK3JS.png","isPro":false,"fullname":"Dian Zheng","user":"zhengli1013","type":"user"},{"_id":"66fd43e80cde4879f9aeca01","avatarUrl":"/avatars/7bc9afa5e023e00820333e8d18dc4bc5.svg","isPro":false,"fullname":"Hongyu Li","user":"appletea2333","type":"user"},{"_id":"647993d9f966f086918da59e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/647993d9f966f086918da59e/NDxz3PEpo3srZQNhwT7Qf.jpeg","isPro":false,"fullname":"Kai Zou","user":"jackyhate","type":"user"},{"_id":"636e19078ba65db4a093a3f4","avatarUrl":"/avatars/287b063b44a022d8576256e80e489c31.svg","isPro":false,"fullname":"alexiosss","user":"Alexislhb","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"67079840a9bcb7459b8d2a46","avatarUrl":"/avatars/32466863c5554f20cb2775b138832ac3.svg","isPro":false,"fullname":"Kaituo Feng","user":"KaituoFeng","type":"user"},{"_id":"63c5d43ae2804cb2407e4d43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1673909278097-noauth.png","isPro":false,"fullname":"xziayro","user":"xziayro","type":"user"},{"_id":"65b5d65b625ac670a79b52a8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b5d65b625ac670a79b52a8/aBPXeSgh-_HQQWuPN89OK.jpeg","isPro":false,"fullname":"Victor Shea-Jay Huang","user":"jeix","type":"user"},{"_id":"643815c4961bb61e463c5896","avatarUrl":"/avatars/3b44592472f16c56105bff8c314d9939.svg","isPro":false,"fullname":"Jianxiong Gao","user":"Jianxiong","type":"user"},{"_id":"6415d088107962562e99517c","avatarUrl":"/avatars/c2fa60334080fc238016b49b1a436c00.svg","isPro":false,"fullname":"Qi Chen-SII","user":"qc316","type":"user"},{"_id":"66a27f8cd3449709d69216ce","avatarUrl":"/avatars/71cd4df83a9f086073768c2fc481fc7c.svg","isPro":false,"fullname":"fenfenda","user":"fenfenda","type":"user"},{"_id":"691f95388b9cd7dc6b1a52b0","avatarUrl":"/avatars/67717064439877d71482abac5c1df6a9.svg","isPro":false,"fullname":"Aiden Tao","user":"AidenTao","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0}">
Papers
arxiv:2605.21487

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Published on May 20
· Submitted by
taesiri
on May 21
Authors:
,
,
,
,
,
,

Abstract

Uni-Edit introduces an intelligent image editing task that simultaneously enhances unified multimodal models' understanding, generation, and editing capabilities through a single training stage and dataset, utilizing an automated data synthesis pipeline for complex editing instructions.

AI-generated summary

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

Community

Paper submitter about 11 hours ago

Uni-Edit is a novel training task for Unified Multimodal Models that integrates image understanding and generation into a single pipeline to achieve simultaneous performance improvements.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.21487 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers