Hugging Face Daily Papers · · 5 min read

Semantic Generative Tuning for Unified Multimodal Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/664da5a094234f3c17df8d3b/Gk_1EKgjEhPR36Cqv6HzG.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/664da5a094234f3c17df8d3b/Gk_1EKgjEhPR36Cqv6HzG.png\" alt=\"image\"></a><br>Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement.</p>\n<p>This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity.</p>\n<p>Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity.</p>\n","updatedAt":"2026-05-20T04:04:27.972Z","author":{"_id":"664da5a094234f3c17df8d3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/664da5a094234f3c17df8d3b/gXQgoByuzICmmTxcQEXE7.png","fullname":"Songsong Yu","name":"Two-hot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8782366514205933},"editors":["Two-hot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/664da5a094234f3c17df8d3b/gXQgoByuzICmmTxcQEXE7.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18714","authors":[{"_id":"6a0d312d65eb30f20d962cdc","name":"Songsong Yu","hidden":false},{"_id":"6a0d312d65eb30f20d962cdd","name":"Yuxin Chen","hidden":false},{"_id":"6a0d312d65eb30f20d962cde","name":"Ying Shan","hidden":false},{"_id":"6a0d312d65eb30f20d962cdf","name":"Yanwei Li","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/664da5a094234f3c17df8d3b/SJfraS37UXx828PXV-bYF.png","https://cdn-uploads.huggingface.co/production/uploads/664da5a094234f3c17df8d3b/VuSElPhruyuyEovK7-kPn.png"],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Semantic Generative Tuning for Unified Multimodal Models","submittedOnDailyBy":{"_id":"664da5a094234f3c17df8d3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/664da5a094234f3c17df8d3b/gXQgoByuzICmmTxcQEXE7.png","isPro":false,"fullname":"Songsong Yu","user":"Two-hot","type":"user","name":"Two-hot"},"summary":"Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.","upvotes":7,"discussionId":"6a0d312d65eb30f20d962ce0","projectPage":"https://song2yu.github.io/SGT/","githubRepo":"https://github.com/song2yu/SGT","githubRepoAddedBy":"user","ai_summary":"Generative post-training with semantic segmentation as a proxy enhances multimodal alignment and performance in unified models.","ai_keywords":["unified multimodal models","generative post-training","visual tasks","image segmentation","semantic tasks","visual-textual attention allocation","feature linear separability","semantic generative tuning"],"githubStars":28,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"664da5a094234f3c17df8d3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/664da5a094234f3c17df8d3b/gXQgoByuzICmmTxcQEXE7.png","isPro":false,"fullname":"Songsong Yu","user":"Two-hot","type":"user"},{"_id":"66eadc634da8e18d5040b809","avatarUrl":"/avatars/bc8e7a6c64194ae22171a00911510dc9.svg","isPro":false,"fullname":"dorthao","user":"dorthao","type":"user"},{"_id":"68d49c5f0b3378f3e67e950a","avatarUrl":"/avatars/1af0e3a83b2f8e37b77862d1fb05fadc.svg","isPro":false,"fullname":"wanglinhuide","user":"wanglinhuide","type":"user"},{"_id":"646f59af041e48e1c47231c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646f59af041e48e1c47231c6/wLRcBvwWp9NwpPE8U0gy4.jpeg","isPro":false,"fullname":"Cyyzpoi","user":"cyypoi","type":"user"},{"_id":"67048dcfc558d03f1d680728","avatarUrl":"/avatars/28bb32c40c6661ecb7a699361bcea6ca.svg","isPro":false,"fullname":"sawadika","user":"konijiwa2002","type":"user"},{"_id":"6645e0888c82c6809f5ad6c6","avatarUrl":"/avatars/3aa729f09952ac8b711c686f7a12e51d.svg","isPro":false,"fullname":"zhangfuxi","user":"DDzx","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18714.md"}">
Papers
arxiv:2605.18714

Semantic Generative Tuning for Unified Multimodal Models

Published on May 18
· Submitted by
Songsong Yu
on May 20
Authors:
,
,
,

Abstract

Generative post-training with semantic segmentation as a proxy enhances multimodal alignment and performance in unified models.

AI-generated summary

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.

Community

Paper submitter about 9 hours ago

image
Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement.

This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity.

Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.18714
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.18714 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.18714 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.18714 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers