Hugging Face Daily Papers · · 7 min read

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

LoMo is a lightweight, architecture-agnostic data curation paradigm that mitigates the cross-carrier modality gap in VLMs by recasting selected text spans into rendered images, turning standard SFT into an implicit cross-modal alignment objective without architectural changes or inference overhead.</p>\n<p>Project Page: <a href=\"https://maplebb.github.io/LoMo/page/\" rel=\"nofollow\">https://maplebb.github.io/LoMo/page/</a><br>GitHub: <a href=\"https://github.com/Maplebb/LoMo\" rel=\"nofollow\">https://github.com/Maplebb/LoMo</a></p>\n","updatedAt":"2026-05-29T03:03:27.566Z","author":{"_id":"65ab5332043d53781a115475","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ab5332043d53781a115475/UaxSFDWteYsByzx7G_KKy.jpeg","fullname":"Zhixiong Zhang (SII)","name":"rookiexiong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7908490896224976},"editors":["rookiexiong"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65ab5332043d53781a115475/UaxSFDWteYsByzx7G_KKy.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1a4097039cfb1a5d9b0255","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:42:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings](https://huggingface.co/papers/2604.19902) (2026)\n* [Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models](https://huggingface.co/papers/2605.18160) (2026)\n* [Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models](https://huggingface.co/papers/2605.12517) (2026)\n* [From Pixels to Words -- Towards Native One-Vision Models at Scale](https://huggingface.co/papers/2605.28820) (2026)\n* [UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs](https://huggingface.co/papers/2605.11856) (2026)\n* [MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models](https://huggingface.co/papers/2604.12537) (2026)\n* [Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap](https://huggingface.co/papers/2604.16256) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.19902\">MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18160\">Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12517\">Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.28820\">From Pixels to Words -- Towards Native One-Vision Models at Scale</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.11856\">UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.12537\">MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.16256\">Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:42:47.016Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6976830959320068},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30265","authors":[{"_id":"6a18ff8656b4bb14ec65cf3c","name":"Feng Han","hidden":false},{"_id":"6a18ff8656b4bb14ec65cf3d","user":{"_id":"65ab5332043d53781a115475","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ab5332043d53781a115475/UaxSFDWteYsByzx7G_KKy.jpeg","isPro":false,"fullname":"Zhixiong Zhang (SII)","user":"rookiexiong","type":"user","name":"rookiexiong"},"name":"Zhixiong Zhang","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:50:32.287Z","hidden":false},{"_id":"6a18ff8656b4bb14ec65cf3e","name":"Zheming Liang","hidden":false},{"_id":"6a18ff8656b4bb14ec65cf3f","user":{"_id":"654c6845bac6e6e49895a5b5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/KXQaAxulqr8jNBSpEaYM4.png","isPro":false,"fullname":"SII-Yibin Wang","user":"CodeGoat24","type":"user","name":"CodeGoat24"},"name":"Yibin Wang","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:50:28.689Z","hidden":false},{"_id":"6a18ff8656b4bb14ec65cf40","name":"Jiaqi Wang","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"LoMo: Local Modality Substitution for Deeper Vision-Language Fusion","submittedOnDailyBy":{"_id":"65ab5332043d53781a115475","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ab5332043d53781a115475/UaxSFDWteYsByzx7G_KKy.jpeg","isPro":false,"fullname":"Zhixiong Zhang (SII)","user":"rookiexiong","type":"user","name":"rookiexiong"},"summary":"Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this \"carrier sensitivity\" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across \"text, visual, text\" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.","upvotes":18,"discussionId":"6a18ff8756b4bb14ec65cf41","projectPage":"https://maplebb.github.io/LoMo/page/","githubRepo":"https://github.com/Maplebb/LoMo","githubRepoAddedBy":"user","ai_summary":"Vision-language models suffer from modality sensitivity due to training data bias, but a new data curation approach called Local Modality Substitution improves cross-modal representation alignment and reasoning performance.","ai_keywords":["Vision-Language Models","multimodal fusion","modality substitution","carrier sensitivity","data curation","cross-modal representational invariance","multimodal reasoning","foundational models","supervised fine-tuning"],"githubStars":25,"organization":{"_id":"643cb0625fcffe09fb6ca688","name":"Fudan-University","fullname":"Fudan University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6437eca0819f3ab20d162e14/kWv0cGlAhAG3iNWVxowkJ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65ab5332043d53781a115475","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ab5332043d53781a115475/UaxSFDWteYsByzx7G_KKy.jpeg","isPro":false,"fullname":"Zhixiong Zhang (SII)","user":"rookiexiong","type":"user"},{"_id":"661240c953de8ec06a9be648","avatarUrl":"/avatars/4fdded6b35b76b6ffb93bf7f6a3e4f97.svg","isPro":false,"fullname":"SII-maple","user":"maplebb","type":"user"},{"_id":"654c6845bac6e6e49895a5b5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/KXQaAxulqr8jNBSpEaYM4.png","isPro":false,"fullname":"SII-Yibin Wang","user":"CodeGoat24","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"68d4fbc58b5d39ebc1b6082f","avatarUrl":"/avatars/f1077ad4eaa1883737990604ab3f1ad3.svg","isPro":false,"fullname":"SII-lzm","user":"SII-null","type":"user"},{"_id":"648ec0da5adc64d0d97d8b3a","avatarUrl":"/avatars/8e1fe165b8209da7b2e7aa0d4d6b4745.svg","isPro":false,"fullname":"chao","user":"ChaoGong","type":"user"},{"_id":"67d7dadc64fc63993a7fbbfc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67d7dadc64fc63993a7fbbfc/5CXBMMWBGGtSimEFm6Aez.jpeg","isPro":false,"fullname":"Yutao LING","user":"wagon196","type":"user"},{"_id":"68c4cc3f61c54aac7e73f4ea","avatarUrl":"/avatars/c61a99065c3a87bb80d59f742f500139.svg","isPro":false,"fullname":"annihilator","user":"Annihi1ator","type":"user"},{"_id":"670a3bc3ada59c956f18cc17","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/670a3bc3ada59c956f18cc17/57oBwS0V9m9SImYHtDb5f.jpeg","isPro":false,"fullname":"SII-sqs","user":"groundhogLLM","type":"user"},{"_id":"677a4c4de0db74115dece89a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/677a4c4de0db74115dece89a/MEUOgReiC6dW4TEH2aJAA.jpeg","isPro":false,"fullname":"Caijun Xu","user":"SII-Molu","type":"user"},{"_id":"68bce0bb6c5888b52eef219d","avatarUrl":"/avatars/1255e2730a5b860d83d627f4ea313f2b.svg","isPro":false,"fullname":"EnigmaYYYY","user":"SII-Enigma","type":"user"},{"_id":"68323f961e5e5c17eb1f0de4","avatarUrl":"/avatars/e0b56c721c2aec1daf52b67f05093a2c.svg","isPro":false,"fullname":"sue","user":"jzf0634","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"643cb0625fcffe09fb6ca688","name":"Fudan-University","fullname":"Fudan University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6437eca0819f3ab20d162e14/kWv0cGlAhAG3iNWVxowkJ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30265.md"}">
Papers
arxiv:2605.30265

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Published on May 28
· Submitted by
Zhixiong Zhang (SII)
on May 29
Authors:
,
,

Abstract

Vision-language models suffer from modality sensitivity due to training data bias, but a new data curation approach called Local Modality Substitution improves cross-modal representation alignment and reasoning performance.

AI-generated summary

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.

Community

Paper author Paper submitter 1 day ago

LoMo is a lightweight, architecture-agnostic data curation paradigm that mitigates the cross-carrier modality gap in VLMs by recasting selected text spans into rendered images, turning standard SFT into an implicit cross-modal alignment objective without architectural changes or inference overhead.

Project Page: https://maplebb.github.io/LoMo/page/
GitHub: https://github.com/Maplebb/LoMo

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30265
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30265 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30265 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30265 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers