Hugging Face Daily Papers · May 25, 2026 · 3 min read

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

A new paradigm for fine-grained video understanding.</p>\n","updatedAt":"2026-05-25T03:47:09.730Z","author":{"_id":"66ef2611fcc1c455f8dce832","avatarUrl":"/avatars/c73ef2dfcd1e6ec8414a31226ad38e3b.svg","fullname":"Boyuan Sun","name":"BBBBCHAN","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8747608065605164},"editors":["BBBBCHAN"],"editorAvatarUrls":["/avatars/c73ef2dfcd1e6ec8414a31226ad38e3b.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18018","authors":[{"_id":"6a0bd0838ca2d0b2563803d3","user":{"_id":"66ef2611fcc1c455f8dce832","avatarUrl":"/avatars/c73ef2dfcd1e6ec8414a31226ad38e3b.svg","isPro":false,"fullname":"Boyuan Sun","user":"BBBBCHAN","type":"user","name":"BBBBCHAN"},"name":"Boyuan Sun","status":"claimed_verified","statusLastChangedAt":"2026-05-19T08:31:49.828Z","hidden":false},{"_id":"6a0bd0838ca2d0b2563803d4","name":"Bowen Yin","hidden":false},{"_id":"6a0bd0838ca2d0b2563803d5","name":"Yuanming Li","hidden":false},{"_id":"6a0bd0838ca2d0b2563803d6","name":"Xihan Wei","hidden":false},{"_id":"6a0bd0838ca2d0b2563803d7","name":"Qibin Hou","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding","submittedOnDailyBy":{"_id":"66ef2611fcc1c455f8dce832","avatarUrl":"/avatars/c73ef2dfcd1e6ec8414a31226ad38e3b.svg","isPro":false,"fullname":"Boyuan Sun","user":"BBBBCHAN","type":"user","name":"BBBBCHAN"},"summary":"We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at https://github.com/HumanMLLM/SWIM{https://github.com/HumanMLLM/SWIM}.","upvotes":25,"discussionId":"6a0bd0848ca2d0b2563803d8","projectPage":"https://github.com/HumanMLLM/SWIM","githubRepo":"https://github.com/HumanMLLM/SWIM","githubRepoAddedBy":"user","ai_summary":"SWIM is a training approach that aligns vision and language representations for fine-grained object understanding using only textual prompts by addressing cross-modal attention misalignment through mask supervision and a new dataset.","ai_keywords":["vision-language representations","cross-modal attention","multimodal large language models","natural language referring expressions","spatial consistency","multi-layer cross-attention maps"],"githubStars":80,"organization":{"_id":"67d15cca6e2cf0e062dbfb54","name":"AlibabaTongyiLab","fullname":"TongyiLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67d1502bfabfe9974d1f77bb/XdUSVf6HqBzE7zFBfSDQP.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63109a4d61cab0446e48c83b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63109a4d61cab0446e48c83b/JQlVkQp0ok586ND1GmB0w.png","isPro":false,"fullname":"Ling-Hao Chen","user":"EvanTHU","type":"user"},{"_id":"6a0bda48570d684fb80e5865","avatarUrl":"/avatars/4edb34e74a337c7e6088ad743d79308a.svg","isPro":false,"fullname":"Yifan Wang","user":"ivanwind7","type":"user"},{"_id":"6a0bda34a7bc3e49988babde","avatarUrl":"/avatars/f8297f2c6150bd977760d0f5f27398d1.svg","isPro":false,"fullname":"Naety Ness","user":"naety4207","type":"user"},{"_id":"6862533a4ff598dbb45ed7a7","avatarUrl":"/avatars/24e7ce873294e86171cb1492ecfe976e.svg","isPro":false,"fullname":"qjdcool","user":"qjdcool","type":"user"},{"_id":"66ef2611fcc1c455f8dce832","avatarUrl":"/avatars/c73ef2dfcd1e6ec8414a31226ad38e3b.svg","isPro":false,"fullname":"Boyuan Sun","user":"BBBBCHAN","type":"user"},{"_id":"6a0bdac9fc8e751d539e6735","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a0bdac9fc8e751d539e6735/wuPI6quIYMNlg2pr06LaA.png","isPro":false,"fullname":"lyric FC","user":"ciryl1","type":"user"},{"_id":"637de917425b0c9abbacc9d8","avatarUrl":"/avatars/8f780c58bc0d4529474a438498823cfe.svg","isPro":true,"fullname":"Ze-Xin Yin","user":"JasonYinnnn","type":"user"},{"_id":"6236eb1ec94f9404d426afc9","avatarUrl":"/avatars/b85074a95210e2b7a37e9eb07d09684d.svg","isPro":false,"fullname":"Lee","user":"Python","type":"user"},{"_id":"69a554d90d01cb5cf0f3455a","avatarUrl":"/avatars/d1ad967d4bf7e6c551a6c88be5193886.svg","isPro":false,"fullname":"Huang","user":"StarryAnastasius","type":"user"},{"_id":"6740a5730bb4a675446a80ad","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6740a5730bb4a675446a80ad/dmruwMdQK3zluJm7YXUtN.jpeg","isPro":false,"fullname":"Zhong-Yu Li","user":"lzyhha","type":"user"},{"_id":"676bc71e490f3664721e81eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/DRAdleyZ_Zy5h4IldZ3gb.png","isPro":false,"fullname":"Sakura Sato","user":"SakuraSato","type":"user"},{"_id":"66f0c7f93dcfc5e61165bd30","avatarUrl":"/avatars/40d1f44de2e4629241bd39b2cddea54f.svg","isPro":false,"fullname":"Boyuan Sun","user":"BoyuanSun","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67d15cca6e2cf0e062dbfb54","name":"AlibabaTongyiLab","fullname":"TongyiLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67d1502bfabfe9974d1f77bb/XdUSVf6HqBzE7zFBfSDQP.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18018.md"}">

Papers

arxiv:2605.18018

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Published on May 18

· Submitted by

Boyuan Sun on May 25

TongyiLab

Upvote

Authors:

Boyuan Sun ,

Abstract

SWIM is a training approach that aligns vision and language representations for fine-grained object understanding using only textual prompts by addressing cross-modal attention misalignment through mask supervision and a new dataset.

AI-generated summary

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at https://github.com/HumanMLLM/SWIM{https://github.com/HumanMLLM/SWIM}.

View arXiv page View PDF Project page GitHub 80 Add to collection