InstructSAM is an instruction-driven multi-instance segmentation framework designed to segment arbitrary target instances from natural-language instructions.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/oMGEapE_jFhFLMIA7KBTr.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/oMGEapE_jFhFLMIA7KBTr.png\" alt=\"model\"></a></p>\n","updatedAt":"2026-05-26T06:11:28.312Z","author":{"_id":"64a3fe3dde901eb01df12398","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a3fe3dde901eb01df12398/Js2bEx4rxKuEKVt5z9I2D.jpeg","fullname":"YuqianYuan","name":"CircleRadon","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7361646890640259},"editors":["CircleRadon"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64a3fe3dde901eb01df12398/Js2bEx4rxKuEKVt5z9I2D.jpeg"],"reactions":[{"reaction":"🤗","users":["sunshine-lwt"],"count":1}],"isReport":false}},{"id":"6a153b93fa8734d2cd58fef7","author":{"_id":"64a3fe3dde901eb01df12398","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a3fe3dde901eb01df12398/Js2bEx4rxKuEKVt5z9I2D.jpeg","fullname":"YuqianYuan","name":"CircleRadon","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2026-05-26T06:20:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"\nhttps://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/wgDGoPF5gPOkqjB1BcEA9.qt\n","html":"<p><video src=\"https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/wgDGoPF5gPOkqjB1BcEA9.qt\" controls=\"\" class=\"max-w-full!\"></video></p>\n","updatedAt":"2026-05-26T06:20:03.510Z","author":{"_id":"64a3fe3dde901eb01df12398","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a3fe3dde901eb01df12398/Js2bEx4rxKuEKVt5z9I2D.jpeg","fullname":"YuqianYuan","name":"CircleRadon","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.577465295791626},"editors":["CircleRadon"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64a3fe3dde901eb01df12398/Js2bEx4rxKuEKVt5z9I2D.jpeg"],"reactions":[{"reaction":"🚀","users":["sunshine-lwt"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26102","authors":[{"_id":"6a15396db57a1823d5708bff","name":"Yuqian Yuan","hidden":false},{"_id":"6a15396db57a1823d5708c00","name":"Wentong Li","hidden":false},{"_id":"6a15396db57a1823d5708c01","name":"Zhaocheng Li","hidden":false},{"_id":"6a15396db57a1823d5708c02","name":"Yutong Lin","hidden":false},{"_id":"6a15396db57a1823d5708c03","name":"Juncheng Li","hidden":false},{"_id":"6a15396db57a1823d5708c04","name":"Siliang Tang","hidden":false},{"_id":"6a15396db57a1823d5708c05","name":"Jun Xiao","hidden":false},{"_id":"6a15396db57a1823d5708c06","name":"Yueting Zhuang","hidden":false},{"_id":"6a15396db57a1823d5708c07","name":"Wenqiao Zhang","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"InstructSAM: Segment Any Instance with Any Instructions","submittedOnDailyBy":{"_id":"64a3fe3dde901eb01df12398","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a3fe3dde901eb01df12398/Js2bEx4rxKuEKVt5z9I2D.jpeg","isPro":false,"fullname":"YuqianYuan","user":"CircleRadon","type":"user","name":"CircleRadon"},"summary":"In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.","upvotes":7,"discussionId":"6a15396db57a1823d5708c08","githubRepo":"https://github.com/DCDmllm/InstructSAM","githubRepoAddedBy":"user","ai_summary":"InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3 through learnable instance queries and hybrid attention mechanisms.","ai_keywords":["multi-instance segmentation","instruction-driven instance segmentation","set-structured query prediction","vision-language model","SAM3","explicit reasoning-to-instance query interface","learnable instance queries","hybrid-attention mechanism","instance-aware slots","single forward pass","LLM-conditioned queries","instruction-based instance segmentation","Inst2Seg","phrase-level referring segmentation"],"githubStars":9},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64a3fe3dde901eb01df12398","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a3fe3dde901eb01df12398/Js2bEx4rxKuEKVt5z9I2D.jpeg","isPro":false,"fullname":"YuqianYuan","user":"CircleRadon","type":"user"},{"_id":"64c48a78d07620bdc99777d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c48a78d07620bdc99777d4/NJC4Ot0a7YSdU5RC6dgga.jpeg","isPro":false,"fullname":"LI WENTONG","user":"sunshine-lwt","type":"user"},{"_id":"67777f2d21a9158f259e826c","avatarUrl":"/avatars/305a4981f95a7858730969bb012809b5.svg","isPro":false,"fullname":"jingjing","user":"Lillyr","type":"user"},{"_id":"684284b2c99d48a7e96b538d","avatarUrl":"/avatars/11532b09e2b50a67203be581b9179eff.svg","isPro":false,"fullname":"qinyuan","user":"zoe20","type":"user"},{"_id":"677791db38f9a731d409b884","avatarUrl":"/avatars/46eaea7e339e93d21fd62505325da2d0.svg","isPro":false,"fullname":"sparkup","user":"sparkup","type":"user"},{"_id":"67778fef6651b0e6c4b1e824","avatarUrl":"/avatars/10f19f049dc11adead7a26131843c47c.svg","isPro":false,"fullname":"spark","user":"spark-tom","type":"user"},{"_id":"6365d83ce7a78348d82572b0","avatarUrl":"/avatars/d50587902cced2c3640fd5de82ff78dd.svg","isPro":false,"fullname":"ll","user":"jianghuyihei","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26102.md"}">
InstructSAM: Segment Any Instance with Any Instructions
Abstract
InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3 through learnable instance queries and hybrid attention mechanisms.
AI-generated summary
In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.
Community
InstructSAM is an instruction-driven multi-instance segmentation framework designed to segment arbitrary target instances from natural-language instructions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.26102 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.26102 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.26102 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.