Hugging Face Daily Papers · May 27, 2026 · 3 min read

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Github: <a href=\"https://github.com/NVlabs/Eagle\" rel=\"nofollow\">https://github.com/NVlabs/Eagle</a></p>\n","updatedAt":"2026-05-27T02:12:12.507Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":306,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7891759276390076},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27365","authors":[{"_id":"6a165265e9aa3c8e322db31d","user":{"_id":"66ff81731687036580bea355","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ff81731687036580bea355/Wgxqf-HeE4D9mhZBu7vDr.jpeg","isPro":false,"fullname":"Wang","user":"ShihaoW","type":"user","name":"ShihaoW"},"name":"Shihao Wang","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:42:12.590Z","hidden":false},{"_id":"6a165265e9aa3c8e322db31e","user":{"_id":"638b13c0c1d591879698f4e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638b13c0c1d591879698f4e2/X8X4EWMXuzhBpG62wO2xS.jpeg","isPro":false,"fullname":"Shilong Liu","user":"ShilongLiu","type":"user","name":"ShilongLiu"},"name":"Shilong Liu","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:42:09.243Z","hidden":false},{"_id":"6a165265e9aa3c8e322db31f","name":"Yuanguo Kuang","hidden":false},{"_id":"6a165265e9aa3c8e322db320","name":"Xinyu Wei","hidden":false},{"_id":"6a165265e9aa3c8e322db321","name":"Yangzhou Liu","hidden":false},{"_id":"6a165265e9aa3c8e322db322","name":"Zhiqi Li","hidden":false},{"_id":"6a165265e9aa3c8e322db323","name":"Yunze Man","hidden":false},{"_id":"6a165265e9aa3c8e322db324","name":"Guo Chen","hidden":false},{"_id":"6a165265e9aa3c8e322db325","name":"Andrew Tao","hidden":false},{"_id":"6a165265e9aa3c8e322db326","name":"Guilin Liu","hidden":false},{"_id":"6a165265e9aa3c8e322db327","name":"Jan Kautz","hidden":false},{"_id":"6a165265e9aa3c8e322db328","name":"Lei Zhang","hidden":false},{"_id":"6a165265e9aa3c8e322db329","user":{"_id":"66c8037c737ba92ae3fe0322","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66c8037c737ba92ae3fe0322/WR_Yh5DWOVVh7IFlF24NM.jpeg","isPro":false,"fullname":"Zhiding Yu","user":"Zhiding","type":"user","name":"Zhiding"},"name":"Zhiding Yu","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:42:18.898Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6039478ab3ecf716b1a5fd4d/Q0JWAuXC4eTAGsjx3iFmT.mp4"],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.","upvotes":63,"discussionId":"6a165265e9aa3c8e322db32a","projectPage":"https://research.nvidia.com/labs/lpr/locate-anything/","ai_summary":"Parallel Box Decoding enables efficient and accurate unified visual grounding and detection by decoding geometric elements as atomic units, improving both throughput and localization quality.","ai_keywords":["vision-language models","visual grounding","detection","coordinate-token generation","box geometry","parallel box decoding","geometric coherence","decoding throughput","localization accuracy","large-scale training data"],"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"686f806aeb53e7adba46c3de","avatarUrl":"/avatars/db11c84d602cefa72ba409c8292e4191.svg","isPro":true,"fullname":"guoguoc","user":"woshichaoren123","type":"user"},{"_id":"638b13c0c1d591879698f4e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638b13c0c1d591879698f4e2/X8X4EWMXuzhBpG62wO2xS.jpeg","isPro":false,"fullname":"Shilong Liu","user":"ShilongLiu","type":"user"},{"_id":"66ff81731687036580bea355","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ff81731687036580bea355/Wgxqf-HeE4D9mhZBu7vDr.jpeg","isPro":false,"fullname":"Wang","user":"ShihaoW","type":"user"},{"_id":"66c8037c737ba92ae3fe0322","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66c8037c737ba92ae3fe0322/WR_Yh5DWOVVh7IFlF24NM.jpeg","isPro":false,"fullname":"Zhiding Yu","user":"Zhiding","type":"user"},{"_id":"63249b42dabddff8da6de8d7","avatarUrl":"/avatars/f6e232434c4e4581413cb9a77ba6dac6.svg","isPro":false,"fullname":"Dong","user":"Yi72","type":"user"},{"_id":"691fed991ab51cfbbbf8386e","avatarUrl":"/avatars/777e5bd3ae4c76e703bdbc35abfd7192.svg","isPro":false,"fullname":"Jian Hu","user":"jianh-nvidia","type":"user"},{"_id":"6447e88ce21484883404854c","avatarUrl":"/avatars/a56903a248de3eb36a2d16c2b7643495.svg","isPro":false,"fullname":"AoqiWu","user":"wswaq","type":"user"},{"_id":"645b0cb3333fb18357875c96","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645b0cb3333fb18357875c96/X-JiamCXh0bbcAfVvRpkj.jpeg","isPro":false,"fullname":"Binfeng Xu","user":"billxbf","type":"user"},{"_id":"68bb61b72ccbd71589ec9447","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/20zF_cQWEBo7GwLqHPNbm.png","isPro":false,"fullname":"Songyang Han","user":"songyanghan","type":"user"},{"_id":"6801df6171a387d10b3ba93c","avatarUrl":"/avatars/0cbfd1ec6125b4e0441db48e1b9677a6.svg","isPro":false,"fullname":"Shaokun zhang","user":"SeanZhang1","type":"user"},{"_id":"64c48a78d07620bdc99777d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c48a78d07620bdc99777d4/NJC4Ot0a7YSdU5RC6dgga.jpeg","isPro":false,"fullname":"LI WENTONG","user":"sunshine-lwt","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":1,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27365.md"}">

Papers

arxiv:2605.27365

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Published on May 26

· Submitted by

taesiri on May 27

#1 Paper of the day

NVIDIA

Upvote

Authors:

Shihao Wang ,

Shilong Liu ,

Zhiding Yu

Abstract

Parallel Box Decoding enables efficient and accurate unified visual grounding and detection by decoding geometric elements as atomic units, improving both throughput and localization quality.

AI-generated summary

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.