Hugging Face Daily Papers · · 3 min read

From Web to Pixels: Bringing Agentic Search into Visual Perception

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Project page: <a href=\"https://pixel-searcher.github.io/\" rel=\"nofollow\">https://pixel-searcher.github.io/</a></p>\n<p>Code: <a href=\"https://github.com/yangbokang/pixel-searcher\" rel=\"nofollow\">https://github.com/yangbokang/pixel-searcher</a></p>\n","updatedAt":"2026-05-13T04:43:12.794Z","author":{"_id":"67079840a9bcb7459b8d2a46","avatarUrl":"/avatars/32466863c5554f20cb2775b138832ac3.svg","fullname":"Kaituo Feng","name":"KaituoFeng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6399911642074585},"editors":["KaituoFeng"],"editorAvatarUrls":["/avatars/32466863c5554f20cb2775b138832ac3.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12497","authors":[{"_id":"6a03e32b86b054ce2fa40dc3","name":"Bokang Yang","hidden":false},{"_id":"6a03e32b86b054ce2fa40dc4","name":"Xinyi Sun","hidden":false},{"_id":"6a03e32b86b054ce2fa40dc5","name":"Kaituo Feng","hidden":false},{"_id":"6a03e32b86b054ce2fa40dc6","name":"Xingping Dong","hidden":false},{"_id":"6a03e32b86b054ce2fa40dc7","name":"Dongming Wu","hidden":false},{"_id":"6a03e32b86b054ce2fa40dc8","name":"Xiangyu Yue","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6039478ab3ecf716b1a5fd4d/lhyT7ZlqKXRC4iUmPgPMW.png"],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"From Web to Pixels: Bringing Agentic Search into Visual Perception","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.","upvotes":10,"discussionId":"6a03e32c86b054ce2fa40dc9","ai_summary":"Researchers introduce WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agent-based approach that connects hidden target identities to visual annotations through search and reasoning.","ai_keywords":["Perception Deep Research","object-anchored benchmark","verifiable evidence","knowledge-intensive queries","precise box/mask annotations","Search-based Grounding","Search-based Segmentation","Search-based VQA","Pixel-Searcher","agentic search-to-pixel workflow","visual instance binding"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67079840a9bcb7459b8d2a46","avatarUrl":"/avatars/32466863c5554f20cb2775b138832ac3.svg","isPro":false,"fullname":"Kaituo Feng","user":"KaituoFeng","type":"user"},{"_id":"66c76445ac384b32b9d5cb31","avatarUrl":"/avatars/d499f13b27511a3490545ba8fe68f0f2.svg","isPro":false,"fullname":"Wu","user":"Dongming97","type":"user"},{"_id":"685772e028e24d8503a01618","avatarUrl":"/avatars/a01a54cd881d6d9c79beec1c27c514a8.svg","isPro":false,"fullname":"YuhangYan","user":"Ireliya","type":"user"},{"_id":"671a9ab2999be5a0259f6c8f","avatarUrl":"/avatars/c1967c6210096f8edff12d4794263228.svg","isPro":false,"fullname":"Haoran Zheng","user":"HZ03","type":"user"},{"_id":"667a518d58120f1b6ac579e8","avatarUrl":"/avatars/3e7d0e3d1e659ec29c0fca3e79df798e.svg","isPro":false,"fullname":"Peiwen Sun","user":"spw2000","type":"user"},{"_id":"65b7b53f23d948d88476d8c9","avatarUrl":"/avatars/d8db8b24d36fdaa1ff17d2b64770e650.svg","isPro":false,"fullname":"yang bokang","user":"yangbokang81","type":"user"},{"_id":"686fbdf89dc54650da2d8409","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/pikuap7x9bhyctesouFOF.png","isPro":false,"fullname":"sunxinyi","user":"xy1212121","type":"user"},{"_id":"644784e219538c015b2531f5","avatarUrl":"/avatars/748507a8164d99a126ff0b21990240bd.svg","isPro":false,"fullname":"Shiyun","user":"Christinexx","type":"user"},{"_id":"6a0439aaacd78d3c3bf9c85d","avatarUrl":"/avatars/4ca60c1240485504f3ed8900dafa5987.svg","isPro":false,"fullname":"Merrill","user":"Merrill146","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12497.md"}">
Papers
arxiv:2605.12497

From Web to Pixels: Bringing Agentic Search into Visual Perception

Published on May 12
· Submitted by
taesiri
on May 13
Authors:
,
,
,
,
,

Abstract

Researchers introduce WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agent-based approach that connects hidden target identities to visual annotations through search and reasoning.

AI-generated summary

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.12497
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.12497 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12497 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers