Hugging Face Daily Papers · May 13, 2026 · 3 min read

From Web to Pixels: Bringing Agentic Search into Visual Perception

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Project page: <a href=\"https://pixel-searcher.github.io/\" rel=\"nofollow\">https://pixel-searcher.github.io/</a></p>\n<p>Code: <a href=\"https://github.com/yangbokang/pixel-searcher\" rel=\"nofollow\">https://github.com/yangbokang/pixel-searcher</a></p>\n","updatedAt":"2026-05-13T04:43:12.794Z","author":{"_id":"67079840a9bcb7459b8d2a46","avatarUrl":"/avatars/32466863c5554f20cb2775b138832ac3.svg","fullname":"Kaituo Feng","name":"KaituoFeng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6399911642074585},"editors":["KaituoFeng"],"editorAvatarUrls":["/avatars/32466863c5554f20cb2775b138832ac3.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12497","authors":[{"_id":"6a03e32b86b054ce2fa40dc3","name":"Bokang Yang","hidden":false},{"_id":"6a03e32b86b054ce2fa40dc4","name":"Xinyi Sun","hidden":false},{"_id":"6a03e32b86b054ce2fa40dc5","name":"Kaituo Feng","hidden":false},{"_id":"6a03e32b86b054ce2fa40dc6","name":"Xingping Dong","hidden":false},{"_id":"6a03e32b86b054ce2fa40dc7","name":"Dongming Wu","hidden":false},{"_id":"6a03e32b86b054ce2fa40dc8","name":"Xiangyu Yue","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6039478ab3ecf716b1a5fd4d/lhyT7ZlqKXRC4iUmPgPMW.png"],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"From Web to Pixels: Bringing Agentic Search into Visual Perception","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.","upvotes":10,"discussionId":"6a03e32c86b054ce2fa40dc9","ai_summary":"Researchers introduce WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agent-based approach that connects hidden target identities to visual annotations through search and reasoning.","ai_keywords":["Perception Deep Research","object-anchored benchmark","verifiable evidence","knowledge-intensive queries","precise box/mask annotations","Search-based Grounding","Search-based Segmentation","Search-based VQA","Pixel-Searcher","agentic search-to-pixel workflow","visual instance binding"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67079840a9bcb7459b8d2a46","avatarUrl":"/avatars/32466863c5554f20cb2775b138832ac3.svg","isPro":false,"fullname":"Kaituo Feng","user":"KaituoFeng","type":"user"},{"_id":"66c76445ac384b32b9d5cb31","avatarUrl":"/avatars/d499f13b27511a3490545ba8fe68f0f2.svg","isPro":false,"fullname":"Wu","user":"Dongming97","type":"user"},{"_id":"685772e028e24d8503a01618","avatarUrl":"/avatars/a01a54cd881d6d9c79beec1c27c514a8.svg","isPro":false,"fullname":"YuhangYan","user":"Ireliya","type":"user"},{"_id":"671a9ab2999be5a0259f6c8f","avatarUrl":"/avatars/c1967c6210096f8edff12d4794263228.svg","isPro":false,"fullname":"Haoran Zheng","user":"HZ03","type":"user"},{"_id":"667a518d58120f1b6ac579e8","avatarUrl":"/avatars/3e7d0e3d1e659ec29c0fca3e79df798e.svg","isPro":false,"fullname":"Peiwen Sun","user":"spw2000","type":"user"},{"_id":"65b7b53f23d948d88476d8c9","avatarUrl":"/avatars/d8db8b24d36fdaa1ff17d2b64770e650.svg","isPro":false,"fullname":"yang bokang","user":"yangbokang81","type":"user"},{"_id":"686fbdf89dc54650da2d8409","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/pikuap7x9bhyctesouFOF.png","isPro":false,"fullname":"sunxinyi","user":"xy1212121","type":"user"},{"_id":"644784e219538c015b2531f5","avatarUrl":"/avatars/748507a8164d99a126ff0b21990240bd.svg","isPro":false,"fullname":"Shiyun","user":"Christinexx","type":"user"},{"_id":"6a0439aaacd78d3c3bf9c85d","avatarUrl":"/avatars/4ca60c1240485504f3ed8900dafa5987.svg","isPro":false,"fullname":"Merrill","user":"Merrill146","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12497.md"}">

Papers

arxiv:2605.12497

From Web to Pixels: Bringing Agentic Search into Visual Perception

Published on May 12

· Submitted by

taesiri on May 13

Upvote

Authors:

Abstract

Researchers introduce WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agent-based approach that connects hidden target identities to visual annotations through search and reasoning.

AI-generated summary

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.