Hugging Face Daily Papers · · 6 min read

Count Anything

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: <a href=\"https://github.com/Mengqi-Lei/count-anything\" rel=\"nofollow\">https://github.com/Mengqi-Lei/count-anything</a>.</p>\n","updatedAt":"2026-06-01T04:34:51.272Z","author":{"_id":"683004003b3ad71ee1cc686e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/683004003b3ad71ee1cc686e/1hOhmTWAZ_d6WoVmztDcA.jpeg","fullname":"Mengqi Lei","name":"MengqiLei","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8528155088424683},"editors":["MengqiLei"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/683004003b3ad71ee1cc686e/1hOhmTWAZ_d6WoVmztDcA.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30846","authors":[{"_id":"6a1cdf9f808ddbc3c7d433a8","name":"Mengqi Lei","hidden":false},{"_id":"6a1cdf9f808ddbc3c7d433a9","name":"Shuokun Cheng","hidden":false},{"_id":"6a1cdf9f808ddbc3c7d433aa","name":"Wei Bao","hidden":false},{"_id":"6a1cdf9f808ddbc3c7d433ab","name":"Shaoyi Du","hidden":false},{"_id":"6a1cdf9f808ddbc3c7d433ac","name":"Jun-Hai Yong","hidden":false},{"_id":"6a1cdf9f808ddbc3c7d433ad","name":"Siqi Li","hidden":false},{"_id":"6a1cdf9f808ddbc3c7d433ae","name":"Yue Gao","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Count Anything","submittedOnDailyBy":{"_id":"683004003b3ad71ee1cc686e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/683004003b3ad71ee1cc686e/1hOhmTWAZ_d6WoVmztDcA.jpeg","isPro":true,"fullname":"Mengqi Lei","user":"MengqiLei","type":"user","name":"MengqiLei"},"summary":"Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.","upvotes":1,"discussionId":"6a1cdf9f808ddbc3c7d433af","githubRepo":"https://github.com/Mengqi-Lei/count-anything","githubRepoAddedBy":"user","ai_summary":"A generalist model for text-guided object counting across multiple domains is presented, utilizing dual-granularity instance enumeration and complementary counting fusion for improved accuracy and cross-domain generalization.","ai_keywords":["text-guided object counting","instance enumeration","dual-granularity","Region-level Sparse Counter","Pixel-level Dense Counter","point-centric supervision","Complementary Count Fusion","cross-domain generalization"],"githubStars":7},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"683004003b3ad71ee1cc686e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/683004003b3ad71ee1cc686e/1hOhmTWAZ_d6WoVmztDcA.jpeg","isPro":true,"fullname":"Mengqi Lei","user":"MengqiLei","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0}">
Papers
arxiv:2605.30846

Count Anything

Published on May 29
· Submitted by
Mengqi Lei
on Jun 1
Authors:
,
,
,
,
,
,

Abstract

A generalist model for text-guided object counting across multiple domains is presented, utilizing dual-granularity instance enumeration and complementary counting fusion for improved accuracy and cross-domain generalization.

AI-generated summary

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.

Community

Paper submitter about 6 hours ago

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30846 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers