Hugging Face Daily Papers · May 13, 2026 · 5 min read

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We introduce Urban-ImageNet, a large-scale multimodal benchmark for urban space perception built from 2M+ public Weibo image–text pairs collected across 61 commercial sites in 24 Chinese cities from 2019–2025.\nGeneral-purpose benchmarks like ImageNet and Places365 identify what is visible in a scene. Urban-ImageNet asks how people inhabit, experience, and socially activate urban space. The dataset is organized by HUSIC, a 10-class taxonomy grounded in the urban theories of Lefebvre, Gehl, and Newman, distinguishing socially activated vs. unoccupied spaces, exterior vs. interior environments, consumption content, and social portraits.\nThe benchmark supports three unified tasks within one standardized library: 🏷️ T1 Urban scene semantic classification 🔍 T2 Cross-modal image–text retrieval 🎯 T3 Instance segmentation\nBalanced 1K / 10K / 100K subsets support controlled scaling-behavior studies, alongside a full 2M-scale corpus for large-scale training. Dataset and code are publicly available on Hugging Face and GitHub.\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/681eed9fa4da16cf41a42b76/FXYwPdmRkuKszxPMqkW-u.jpeg\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/681eed9fa4da16cf41a42b76/FXYwPdmRkuKszxPMqkW-u.jpeg\" alt=\"01-Overall-Framework\"></a> <a href=\"https://cdn-uploads.huggingface.co/production/uploads/681eed9fa4da16cf41a42b76/CNfIXxQd_lIZBc-rDyXwm.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/681eed9fa4da16cf41a42b76/CNfIXxQd_lIZBc-rDyXwm.png\" alt=\"02-HUSIC-Framework\"></a>\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/681eed9fa4da16cf41a42b76/VBhNrUfMMx0HQ1CsTzVxm.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/681eed9fa4da16cf41a42b76/VBhNrUfMMx0HQ1CsTzVxm.png\" alt=\"06-Data-Collection\"></a>\n","updatedAt":"2026-05-13T17:10:35.191Z","author":{"_id":"681eed9fa4da16cf41a42b76","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/dIpMNpJ4OJPZAOLCmZHO7.png","fullname":"Yiwei Ou","name":"Yiwei-Ou","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7409000992774963},"editors":["Yiwei-Ou"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/dIpMNpJ4OJPZAOLCmZHO7.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.09936","authors":[{"_id":"6a049cdeb1a8cbabc9f084ac","name":"Yiwei Ou","hidden":false},{"_id":"6a049cdeb1a8cbabc9f084ad","name":"Chung Ching Cheung","hidden":false},{"_id":"6a049cdeb1a8cbabc9f084ae","name":"Jun Yang Ang","hidden":false},{"_id":"6a049cdeb1a8cbabc9f084af","name":"Xiaobin Ren","hidden":false},{"_id":"6a049cdeb1a8cbabc9f084b0","name":"Ronggui Sun","hidden":false},{"_id":"6a049cdeb1a8cbabc9f084b1","name":"Guansong Gao","hidden":false},{"_id":"6a049cdeb1a8cbabc9f084b2","name":"Kaiqi Zhao","hidden":false},{"_id":"6a049cdeb1a8cbabc9f084b3","name":"Manfredo Manfredini","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/681eed9fa4da16cf41a42b76/Q4iW5pSOPnCZjhfH0GUHg.jpeg","https://cdn-uploads.huggingface.co/production/uploads/681eed9fa4da16cf41a42b76/VLwyPTMuZ_6P8Q4wpCqt8.jpeg","https://cdn-uploads.huggingface.co/production/uploads/681eed9fa4da16cf41a42b76/uvJd6CMPrOgvgRuNc0Yed.png"],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception","submittedOnDailyBy":{"_id":"681eed9fa4da16cf41a42b76","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/dIpMNpJ4OJPZAOLCmZHO7.png","isPro":false,"fullname":"Yiwei Ou","user":"Yiwei-Ou","type":"user","name":"Yiwei-Ou"},"summary":"We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.","upvotes":0,"discussionId":"6a049cdeb1a8cbabc9f084b4","projectPage":"https://huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet","githubRepo":"https://github.com/yiasun/dataset-2","githubRepoAddedBy":"user","ai_summary":"Urban-ImageNet presents a large-scale multi-modal dataset and evaluation benchmark for urban space perception from social media imagery, organized under a hierarchical taxonomy for scene classification, cross-modal retrieval, and instance segmentation tasks.","ai_keywords":["Hierarchical Urban Space Image Classification","urban scene semantic classification","cross-modal image-text retrieval","instance segmentation","vision-language models","multi-scale training","balanced training data"],"githubStars":1,"organization":{"_id":"64cad5514726a3f8333c6361","name":"auckland","fullname":"University of Auckland","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/61ac8f8a00d01045fca0ad2f/3GMZ9LmVzRkOyy2lu5a54.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"64cad5514726a3f8333c6361","name":"auckland","fullname":"University of Auckland","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/61ac8f8a00d01045fca0ad2f/3GMZ9LmVzRkOyy2lu5a54.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.09936.md"}">

Papers

arxiv:2605.09936

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Published on May 11

· Submitted by

Yiwei Ou on May 13

University of Auckland

Upvote

Authors:

Abstract

Urban-ImageNet presents a large-scale multi-modal dataset and evaluation benchmark for urban space perception from social media imagery, organized under a hierarchical taxonomy for scene classification, cross-modal retrieval, and instance segmentation tasks.

AI-generated summary

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

Yiwei-Ou

Paper submitter about 4 hours ago

General-purpose benchmarks like ImageNet and Places365 identify what is visible in a scene. Urban-ImageNet asks how people inhabit, experience, and socially activate urban space. The dataset is organized by HUSIC, a 10-class taxonomy grounded in the urban theories of Lefebvre, Gehl, and Newman, distinguishing socially activated vs. unoccupied spaces, exterior vs. interior environments, consumption content, and social portraits.

The benchmark supports three unified tasks within one standardized library:
🏷️ T1 Urban scene semantic classification
🔍 T2 Cross-modal image–text retrieval
🎯 T3 Instance segmentation

Balanced 1K / 10K / 100K subsets support controlled scaling-behavior studies, alongside a full 2M-scale corpus for large-scale training. Dataset and code are publicly available on Hugging Face and GitHub.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.09936

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.09936 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.09936 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers