Hugging Face Daily Papers · June 8, 2026 · 3 min read

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

#model-release #multimodal #reasoning #benchmark

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

WorldBench is a challenging and visually diverse benchmark designed to evaluate Multimodal Large Language Models, addressing gaps in visual concept representation found in existing multimodal evaluation frameworks.</p>\n","updatedAt":"2026-06-08T02:00:33.908Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":312,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8341949582099915},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.06538","authors":[{"_id":"6a261828e4c258a029491f6d","name":"Yida Yin","hidden":false},{"_id":"6a261828e4c258a029491f6e","name":"Harish Krishnakumar","hidden":false},{"_id":"6a261828e4c258a029491f6f","name":"Chung Peng Lee","hidden":false},{"_id":"6a261828e4c258a029491f70","name":"Boya Zeng","hidden":false},{"_id":"6a261828e4c258a029491f71","name":"Wenhao Chai","hidden":false},{"_id":"6a261828e4c258a029491f72","name":"Shengbang Tong","hidden":false},{"_id":"6a261828e4c258a029491f73","name":"Wenhu Chen","hidden":false},{"_id":"6a261828e4c258a029491f74","name":"Hu Xu","hidden":false},{"_id":"6a261828e4c258a029491f75","name":"Xingyu Fu","hidden":false},{"_id":"6a261828e4c258a029491f76","name":"Gabriel Sarch","hidden":false},{"_id":"6a261828e4c258a029491f77","name":"Aleksandra Korolova","hidden":false},{"_id":"6a261828e4c258a029491f78","name":"Zhuang Liu","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.","upvotes":0,"discussionId":"6a261829e4c258a029491f79","projectPage":"https://worldbench-vl.github.io/","ai_summary":"WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities.","ai_keywords":["Multimodal Large Language Models","visual diversity","reasoning benchmark","visual concepts","multimodal benchmarks"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"]}">

Papers

arxiv:2606.06538

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Published on Jun 4

· Submitted by

taesiri on Jun 8

Upvote

Authors:

Abstract

WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

View arXiv page View PDF Project page Add to collection

Community

taesiri

Paper submitter about 7 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.06538 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.06538 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers