Hugging Face Daily Papers · May 25, 2026 · 4 min read

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We argue many vision-language models don't truly see — they exploit language priors to mask a broken visual pipeline — and propose the Expense of Seeing: metrics that measure it by translating a sample across modalities rather than ablating it, revealing that no released benchmark can yet compute them.</p>\n","updatedAt":"2026-05-25T04:00:35.550Z","author":{"_id":"67740b331bb239908a43bceb","avatarUrl":"/avatars/ca2c6320cb941679ee55fc3914d5dc4a.svg","fullname":"Karan Goyal","name":"goyalkaraniit","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9402035474777222},"editors":["goyalkaraniit"],"editorAvatarUrls":["/avatars/ca2c6320cb941679ee55fc3914d5dc4a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2604.20665","authors":[{"_id":"69eb5431cffbe464140f11d9","user":{"_id":"67740b331bb239908a43bceb","avatarUrl":"/avatars/ca2c6320cb941679ee55fc3914d5dc4a.svg","isPro":false,"fullname":"Karan Goyal","user":"goyalkaraniit","type":"user","name":"goyalkaraniit"},"name":"Karan Goyal","status":"claimed_verified","statusLastChangedAt":"2026-04-24T13:56:21.949Z","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm","submittedOnDailyBy":{"_id":"67740b331bb239908a43bceb","avatarUrl":"/avatars/ca2c6320cb941679ee55fc3914d5dc4a.svg","isPro":false,"fullname":"Karan Goyal","user":"goyalkaraniit","type":"user","name":"goyalkaraniit"},"summary":"The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond \"multimodal gain\" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.","upvotes":1,"discussionId":"69eb5432cffbe464140f11db","ai_summary":"Vision-Language Models often fail to faithfully synthesize multimodal data due to reliance on language priors over visual representation, necessitating new evaluation frameworks that prioritize semantic sufficiency over traditional multimodal gain metrics.","ai_keywords":["Vision-Language Models","Vision Encoder-Projector-LLM paradigm","functional blindness","multimodal evaluation","Modality Translation Protocol","Expense of Seeing","Toll (ToS)","Curse (CoS)","Fallacy (FoS)","Semantic Sufficiency Criterion (SSC)","Divergence Law of Multimodal Scaling"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67740b331bb239908a43bceb","avatarUrl":"/avatars/ca2c6320cb941679ee55fc3914d5dc4a.svg","isPro":false,"fullname":"Karan Goyal","user":"goyalkaraniit","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2604/2604.20665.md"}">

Papers

arxiv:2604.20665

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Published on May 21

· Submitted by

Karan Goyal on May 25

Upvote

Authors:

Karan Goyal

Abstract

Vision-Language Models often fail to faithfully synthesize multimodal data due to reliance on language priors over visual representation, necessitating new evaluation frameworks that prioritize semantic sufficiency over traditional multimodal gain metrics.

AI-generated summary

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.

View arXiv page View PDF Add to collection

Community

goyalkaraniit

Paper author Paper submitter about 7 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.20665

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.20665 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.20665 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.20665 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers