Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.</p>\n","updatedAt":"2026-06-09T05:32:34.064Z","author":{"_id":"6a20f79545b9d63e3b637155","avatarUrl":"/avatars/22ef7a9c030a49bc7edf668cde3b23fe.svg","fullname":"XiuYu Zhang","name":"xiuyuz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9172008037567139},"editors":["xiuyuz"],"editorAvatarUrls":["/avatars/22ef7a9c030a49bc7edf668cde3b23fe.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05753","authors":[{"_id":"6a263483e4c258a029492024","user":{"_id":"6a20f79545b9d63e3b637155","avatarUrl":"/avatars/22ef7a9c030a49bc7edf668cde3b23fe.svg","isPro":false,"fullname":"XiuYu Zhang","user":"xiuyuz","type":"user","name":"xiuyuz"},"name":"XiuYu Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-08T09:44:20.168Z","hidden":false},{"_id":"6a263483e4c258a029492025","name":"Junfeng Fang","hidden":false},{"_id":"6a263483e4c258a029492026","name":"Zhenkai Liang","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents","submittedOnDailyBy":{"_id":"6a20f79545b9d63e3b637155","avatarUrl":"/avatars/22ef7a9c030a49bc7edf668cde3b23fe.svg","isPro":false,"fullname":"XiuYu Zhang","user":"xiuyuz","type":"user","name":"xiuyuz"},"summary":"Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.","upvotes":1,"discussionId":"6a263483e4c258a029492027","githubRepo":"https://github.com/xiuyuz/cosine-misleads","githubRepoAddedBy":"user","ai_summary":"Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather than within them.","ai_keywords":["latent visual reasoning","vision-language models","supervised latent tokens","cosine similarity","mean squared error","linear probe","corruption test","information bottleneck","shared parameters"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"6508ab2b349930913196378b","name":"NationalUniversityofSingapore","fullname":"National University of Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZYUmpSMsa5Whihw3me2Bw.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a20f79545b9d63e3b637155","avatarUrl":"/avatars/22ef7a9c030a49bc7edf668cde3b23fe.svg","isPro":false,"fullname":"XiuYu Zhang","user":"xiuyuz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6508ab2b349930913196378b","name":"NationalUniversityofSingapore","fullname":"National University of Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZYUmpSMsa5Whihw3me2Bw.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.05753.md"}">
Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents
Abstract
Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather than within them.
Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.
Community
Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.05753 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.05753 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.05753 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.