Hugging Face Daily Papers · · 4 min read

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Optical reasoning introduces a new paradigm that uses images alone as an expressive reasoning medium for both language and multimodal tasks, moving beyond text-based and interleaved-modal Chain-of-Thought reasoning. With typographic and graphical visual rationales, it matches or surpasses text reasoning while substantially reducing reasoning tokens, highlighting the potential of images alone to serve as a unified, efficient, and general-purpose medium for reasoning.</p>\n","updatedAt":"2026-06-09T07:07:22.808Z","author":{"_id":"64ac01f8ac13891808807e01","avatarUrl":"/avatars/25a2e2ad943581521ad488d00bf37738.svg","fullname":"charlie","name":"charlesdj","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8821080923080444},"editors":["charlesdj"],"editorAvatarUrls":["/avatars/25a2e2ad943581521ad488d00bf37738.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09585","authors":[{"_id":"6a27a3316dde1c5ef75bd12c","name":"Yutong Bian","hidden":false},{"_id":"6a27a3316dde1c5ef75bd12d","name":"Dongjie Cheng","hidden":false},{"_id":"6a27a3316dde1c5ef75bd12e","name":"Heming Xia","hidden":false},{"_id":"6a27a3316dde1c5ef75bd12f","name":"Yongqi Li","hidden":false},{"_id":"6a27a3316dde1c5ef75bd130","name":"Wenjie Li","hidden":false}],"publishedAt":"2026-06-08T14:58:59.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text","submittedOnDailyBy":{"_id":"64ac01f8ac13891808807e01","avatarUrl":"/avatars/25a2e2ad943581521ad488d00bf37738.svg","isPro":true,"fullname":"charlie","user":"charlesdj","type":"user","name":"charlesdj"},"summary":"Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.","upvotes":2,"discussionId":"6a27a3316dde1c5ef75bd131","githubRepo":"https://github.com/ModalityDance/Optical-Reasoning","githubRepoAddedBy":"user","ai_summary":"Optical reasoning uses images as a standalone reasoning medium for language and multimodal tasks, achieving higher token efficiency than traditional text-based approaches.","ai_keywords":["Chain-of-Thought","Large Language Models","Multimodal Large Language Models","interleaved-modal reasoning","optical reasoning","typographic-based optical reasoning","graphical-based optical reasoning","visual rationales","token efficiency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"69396d0f6ef210a3d45ac4b7","name":"ModalityDance","fullname":"ModalityDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/653a111eee5888edef9182cf/7BPn5_PnfH27PkAaLQnxW.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65433bbfb94aba949a777bd5","avatarUrl":"/avatars/785e0d5bf612e086d00bd3b8a209d5c5.svg","isPro":false,"fullname":"Bryan ","user":"petrichor20211","type":"user"},{"_id":"64ac01f8ac13891808807e01","avatarUrl":"/avatars/25a2e2ad943581521ad488d00bf37738.svg","isPro":true,"fullname":"charlie","user":"charlesdj","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69396d0f6ef210a3d45ac4b7","name":"ModalityDance","fullname":"ModalityDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/653a111eee5888edef9182cf/7BPn5_PnfH27PkAaLQnxW.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09585.md"}">
Papers
arxiv:2606.09585

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Published on Jun 8
· Submitted by
charlie
on Jun 9
Authors:
,
,
,
,

Abstract

Optical reasoning uses images as a standalone reasoning medium for language and multimodal tasks, achieving higher token efficiency than traditional text-based approaches.

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

Community

Paper submitter about 12 hours ago

Optical reasoning introduces a new paradigm that uses images alone as an expressive reasoning medium for both language and multimodal tasks, moving beyond text-based and interleaved-modal Chain-of-Thought reasoning. With typographic and graphical visual rationales, it matches or surpasses text reasoning while substantially reducing reasoning tokens, highlighting the potential of images alone to serve as a unified, efficient, and general-purpose medium for reasoning.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.09585
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09585 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09585 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers