Hugging Face Daily Papers · June 9, 2026 · 4 min read

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Optical reasoning introduces a new paradigm that uses images alone as an expressive reasoning medium for both language and multimodal tasks, moving beyond text-based and interleaved-modal Chain-of-Thought reasoning. With typographic and graphical visual rationales, it matches or surpasses text reasoning while substantially reducing reasoning tokens, highlighting the potential of images alone to serve as a unified, efficient, and general-purpose medium for reasoning.</p>\n","updatedAt":"2026-06-09T07:07:22.808Z","author":{"_id":"64ac01f8ac13891808807e01","avatarUrl":"/avatars/25a2e2ad943581521ad488d00bf37738.svg","fullname":"charlie","name":"charlesdj","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8821080923080444},"editors":["charlesdj"],"editorAvatarUrls":["/avatars/25a2e2ad943581521ad488d00bf37738.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09585","authors":[{"_id":"6a27a3316dde1c5ef75bd12c","name":"Yutong Bian","hidden":false},{"_id":"6a27a3316dde1c5ef75bd12d","name":"Dongjie Cheng","hidden":false},{"_id":"6a27a3316dde1c5ef75bd12e","name":"Heming Xia","hidden":false},{"_id":"6a27a3316dde1c5ef75bd12f","name":"Yongqi Li","hidden":false},{"_id":"6a27a3316dde1c5ef75bd130","name":"Wenjie Li","hidden":false}],"publishedAt":"2026-06-08T14:58:59.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text","submittedOnDailyBy":{"_id":"64ac01f8ac13891808807e01","avatarUrl":"/avatars/25a2e2ad943581521ad488d00bf37738.svg","isPro":true,"fullname":"charlie","user":"charlesdj","type":"user","name":"charlesdj"},"summary":"Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.","upvotes":2,"discussionId":"6a27a3316dde1c5ef75bd131","githubRepo":"https://github.com/ModalityDance/Optical-Reasoning","githubRepoAddedBy":"user","ai_summary":"Optical reasoning uses images as a standalone reasoning medium for language and multimodal tasks, achieving higher token efficiency than traditional text-based approaches.","ai_keywords":["Chain-of-Thought","Large Language Models","Multimodal Large Language Models","interleaved-modal reasoning","optical reasoning","typographic-based optical reasoning","graphical-based optical reasoning","visual rationales","token efficiency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"69396d0f6ef210a3d45ac4b7","name":"ModalityDance","fullname":"ModalityDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/653a111eee5888edef9182cf/7BPn5_PnfH27PkAaLQnxW.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65433bbfb94aba949a777bd5","avatarUrl":"/avatars/785e0d5bf612e086d00bd3b8a209d5c5.svg","isPro":false,"fullname":"Bryan ","user":"petrichor20211","type":"user"},{"_id":"64ac01f8ac13891808807e01","avatarUrl":"/avatars/25a2e2ad943581521ad488d00bf37738.svg","isPro":true,"fullname":"charlie","user":"charlesdj","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69396d0f6ef210a3d45ac4b7","name":"ModalityDance","fullname":"ModalityDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/653a111eee5888edef9182cf/7BPn5_PnfH27PkAaLQnxW.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09585.md"}">

Papers

arxiv:2606.09585

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Published on Jun 8

· Submitted by

charlie on Jun 9

ModalityDance

Upvote

Authors:

Abstract

Optical reasoning uses images as a standalone reasoning medium for language and multimodal tasks, achieving higher token efficiency than traditional text-based approaches.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

View arXiv page View PDF GitHub 2 Add to collection

Community

charlesdj

Paper submitter about 12 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.09585

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09585 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09585 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers