Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.</p>\n","updatedAt":"2026-06-01T12:45:41.049Z","author":{"_id":"6764e0e96be739a31929270f","avatarUrl":"/avatars/70e043a90f3838de0cbc3b456f08c95e.svg","fullname":"Mingjian Gao","name":"fengnian1678","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8148249387741089},"editors":["fengnian1678"],"editorAvatarUrls":["/avatars/70e043a90f3838de0cbc3b456f08c95e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30011","authors":[{"_id":"6a1930aa56b4bb14ec65d0f6","user":{"_id":"6764e0e96be739a31929270f","avatarUrl":"/avatars/70e043a90f3838de0cbc3b456f08c95e.svg","isPro":false,"fullname":"Mingjian Gao","user":"fengnian1678","type":"user","name":"fengnian1678"},"name":"Mingjian Gao","status":"admin_assigned","statusLastChangedAt":"2026-05-29T15:03:02.391Z","hidden":false},{"_id":"6a1930aa56b4bb14ec65d0f7","name":"Wenqiao Zhang","hidden":false},{"_id":"6a1930aa56b4bb14ec65d0f8","name":"Yuqian Yuan","hidden":false},{"_id":"6a1930aa56b4bb14ec65d0f9","name":"Yang Dai","hidden":false},{"_id":"6a1930aa56b4bb14ec65d0fa","name":"Binhe Yu","hidden":false},{"_id":"6a1930aa56b4bb14ec65d0fb","name":"Zheqi Lv","hidden":false},{"_id":"6a1930aa56b4bb14ec65d0fc","name":"Haoyu Zheng","hidden":false},{"_id":"6a1930aa56b4bb14ec65d0fd","name":"Jiaqi Zhu","hidden":false},{"_id":"6a1930aa56b4bb14ec65d0fe","name":"Zhiqi Ge","hidden":false},{"_id":"6a1930aa56b4bb14ec65d0ff","name":"Zixuan Wan","hidden":false},{"_id":"6a1930aa56b4bb14ec65d100","name":"Siliang Tang","hidden":false},{"_id":"6a1930aa56b4bb14ec65d101","name":"Yueting Zhuang","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies","submittedOnDailyBy":{"_id":"6764e0e96be739a31929270f","avatarUrl":"/avatars/70e043a90f3838de0cbc3b456f08c95e.svg","isPro":false,"fullname":"Mingjian Gao","user":"fengnian1678","type":"user","name":"fengnian1678"},"summary":"Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.","upvotes":2,"discussionId":"6a1930ab56b4bb14ec65d102","githubRepo":"https://github.com/DCDmllm/VisualThink-VLA","githubRepoAddedBy":"user","ai_summary":"VisualThinking-VLA enables fast, accurate vision-language-action policies through visual reasoning that preserves spatial precision and reduces latency compared to text-based approaches.","ai_keywords":["vision-language-action policies","visual intermediate-reasoning","visual-evidence interface","selective routing mechanism","VisualEvidence-Kit","VisualEvidence-Agent","VisualEvidence-Set","ECoT","BridgeData V2"],"githubStars":15,"organization":{"_id":"66ae12e28ba25782e4109509","name":"ZhejiangDun","fullname":"Zhejiang university","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ae10e1c7a575aa0e1760a9/2SkXxIwYCwMZJ89L_rbC1.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65fc18edfb66882aba4d548e","avatarUrl":"/avatars/f70d47fe4aba98b5a5cd64f7e002dfd2.svg","isPro":false,"fullname":"wenqiao","user":"wannature","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":false,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66ae12e28ba25782e4109509","name":"ZhejiangDun","fullname":"Zhejiang university","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ae10e1c7a575aa0e1760a9/2SkXxIwYCwMZJ89L_rbC1.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30011.md"}">
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
VisualThinking-VLA enables fast, accurate vision-language-action policies through visual reasoning that preserves spatial precision and reduces latency compared to text-based approaches.
AI-generated summary
Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.
Community
Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30011 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.30011 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.30011 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.