Visual ParaThinker++</p>\n","updatedAt":"2026-06-12T03:49:47.649Z","author":{"_id":"66f18c7982d5de5715393736","avatarUrl":"/avatars/dd278f91dab5cf1be97a751027a637b1.svg","fullname":"haoran xu","name":"pianzhikuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.297995924949646},"editors":["pianzhikuang"],"editorAvatarUrls":["/avatars/dd278f91dab5cf1be97a751027a637b1.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09290","authors":[{"_id":"6a27974d6dde1c5ef75bd0f1","name":"Haoran Xu","hidden":false},{"_id":"6a27974d6dde1c5ef75bd0f2","name":"Hongyu Wang","hidden":false},{"_id":"6a27974d6dde1c5ef75bd0f3","name":"Yifei Gao","hidden":false},{"_id":"6a27974d6dde1c5ef75bd0f4","name":"Jiaze Li","hidden":false},{"_id":"6a27974d6dde1c5ef75bd0f5","user":{"_id":"637f114c1dbae0919108987d","avatarUrl":"/avatars/23d73811b697261ceb80ef1b0806a633.svg","isPro":false,"fullname":"Zizhao Tong","user":"zizhaotong","type":"user","name":"zizhaotong"},"name":"Zizhao Tong","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:41:46.977Z","hidden":false},{"_id":"6a27974d6dde1c5ef75bd0f6","name":"Xiaofeng Zhang","hidden":false},{"_id":"6a27974d6dde1c5ef75bd0f7","name":"Xiaosong Yuan","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning","submittedOnDailyBy":{"_id":"66f18c7982d5de5715393736","avatarUrl":"/avatars/dd278f91dab5cf1be97a751027a637b1.svg","isPro":false,"fullname":"haoran xu","user":"pianzhikuang","type":"user","name":"pianzhikuang"},"summary":"Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.","upvotes":6,"discussionId":"6a27974d6dde1c5ef75bd0f8","ai_summary":"A multi-agent framework with shared MLLM policy and role-specific training methods improves visual reasoning by reducing hallucinations and enabling efficient parallel processing.","ai_keywords":["multi-agent framework","MLLM policy","role-conditioned agents","Main Agent","Worker Agents","Summary Agent","multi-agent capability injection","role-decoupled multi-agent optimization","gradient conflict","visual prefix","KV cache reuse","visual reasoning","hallucination"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f18c7982d5de5715393736","avatarUrl":"/avatars/dd278f91dab5cf1be97a751027a637b1.svg","isPro":false,"fullname":"haoran xu","user":"pianzhikuang","type":"user"},{"_id":"637f114c1dbae0919108987d","avatarUrl":"/avatars/23d73811b697261ceb80ef1b0806a633.svg","isPro":false,"fullname":"Zizhao Tong","user":"zizhaotong","type":"user"},{"_id":"63e202f352b7578dba448ab5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e202f352b7578dba448ab5/8itVBLcv14m7OVsoF8h1o.jpeg","isPro":false,"fullname":"Kaicheng Yang","user":"Kaichengalex","type":"user"},{"_id":"6443f2859174daa2f68f125f","avatarUrl":"/avatars/0a7a2ebdc174df95ed85def44608f306.svg","isPro":false,"fullname":"Xiaosong Yuan","user":"yuanxs21","type":"user"},{"_id":"661fba31e7c6dde9fa632ad7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/W5T27lW2ZNRscY0J7JAz6.jpeg","isPro":false,"fullname":"lu_kenny","user":"XiaoLu0216","type":"user"},{"_id":"67223563fa69c82e19d2232c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/1z_axjIty3uB4UDYa9JK4.png","isPro":false,"fullname":"Xiaoxing Hu","user":"wsdwJohn1231","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09290.md","query":{}}">
Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning
Abstract
A multi-agent framework with shared MLLM policy and role-specific training methods improves visual reasoning by reducing hallucinations and enabling efficient parallel processing.
Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.09290 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.09290 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.09290 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.