Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding.</p>\n","updatedAt":"2026-06-12T03:41:53.040Z","author":{"_id":"642e7a12ccdcf5da7f9657a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png","fullname":"Jiaqi Tang","name":"Jiaqi-hkust","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":25,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8734767436981201},"editors":["Jiaqi-hkust"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png"],"reactions":[{"reaction":"🚀","users":["Jiaqi-hkust"],"count":1},{"reaction":"🔥","users":["Jiaqi-hkust"],"count":1},{"reaction":"👀","users":["Jiaqi-hkust"],"count":1},{"reaction":"❤️","users":["Jiaqi-hkust"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.08063","authors":[{"_id":"6a2b7f8d4957fcdd3aac06ee","user":{"_id":"642e7a12ccdcf5da7f9657a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png","isPro":true,"fullname":"Jiaqi Tang","user":"Jiaqi-hkust","type":"user","name":"Jiaqi-hkust"},"name":"Jiaqi Tang","status":"claimed_verified","statusLastChangedAt":"2026-06-12T06:56:26.807Z","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06ef","name":"Jianmin Chen","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f0","name":"Youyang Zhai","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f1","name":"Wei Wei","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f2","name":"Runtao Liu","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f3","name":"Mengjie Zhao","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f4","name":"Xiangyu Wu","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f5","name":"Qingfa Xiao","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f6","name":"Qifeng Chen","hidden":false}],"publishedAt":"2026-06-06T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?","submittedOnDailyBy":{"_id":"642e7a12ccdcf5da7f9657a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png","isPro":true,"fullname":"Jiaqi Tang","user":"Jiaqi-hkust","type":"user","name":"Jiaqi-hkust"},"summary":"Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.","upvotes":37,"discussionId":"6a2b7f8d4957fcdd3aac06f7","projectPage":"https://huggingface.co/spaces/Jiaqi-hkust/Robust-U1","githubRepo":"https://github.com/jqtangust/Robust-U1","githubRepoAddedBy":"user","ai_summary":"Robust-U1 enhances multimodal large language models' robustness against visual corruptions through self-recovery capabilities that improve both visual quality and reasoning performance.","ai_keywords":["multimodal large language models","visual corruptions","robustness enhancement","supervised fine-tuning","reinforcement learning","dual rewards","pixel-level SSIM","semantic-level CLIP similarity","multimodal reasoning","visual self-recovery"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":13},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66a8bd30a4d314247d5b0c96","avatarUrl":"/avatars/fa42c38f5f67ae0bbb5af91bc8e9d770.svg","isPro":false,"fullname":"Marco Legend","user":"PPBoy","type":"user"},{"_id":"68e90cf73f71e9f931bdbd64","avatarUrl":"/avatars/781559f185f1a6ad44837d1c1a1a0e64.svg","isPro":false,"fullname":"Geng Bo","user":"ruye1111","type":"user"},{"_id":"642e7a12ccdcf5da7f9657a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png","isPro":true,"fullname":"Jiaqi Tang","user":"Jiaqi-hkust","type":"user"},{"_id":"65fb8ef8ca9a46d48445b02c","avatarUrl":"/avatars/0840c1d03f18038d6e7139e0beb1b6b4.svg","isPro":false,"fullname":"samashu","user":"sam234990","type":"user"},{"_id":"6538db7f670aeda41c5b87eb","avatarUrl":"/avatars/d56053395bfc2192f8c503585840da61.svg","isPro":false,"fullname":"Jianmin Chen","user":"WhateverBlue","type":"user"},{"_id":"68b05a5810365b11089da1d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/8pA3ZoU-3mX_jQy-NKf1h.png","isPro":false,"fullname":"Qianzhou Wang","user":"NLTSHager","type":"user"},{"_id":"6a2b83c905dfe766dea4b3b0","avatarUrl":"/avatars/8e6aa7dd0569039e37b490f5b6216663.svg","isPro":false,"fullname":"FAN Liheng","user":"Frank0118","type":"user"},{"_id":"6a2b85224b4174e9f2f1ed06","avatarUrl":"/avatars/7a78b9d8318567e6dbcc8c556278a9ac.svg","isPro":false,"fullname":"LU Chuanyu","user":"Marcus070814","type":"user"},{"_id":"694a00e503aa6599d72da3c1","avatarUrl":"/avatars/32f091562acf74285d16922a6b73e2c7.svg","isPro":false,"fullname":"Alex","user":"ling7c","type":"user"},{"_id":"626bb670974d6a67df6d079a","avatarUrl":"/avatars/463c3cb005b520c5d239bc9b54b96b5e.svg","isPro":false,"fullname":"Mengjie Zhao","user":"mzhao","type":"user"},{"_id":"637ee45b2438d7485b8d8f6a","avatarUrl":"/avatars/11b7d29b6fa6c1b392641e0cd4002863.svg","isPro":false,"fullname":"Xiaogang Xu","user":"xiaogang00","type":"user"},{"_id":"65b2124e1dcf354c000322e9","avatarUrl":"/avatars/76256eb716c9b8c4ec657a4b3dccca66.svg","isPro":false,"fullname":"YF","user":"FEInaldo","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.08063.md","query":{}}">
Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?
Abstract
Robust-U1 enhances multimodal large language models' robustness against visual corruptions through self-recovery capabilities that improve both visual quality and reasoning performance.
Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.
Community
Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.08063 in a dataset README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.