Hugging Face Daily Papers · · 5 min read

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding.</p>\n","updatedAt":"2026-06-12T03:41:53.040Z","author":{"_id":"642e7a12ccdcf5da7f9657a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png","fullname":"Jiaqi Tang","name":"Jiaqi-hkust","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":25,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8734767436981201},"editors":["Jiaqi-hkust"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png"],"reactions":[{"reaction":"🚀","users":["Jiaqi-hkust"],"count":1},{"reaction":"🔥","users":["Jiaqi-hkust"],"count":1},{"reaction":"👀","users":["Jiaqi-hkust"],"count":1},{"reaction":"❤️","users":["Jiaqi-hkust"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.08063","authors":[{"_id":"6a2b7f8d4957fcdd3aac06ee","user":{"_id":"642e7a12ccdcf5da7f9657a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png","isPro":true,"fullname":"Jiaqi Tang","user":"Jiaqi-hkust","type":"user","name":"Jiaqi-hkust"},"name":"Jiaqi Tang","status":"claimed_verified","statusLastChangedAt":"2026-06-12T06:56:26.807Z","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06ef","name":"Jianmin Chen","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f0","name":"Youyang Zhai","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f1","name":"Wei Wei","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f2","name":"Runtao Liu","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f3","name":"Mengjie Zhao","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f4","name":"Xiangyu Wu","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f5","name":"Qingfa Xiao","hidden":false},{"_id":"6a2b7f8d4957fcdd3aac06f6","name":"Qifeng Chen","hidden":false}],"publishedAt":"2026-06-06T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?","submittedOnDailyBy":{"_id":"642e7a12ccdcf5da7f9657a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png","isPro":true,"fullname":"Jiaqi Tang","user":"Jiaqi-hkust","type":"user","name":"Jiaqi-hkust"},"summary":"Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.","upvotes":37,"discussionId":"6a2b7f8d4957fcdd3aac06f7","projectPage":"https://huggingface.co/spaces/Jiaqi-hkust/Robust-U1","githubRepo":"https://github.com/jqtangust/Robust-U1","githubRepoAddedBy":"user","ai_summary":"Robust-U1 enhances multimodal large language models' robustness against visual corruptions through self-recovery capabilities that improve both visual quality and reasoning performance.","ai_keywords":["multimodal large language models","visual corruptions","robustness enhancement","supervised fine-tuning","reinforcement learning","dual rewards","pixel-level SSIM","semantic-level CLIP similarity","multimodal reasoning","visual self-recovery"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":13},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66a8bd30a4d314247d5b0c96","avatarUrl":"/avatars/fa42c38f5f67ae0bbb5af91bc8e9d770.svg","isPro":false,"fullname":"Marco Legend","user":"PPBoy","type":"user"},{"_id":"68e90cf73f71e9f931bdbd64","avatarUrl":"/avatars/781559f185f1a6ad44837d1c1a1a0e64.svg","isPro":false,"fullname":"Geng Bo","user":"ruye1111","type":"user"},{"_id":"642e7a12ccdcf5da7f9657a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png","isPro":true,"fullname":"Jiaqi Tang","user":"Jiaqi-hkust","type":"user"},{"_id":"65fb8ef8ca9a46d48445b02c","avatarUrl":"/avatars/0840c1d03f18038d6e7139e0beb1b6b4.svg","isPro":false,"fullname":"samashu","user":"sam234990","type":"user"},{"_id":"6538db7f670aeda41c5b87eb","avatarUrl":"/avatars/d56053395bfc2192f8c503585840da61.svg","isPro":false,"fullname":"Jianmin Chen","user":"WhateverBlue","type":"user"},{"_id":"68b05a5810365b11089da1d8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/8pA3ZoU-3mX_jQy-NKf1h.png","isPro":false,"fullname":"Qianzhou Wang","user":"NLTSHager","type":"user"},{"_id":"6a2b83c905dfe766dea4b3b0","avatarUrl":"/avatars/8e6aa7dd0569039e37b490f5b6216663.svg","isPro":false,"fullname":"FAN Liheng","user":"Frank0118","type":"user"},{"_id":"6a2b85224b4174e9f2f1ed06","avatarUrl":"/avatars/7a78b9d8318567e6dbcc8c556278a9ac.svg","isPro":false,"fullname":"LU Chuanyu","user":"Marcus070814","type":"user"},{"_id":"694a00e503aa6599d72da3c1","avatarUrl":"/avatars/32f091562acf74285d16922a6b73e2c7.svg","isPro":false,"fullname":"Alex","user":"ling7c","type":"user"},{"_id":"626bb670974d6a67df6d079a","avatarUrl":"/avatars/463c3cb005b520c5d239bc9b54b96b5e.svg","isPro":false,"fullname":"Mengjie Zhao","user":"mzhao","type":"user"},{"_id":"637ee45b2438d7485b8d8f6a","avatarUrl":"/avatars/11b7d29b6fa6c1b392641e0cd4002863.svg","isPro":false,"fullname":"Xiaogang Xu","user":"xiaogang00","type":"user"},{"_id":"65b2124e1dcf354c000322e9","avatarUrl":"/avatars/76256eb716c9b8c4ec657a4b3dccca66.svg","isPro":false,"fullname":"YF","user":"FEInaldo","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.08063.md","query":{}}">
Papers
arxiv:2606.08063

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Published on Jun 6
· Submitted by
Jiaqi Tang
on Jun 12
Authors:
,
,
,
,
,
,
,

Abstract

Robust-U1 enhances multimodal large language models' robustness against visual corruptions through self-recovery capabilities that improve both visual quality and reasoning performance.

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

Community

Paper author Paper submitter about 6 hours ago

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.08063
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.08063 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers