Hugging Face Daily Papers · · 7 min read

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.</p>\n","updatedAt":"2026-06-02T02:41:38.470Z","author":{"_id":"6622f3e1c80be2cc569fb5e1","avatarUrl":"/avatars/f4bacc6e090ec6e6d9f89b279783f1bd.svg","fullname":"LeyiWu","name":"YUEVII","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.903066098690033},"editors":["YUEVII"],"editorAvatarUrls":["/avatars/f4bacc6e090ec6e6d9f89b279783f1bd.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.00828","authors":[{"_id":"6a1e3f5a808ddbc3c7d43beb","name":"Leyi Wu","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bec","name":"Yifan Zhao","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bed","name":"Jinjie Zhang","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bee","name":"Suzeyu Chen","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bef","name":"Wosong Chen","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bf0","name":"Zhifei Chen","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bf1","name":"Tianshuo Xu","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bf2","name":"Qingchun He","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bf3","name":"Hongxin Hu","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bf4","name":"Haojian Huang","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bf5","name":"Yangkai Wei","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bf6","name":"Wenqian Li","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bf7","name":"Yinchuan Li","hidden":false},{"_id":"6a1e3f5a808ddbc3c7d43bf8","name":"Ying-Cong Chen","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6622f3e1c80be2cc569fb5e1/QJ6PrrMntMt-W8kL9PpC_.mp4"],"publishedAt":"2026-05-30T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes","submittedOnDailyBy":{"_id":"6622f3e1c80be2cc569fb5e1","avatarUrl":"/avatars/f4bacc6e090ec6e6d9f89b279783f1bd.svg","isPro":false,"fullname":"LeyiWu","user":"YUEVII","type":"user","name":"YUEVII"},"summary":"Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.","upvotes":6,"discussionId":"6a1e3f5a808ddbc3c7d43bf9","projectPage":"https://yuevii.github.io/robostressbench-page/","githubRepo":"https://github.com/YUEVII/RoboStressBench","githubRepoAddedBy":"user","ai_summary":"RoboStressBench presents a principled benchmark for evaluating vision-language model robustness to physical visual stress in embodied AI, decomposing visual stress into material, viewpoint, lighting, and geometry dimensions.","ai_keywords":["Vision-Language Models","embodied AI","visual perception","inverse graphics","RoboStressBench","physical rendering equation","visual stress","visual recognition","visual reasoning","visual planning","stress-aware agentic solver","visual-editing skills"],"githubStars":2,"organization":{"_id":"6a1ae2b4cbdf03bba28c810d","name":"RoboStressBench","fullname":"RoboStressBench Team","avatar":"https://www.gravatar.com/avatar/89273f8294a1ccfaac14fa66fbf394ed?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6622f3e1c80be2cc569fb5e1","avatarUrl":"/avatars/f4bacc6e090ec6e6d9f89b279783f1bd.svg","isPro":false,"fullname":"LeyiWu","user":"YUEVII","type":"user"},{"_id":"689ee12d83adca175b4bbaa8","avatarUrl":"/avatars/afe260ae4ac96a50373cb0e3e24e37a8.svg","isPro":false,"fullname":"chen wosong","user":"wschen","type":"user"},{"_id":"69bcc948518f6d1f3d2b8cfe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/1KHoLmvRcKJrnnUb_ChMu.png","isPro":false,"fullname":"大翔 井上","user":"averypa","type":"user"},{"_id":"699154ca5746ce1d3a4df656","avatarUrl":"/avatars/5b848ef75e8cdfa6c0c1a815c054120e.svg","isPro":false,"fullname":"Fjd408hft","user":"fjd408hft","type":"user"},{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user"},{"_id":"687363d49a81c7dcbcfa2d84","avatarUrl":"/avatars/5d943a5c811ed931c3fdcfee19253049.svg","isPro":false,"fullname":"jj","user":"realman123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a1ae2b4cbdf03bba28c810d","name":"RoboStressBench","fullname":"RoboStressBench Team","avatar":"https://www.gravatar.com/avatar/89273f8294a1ccfaac14fa66fbf394ed?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.00828.md"}">
Papers
arxiv:2606.00828

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Published on May 30
· Submitted by
LeyiWu
on Jun 2
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

RoboStressBench presents a principled benchmark for evaluating vision-language model robustness to physical visual stress in embodied AI, decomposing visual stress into material, viewpoint, lighting, and geometry dimensions.

AI-generated summary

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.

Community

Paper submitter about 7 hours ago

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.00828
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.00828 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.00828 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers