Hugging Face Daily Papers · · 5 min read

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

SleepWalk introduces a scalable single-scene 3D benchmark that tests whether VLMs can translate natural-language instructions into continuous, collision-aware, interaction-compatible trajectories, revealing that current frontier models still fail at precise spatial grounding despite producing plausible-looking paths.</p>\n<p>➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐒𝐥𝐞𝐞𝐩𝐖𝐚𝐥𝐤:</p>\n<p>🧭 𝑺𝒊𝒏𝒈𝒍𝒆-𝑺𝒄𝒆𝒏𝒆 𝟑𝑫 𝑻𝒓𝒂𝒋𝒆𝒄𝒕𝒐𝒓𝒚 𝑩𝒆𝒏𝒄𝒉𝒎𝒂𝒓𝒌: Introduces <strong>SleepWalk</strong>, built from <strong>2,472 curated 3D environments</strong> reconstructed from text using <strong>Hunyuan3D-3.0</strong>, targeting localized, object-centric embodied reasoning rather than long-horizon room-to-room VLN.</p>\n<p>🧩 𝑻𝒉𝒓𝒆𝒆-𝑻𝒊𝒆𝒓 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝑺𝒕𝒓𝒆𝒔𝒔 𝑻𝒆𝒔𝒕: Generates <strong>nine instructions per scene</strong> with <strong>Qwen3-8B-VL</strong>, split into easy, medium, and hard tiers that progressively test single-goal localization, compositional spatial grounding, and multi-step interaction-aware planning.</p>\n<p>⚖️ 𝑷𝒐𝒊𝒏𝒕𝒘𝒊𝒔𝒆 𝑱𝒖𝒅𝒈𝒆-𝑩𝒂𝒔𝒆𝒅 𝑬𝒗𝒂𝒍𝒖𝒂𝒕𝒊𝒐𝒏: Proposes a standardized <strong>GPT-5-mini judge protocol</strong> scoring trajectories on <strong>start-location consistency, goal satisfaction, obstacle avoidance, and trajectory efficiency</strong>, showing GPT-5-mini leads among tested models but still degrades sharply on harder interaction-heavy tasks.</p>\n","updatedAt":"2026-05-13T22:52:44.040Z","author":{"_id":"63a4754927f1f64ed7238dac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a4754927f1f64ed7238dac/aH-eJF-31g4vof9jv2gmI.jpeg","fullname":"Aman Chadha","name":"amanchadha","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7768599987030029},"editors":["amanchadha"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a4754927f1f64ed7238dac/aH-eJF-31g4vof9jv2gmI.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.10376","authors":[{"_id":"6a04ff5cb1a8cbabc9f0861a","name":"Niyati Rawal","hidden":false},{"_id":"6a04ff5cb1a8cbabc9f0861b","name":"Sushant Ravva","hidden":false},{"_id":"6a04ff5cb1a8cbabc9f0861c","name":"Shah Alam Abir","hidden":false},{"_id":"6a04ff5cb1a8cbabc9f0861d","name":"Saksham Jain","hidden":false},{"_id":"6a04ff5cb1a8cbabc9f0861e","name":"Aman Chadha","hidden":false},{"_id":"6a04ff5cb1a8cbabc9f0861f","name":"Vinija Jain","hidden":false},{"_id":"6a04ff5cb1a8cbabc9f08620","name":"Suranjana Trivedy","hidden":false},{"_id":"6a04ff5cb1a8cbabc9f08621","name":"Amitava Das","hidden":false}],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation","submittedOnDailyBy":{"_id":"63a4754927f1f64ed7238dac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a4754927f1f64ed7238dac/aH-eJF-31g4vof9jv2gmI.jpeg","isPro":false,"fullname":"Aman Chadha","user":"amanchadha","type":"user","name":"amanchadha"},"summary":"Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.","upvotes":1,"discussionId":"6a04ff5cb1a8cbabc9f08622","ai_summary":"SleepWalk is a benchmark for evaluating vision-language models' ability to predict spatially coherent, executable trajectories in 3D environments based on textual instructions and visual observations.","ai_keywords":["Vision-Language Models","embodied reasoning","trajectory prediction","3D environments","spatial coherence","action-capable agents","vision-language navigation","multimodal perception","language understanding"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63a4754927f1f64ed7238dac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a4754927f1f64ed7238dac/aH-eJF-31g4vof9jv2gmI.jpeg","isPro":false,"fullname":"Aman Chadha","user":"amanchadha","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0}">
Papers
arxiv:2605.10376

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

Published on May 11
· Submitted by
Aman Chadha
on May 13
Authors:
,
,
,
,
,
,
,

Abstract

SleepWalk is a benchmark for evaluating vision-language models' ability to predict spatially coherent, executable trajectories in 3D environments based on textual instructions and visual observations.

AI-generated summary

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

Community

Paper submitter about 3 hours ago

SleepWalk introduces a scalable single-scene 3D benchmark that tests whether VLMs can translate natural-language instructions into continuous, collision-aware, interaction-compatible trajectories, revealing that current frontier models still fail at precise spatial grounding despite producing plausible-looking paths.

➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐒𝐥𝐞𝐞𝐩𝐖𝐚𝐥𝐤:

🧭 𝑺𝒊𝒏𝒈𝒍𝒆-𝑺𝒄𝒆𝒏𝒆 𝟑𝑫 𝑻𝒓𝒂𝒋𝒆𝒄𝒕𝒐𝒓𝒚 𝑩𝒆𝒏𝒄𝒉𝒎𝒂𝒓𝒌: Introduces SleepWalk, built from 2,472 curated 3D environments reconstructed from text using Hunyuan3D-3.0, targeting localized, object-centric embodied reasoning rather than long-horizon room-to-room VLN.

🧩 𝑻𝒉𝒓𝒆𝒆-𝑻𝒊𝒆𝒓 𝑰𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏 𝑺𝒕𝒓𝒆𝒔𝒔 𝑻𝒆𝒔𝒕: Generates nine instructions per scene with Qwen3-8B-VL, split into easy, medium, and hard tiers that progressively test single-goal localization, compositional spatial grounding, and multi-step interaction-aware planning.

⚖️ 𝑷𝒐𝒊𝒏𝒕𝒘𝒊𝒔𝒆 𝑱𝒖𝒅𝒈𝒆-𝑩𝒂𝒔𝒆𝒅 𝑬𝒗𝒂𝒍𝒖𝒂𝒕𝒊𝒐𝒏: Proposes a standardized GPT-5-mini judge protocol scoring trajectories on start-location consistency, goal satisfaction, obstacle avoidance, and trajectory efficiency, showing GPT-5-mini leads among tested models but still degrades sharply on harder interaction-heavy tasks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.10376 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.10376 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.10376 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers