Hugging Face Daily Papers · · 5 min read

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Repo: <a href=\"https://github.com/showlab/Dream.exe\" rel=\"nofollow\">https://github.com/showlab/Dream.exe</a></p>\n","updatedAt":"2026-06-05T10:15:12.536Z","author":{"_id":"652b83b73b5997ed71a310f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652b83b73b5997ed71a310f2/ipCpdeHUp4-0OmRz5z8IW.png","fullname":"Rui Zhao","name":"ruizhaocv","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6428112387657166},"editors":["ruizhaocv"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/652b83b73b5997ed71a310f2/ipCpdeHUp4-0OmRz5z8IW.png"],"reactions":[{"reaction":"🚀","users":["KevinQHLin"],"count":1}],"isReport":false}},{"id":"6a22b3f0305800d29685ecab","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-05T11:33:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"One observation is that the framework reveals visual quality as a poor predictor of executability, indicating that internet-scale generative priors already encode some physical knowledge yet standard metrics miss this dimension. \n\nHow might the video-to-execution pipeline change if the physics simulator were replaced by real-robot deployment, particularly for tasks where sim-to-real gaps could alter the measured success rates?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/d276b4bc-9df3-4822-be21-f1e126663737","html":"<p>One observation is that the framework reveals visual quality as a poor predictor of executability, indicating that internet-scale generative priors already encode some physical knowledge yet standard metrics miss this dimension. </p>\n<p>How might the video-to-execution pipeline change if the physics simulator were replaced by real-robot deployment, particularly for tasks where sim-to-real gaps could alter the measured success rates?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/d276b4bc-9df3-4822-be21-f1e126663737\" rel=\"nofollow\">https://researchpod.app/episode/d276b4bc-9df3-4822-be21-f1e126663737</a></p>\n","updatedAt":"2026-06-05T11:33:04.401Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9149320125579834},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.04811","authors":[{"_id":"6a22a193047f837f986778e7","name":"Rui Zhao","hidden":false},{"_id":"6a22a193047f837f986778e8","name":"Kaiming Yang","hidden":false},{"_id":"6a22a193047f837f986778e9","name":"Jifeng Zhu","hidden":false},{"_id":"6a22a193047f837f986778ea","name":"Siyang Chen","hidden":false},{"_id":"6a22a193047f837f986778eb","name":"Ziqi Wang","hidden":false},{"_id":"6a22a193047f837f986778ec","name":"Weijia Wu","hidden":false},{"_id":"6a22a193047f837f986778ed","name":"Kevin Qinghong Lin","hidden":false},{"_id":"6a22a193047f837f986778ee","name":"Heng Wang","hidden":false},{"_id":"6a22a193047f837f986778ef","name":"Mike Zheng Shou","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?","submittedOnDailyBy":{"_id":"652b83b73b5997ed71a310f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652b83b73b5997ed71a310f2/ipCpdeHUp4-0OmRz5z8IW.png","isPro":false,"fullname":"Rui Zhao","user":"ruizhaocv","type":"user","name":"ruizhaocv"},"summary":"Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.","upvotes":12,"discussionId":"6a22a194047f837f986778f0","ai_summary":"Video generation models were evaluated through robotic manipulation tasks to assess their ability to reflect physical reality, revealing that visual quality does not predict executable motion accuracy.","ai_keywords":["video generation models","robotic manipulation","physics simulator","video-to-execution pipeline","generative priors","trajectory fidelity","execution success"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6357c9f400f138b8ca551704","avatarUrl":"/avatars/9bf638df27fae9a78d15ccfe67619c7a.svg","isPro":false,"fullname":"Siyuan Hu","user":"h-siyuan","type":"user"},{"_id":"68dde6c99ba9f8600f1cf45c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/OwKvTNQGc_NC-XsT5wldu.png","isPro":false,"fullname":"HengThong Lam","user":"htlam08","type":"user"},{"_id":"63fb063d1b4b1bd4e7fb3c86","avatarUrl":"/avatars/241e3b1340bf6450893a821ae73f2f56.svg","isPro":false,"fullname":"Andy","user":"Lingmin-Ran","type":"user"},{"_id":"652b83b73b5997ed71a310f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652b83b73b5997ed71a310f2/ipCpdeHUp4-0OmRz5z8IW.png","isPro":false,"fullname":"Rui Zhao","user":"ruizhaocv","type":"user"},{"_id":"6582a58a914a31aa813b53cf","avatarUrl":"/avatars/62312b6772e8850d7a463098c8b598da.svg","isPro":false,"fullname":"soleil tang","user":"accebet","type":"user"},{"_id":"623461fccd8a0462e55b3666","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1647600114080-noauth.jpeg","isPro":true,"fullname":"Guian Fang","user":"Enderfga","type":"user"},{"_id":"64c4c3dc5cd9506edf31c37c","avatarUrl":"/avatars/d9199c993fd44b7a1388850e08418717.svg","isPro":false,"fullname":"ZEN Weiss","user":"ZEN1984","type":"user"},{"_id":"642435a1a3adbc7142c3b0a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642435a1a3adbc7142c3b0a6/wgLT_w9jNWRU3O0jU0646.jpeg","isPro":false,"fullname":"Joya Chen","user":"chenjoya","type":"user"},{"_id":"683725789cf5a24ae619ff81","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/sE4DTVx1mCHbUf5uJjCcr.png","isPro":false,"fullname":"Kaiming Yang","user":"kaimingyang","type":"user"},{"_id":"6729d1fed3ec5370cb035901","avatarUrl":"/avatars/50f7ce9c635148df76d1c63ebf3efa38.svg","isPro":false,"fullname":"1","user":"DANNY621","type":"user"},{"_id":"64440be5af034cdfd69ca3a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64440be5af034cdfd69ca3a7/qmx24QiDFT29vleCxL9TX.jpeg","isPro":false,"fullname":"Qinghong (Kevin) Lin","user":"KevinQHLin","type":"user"},{"_id":"67ee16793431887c4ddeeb25","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/qYNhJKGo5oZCegA97dk9R.png","isPro":false,"fullname":"Junchao Yi","user":"Junc1i","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.04811.md"}">
Papers
arxiv:2606.04811

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Published on Jun 4
· Submitted by
Rui Zhao
on Jun 5
Authors:
,
,
,
,
,
,
,
,

Abstract

Video generation models were evaluated through robotic manipulation tasks to assess their ability to reflect physical reality, revealing that visual quality does not predict executable motion accuracy.

Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.

Community

One observation is that the framework reveals visual quality as a poor predictor of executability, indicating that internet-scale generative priors already encode some physical knowledge yet standard metrics miss this dimension.

How might the video-to-execution pipeline change if the physics simulator were replaced by real-robot deployment, particularly for tasks where sim-to-real gaps could alter the measured success rates?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/d276b4bc-9df3-4822-be21-f1e126663737

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.04811
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.04811 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.04811 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.04811 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers