MoVerse is a real-time video world model that transforms a single narrow-field-of-view image into an interactively navigable environment by lifting a topology-aware 360° panorama into a persistent 3D Gaussian scaffold, achieving high-fidelity scene roaming at 8 FPS on a single NVIDIA RTX 4090 GPU.</p>\n","updatedAt":"2026-06-12T02:05:44.313Z","author":{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","fullname":"Jun Liang","name":"utopiar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8692756295204163},"editors":["utopiar"],"editorAvatarUrls":["/avatars/367731ce1c71d1e19ff415a52ae4067d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.13376","authors":[{"_id":"6a2b694b4957fcdd3aac05f8","name":"Yang Zhou","hidden":false},{"_id":"6a2b694b4957fcdd3aac05f9","name":"Ziheng Wang","hidden":false},{"_id":"6a2b694b4957fcdd3aac05fa","name":"Yuqin Lu","hidden":false},{"_id":"6a2b694b4957fcdd3aac05fb","name":"Haofeng Liu","hidden":false},{"_id":"6a2b694b4957fcdd3aac05fc","user":{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","isPro":false,"fullname":"Jun Liang","user":"utopiar","type":"user","name":"utopiar"},"name":"Jun Liang","status":"claimed_verified","statusLastChangedAt":"2026-06-12T06:57:13.175Z","hidden":false},{"_id":"6a2b694b4957fcdd3aac05fd","name":"Shengfeng He","hidden":false},{"_id":"6a2b694b4957fcdd3aac05fe","name":"Jing Li","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6455afeabda0fbba412d4922/T7D8_Ot4nL1AQxeU99o1d.mp4"],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold","submittedOnDailyBy":{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","isPro":false,"fullname":"Jun Liang","user":"utopiar","type":"user","name":"utopiar"},"summary":"We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360^circ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.","upvotes":8,"discussionId":"6a2b694c4957fcdd3aac05ff","projectPage":"https://orange-3dv-team.github.io/MoVerse/","githubRepo":"https://github.com/Orange-3DV-Team/MoVerse","githubRepoAddedBy":"user","ai_summary":"MoVerse generates real-time interactive video from single images by creating 360° panoramas and 3D Gaussian scaffolds, enabling efficient rendering through diffusion-based techniques.","ai_keywords":["video world model","360° panorama","topology-aware diffusion","3D Gaussian scaffold","panoramic geometry-aware residual prediction","Gaussian-conditioned video renderer","bidirectional diffusion","causal autoregressive student","real-time scene roaming"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"69670c472ad4f2c0e892575c","name":"Orange-Team","fullname":"Orange Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6455afeabda0fbba412d4922/Sy7zZn0kb-Q1SCSUwOn-9.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","isPro":false,"fullname":"Jun Liang","user":"utopiar","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"641ab8b0a5f876fe30c1ff87","avatarUrl":"/avatars/a95544619ef05814d58e2b2b8581f223.svg","isPro":false,"fullname":"jiema","user":"unrealMJ","type":"user"},{"_id":"6a267fa39a4a7577ed192b14","avatarUrl":"/avatars/f14221b223e9d96ba1fbd49aa607f708.svg","isPro":false,"fullname":"ab","user":"dakericy","type":"user"},{"_id":"6535d2c333c5982a29731933","avatarUrl":"/avatars/82c3e6fa1837025a424c6a5413e1db4b.svg","isPro":false,"fullname":"wukong","user":"Wukongi","type":"user"},{"_id":"687c4884e0e4d7cd7f980a35","avatarUrl":"/avatars/47c9aae3a05cbf72b7f71ba2d3ee2a11.svg","isPro":false,"fullname":"Ali Mohamud","user":"mokuai","type":"user"},{"_id":"687363d49a81c7dcbcfa2d84","avatarUrl":"/avatars/5d943a5c811ed931c3fdcfee19253049.svg","isPro":false,"fullname":"jj","user":"realman123","type":"user"},{"_id":"67ff2541372d6790b1ba1012","avatarUrl":"/avatars/4992f2ca5df07385d1191a148e299867.svg","isPro":false,"fullname":"yang","user":"matrixgle","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69670c472ad4f2c0e892575c","name":"Orange-Team","fullname":"Orange Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6455afeabda0fbba412d4922/Sy7zZn0kb-Q1SCSUwOn-9.png"},"query":{}}">
MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold
Abstract
MoVerse generates real-time interactive video from single images by creating 360° panoramas and 3D Gaussian scaffolds, enabling efficient rendering through diffusion-based techniques.
We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360^circ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.
Community
MoVerse is a real-time video world model that transforms a single narrow-field-of-view image into an interactively navigable environment by lifting a topology-aware 360° panorama into a persistent 3D Gaussian scaffold, achieving high-fidelity scene roaming at 8 FPS on a single NVIDIA RTX 4090 GPU.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.13376 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.13376 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.13376 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.