A scalable framework that trains world models to solve tasks via self-distillation and RL from VLM feedback.</p>\n","updatedAt":"2026-06-11T02:33:56.036Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":314,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.941815972328186},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12072","authors":[{"_id":"6a2a1e6180a9c7c6830c0eb1","user":{"_id":"683ec762eb5aadc8bcaa8625","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/683ec762eb5aadc8bcaa8625/YD8m1T-5ycCSnjlOMk-lc.jpeg","isPro":false,"fullname":"Sebastian Stapf","user":"sebastian-stapf","type":"user","name":"sebastian-stapf"},"name":"Sebastian Stapf","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:46.662Z","hidden":false},{"_id":"6a2a1e6180a9c7c6830c0eb2","name":"Pablo Acuaviva Huertos","hidden":false},{"_id":"6a2a1e6180a9c7c6830c0eb3","user":{"_id":"655df11f82afda0fc47c421d","avatarUrl":"/avatars/f0ea4317e6010b28392d85bb94dd2230.svg","isPro":false,"fullname":"Aram Davtyan","user":"araachie","type":"user","name":"araachie"},"name":"Aram Davtyan","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:42.821Z","hidden":false},{"_id":"6a2a1e6180a9c7c6830c0eb4","name":"Paolo Favaro","hidden":false}],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"World Model Self-Distillation: Training World Models to Solve General Tasks","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.","upvotes":5,"discussionId":"6a2a1e6180a9c7c6830c0eb5","projectPage":"https://sebastian-stapf.github.io/world-model-self-distillation/","githubRepo":"https://github.com/sebastian-stapf/world-model-self-distillation","githubRepoAddedBy":"user","ai_summary":"A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data.","ai_keywords":["vision-language model","video diffusion model","self-distillation","reinforcement learning","Demonstrator","Executor","VLM-based evaluation","WorldTasks-Benchmark","DreamGen robotics benchmark"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6,"organization":{"_id":"655e6783c0a20e9dbb031680","name":"cvg-unibe","fullname":"Computer Vision Group @ University of Bern","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/655df11f82afda0fc47c421d/ldC46bZlib6DB8Endhy9N.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"683ec762eb5aadc8bcaa8625","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/683ec762eb5aadc8bcaa8625/YD8m1T-5ycCSnjlOMk-lc.jpeg","isPro":false,"fullname":"Sebastian Stapf","user":"sebastian-stapf","type":"user"},{"_id":"655df11f82afda0fc47c421d","avatarUrl":"/avatars/f0ea4317e6010b28392d85bb94dd2230.svg","isPro":false,"fullname":"Aram Davtyan","user":"araachie","type":"user"},{"_id":"6568a0ee3fd0bf1f825f1996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6568a0ee3fd0bf1f825f1996/iLneHRPiK9HUVpQiRZdVK.png","isPro":false,"fullname":"Pablo Acuaviva Huertos","user":"pacuaviva","type":"user"},{"_id":"6a26805820ff9cfc7c79f165","avatarUrl":"/avatars/a414704d1ca02e1aaa27f5963f7766aa.svg","isPro":false,"fullname":"Sebastian Stapf","user":"WMSD","type":"user"},{"_id":"6a2ae6c2e36bc84d91b6e7cc","avatarUrl":"/avatars/abf4b4c0020f9332b6827952cc53163e.svg","isPro":false,"fullname":"mmgood","user":"mmgood","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"655e6783c0a20e9dbb031680","name":"cvg-unibe","fullname":"Computer Vision Group @ University of Bern","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/655df11f82afda0fc47c421d/ldC46bZlib6DB8Endhy9N.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12072.md"}">
World Model Self-Distillation: Training World Models to Solve General Tasks
Abstract
A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data.
Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.
Community
A scalable framework that trains world models to solve tasks via self-distillation and RL from VLM feedback.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.12072 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.12072 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.12072 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.