Hugging Face Daily Papers · June 11, 2026 · 3 min read

World Model Self-Distillation: Training World Models to Solve General Tasks

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

A scalable framework that trains world models to solve tasks via self-distillation and RL from VLM feedback.</p>\n","updatedAt":"2026-06-11T02:33:56.036Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":314,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.941815972328186},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12072","authors":[{"_id":"6a2a1e6180a9c7c6830c0eb1","user":{"_id":"683ec762eb5aadc8bcaa8625","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/683ec762eb5aadc8bcaa8625/YD8m1T-5ycCSnjlOMk-lc.jpeg","isPro":false,"fullname":"Sebastian Stapf","user":"sebastian-stapf","type":"user","name":"sebastian-stapf"},"name":"Sebastian Stapf","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:46.662Z","hidden":false},{"_id":"6a2a1e6180a9c7c6830c0eb2","name":"Pablo Acuaviva Huertos","hidden":false},{"_id":"6a2a1e6180a9c7c6830c0eb3","user":{"_id":"655df11f82afda0fc47c421d","avatarUrl":"/avatars/f0ea4317e6010b28392d85bb94dd2230.svg","isPro":false,"fullname":"Aram Davtyan","user":"araachie","type":"user","name":"araachie"},"name":"Aram Davtyan","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:42.821Z","hidden":false},{"_id":"6a2a1e6180a9c7c6830c0eb4","name":"Paolo Favaro","hidden":false}],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"World Model Self-Distillation: Training World Models to Solve General Tasks","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.","upvotes":5,"discussionId":"6a2a1e6180a9c7c6830c0eb5","projectPage":"https://sebastian-stapf.github.io/world-model-self-distillation/","githubRepo":"https://github.com/sebastian-stapf/world-model-self-distillation","githubRepoAddedBy":"user","ai_summary":"A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data.","ai_keywords":["vision-language model","video diffusion model","self-distillation","reinforcement learning","Demonstrator","Executor","VLM-based evaluation","WorldTasks-Benchmark","DreamGen robotics benchmark"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6,"organization":{"_id":"655e6783c0a20e9dbb031680","name":"cvg-unibe","fullname":"Computer Vision Group @ University of Bern","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/655df11f82afda0fc47c421d/ldC46bZlib6DB8Endhy9N.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"683ec762eb5aadc8bcaa8625","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/683ec762eb5aadc8bcaa8625/YD8m1T-5ycCSnjlOMk-lc.jpeg","isPro":false,"fullname":"Sebastian Stapf","user":"sebastian-stapf","type":"user"},{"_id":"655df11f82afda0fc47c421d","avatarUrl":"/avatars/f0ea4317e6010b28392d85bb94dd2230.svg","isPro":false,"fullname":"Aram Davtyan","user":"araachie","type":"user"},{"_id":"6568a0ee3fd0bf1f825f1996","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6568a0ee3fd0bf1f825f1996/iLneHRPiK9HUVpQiRZdVK.png","isPro":false,"fullname":"Pablo Acuaviva Huertos","user":"pacuaviva","type":"user"},{"_id":"6a26805820ff9cfc7c79f165","avatarUrl":"/avatars/a414704d1ca02e1aaa27f5963f7766aa.svg","isPro":false,"fullname":"Sebastian Stapf","user":"WMSD","type":"user"},{"_id":"6a2ae6c2e36bc84d91b6e7cc","avatarUrl":"/avatars/abf4b4c0020f9332b6827952cc53163e.svg","isPro":false,"fullname":"mmgood","user":"mmgood","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"655e6783c0a20e9dbb031680","name":"cvg-unibe","fullname":"Computer Vision Group @ University of Bern","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/655df11f82afda0fc47c421d/ldC46bZlib6DB8Endhy9N.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12072.md"}">

Papers

arxiv:2606.12072

World Model Self-Distillation: Training World Models to Solve General Tasks

Published on Jun 10

· Submitted by

taesiri on Jun 11

Computer Vision Group @ University of Bern

Upvote

Authors:

Sebastian Stapf ,

Aram Davtyan ,

Abstract

A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.

View arXiv page View PDF Project page GitHub 6 Add to collection

Community

taesiri

Paper submitter about 17 hours ago

A scalable framework that trains world models to solve tasks via self-distillation and RL from VLM feedback.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.12072

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12072 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12072 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12072 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

No comments yet. Sign in and be the first to say something.

World Model Self-Distillation: Training World Models to Solve General Tasks

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 2

Discussion (0)

More from Hugging Face Daily Papers