Hugging Face Daily Papers · · 4 min read

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

World Models Meet Language Models: Bridging Concrete and Abstract Reasoning</p>\n<p>Can we bridge the gap between physical intuition and logical thought? We explore the powerful synergy between World Models (concrete reasoning) and Large Language Models (abstract reasoning). By integrating these two paradigms, we unlock a new level of multimodal AI capabilities.</p>\n","updatedAt":"2026-06-03T04:24:58.776Z","author":{"_id":"636f37fa93d9a0c987e092fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636f37fa93d9a0c987e092fa/vdZgFPobSIUbBTC3jlfH5.jpeg","fullname":"Yucheng Zhou","name":"YCZhou","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8443126678466797},"editors":["YCZhou"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/636f37fa93d9a0c987e092fa/vdZgFPobSIUbBTC3jlfH5.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03603","authors":[{"_id":"6a1f9a6ee292c1c78ecb1350","name":"Yucheng Zhou","hidden":false},{"_id":"6a1f9a6ee292c1c78ecb1351","name":"Wei Tao","hidden":false},{"_id":"6a1f9a6ee292c1c78ecb1352","name":"Yiwen Guo","hidden":false},{"_id":"6a1f9a6ee292c1c78ecb1353","name":"Jianbing Shen","hidden":false}],"publishedAt":"2026-06-02T13:07:49.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning","submittedOnDailyBy":{"_id":"636f37fa93d9a0c987e092fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636f37fa93d9a0c987e092fa/vdZgFPobSIUbBTC3jlfH5.jpeg","isPro":false,"fullname":"Yucheng Zhou","user":"YCZhou","type":"user","name":"YCZhou"},"summary":"World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.","upvotes":16,"discussionId":"6a1f9a6ee292c1c78ecb1354","githubRepo":"https://github.com/yczhou001/PF-OPSD","githubRepoAddedBy":"user","ai_summary":"Controlled concrete reasoning combines visual simulation with abstract reasoning through a training method that uses privileged future information to improve prediction accuracy and robustness.","ai_keywords":["world models","multimodal large language models","visual rollouts","concrete reasoning","privileged context","on-policy self-distillation","visual simulation","abstract reasoning","controlled concrete reasoning","PF-OPSD"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"636f37fa93d9a0c987e092fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636f37fa93d9a0c987e092fa/vdZgFPobSIUbBTC3jlfH5.jpeg","isPro":false,"fullname":"Yucheng Zhou","user":"YCZhou","type":"user"},{"_id":"65aa518b71c5d01a2832102f","avatarUrl":"/avatars/a932ccab8b513b4ecfc7ce6fc39e430e.svg","isPro":false,"fullname":"Davil Su","user":"DavilSu","type":"user"},{"_id":"66026f1a9196a7da29d98c2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66026f1a9196a7da29d98c2a/EosK-f462ZnYOgRVk5ulH.jpeg","isPro":false,"fullname":"chenglin","user":"liyyy","type":"user"},{"_id":"63e7a6d3db40d9e67fef2da8","avatarUrl":"/avatars/999ce90f4ff43223d12de3902505f6ed.svg","isPro":false,"fullname":"Dobbin Chen","user":"Dobbin","type":"user"},{"_id":"69a05bb34029886df9ea6a1c","avatarUrl":"/avatars/17638b62ad58cb56e2f9051824e66c93.svg","isPro":false,"fullname":"Ye Wang","user":"WangYe007","type":"user"},{"_id":"6355473d525beaee688b7ba1","avatarUrl":"/avatars/1fb0d57ed5f1a9b872a1ada8b2973ffb.svg","isPro":false,"fullname":"Wei Tao","user":"itaowe","type":"user"},{"_id":"665d72007bef1cfc313a92dd","avatarUrl":"/avatars/6d56671153bbf1ffff072472678819da.svg","isPro":false,"fullname":"Haoyu Zhang","user":"lemonade666","type":"user"},{"_id":"66446af96696cad434cd6f4f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/R7vpM8WsbF1e7XqS3NNOH.png","isPro":false,"fullname":"Wu","user":"SimAlan","type":"user"},{"_id":"666aa9a4372850ab366a7905","avatarUrl":"/avatars/f0083af000af37cc1e3f966862df03e7.svg","isPro":false,"fullname":"linxiaoyang","user":"tlx001","type":"user"},{"_id":"67c5541317ff2ccbf95c1e81","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67c5541317ff2ccbf95c1e81/ivFXIwna3ofM3rcTiGuHZ.jpeg","isPro":false,"fullname":"HU HE","user":"GMLHUHE","type":"user"},{"_id":"67e5251e655225dfe702cbe3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/nSTvI0hLgir8XUYVEZvJX.png","isPro":false,"fullname":"Huan Zheng","user":"Ian9898","type":"user"},{"_id":"660383b2527470e0164533a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660383b2527470e0164533a9/CXIpr6_vtoxPFXW5EKh8n.jpeg","isPro":false,"fullname":"Chengqian Ma","user":"ChengqianMa","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.03603.md"}">
Papers
arxiv:2606.03603

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

Published on Jun 2
· Submitted by
Yucheng Zhou
on Jun 3
Authors:
,
,
,

Abstract

Controlled concrete reasoning combines visual simulation with abstract reasoning through a training method that uses privileged future information to improve prediction accuracy and robustness.

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.

Community

Paper submitter about 9 hours ago

World Models Meet Language Models: Bridging Concrete and Abstract Reasoning

Can we bridge the gap between physical intuition and logical thought? We explore the powerful synergy between World Models (concrete reasoning) and Large Language Models (abstract reasoning). By integrating these two paradigms, we unlock a new level of multimodal AI capabilities.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.03603
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.03603 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.03603 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers