Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action Models
Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.
Background: two building blocks. A visuomotor policy maps current observations plus a goal or instruction to robot actions. A world model predicts future visual or latent states from the current state plus an action or goal abstraction. A WAM sits at the overlap: it leverages a pretrained video/world-model backbone as a prior and predicts both future states and robot actions.Visuomotor policy: language instruction and current observation in, action sequence out.
World model: current world state plus an action abstraction in, future image or latent out.
Introduction
Last year, my Scholar Inbox digest was dominated almost every day by new VLA papers. This changed in the last months, and a different keyword is coming up almost daily now too: WAM, short for World-Action Model. In October 2025, I wrote in my State of VLA post that WAMs were a small subfield within VLA research and far less popular than VLAs initialized from VLMs [60]. That has changed fast, and my wish to see more work in this direction has already become reality.
So what changed, and why now? Maybe it is because WAMs are the shiny new thing everyone wants to work on, or VLA authors ran out of new names for their own VLAs, since basically all “-VLA” names like “X-VLA” and “Ego-VLA” are already used. So now we can recycle them for the WAM area. But more likely it has something to do with VLM-based VLAs getting stuck. Modern VLAs benefited from massive vision-language pretraining, but they still hit a language-to-action grounding wall. The problem of mapping language and pixels into behavior still has to be learned from robot data. WAMs offer a different starting point. They use pretrained video or world-model backbones that already model how scene dynamics change under language conditioning. If that prior transfers to behavior generation, the remaining video-to-action gap may be smaller than learning language-to-action grounding directly.
But the ideas behind WAMs are not new. Early WAMs like UniPi [10] proposed essentially this approach back in 2023. So why did it take several years for the paradigm to enter the robot foundation model mainstream, and where does it actually stand today? This post takes a closer look at the modern WAM landscape to answer the central question:
Central question: Is this a real paradigm shift in research and industry, or just a short hype cycle? And if the recipe works so well, why did it take several years after early papers like UniPi for WAMs to become so popular?
My take: WAMs will become the second major recipe for robot foundation models, alongside VLM-based VLAs. The open questions are which formulation of them wins, and which parts of the model architecture and pipeline actually matter. It is likely that the winner is neither pure VLA nor pure WAM, but a hybrid of both.
This is my map of the modern WAM space: how to categorize and understand WAMs, what changed since the early models, and how current results compare to VLAs. For a broader survey, see the recent NTU survey “World Model for Robot Learning: A Comprehensive Survey” [57], which maps world models for robot learning across simulation, evaluation, navigation, and autonomous driving.
Table of contents
The two representation bets for generalist policies
The field currently has two major representation bets for robot foundation models in both research and industry. Many teams are building on the traditional VLA recipe established by Pi-0 [2] and later refined by Pi-0.5 [4], using VLM backbones as the starting point for policy learning. This VLM-backbone recipe appears in public work from teams including NVIDIA GR00T [5], Xiaomi Robotics [27], Being-H0.5 [28], and others.
More recently, a different paradigm has emerged: using pretrained video backbones as an alternative path toward generalist manipulation. Public examples now span NVIDIA’s DreamZero [8] and Cosmos Policy [13], Ant Group’s LingBot-VA [9], Rhoda AI’s DVA [40], Sereact’s Cortex 2.0 [45], and Mimic Robotics with mimic-video [14]. At the same time, many university labs and open research groups are also pushing the frontier with new ideas, including Video Prediction Policy [24], Unified Video Action Model [39], and Fast-WAM [23]. We discuss these in more detail below.
The choice of backbone impacts the full training and evaluation pipeline, from training recipe and data mixture to inference optimizations. Given the cost of running these models at scale, most teams will likely have to prioritize one direction (VLA or WAM) first rather than fully pursuing both in parallel. Which path proves out, or whether the two converge, is still open. Which one would you bet on today? In the following sections, we dive deeper into both sides of this decision.
Why World-Action Models? Our hypotheses
Before we dive deeper into current models, let’s first review why WAMs are attractive as an alternative to VLM-based VLAs. It also helps to first place WAMs inside the broader landscape of world models in robotics.
The grounding gap
To understand why WAMs are attractive, it helps to understand the core challenge of “classical” VLAs built on VLM backbones. The motivation for the first VLAs was to leverage the internet-scale knowledge of VLMs for robotics. VLMs are trained on massive amounts of vision-text data and show notable zero-shot performance on many vision tasks. The VLA recipe then adapts these pretrained representations for action generation.
However, there is a major domain gap between VLM pretraining and embodied manipulation. Several VLA papers either observe degradation of pretrained VLM capabilities or design around it, particularly when the action-learning objective diverges sharply from the original VLM objective. VLM2VLA frames this directly as catastrophic forgetting during the VLM-to-VLA transition [55]. Knowledge Insulation reports similar findings and makes the concern architectural: it isolates the gradients of the flow-matching action expert from the VLM backbone to preserve pretrained language/vision knowledge, improving training convergence, task performance, and language following [20]. Recent solutions like VLM co-training and discrete action tokenizers have helped, but the core challenge remains: grounding language into physical action from limited robot data. We cover these solutions in the modern VLA baseline section below.
This naturally raises the question: what if we started from a backbone that already represents how language maps to visual change in the world?
Core hypotheses for WAMs as policy representations
The core idea is simple: instead of using a VLM backbone to jump-start imitation learning, use a pretrained video backbone. Current video models are trained on large video corpora and learn spatiotemporal representations of how visual scenes evolve. Crucially, current video models are often text-conditioned: they are trained to generate videos from precise language descriptions, sometimes with a reference frame and sometimes from text alone. Many of these videos contain intentional behavior: hands reaching, tools moving, objects being manipulated, and scenes changing because someone or something acted. That makes video backbones attractive as a model prior for generalist manipulation. Before seeing any robot actions, the backbone already encodes useful links between language, visual change, and plausible object interactions. The Veo 3.1 demonstration below is a quick illustration.
I would treat the next three points as hypotheses, not conclusions. They are recurring claims across papers, discussions with peers, and my own read of the field, supported by qualitative intuition, simulation evidence, and a few early real-world signals, but not by clean matched comparisons yet:
- Predicting future world changes correlates with generating the necessary actions. Inverse dynamics prediction is often easier than pure action generation [26]. If the desired outcome is known, inferring the action that produced it is usually simpler than predicting the action directly from the instruction and current observation. Pi-0.7’s visual-subgoal results point in the same direction: when the policy is given a desired future image, action prediction becomes more direct and training converges faster [43].
- Video pretraining provides grounding between language and physical change. Video models learn to map text descriptions to visual outcomes. If this transfers to robotics, it could reduce the amount of grounding that has to be learned from robot demonstrations alone.
- Video data regularizes robot policies. Robot datasets are small relative to web-scale video. Either through pretraining on video first or through co-training on video alongside robot data, the broader visual prior can reduce overfitting; the benefit depends on the dataset, objective, and architecture. DreamZero [8] and Fast-WAM [23] both show that, during robot fine-tuning, WAMs perform best when action learning is co-trained with a video-prediction objective.
A quick experiment: how much does a frontier video model already “understand” about robot manipulation?
How much do modern video models already capture before any robotics-specific action head is added? We ran a simple experiment with Google’s Veo 3.1, a frontier video generation model. Given a single context frame from an original RoboArena rollout of a toaster task in the DROID setup, we prompted Veo to push the toaster lever (the reference task, matching the original DROID demonstration) and then pick up an orange sitting to the left (the composed extension, beyond the demonstration). This video is very unlikely to be part of Veo’s pretraining data, but we cannot verify the training set directly; treat this as a qualitative check of the prior, not a controlled probe of training-set membership. One-shot attempt, no prompt optimization.
The prompt used was:
“Given this initial frame, generate a video of the robot arm pushing the toaster lever. After finishing that task, the robot should pick up the orange on the left side of the toaster and stop after it has picked it up.”
Context frame and ground-truth rollout:
Veo 3.1 generated rollouts (zero-shot, no robotics fine-tuning):
The generated rollout is surprisingly good for a model that was not explicitly trained as a robot policy. The generated motions are smooth, the background remains stable and consistent, and the robot follows a plausible trajectory toward both target objects. Even the sequencing is respected: finish the lever, then move to the orange.
The limitations are equally visible: The model does not fully push the toaster lever down and at points appears to attempt the opposite motion (pulling it up). More visibly, the pinch gripper from the original DROID setup morphs into a four-fingered hand. The fixed-base robot arm is reimagined, almost instantly after the context frame, as a different robot with fewer degrees of freedom. These artifacts are consistent with the model using broad visual priors rather than faithfully modeling the specific hardware.
Still, the result illustrates why video backbones are attractive for robotics: the model has a useful prior for what robot-object interaction should look like, even though it is not yet reliable enough for control. WAM fine-tuning is the attempt to turn that zero-shot imagination into reliable control.
Understanding modern WAMs: Core formulations
After establishing the core motivation, we can now focus on the current WAM research. In contrast to VLM-based VLAs, where the training recipe has largely converged around VLM co-training with a flow transformer for action generation, WAMs are still splitting into several active formulations. This is exactly what makes the area interesting right now: the field does not yet know which combination of design choices will win, or whether the best systems will merge parts of several.
To make the design space readable, we organize WAMs along three axes (which are not fully independent):
- Paradigm: what does the model predict, and how is the predicted video used to generate actions? (inverse dynamics vs joint prediction vs representation-only)
- Action integration: how do actions actually enter the model? (default action tokens vs action-as-image vs latent actions/plans)
- Architecture: how are the components composed? (Mixture-of-Transformers vs monolithic vs hierarchical)
The axes are not fully independent, and some WAMs do not fit well into a single category. I would not treat this as a perfect taxonomy. It should be more a practical map for reading the current papers without getting lost in naming choices. For each axis, I present the idea with an older paper and then a modern scaled-up version of the same rough recipe.
Paradigm: What the model predicts
The first axis is the policy formulation: what the model predicts, and how the predicted video is used to generate actions. Across modern WAMs, we see three directions that differ at the inference boundary: inverse dynamics, joint prediction, and representation-only.
Inverse dynamics: Predict the future, then infer the action
The inverse-dynamics setup is the easiest WAM recipe to understand: first imagine the future, then predict the most likely action from the video. This shifts the hard language-grounding problem into the video stage: translate the command into a plausible visual change. The bet is that video pretraining has already learned a useful part of this language-to-visual-change mapping, so the action head does not have to learn everything from robot demos and can focus on the inverse-dynamics problem instead.
More from NVIDIA Developer Blog
-
How to Govern Autonomous Agents in Enterprise AI Factories
Jun 29
-
Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure
Jun 26
-
Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer
Jun 26
-
Streamlining Resource Binding with End-to-End Support for Vulkan Descriptor Heaps
Jun 25
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.