NVIDIA Developer Blog · · 17 min read

Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action Models

Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.

Quick glossary for readers new to VLA/WAM terminology
VLA Vision-Language-Action model: a robot policy that starts from a pretrained VLM backbone and adapts it to generate actions from visual observations and language instructions. Large-scale VLM pretraining is a core part of the recipe. See Pi-0 and GR00T N1.
WAM World-Action Model: a policy that starts from a pretrained world-model or video backbone and adapts it to represent or predict how the scene changes over time and emit corresponding actions. We use WAM as the term throughout this post.
VLM Vision-Language Model: a model pretrained on image-text or video-text data to produce language outputs grounded in visual inputs, usually before being adapted for robot control.
Video backbone A pretrained video model reused as the central representation or generator inside a robot policy.
World model A model that predicts a future world state, conditioned on some action abstraction such as language, robot actions, or latent actions. The predicted state may be represented as images, video, point tracks, object states, or latent features. See the classic World Models paper and NVIDIA’s Cosmos world foundation model paper.
Grounding Connecting symbols (e.g. words in a language instruction) to the perceptual and motor referents that satisfy them. Language-to-action grounding in particular means turning an instruction like “pick up the red mug” into the visual percepts and motor commands that actually accomplish it. The grounding gap is the persistent shortfall between what a model knows about language and what it can reliably cause to happen in the physical world.
Inverse dynamics Given a current observation ot and a future observation ot+k, infer the most plausible action or action sequence that would produce the transition.
Joint prediction Given ot and language lt, train one policy π(ot, lt) to predict both future observations ot+1:t+k and actions at:t+k.
Action chunk A short horizon action sequence at:t+k — i.e. the k actions at, at+1, …, at+k−1 — such as joint commands, end-effector deltas, and gripper states, predicted in one policy call. See ACT and Diffusion Policy.
Mixture-of-Transformers (MoT) Several modality-specific transformers or experts, such as a video transformer and an action transformer, connected through shared attention while keeping separate weights. See the related Transfusion paper.
Diffusion Transformer (DiT) A transformer backbone used inside diffusion or flow-matching models to denoise image, video, or action tokens over multiple steps. DiT commonly uses adaptive layer normalization (adaLN) to inject timestep conditioning into transformer blocks. See the Peebles and Xie DiT paper.
VAE Variational Autoencoder: in this post, mainly image and video VAEs that compress high-resolution images or videos into latent representations before generation or policy learning. This reduces token count substantially; for example, Wan 2.1’s VAE uses 4× temporal and 8×8 spatial compression, while Wan 2.2-5B uses a higher-compression 4× temporal and 16×16 spatial interface. See the original VAE paper, Rombach et al.’s latent diffusion paper, the Wan paper, and the Wan 2.2 release.
Wan A family of large pretrained video-generation models often used as the video backbone in recent WAMs. See the Wan paper.
Cosmos NVIDIA’s world foundation model family for physical AI, including video prediction models that can be adapted for robotics and policy learning. See the Cosmos paper.
DROID Distributed Robot Interaction Dataset: a large real-world manipulation dataset with more than 50k demonstrations across varied tasks, collected using Franka Panda robot arms. See the DROID paper.
RoboArena A distributed real-world benchmark for evaluating generalist robot policies on open-ended language-conditioned tasks. See the RoboArena paper.
RoboLab A high-fidelity simulation benchmark for analyzing task-generalist robot policies across visual, relational, and procedural competencies. See the RoboLab paper.
CALVIN A language-conditioned manipulation benchmark focused on long-horizon task sequences in simulation. See the CALVIN paper.
LIBERO A robot-learning benchmark for studying knowledge transfer, lifelong learning, and generalization in manipulation. See the LIBERO paper.
RoboTwin A simulation data generator and benchmark for robust bimanual robotic manipulation under domain randomization. See the RoboTwin 2.0 paper.
FAST / BEAST Discrete action-tokenization methods that turn continuous robot actions into token sequences, making action learning more compatible with VLM-style training. See the FAST paper and BEAST paper.
VPP Video Prediction Policy: a WAM-style method that uses predictive visual representations from a video model to condition robot actions. See the VPP paper.
LAPA Latent Action Pretraining from Videos: a method for learning action-like latent variables from videos without ground-truth robot action labels. See the LAPA paper.
OOD Out-of-distribution: a task, object, environment, or instruction outside the examples used during training or demonstration.
FLOP / ZFLOP Floating-point operations measure training compute. 1 ZFLOP equals 1021 FLOPs.
H100 / GPU-hour H100 is a high-end NVIDIA training GPU. A GPU-hour means one GPU running for one hour, a rough unit for comparing training cost.
BF16 Brain floating point 16-bit: a lower-precision number format commonly used to train large neural networks efficiently.
I2V Image-to-video: a video-generation setup conditioned on an initial image or frame.

Background: two building blocks. A visuomotor policy maps current observations plus a goal or instruction to robot actions. A world model predicts future visual or latent states from the current state plus an action or goal abstraction. A WAM sits at the overlap: it leverages a pretrained video/world-model backbone as a prior and predicts both future states and robot actions.Visuomotor policy: language instruction and current observation in, action sequence out.Visuomotor policy: language instruction and current observation in, action sequence out.World model: current world state plus an action abstraction in, future image or latent out.World model: current world state plus an action abstraction in, future image or latent out.

Introduction

Last year, my Scholar Inbox digest was dominated almost every day by new VLA papers. This changed in the last months, and a different keyword is coming up almost daily now too: WAM, short for World-Action Model. In October 2025, I wrote in my State of VLA post that WAMs were a small subfield within VLA research and far less popular than VLAs initialized from VLMs [60]. That has changed fast, and my wish to see more work in this direction has already become reality.

So what changed, and why now? Maybe it is because WAMs are the shiny new thing everyone wants to work on, or VLA authors ran out of new names for their own VLAs, since basically all “-VLA” names like “X-VLA” and “Ego-VLA” are already used. So now we can recycle them for the WAM area. But more likely it has something to do with VLM-based VLAs getting stuck. Modern VLAs benefited from massive vision-language pretraining, but they still hit a language-to-action grounding wall. The problem of mapping language and pixels into behavior still has to be learned from robot data. WAMs offer a different starting point. They use pretrained video or world-model backbones that already model how scene dynamics change under language conditioning. If that prior transfers to behavior generation, the remaining video-to-action gap may be smaller than learning language-to-action grounding directly.

But the ideas behind WAMs are not new. Early WAMs like UniPi [10] proposed essentially this approach back in 2023. So why did it take several years for the paradigm to enter the robot foundation model mainstream, and where does it actually stand today? This post takes a closer look at the modern WAM landscape to answer the central question:

Central question: Is this a real paradigm shift in research and industry, or just a short hype cycle? And if the recipe works so well, why did it take several years after early papers like UniPi for WAMs to become so popular?

My take: WAMs will become the second major recipe for robot foundation models, alongside VLM-based VLAs. The open questions are which formulation of them wins, and which parts of the model architecture and pipeline actually matter. It is likely that the winner is neither pure VLA nor pure WAM, but a hybrid of both.

This is my map of the modern WAM space: how to categorize and understand WAMs, what changed since the early models, and how current results compare to VLAs. For a broader survey, see the recent NTU survey “World Model for Robot Learning: A Comprehensive Survey” [57], which maps world models for robot learning across simulation, evaluation, navigation, and autonomous driving.

Table of contents

The two representation bets for generalist policies

The two current bets for generalist manipulation policies: VLM-based VLAs vs video-backbone WAMs.
Figure 1. The two current bets for generalist manipulation policies: VLM-based VLAs vs video-backbone WAMs.

The field currently has two major representation bets for robot foundation models in both research and industry. Many teams are building on the traditional VLA recipe established by Pi-0 [2] and later refined by Pi-0.5 [4], using VLM backbones as the starting point for policy learning. This VLM-backbone recipe appears in public work from teams including NVIDIA GR00T [5], Xiaomi Robotics [27], Being-H0.5 [28], and others.

More recently, a different paradigm has emerged: using pretrained video backbones as an alternative path toward generalist manipulation. Public examples now span NVIDIA’s DreamZero [8] and Cosmos Policy [13], Ant Group’s LingBot-VA [9], Rhoda AI’s DVA [40], Sereact’s Cortex 2.0 [45], and Mimic Robotics with mimic-video [14]. At the same time, many university labs and open research groups are also pushing the frontier with new ideas, including Video Prediction Policy [24], Unified Video Action Model [39], and Fast-WAM [23]. We discuss these in more detail below.

The choice of backbone impacts the full training and evaluation pipeline, from training recipe and data mixture to inference optimizations. Given the cost of running these models at scale, most teams will likely have to prioritize one direction (VLA or WAM) first rather than fully pursuing both in parallel. Which path proves out, or whether the two converge, is still open. Which one would you bet on today? In the following sections, we dive deeper into both sides of this decision.

Why World-Action Models? Our hypotheses

Before we dive deeper into current models, let’s first review why WAMs are attractive as an alternative to VLM-based VLAs. It also helps to first place WAMs inside the broader landscape of world models in robotics.

Figure 4. World models in robotics. Action-conditioned world models (DreamDojo, Genie, JEPA-WM) predict future states from a learned action abstraction. Video world models (Cosmos-Predict, Wan, Veo) predict future video conditioned on language and a reference frame. World-Action Models (WAM) like DreamZero, LingBot-VA, UniPi, and mimic-Video sit at the intersection: they reuse a video or world-model backbone inside a robot policy that emits actions.
Figure 2. World models in robotics. Action-conditioned world models (DreamDojo, Genie, JEPA-WM) predict future states from a learned action abstraction. Video world models (Cosmos-3, WAN, Veo, LTX-Video) predict future video conditioned on language and a reference frame. World-Action Models (WAM) like DreamZero, LingBot-VA, UniPi, and mimic-Video sit at the intersection: they reuse a video or world-model backbone inside a robot policy that emits actions.

The grounding gap

To understand why WAMs are attractive, it helps to understand the core challenge of “classical” VLAs built on VLM backbones. The motivation for the first VLAs was to leverage the internet-scale knowledge of VLMs for robotics. VLMs are trained on massive amounts of vision-text data and show notable zero-shot performance on many vision tasks. The VLA recipe then adapts these pretrained representations for action generation.

However, there is a major domain gap between VLM pretraining and embodied manipulation. Several VLA papers either observe degradation of pretrained VLM capabilities or design around it, particularly when the action-learning objective diverges sharply from the original VLM objective. VLM2VLA frames this directly as catastrophic forgetting during the VLM-to-VLA transition [55]. Knowledge Insulation reports similar findings and makes the concern architectural: it isolates the gradients of the flow-matching action expert from the VLM backbone to preserve pretrained language/vision knowledge, improving training convergence, task performance, and language following [20]. Recent solutions like VLM co-training and discrete action tokenizers have helped, but the core challenge remains: grounding language into physical action from limited robot data. We cover these solutions in the modern VLA baseline section below.

This naturally raises the question: what if we started from a backbone that already represents how language maps to visual change in the world?

Core hypotheses for WAMs as policy representations

The core idea is simple: instead of using a VLM backbone to jump-start imitation learning, use a pretrained video backbone. Current video models are trained on large video corpora and learn spatiotemporal representations of how visual scenes evolve. Crucially, current video models are often text-conditioned: they are trained to generate videos from precise language descriptions, sometimes with a reference frame and sometimes from text alone. Many of these videos contain intentional behavior: hands reaching, tools moving, objects being manipulated, and scenes changing because someone or something acted. That makes video backbones attractive as a model prior for generalist manipulation. Before seeing any robot actions, the backbone already encodes useful links between language, visual change, and plausible object interactions. The Veo 3.1 demonstration below is a quick illustration.

I would treat the next three points as hypotheses, not conclusions. They are recurring claims across papers, discussions with peers, and my own read of the field, supported by qualitative intuition, simulation evidence, and a few early real-world signals, but not by clean matched comparisons yet:

  1. Predicting future world changes correlates with generating the necessary actions. Inverse dynamics prediction is often easier than pure action generation [26]. If the desired outcome is known, inferring the action that produced it is usually simpler than predicting the action directly from the instruction and current observation. Pi-0.7’s visual-subgoal results point in the same direction: when the policy is given a desired future image, action prediction becomes more direct and training converges faster [43].
  2. Video pretraining provides grounding between language and physical change. Video models learn to map text descriptions to visual outcomes. If this transfers to robotics, it could reduce the amount of grounding that has to be learned from robot demonstrations alone.
  3. Video data regularizes robot policies. Robot datasets are small relative to web-scale video. Either through pretraining on video first or through co-training on video alongside robot data, the broader visual prior can reduce overfitting; the benefit depends on the dataset, objective, and architecture. DreamZero [8] and Fast-WAM [23] both show that, during robot fine-tuning, WAMs perform best when action learning is co-trained with a video-prediction objective.

A quick experiment: how much does a frontier video model already “understand” about robot manipulation?

How much do modern video models already capture before any robotics-specific action head is added? We ran a simple experiment with Google’s Veo 3.1, a frontier video generation model. Given a single context frame from an original RoboArena rollout of a toaster task in the DROID setup, we prompted Veo to push the toaster lever (the reference task, matching the original DROID demonstration) and then pick up an orange sitting to the left (the composed extension, beyond the demonstration). This video is very unlikely to be part of Veo’s pretraining data, but we cannot verify the training set directly; treat this as a qualitative check of the prior, not a controlled probe of training-set membership. One-shot attempt, no prompt optimization.

The prompt used was:

“Given this initial frame, generate a video of the robot arm pushing the toaster lever. After finishing that task, the robot should pick up the orange on the left side of the toaster and stop after it has picked it up.”

Context frame and ground-truth rollout:

Figure 5. Context frame from a RoboArena toaster task in the DROID setup.
Figure 3. Context frame from a RoboArena toaster task in the DROID setup.
Figure 6. Ground-truth rollout: robot pushes the toaster lever.
Figure 4. Ground-truth rollout: robot pushes the toaster lever.

Veo 3.1 generated rollouts (zero-shot, no robotics fine-tuning):

Figure 5. Veo 3.1 rollout for the reference task (pushing the toaster lever).
Figure 5. Veo 3.1 rollout for the reference task (pushing the toaster lever).
Figure 6. Veo 3.1 rollout for the composed extension (lever push followed by orange pickup).
Figure 6. Veo 3.1 rollout for the composed extension (lever push followed by orange pickup).
Animated rollout of the full composed-extension sequence: lever push followed by orange pickup.
Figure 7. Animated rollout of the full composed-extension sequence: lever push followed by orange pickup.

The generated rollout is surprisingly good for a model that was not explicitly trained as a robot policy. The generated motions are smooth, the background remains stable and consistent, and the robot follows a plausible trajectory toward both target objects. Even the sequencing is respected: finish the lever, then move to the orange.

The limitations are equally visible: The model does not fully push the toaster lever down and at points appears to attempt the opposite motion (pulling it up). More visibly, the pinch gripper from the original DROID setup morphs into a four-fingered hand. The fixed-base robot arm is reimagined, almost instantly after the context frame, as a different robot with fewer degrees of freedom. These artifacts are consistent with the model using broad visual priors rather than faithfully modeling the specific hardware.

Still, the result illustrates why video backbones are attractive for robotics: the model has a useful prior for what robot-object interaction should look like, even though it is not yet reliable enough for control. WAM fine-tuning is the attempt to turn that zero-shot imagination into reliable control.

Understanding modern WAMs: Core formulations

After establishing the core motivation, we can now focus on the current WAM research. In contrast to VLM-based VLAs, where the training recipe has largely converged around VLM co-training with a flow transformer for action generation, WAMs are still splitting into several active formulations. This is exactly what makes the area interesting right now: the field does not yet know which combination of design choices will win, or whether the best systems will merge parts of several.

To make the design space readable, we organize WAMs along three axes (which are not fully independent):

  1. Paradigm: what does the model predict, and how is the predicted video used to generate actions? (inverse dynamics vs joint prediction vs representation-only)
  2. Action integration: how do actions actually enter the model? (default action tokens vs action-as-image vs latent actions/plans)
  3. Architecture: how are the components composed? (Mixture-of-Transformers vs monolithic vs hierarchical)

The axes are not fully independent, and some WAMs do not fit well into a single category. I would not treat this as a perfect taxonomy. It should be more a practical map for reading the current papers without getting lost in naming choices. For each axis, I present the idea with an older paper and then a modern scaled-up version of the same rough recipe.

Figure 8. The WAM design space at a glance. Left: The three paradigms differ in what the model predicts. An inverse-dynamics WAM generates future video and then derives actions from it. A joint-prediction WAM emits video and actions together. A representation-only WAM uses the video backbone purely as a representation and skips video generation at inference. Middle: The three action-integration choices differ in how actions enter the model. Actions can be standalone tokens. They can be image-shaped targets the video model natively denoises. Or they can be compressed latent actions and plans. Right: The three architecture styles differ in how the components are composed. A monolithic transformer handles everything in one stack. Modality-specific experts coupled by shared attention (MoT) keep separate weights but share information. A hierarchical pipeline runs a video module before an action module. The rest of this section walks through each axis in turn.
Figure 8. The WAM design space at a glance. Left: The three paradigms differ in what the model predicts. An inverse-dynamics WAM generates future video and then derives actions from it. A joint-prediction WAM emits video and actions together. A representation-only WAM uses the video backbone purely as a representation and skips video generation at inference. Middle: The three action-integration choices differ in how actions enter the model. Actions can be standalone tokens. They can be image-shaped targets the video model natively denoises. Or they can be compressed latent actions and plans. Right: The three architecture styles differ in how the components are composed. A monolithic transformer handles everything in one stack. Modality-specific experts coupled by shared attention (MoT) keep separate weights but share information. A hierarchical pipeline runs a video module before an action module. The rest of this section walks through each axis in turn.

Paradigm: What the model predicts

The first axis is the policy formulation: what the model predicts, and how the predicted video is used to generate actions. Across modern WAMs, we see three directions that differ at the inference boundary: inverse dynamics, joint prediction, and representation-only.

Inverse dynamics: Predict the future, then infer the action

Figure 9. Inverse-Dynamics WAM (abstract). A video model first produces future frames or latents from the language instruction and current observation; an inverse-dynamics head then maps the predicted transition into a sequence of actions. Specific systems differ in whether they use full RGB futures (LingBot-VA, DVA), latent video features (VPP, mimic-video), or only intermediate features.
Figure 9. Inverse-Dynamics WAM (abstract). A video model first produces future frames or latents from the language instruction and current observation; an inverse-dynamics head then maps the predicted transition into a sequence of actions. Specific systems differ in whether they use full RGB futures (LingBot-VA, DVA), latent video features (VPP, mimic-video), or only intermediate features.

The inverse-dynamics setup is the easiest WAM recipe to understand: first imagine the future, then predict the most likely action from the video. This shifts the hard language-grounding problem into the video stage: translate the command into a plausible visual change. The bet is that video pretraining has already learned a useful part of this language-to-visual-change mapping, so the action head does not have to learn everything from robot demos and can focus on the inverse-dynamics problem instead.

UniPi overview. A text-conditioned video generator produces a future image sequence from the current frame and language instruction; a separate inverse-dynamics module then extracts actions from consecutive frames

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from NVIDIA Developer Blog