NVIDIA Developer Blog · · 8 min read

How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo

Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.

How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo

Developing autonomous vehicle (AV) policies requires bridging an important gap between training and deployment. Vision-language-action (VLA) models that can reason over more complex driving scenes and produce richer intermediate reasoning are predominantly trained in open-loop, where model outputs are directly compared to ground-truth behaviors without considering their effect on the environment.

In deployment, however, a driving policy runs in closed-loop, where every braking, steering, and navigation decision affects the environment, and small errors can compound over time.

A systematic means to address this challenge is provided by NVIDIA Alpamayo, an open portfolio of AI models, simulation frameworks, and physical AI datasets for AV development. Alpamayo includes the AlpaSim AV simulation platform and the AlpaGym closed-loop training framework (coming soon).

This post explains how to train AV models in closed-loop with NVIDIA Alpamayo. Specifically, it walks through how to:

  • Install and configure AlpaGym 
  • Define closed-loop rewards
  • Launch closed-loop training
  • Export the post-trained checkpoint for downstream use

Closed-loop post-training with AlpaGym extends AV training workflows by turning AlpaSim rollouts into training experience. Rather than treating simulation only as a final evaluation stage, AlpaGym connects simulator feedback directly to the policy training loop.

Workflow diagram showing a driving model (such as Alpamayo) undergoing reinforcement learning post-training in AlpaGym, including Data Collection, Closed-Loop Simulation, Driving Model, Policy Training and Orchestration.
Figure 1. End-to-end workflow for post-training a driving model such as Alpamayo using AlpaGym

How to use AlpaGym for closed-loop reinforcement learning

Reinforcement learning (RL) can be used to improve a policy that was initially trained in open-loop. Instead of optimizing only against logged expert trajectories, the model can now learn from the consequences of its own actions in simulation.

This shift is critical for AV development, where small prediction or planning errors can compound over time. In closed-loop training, each braking, steering, and navigation decision affects the next state of the environment, revealing failure modes that static datasets or open-loop evaluation may miss.

However, enabling closed-loop RL comes with its own challenges. Model inference, running simulation, training models, syncing weight updates, communicating across instances and moving data—all in parallel—is complex. This requires orchestration and efficient utilization of compute resources in a robust yet flexible manner. 

Perspective grid of driving-scene clips showing many AlpaSim closed-loop rollout instances running in parallel across different road scenarios for AlpaGym reinforcement learning.
Figure 2. AlpaGym enables large-scale closed-loop training, where driving models learn from the consequences of their own actions across a wide variety of simulated scenarios–greatly reducing the difference between training and deployment

To address these challenges, AlpaGym connects policy training to AlpaSim closed-loop rollouts and provides an open source, high-throughput framework for closed-loop RL. The system combines AlpaSim simulator microservices, NVIDIA Physical AI Open Datasets, and distributed NVIDIA Cosmos-RL training framework into a scalable post-training pipeline.

Built to scale seamlessly from a single GPU to multi-node GPU clusters, AlpaGym supports efficient large-scale training through an asynchronous and stable distributed RL pipeline, without requiring changes to user code. It integrates AlpaSim and Cosmos RL as its runtime and orchestration layer, GRPO as a default algorithm, and includes reference reward functions tested with Alpamayo models and the Physical AI AV NuRec dataset.

To get started with AlpaGym post-training, follow the steps outlined below.

Step 1: Install and configure AlpaGym

To install AlpaGym from the Alpamayo checkout, install the native CUDA dependencies and Redis on the host, then sync the UV workspace:

sudo apt-get update
sudo apt-get install -y libcudnn9-dev-cuda-12 \
  libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \
  redis-server git-lfs

git lfs install
git lfs pull

huggingface-cli login
# Or export HF_TOKEN=...

uv sync --all-packages
sudo apt-get update
sudo apt-get install -y libcudnn9-dev-cuda-12 \
  libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \
  redis-server
uv sync --all-packages

The Python environment is managed by uv, but cuDNN, NCCL, and the redis-server binary are host dependencies used by the CUDA model stack and Cosmos-RL. Alternatively, a suitable Dockerfile is also provided. Hugging Face authentication is required to download the scene artifacts.

An AlpaGym run is a Hydra configuration. It specifies the policy checkpoint, the AlpaSim scene set, rollout parallelism, reward function, and Cosmos-RL training parameters. In this workflow, the starting checkpoint is an Alpamayo model.

Architecture diagram of AlpaGym closed-loop post-training, showing AlpaSim simulator sessions sending sensor data and receiving driving actions through rollout workers, while a policy trainer and orchestrator update the model and coordinate data flow.
Figure 3. In AlpaGym closed-loop post-training, the host process starts AlpaSim, rollout workers expose policy drivers, AlpaSim executes simulator sessions, and AlpaGym returns rollout artifacts and rewards to the trainer

Step 2: Define the closed-loop reward

The reward should match the behavior you want to improve in closed-loop. For trajectory-quality post-training, common reward terms include progress, lane keeping, collision avoidance, offroad rate, comfort, and distance to a reference trajectory.

A practical first reward is intentionally simple: combine progress with penalties for safety-critical failures. In AlpaGym, this can be expressed as a small sum of terms, using AlpaSim metrics where possible:

# reward/progress_safety.yaml
terms:
  - kind: metric
    metric_name: progress
    scale: 1.0
  - kind: metric
    metric_name: collision_any
    scale: -10.0
  - kind: metric
    metric_name: offroad
    scale: -5.0

Once the pipeline is stable, add more targeted terms for the failure modes observed in AlpaSim videos and metrics.

Step 3: Launch closed-loop post-training

Start AlpaGym training from your model checkpoint. Alpamayo serves as an example model here.

uv run -m alpagym_host.cli \
  policy=alpamayo \
  policy.model.kind=alpamayo_r1 \
  policy.model.path=/path/to/checkpoint \
  reward=progress_safety

This will bring up AlpaGym with AlpaSim on a single GPU. Stay tuned for detailed instructions on how to use your own AV model.

During training, AlpaGym requests scene rollouts from AlpaSim, collects per-episode artifacts, computes rewards, and updates the policy. Useful training signals include mean reward, reward variance, failure rates, policy loss, rollout throughput, and the gap between generated rollouts and the latest policy weights.

In this recipe, these rollout artifacts and training signals are the primary outputs of the post-training run. They help you confirm that closed-loop learning is running correctly and select checkpoints for downstream evaluation on your own held-out AlpaSim scenario suites.

Step 4: Export the post-trained checkpoint

After training, place the AlpaGym-produced checkpoint and config files into a folder that can be accessed by the AlpaSim driver (your Hugging Face model cache, for example). Then create a new driver config with that folder path (called alpamayo1_CLRL here). See the following code for what to edit to specify custom paths in a driver yaml config. This makes the AlpaGym post-trained policy runnable inside AlpaSim for closed-loop rollouts.

...
model:
  model_type: alpamayo1
  checkpoint_path: "/root/.cache/huggingface/alpasim_models/alpamayo1_CLRL/step_NNNNNN"
  device: "cuda"
...

Next, run the exported model on a representative scenario to verify that the policy, driver, and simulation loop are connected correctly. At this stage, you can inspect how the policy behaves when its own actions affect the next state of the environment.

uv run alpasim_wizard deploy=local topology=1gpu 
driver=alpamayo1_CLRL wizard.log_dir=$PWD/tutorial_alpamayo_CLRL 
scenes.scene_ids=[clipgt-9ea70552-6dcb-4ee8-a368-9a906a333f6e]

A closed-loop rollout provides useful qualitative signals: whether the model produces stable trajectories and remains within the drivable area, how it reacts to nearby traffic agents, and which failure modes should be targeted during post-training.

Video 1. AlpaSim closed-loop rollout of an AV model, including the rendered camera view, predicted trajectory, and rollout-level diagnostics

With this checkpoint, teams can inspect rollout videos, per-episode metrics, reward traces, and failure cases collected during training. These artifacts are useful for debugging reward design, checking rollout stability, and selecting checkpoints for later held-out evaluation in AlpaSim.

Get started post-training AV models 

Closed-loop post-training provides a practical path for iterating on end-to-end driving policies. In this case, AlpaGym uses closed-loop rollouts to post-train AV policies in simulation, enabling them to learn from the consequences of their actions.

You can use these tools together with the other components of the NVIDIA Alpamayo Open Platform to develop reasoning models that can be run, inspected, and post-trained in a closed-loop simulation workflow. Extend this same recipe more broadly with your own rewards, scenarios, and evaluation suites.

Ready to get started? Check out the NVlabs/alpamayo-recipes GitHub repo to adapt the recipe in this post for your own use cases. 

To evaluate your model on a public leaderboard, see the two open AV challenges NVIDIA launched at CVPR 2026: 

To learn more, see Expanding the Alpamayo Open Platform for Developing Reasoning AVs Across Models, Data, and Simulation.

Join NVIDIA founder and CEO Jensen Huang for the NVIDIA GTC Taipei 2026 Keynote and dive deeper with related sessions.  

Discuss (0)

Tags

Developer Tools & Techniques | Robotics | Simulation / Modeling / Design | Automotive / Transportation | Cosmos | Intermediate Technical | Tutorial | autonomous vehicles | Computex 2026 | Open Source | Physical AI | Reinforcement Learning | Training AI Models

About the Authors

Boris Ivanovic
About Boris Ivanovic
Boris Ivanovic is a senior research scientist and manager in the NVIDIA Autonomous Vehicle Research Group. His research interests include AV foundation models, simulation, and AI safety. Prior to joining NVIDIA, he received his Ph.D. in Aeronautics and Astronautics in 2021 and an M.S. in Computer Science in 2018, both from Stanford University. His work has been recognized with a number of awards, including a Best Paper Award Finalist at CVPR 2025 as well as a Computex 2026 Best Choice Award.
Marco Pavone
About Marco Pavone
Dr. Marco Pavone is senior director of Autonomous Vehicle Research at NVIDIA and an associate professor of Aeronautics and Astronautics at Stanford University, where he directs the Autonomous Systems Laboratory. He earned his Ph.D. in Aeronautics and Astronautics from Massachusetts Institute of Technology in 2010. His research focuses on physical AI—the development of AI systems grounded in physics, perception, and control that can operate robustly in the real world. His work spans a range of applications, including autonomous vehicles, aerospace systems, and general-purpose robotics. He has received numerous honors, including the Presidential Early Career Award for Scientists and Engineers from the White House.

Comments

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from NVIDIA Developer Blog