NVIDIA Developer Blog · June 1, 2026 · 7 min read

NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale

#gpu

Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.

Like Read original ↗

Data Center / Cloud

NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale

May 31, 2026

By Warren Barkley

Discuss (0)

AI is now essential infrastructure, powered by AI factories that generate intelligence in the form of tokens. As demand grows, these factories must scale faster, operate more efficiently, and lower the cost of intelligence across the five-layer stack: energy, chips, infrastructure, models, and applications.

NVIDIA DSX platform provides the complete playbook for designing, simulating, building, and operating AI factories, aligning every layer of the stack across compute, software, facilities, and partner technologies through a common co-designed architecture.

The DSX platform now includes DSX OS software to accelerate AI factory deployments and improve operational efficiency. DSX OS includes open source, modular software components and related NVIDIA technologies purpose-built for operating and scaling multi-tenant AI factories.

Together, DSX OS components enable NVIDIA DSX’s AI factory ecosystem to adopt the latest in agentic AI infrastructure software across the full stack, improving tokens per watt and lowering token cost, accelerating deployment, and strengthening operational reliability and resiliency.

Architecture diagram showing NVIDIA DSX OS within the larger NVIDIA DSX platform across hardware, facilities, software, simulation, resiliency, and security layers — **Figure 1: NVIDIA DSX OS software in the DSX platform.** DSX OS provides the open-source software for AI factory operations

Why DSX OS matters to the AI factory ecosystem

AI factories must perform optimally in order to maximize the number of tokens they produce relative to the watts they consume, and bring real value to the operators.

In order to achieve this, the complex network of components that goes into operating AI workloads at scale across datacenters must function in close harmony, requiring coordination across chips; systems; facilities infrastructure such as building management controls, cooling, and power distribution units; the power grid; the software and partner technologies running all of these; and the AI platforms and services running on top.

DSX OS software is designed for this entire ecosystem of components and provides a comprehensive set of open and extensible technologies and capabilities that can be integrated and adopted into existing platforms and software.

These capabilities have been designed and optimized around a common architecture, enabling all of the components involved to work together to deliver on three main outcomes that drive AI factory economics:

1) Faster time to revenue

NVIDIA builds and operates infrastructure and platform software on NVIDIA DGX Cloud, and now this software is being released as open source. NVIDIA ecosystem partners can leverage these components to deliver AI services rather than rebuild from scratch, eliminating months of custom development.

2) Better efficiency

Power is the limiting factor in an AI factory, and DSX connects power and grid behavior as part of the platform rather than as a facilities concern separated from the rest of the AI infrastructure. With DSX software, AI factories can run up to 40% more GPUs at peak energy efficiency within a fixed power budget, with minimal impact on inference workload performance.

3) Higher reliability and resiliency

AI factories run continuous large-scale workloads through hardware faults, grid events, and operational changes. DSX OS shifts cluster operations from reactive alerting to automated remediation, keeps runtime versions consistent across regions, and gives operators fleet-wide visibility.

How DSX OS enables gigawatt-scale AI factories

The open source, modular components in DSX OS provide the foundational technologies for building and operating AI factories, and are designed to solve challenges unique to operating AI workloads efficiently and reliably at gigawatt scale.

They do so by providing a co-designed set of core capabilities, including (but not limited to) standardized communication, power and efficiency optimization, provisioning and lifecycle operations, health monitoring and remediation, and intelligent platform services.

More details about how DSX OS provides these capabilities follows:

Standardized communication across the data center, enabled for agentic interfaces

An AI factory spans compute, networking, power, and cooling systems that all need to interoperate seamlessly. DSX Exchange bridges these components with an MQTT-based IT/OT communication hub that makes facility-level signals such as grid events, thermal data, and power anomalies, visible to the software managing the rest of the AI factory, enabling components such as DSX Flex, MaxLPS, and partner software to react to each other’s state in real time, improving coordination and efficiency

DSX OS software components across the full DSX stack will also provide MCP servers for provisioning, networking, observability, and more. Using these MCP servers, AI agents can discover the entire operational surface of the factory as a unified tool catalog, enabling them to interface across every system and perform cross-domain correlation. With an agentic AI factory, operators can easily connect a GPU health event with a thermal anomaly, or a network issue to a performance issue, or other potential scenarios.

A simplified diagram showing the connections between DSX Exchange, DSX Flex, DSX MaxLPS, provisioning systems such as NVIDIA Infra Controller, facilities components such as Building Management Systems, the power grid, third-party and partner software such as Emerald AI and Phaidra, and the Vera Rubin NVL 72 hardware — Figure 2. DSX Exchange coordinates communication within the AI factory, including grid signals from DSX Flex, facilities-level signals, power policies to and from DSX MaxLPS, provisioning systems like NVIDIA Infra Controller, and more

Power and efficiency optimization

Static power allocation strands capacity, reactive cooling creates thermal oscillations, and disconnected IT/OT systems make grid events a manual fire drill. DSX MaxLPS includes software that treats power as a programmable resource by dynamically enforcing policies at the GPU, rack, cooling, and workload level, enabling AI factories to recover stranded power to run additional compute at optimal utilization. DSX Flex extends this beyond the factory walls, with libraries for connecting workloads to grid services so AI factories can automatically adapt to demand response, load shedding, and renewable energy availability.

Partners including CoreWeave, Firmus, Lambda, Nscale, and Phaidra are deploying MaxLPS, while Emerald AI, ENGIE, Silicon Valley Power, and UK National Grid are leveraging DSX Flex.

Provisioning and multi-tenant lifecycle operations

At scale, provisioning is a continuous workflow: nodes cycle through tenant assignments, hardware is replaced, and every transition must be auditable and secure. NVIDIA Infra Controller (NICo) makes this programmable with API-driven bare-metal lifecycle management and hardware-enforced tenant isolation through NVIDIA BlueField DPUs and the NVIDIA DOCA Platform Framework. NVIDIA AI Cluster Runtime (AICR) complements this by capturing validated runtime configurations as version-locked recipes, eliminating the configuration drift that causes silent failures across large fleets.

IREN, OpenNebula Systems, Mirantis, Rafay, Red Hat, and Supermicro are among the partners integrating these components.

Health monitoring and automation tooling

In a large GPU fleet, hardware degradation is a daily occurrence, and the traditional alert-page-investigate cycle is too manual for minimizing impact on workloads. NVIDIA NVSentinel provides Kubernetes-native GPU fault detection and automated remediation, cordoning unhealthy compute nodes and draining workloads in seconds rather than minutes or hours. NVIDIA Fleet Intelligence provides fleet-wide visibility, integrity verification, and health monitoring across global deployments.

Lambda is an early adopter of Fleet Intelligence.

Screenshot of the Fleet Intelligence dashboard that summarizes fleet wide aggregations of data such as GPU and memory utilization as well as total GPUs in an up state — *Figure 3. The NVIDIA Fleet Intelligence dashboard summarizes fleet-wide aggregations of data such as GPU and memory utilization as well as total GPUs in an up state*

Intelligent AI workload scheduling and platform services

AI workloads need more than GPU access; they need topology-aware intelligent scheduling, distributed inference, and production APIs. KAI Scheduler and NVIDIA Run:ai provide GPU-aware workload placement with fractional allocation and hierarchical quotas. NVIDIA Dynamo and NVIDIA Grove deliver distributed inference serving with disaggregated prefill/decode and per-stage autoscaling. NVIDIA Cloud Functions (NVCF) ties it together with unified APIs across inference, fine-tuning, and batch workloads with built-in multi-tenancy.

Partners including Aible, Beyond AI, Bhashini, Crusoe, DCAI, Mirantis, Nebius, Rafay, Sarvam, Simplismart, Spectro Cloud, vCluster, Vultr, and Yotta are using many of these components in production.

Getting started

DSX OS components are available on GitHub and designed for incremental adoption and integration with existing software stacks.

Start with the component that addresses your most immediate requirements, and build from there, leveraging the capabilities and technologies provided to accelerate your AI factory deployment and improve operational efficiency.

Some examples are provided below:

IT/OT communications: DSX Exchange
Bare-metal lifecycle management and tenant isolation: NVIDIA Infra Controller and DOCA Platform Framework
Fleet visibility, health, and integrity: NVIDIA Fleet Intelligence
Unified AI inference APIs: NVIDIA Cloud Functions

Review NVIDIA DSX documentation for more details about all of the components of DSX OS, implementation and reference design guides, quickstarts, and integration guidance.

Discuss (0)

About the Authors

About Warren Barkley
Warren Barkley is vice president of product management of DGX Cloud at NVIDIA. Prior to NVIDIA, Warren led core AI/ML platform teams in Google Cloud. Before Google, he held executive roles managing AI and machine learning product groups at Amazon Web Services and at Microsoft, where he helped drive several major cloud and communications products.

View all posts by Warren Barkley

Comments

Discussion (0)

No comments yet. Sign in and be the first to say something.

NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale

NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale

Why DSX OS matters to the AI factory ecosystem

1) Faster time to revenue

2) Better efficiency

3) Higher reliability and resiliency

How DSX OS enables gigawatt-scale AI factories

Standardized communication across the data center, enabled for agentic interfaces

Power and efficiency optimization

Provisioning and multi-tenant lifecycle operations

Health monitoring and automation tooling

Intelligent AI workload scheduling and platform services

Getting started

Tags

About the Authors

Comments

Discussion (0)

More from NVIDIA Developer Blog