NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale
Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.
NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale
AI is now essential infrastructure, powered by AI factories that generate intelligence in the form of tokens. As demand grows, these factories must scale faster, operate more efficiently, and lower the cost of intelligence across the five-layer stack: energy, chips, infrastructure, models, and applications.
NVIDIA DSX platform provides the complete playbook for designing, simulating, building, and operating AI factories, aligning every layer of the stack across compute, software, facilities, and partner technologies through a common co-designed architecture.
The DSX platform now includes DSX OS software to accelerate AI factory deployments and improve operational efficiency. DSX OS includes open source, modular software components and related NVIDIA technologies purpose-built for operating and scaling multi-tenant AI factories.
Together, DSX OS components enable NVIDIA DSX’s AI factory ecosystem to adopt the latest in agentic AI infrastructure software across the full stack, improving tokens per watt and lowering token cost, accelerating deployment, and strengthening operational reliability and resiliency.
Why DSX OS matters to the AI factory ecosystem
AI factories must perform optimally in order to maximize the number of tokens they produce relative to the watts they consume, and bring real value to the operators.
In order to achieve this, the complex network of components that goes into operating AI workloads at scale across datacenters must function in close harmony, requiring coordination across chips; systems; facilities infrastructure such as building management controls, cooling, and power distribution units; the power grid; the software and partner technologies running all of these; and the AI platforms and services running on top.
DSX OS software is designed for this entire ecosystem of components and provides a comprehensive set of open and extensible technologies and capabilities that can be integrated and adopted into existing platforms and software.
These capabilities have been designed and optimized around a common architecture, enabling all of the components involved to work together to deliver on three main outcomes that drive AI factory economics:
1) Faster time to revenue
NVIDIA builds and operates infrastructure and platform software on NVIDIA DGX Cloud, and now this software is being released as open source. NVIDIA ecosystem partners can leverage these components to deliver AI services rather than rebuild from scratch, eliminating months of custom development.
2) Better efficiency
Power is the limiting factor in an AI factory, and DSX connects power and grid behavior as part of the platform rather than as a facilities concern separated from the rest of the AI infrastructure. With DSX software, AI factories can run up to 40% more GPUs at peak energy efficiency within a fixed power budget, with minimal impact on inference workload performance.
3) Higher reliability and resiliency
AI factories run continuous large-scale workloads through hardware faults, grid events, and operational changes. DSX OS shifts cluster operations from reactive alerting to automated remediation, keeps runtime versions consistent across regions, and gives operators fleet-wide visibility.
How DSX OS enables gigawatt-scale AI factories
The open source, modular components in DSX OS provide the foundational technologies for building and operating AI factories, and are designed to solve challenges unique to operating AI workloads efficiently and reliably at gigawatt scale.
They do so by providing a co-designed set of core capabilities, including (but not limited to) standardized communication, power and efficiency optimization, provisioning and lifecycle operations, health monitoring and remediation, and intelligent platform services.
More details about how DSX OS provides these capabilities follows:
Standardized communication across the data center, enabled for agentic interfaces
An AI factory spans compute, networking, power, and cooling systems that all need to interoperate seamlessly. DSX Exchange bridges these components with an MQTT-based IT/OT communication hub that makes facility-level signals such as grid events, thermal data, and power anomalies, visible to the software managing the rest of the AI factory, enabling components such as DSX Flex, MaxLPS, and partner software to react to each other’s state in real time, improving coordination and efficiency
DSX OS software components across the full DSX stack will also provide MCP servers for provisioning, networking, observability, and more. Using these MCP servers, AI agents can discover the entire operational surface of the factory as a unified tool catalog, enabling them to interface across every system and perform cross-domain correlation. With an agentic AI factory, operators can easily connect a GPU health event with a thermal anomaly, or a network issue to a performance issue, or other potential scenarios.
Power and efficiency optimization
Static power allocation strands capacity, reactive cooling creates thermal oscillations, and disconnected IT/OT systems make grid events a manual fire drill. DSX MaxLPS includes software that treats power as a programmable resource by dynamically enforcing policies at the GPU, rack, cooling, and workload level, enabling AI factories to recover stranded power to run additional compute at optimal utilization. DSX Flex extends this beyond the factory walls, with libraries for connecting workloads to grid services so AI factories can automatically adapt to demand response, load shedding, and renewable energy availability.
Partners including CoreWeave, Firmus, Lambda, Nscale, and Phaidra are deploying MaxLPS, while Emerald AI, ENGIE, Silicon Valley Power, and UK National Grid are leveraging DSX Flex.
Provisioning and multi-tenant lifecycle operations
At scale, provisioning is a continuous workflow: nodes cycle through tenant assignments, hardware is replaced, and every transition must be auditable and secure. NVIDIA Infra Controller (NICo) makes this programmable with API-driven bare-metal lifecycle management and hardware-enforced tenant isolation through NVIDIA BlueField DPUs and the NVIDIA DOCA Platform Framework. NVIDIA AI Cluster Runtime (AICR) complements this by capturing validated runtime configurations as version-locked recipes, eliminating the configuration drift that causes silent failures across large fleets.
IREN, OpenNebula Systems, Mirantis, Rafay, Red Hat, and Supermicro are among the partners integrating these components.
Health monitoring and automation tooling
In a large GPU fleet, hardware degradation is a daily occurrence, and the traditional alert-page-investigate cycle is too manual for minimizing impact on workloads. NVIDIA NVSentinel provides Kubernetes-native GPU fault detection and automated remediation, cordoning unhealthy compute nodes and draining workloads in seconds rather than minutes or hours. NVIDIA Fleet Intelligence provides fleet-wide visibility, integrity verification, and health monitoring across global deployments.
Lambda is an early adopter of Fleet Intelligence.
Intelligent AI workload scheduling and platform services
AI workloads need more than GPU access; they need topology-aware intelligent scheduling, distributed inference, and production APIs. KAI Scheduler and NVIDIA Run:ai provide GPU-aware workload placement with fractional allocation and hierarchical quotas. NVIDIA Dynamo and NVIDIA Grove deliver distributed inference serving with disaggregated prefill/decode and per-stage autoscaling. NVIDIA Cloud Functions (NVCF) ties it together with unified APIs across inference, fine-tuning, and batch workloads with built-in multi-tenancy.
Partners including Aible, Beyond AI, Bhashini, Crusoe, DCAI, Mirantis, Nebius, Rafay, Sarvam, Simplismart, Spectro Cloud, vCluster, Vultr, and Yotta are using many of these components in production.
Getting started
DSX OS components are available on GitHub and designed for incremental adoption and integration with existing software stacks.
Start with the component that addresses your most immediate requirements, and build from there, leveraging the capabilities and technologies provided to accelerate your AI factory deployment and improve operational efficiency.
Some examples are provided below:
- IT/OT communications: DSX Exchange
- Bare-metal lifecycle management and tenant isolation: NVIDIA Infra Controller and DOCA Platform Framework
- Fleet visibility, health, and integrity: NVIDIA Fleet Intelligence
- Unified AI inference APIs: NVIDIA Cloud Functions
Review NVIDIA DSX documentation for more details about all of the components of DSX OS, implementation and reference design guides, quickstarts, and integration guidance.
Tags
About the Authors
Warren Barkley is vice president of product management of DGX Cloud at NVIDIA. Prior to NVIDIA, Warren led core AI/ML platform teams in Google Cloud. Before Google, he held executive roles managing AI and machine learning product groups at Amazon Web Services and at Microsoft, where he helped drive several major cloud and communications products.
Comments
More from NVIDIA Developer Blog
-
How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo
Jun 1
-
Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3
Jun 1
-
Advancing AI Infrastructure for Agentic AI with NVIDIA DOCA In-Silicon Security
Jun 1
-
NVIDIA Vera CPU Sets a New Standard for Agentic Workloads in AI Factories
Jun 1
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.