Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling
Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.
Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling
AI-Generated Summary
- NVIDIA GB200 NVL72 delivers exascale compute in a single rack with 72 Blackwell GPUs interconnected by NVLink, providing 130 TB/s of low-latency bandwidth, enabling real-time trillion-parameter AI models and significant performance gains across AI training and inference workloads.
- Slurm's new topology/block plugin, co-developed by NVIDIA and SchedMD for version 23.11, enables topology-aware job scheduling that aligns jobs with NVL72 domain boundaries, minimizing fragmentation and optimizing GPU occupancy in cluster environments.
- Larger job segment sizes (up to 18 nodes) on GB200 NVL72 allow efficient grouping of GPUs communicating entirely over NVLink, benefiting high I/O workloads like mixture-of-experts training, while smaller jobs use smaller segment sizes to avoid scheduler constraints.
- Scheduling simulations using a 5,000-node GB200 NVL72 cluster model showed that topology-aware scheduling places small jobs strategically within domains to reduce fragmentation and achieves GPU occupancy within 1% of theoretical maximum, maintaining high utilization without performance loss.
- Recommended scheduling policies prioritize large jobs (64 GPUs) with segment sizes of 16 nodes to maximize NVLink domain usage, while smaller jobs use segment sizes of 2 to 8 nodes, enabling efficient resource alignment and cluster utilization.
- Continuous monitoring of fragmentation and segment size adjustment, supported by simulation tools, is essential to sustain optimal performance and utilization in NVIDIA GB200 NVL72 clusters over time.
AI-generated content may summarize information incompletely. Verify important information. Learn more
As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on the hardware itself. NVIDIA GB200 NVL72 delivers exascale compute in a single rack, unlocking real-time trillion-parameter models. Yet capturing that performance in a shared cluster requires schedulers that understand the system architecture and align jobs with its network topology.
This post explains how Slurm topology-aware job scheduling works on NVIDIA GB200 NVL72, and provides scheduling recommendations for optimal GPU occupancy.
How does NVIDIA GB200 NVL72 deliver exascale compute?
NVIDIA GB200 NVL72 is an exascale computer in a single rack. With 72 NVIDIA Blackwell GPUs interconnected by the largest production scale-up compute fabric, NVIDIA NVLink provides 130 terabytes per second (TB/s) of low-latency GPU communication bandwidth for AI and high-performance computing (HPC) workloads. Multiple GB200 NVL72 systems combined in a cluster create hierarchical network topology with large domains of very high networking bandwidth.
An AI training job can greatly benefit from the abundant networking bandwidth offered by GB200 NVL72, when scheduled to maximize the use of NVLink fabrics. Recent results show that GB200 NVL72 delivers significant improvement in performance for all AI workloads, including training (>2.6x with recent MLPerf training), across different inference use cases (real-time inference for trillion-parameter models, >1.5 million tokens/second for the OAI gpt-oss model, state-of-art disaggregate serving), as well as reasoning.
In a shared cluster running multiple training jobs, a resource-efficient scheduler must account for varying network bandwidth requirements.
What is topology-aware job scheduling?
Topology-aware job scheduling allows a job scheduler such as Slurm to make resource allocation decisions based on the cluster’s physical network layout, such as the hierarchy of switches and racks. The scheduler should preserve locality, keeping workloads within the same NVLink domain whenever possible. In addition, because multiple training or inference jobs can fit in a group of NVL72 racks, the scheduler must provide efficient bin-packing to avoid resource fragmentation.
The longstanding Slurm topology/tree plugin provides topology-aware scheduling for large clusters, but its best-effort approach often fragments jobs across leaf switches to reduce queue time. While this compromise between start time and performance was acceptable for traditional InfiniBand fabrics, the advent of rack-scale systems like GB200 NVL72 and GB300 NVL72 necessitated a change. In response, NVIDIA and SchedMD collaborated to launch the new topology/block plugin in Slurm 23.11, specifically designed for these modern architectures.
This topology plugin configuration provides information about groups of nodes belonging to the same NVL72 domain, which enables algorithms that can align Slurm jobs with NVL72 domain boundaries. To learn more about the block topology plugin and how segment sizes are scheduled, see Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling.
How do cluster segmentation and job scheduling work on GB200 NVL72?
As clusters grow in scale and complexity, managing GPU resources becomes critical for achieving both high utilization and predictable performance. The GB200 NVL72 system introduces larger AI job segment sizes and fine-grained scheduling control, enabling operators to align segment configurations with workload needs. Together with GB200 NVL72-aware scheduling extensions in the Slurm workload manager, this approach balances large and small jobs to maximize efficiency even in the presence of hardware faults.
How does GB200 NVL72 enable larger segment sizes?
In multi-GPU workloads, the job segment size defines the subunit made of nodes that can communicate with each other entirely over NVLink. Figure 1 illustrates how segment number (Y) and segment size (S) are used to define the GPUs assigned to a specific job. GPUs per node (G) is always four for GB200 and GB300.
In prior systems, such as NVIDIA HGX H100, jobs were limited to a segment size of one node. The GB200 NVL72 system supports much larger segment sizes (up to 18 nodes) while also efficiently supporting segments as a single node.
The optimal segment size for a given application is determined by factors such as model type and the combination of parallelism types used for training. Generally, larger jobs (those utilizing more GPUs) and those with high I/O bandwidth requirements—mixture-of-experts (MoE) training, for example—benefit from larger segment sizes. Conversely, smaller jobs typically have lower I/O bandwidth needs and should use a smaller segment size to prevent over-constraining the cluster scheduler. Users should validate this guidance for their specific workloads if unsure, as performance effects can be workload-specific.
What are best practices for GB200 NVL72 segment sizing?
In modeling, our team found a few general guidelines for maximizing GB200 NVL72 cluster utilization. A rule of thumb is to choose the critical job size that uses a “large” segment size of 16 nodes such that the percentage of GPU hours in the cluster for those jobs is <= 90%. This will give the scheduler flexibility to fully utilize the cluster with a good mix of segment sizes. Table 1 summarizes some of the recommended optimal configurations.
| Job size | Segment size | Example workloads |
| 128 | 16 | MoE model training |
| 32 – 64 | 4 | Large dense model training |
| Less than 32 | 1 | Smaller model training |
Note that, for the purposes of this post, we assume user jobs prefer to run with a power-of-two GPUs segment sizes (for example, 4 nodes = 16 GPUs). It is also possible to choose other segment sizes (12, 36, or 72 GPUs per segment, for example). To decide whether an alternate approach makes sense, study the efficiency of your jobs when mapped across a non-power-of-two segment size, and the effect on overall utilization of the cluster for different sized jobs.
How to schedule jobs on GB200 NVL72 systems
NVIDIA and SchedMD have developed block scheduling extensions built on Slurm that enable GB200 NVL72-aware job placement for high utilization.
With power-of-two segment sizes, an GB200 NVL72 cluster can run large and small jobs side by side—for example, one 512 GPU job using 16 node segments alongside several 16 GPU jobs using single node segments. These scheduling policies minimize fragmentation while maintaining high efficiency across the cluster.
What is the GB200 NVL72 scheduling simulation framework?
To evaluate scheduling strategies at scale, we developed a standalone Slurm simulator that runs on a virtual machine and enables time-accelerated workload simulation. As shown in Figure 2, this simulator provides accurate and repeatable results by:
- Running the Slurm code
- Replaying production workloads or generating synthetic workloads
- Simulating real-world conditions, including node failures and recoveries
- Integrating with the metrics system for direct comparison of results
This setup provides significant leverage to test, compare, and confidently roll out new scheduling policies before deploying them in production.
Simulation parameters
Parameters of the simulation environment the team modeled include:
- Cluster capacity: 5,000 GB200 NVL72 nodes (20,000 GPUs)
- Workload: 15,000 jobs over a seven-day period
- Reliability: Average of 2.5% of nodes down at any given time
The team evaluated performance using a Large_Perf_Custom policy, designed to balance utilization and large job performance:
- Jobs with 32 nodes or more ran with a segment size of 16
- Smaller jobs ran with a segment size of two
What do the simulation results show?
To evaluate the performance of the new scheduling strategies, we focused on two key primary cluster metrics: fragmentation of blocks and overall GPU occupancy.
Fragmentation analysis
A key metric for GB200 NVL72 scheduling is how small jobs impact NVLink domain availability for large jobs. The simulator tracked how small jobs (1-18 nodes) were placed within each NVLink domain.
The key finding was that the topology plugin effectively placed small jobs on the last two nodes of each domain, minimizing fragmentation and preserving capacity for larger jobs.
Occupancy metrics
While topology-aware scheduling introduces constraints, our results showed that its impact on overall occupancy can be almost entirely eliminated through an optimal topology-aware scheduling implementation. Figure 5 shows only ~1% difference between Large_Perf_Custom and NoTopo. The gap can be further filled with more small jobs.
We compared occupancy under the Large_Perf_Custom algorithm we developed, versus a noTopo policy, where the noTopo configuration represents the best theoretical occupancy possible given the job size distribution, ignoring the large runtime penalties that would result from poor placement in the noTopo algorithm. The practical goal is to get as close as possible to noTopo occupancy while avoiding the performance penalties of topology-naive scheduling.
Results show that our simulation achieved occupancy within roughly 1% of noTopo, demonstrating that topology-aware scheduling can deliver high utilization without sacrificing performance.
What is the best job scheduling approach for GB200 NVL72?
Based on our simulation results and performance testing, we recommend a scheduling approach for NVIDIA GB200 NVL72 clusters that prioritizes large job performance while maintaining high utilization. Large jobs of 64 GPUs or more should be given access to the maximum number of NVLink domains, using segment sizing to ensure proportional GPU allocation across domains. Segment-based scheduling is essential for aligning resources with workload patterns. For jobs of 32 nodes or more, a segment size of 16 is recommended if the application can benefit from it, while smaller jobs are better suited to segment sizes of two to eight, depending on workload characteristics.
To maintain efficiency over time, it is important to monitor and optimize continuously. Tracking fragmentation metrics, adjusting segment sizes as workload patterns evolve, and validating changes with simulation tools before production deployment can help sustain high utilization without sacrificing performance. While block topology can introduce constraints that reduce occupancy, applying strategic scheduling policies can mitigate this effect and preserve performance benefits.
Get started with NVIDIA GB200 NVL72
The NVIDIA GB200 NVL72 system represents a major advancement in AI and HPC computing, and unlocking its full potential requires topology-aware scheduling. Our modeling demonstrates that, with simple configuration and segment-based scheduling, it is possible to achieve optimal performance while maintaining high cluster utilization. The ability to simulate different scheduling scenarios further enables confident deployment of new policies without risking production workloads. Learn more about NVIDIA GB200 NVL72.
Tags
About the Authors
Sachin Lakharia is a principal software engineer at NVIDIA, where he leads multiple projects focused on scheduling, resource management, and data management for large-scale GPU infrastructure. His work supports the efficient operation of critical ML workloads across high-performance computing environments. Previously he has held several senior engineering roles at Meta Platforms (Facebook), including leading AI infrastructure resource management and data infrastructure initiatives. With over a decade of experience in building and scaling distributed systems, Sachin brings deep expertise in infrastructure, machine learning platforms, and resource optimization at hyperscale.
Vipin Sirohi is a principal HPC architect at NVIDIA with over a decade of experience in HPC and EDA infrastructure. He has deep expertise in large-scale workload management and has successfully managed enterprise scheduling platforms including LSF and Slurm. In his current role, he plays a pivotal role in architecting and overseeing scheduler operations for NVIDIA internal supercomputing environments, helping optimize performance, reliability, and utilization at scale.
Petr Lapukhov works in the GPU architecture and engineering team at NVIDIA with a focus on system and network evolution in the AI era.
Dheevatsa Mudigere is a senior distinguished engineer in the NVIDIA Compute Architecture group, focusing on the application-driven co-design of large-scale AI systems. He and his team work on understanding current and future AI applications and developing HW/SW technology to enable more capable and efficient AI systems. Before NVIDIA, he worked on designing, building, and deploying production hyperscale AI systems.
Eduardo Alvarez is a senior technical lead at NVIDIA, where he focuses on AI inference at scale, performance optimization, workload economic analysis, and application enablement. He has a deep background in AI systems engineering, workload optimization, and accelerated computing—focused on translating innovations into real-world applications. Before NVIDIA, Eduardo held engineering roles at various semiconductor and energy tech companies.
As senior director of Engineering at NVIDIA, Mohamed Fawzy brings over nine years of experience leading large-scale machine learning and data infrastructure initiatives. His current role focuses on advancing AI platforms and infrastructure to optimize machine learning pipelines, improve developer productivity, and support innovative AI solutions. His expertise includes managing geo-distributed teams and scaling systems for complex AI and data needs.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.