NVIDIA Developer Blog · May 21, 2026 · 12 min read

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling

#gpu

Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.

Like Read original ↗

Data Center / Cloud

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling

May 21, 2026

By Sachin Lakharia, Vipin Sirohi, Petr Lapukhov, Dheevatsa Mudigere, Eduardo Alvarez and Mohamed Fawzy

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA GB200 NVL72 delivers exascale compute in a single rack with 72 Blackwell GPUs interconnected by NVLink, providing 130 TB/s of low-latency bandwidth, enabling real-time trillion-parameter AI models and significant performance gains across AI training and inference workloads.
Slurm's new topology/block plugin, co-developed by NVIDIA and SchedMD for version 23.11, enables topology-aware job scheduling that aligns jobs with NVL72 domain boundaries, minimizing fragmentation and optimizing GPU occupancy in cluster environments.
Larger job segment sizes (up to 18 nodes) on GB200 NVL72 allow efficient grouping of GPUs communicating entirely over NVLink, benefiting high I/O workloads like mixture-of-experts training, while smaller jobs use smaller segment sizes to avoid scheduler constraints.
Scheduling simulations using a 5,000-node GB200 NVL72 cluster model showed that topology-aware scheduling places small jobs strategically within domains to reduce fragmentation and achieves GPU occupancy within 1% of theoretical maximum, maintaining high utilization without performance loss.
Recommended scheduling policies prioritize large jobs (64 GPUs) with segment sizes of 16 nodes to maximize NVLink domain usage, while smaller jobs use segment sizes of 2 to 8 nodes, enabling efficient resource alignment and cluster utilization.
Continuous monitoring of fragmentation and segment size adjustment, supported by simulation tools, is essential to sustain optimal performance and utilization in NVIDIA GB200 NVL72 clusters over time.

AI-generated content may summarize information incompletely. Verify important information. Learn more

As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on the hardware itself. NVIDIA GB200 NVL72 delivers exascale compute in a single rack, unlocking real-time trillion-parameter models. Yet capturing that performance in a shared cluster requires schedulers that understand the system architecture and align jobs with its network topology.

This post explains how Slurm topology-aware job scheduling works on NVIDIA GB200 NVL72, and provides scheduling recommendations for optimal GPU occupancy.

How does NVIDIA GB200 NVL72 deliver exascale compute?

NVIDIA GB200 NVL72 is an exascale computer in a single rack. With 72 NVIDIA Blackwell GPUs interconnected by the largest production scale-up compute fabric, NVIDIA NVLink provides 130 terabytes per second (TB/s) of low-latency GPU communication bandwidth for AI and high-performance computing (HPC) workloads. Multiple GB200 NVL72 systems combined in a cluster create hierarchical network topology with large domains of very high networking bandwidth.

An AI training job can greatly benefit from the abundant networking bandwidth offered by GB200 NVL72, when scheduled to maximize the use of NVLink fabrics. Recent results show that GB200 NVL72 delivers significant improvement in performance for all AI workloads, including training (>2.6x with recent MLPerf training), across different inference use cases (real-time inference for trillion-parameter models, >1.5 million tokens/second for the OAI gpt-oss model, state-of-art disaggregate serving), as well as reasoning.

In a shared cluster running multiple training jobs, a resource-efficient scheduler must account for varying network bandwidth requirements.

What is topology-aware job scheduling?

Topology-aware job scheduling allows a job scheduler such as Slurm to make resource allocation decisions based on the cluster’s physical network layout, such as the hierarchy of switches and racks. The scheduler should preserve locality, keeping workloads within the same NVLink domain whenever possible. In addition, because multiple training or inference jobs can fit in a group of NVL72 racks, the scheduler must provide efficient bin-packing to avoid resource fragmentation.

The longstanding Slurm topology/tree plugin provides topology-aware scheduling for large clusters, but its best-effort approach often fragments jobs across leaf switches to reduce queue time. While this compromise between start time and performance was acceptable for traditional InfiniBand fabrics, the advent of rack-scale systems like GB200 NVL72 and GB300 NVL72 necessitated a change. In response, NVIDIA and SchedMD collaborated to launch the new topology/block plugin in Slurm 23.11, specifically designed for these modern architectures.

This topology plugin configuration provides information about groups of nodes belonging to the same NVL72 domain, which enables algorithms that can align Slurm jobs with NVL72 domain boundaries. To learn more about the block topology plugin and how segment sizes are scheduled, see Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling.

How do cluster segmentation and job scheduling work on GB200 NVL72?

As clusters grow in scale and complexity, managing GPU resources becomes critical for achieving both high utilization and predictable performance. The GB200 NVL72 system introduces larger AI job segment sizes and fine-grained scheduling control, enabling operators to align segment configurations with workload needs. Together with GB200 NVL72-aware scheduling extensions in the Slurm workload manager, this approach balances large and small jobs to maximize efficiency even in the presence of hardware faults.

How does GB200 NVL72 enable larger segment sizes?

In multi-GPU workloads, the job segment size defines the subunit made of nodes that can communicate with each other entirely over NVLink. Figure 1 illustrates how segment number (Y) and segment size (S) are used to define the GPUs assigned to a specific job. GPUs per node (G) is always four for GB200 and GB300.

A diagram visually explains how GPUs are allocated to a job within the NVIDIA GB200 NVL72 system. The left side shows a 4x4 grid of green GPU blocks representing the total GPUs assigned to a job (X). This is broken down into an equation: 2 segments (Y) × 2 nodes per segment (S) × 4 GPUs per node (G), totaling 16 GPUs. The green blocks are grouped and labeled in a way that helps illustrate how scaling works across segments and nodes, all interconnected over NVLink for high-speed communication. The design emphasizes modular scalability within GB200 and GB300 architectures. — *Figure 1. GB200 NVL72 job size enables larger, scalable GPU groupings over NVLink*

In prior systems, such as NVIDIA HGX H100, jobs were limited to a segment size of one node. The GB200 NVL72 system supports much larger segment sizes (up to 18 nodes) while also efficiently supporting segments as a single node.

The optimal segment size for a given application is determined by factors such as model type and the combination of parallelism types used for training. Generally, larger jobs (those utilizing more GPUs) and those with high I/O bandwidth requirements—mixture-of-experts (MoE) training, for example—benefit from larger segment sizes. Conversely, smaller jobs typically have lower I/O bandwidth needs and should use a smaller segment size to prevent over-constraining the cluster scheduler. Users should validate this guidance for their specific workloads if unsure, as performance effects can be workload-specific.

What are best practices for GB200 NVL72 segment sizing?

In modeling, our team found a few general guidelines for maximizing GB200 NVL72 cluster utilization. A rule of thumb is to choose the critical job size that uses a “large” segment size of 16 nodes such that the percentage of GPU hours in the cluster for those jobs is <= 90%. This will give the scheduler flexibility to fully utilize the cluster with a good mix of segment sizes. Table 1 summarizes some of the recommended optimal configurations.

Job size	Segment size	Example workloads
128	16	MoE model training
32 – 64	4	Large dense model training
Less than 32	1	Smaller model training

Table 1. Recommended GB200 NVL72 segment sizes by job size and workload type

Note that, for the purposes of this post, we assume user jobs prefer to run with a power-of-two GPUs segment sizes (for example, 4 nodes = 16 GPUs). It is also possible to choose other segment sizes (12, 36, or 72 GPUs per segment, for example). To decide whether an alternate approach makes sense, study the efficiency of your jobs when mapped across a non-power-of-two segment size, and the effect on overall utilization of the cluster for different sized jobs.

How to schedule jobs on GB200 NVL72 systems

NVIDIA and SchedMD have developed block scheduling extensions built on Slurm that enable GB200 NVL72-aware job placement for high utilization.

With power-of-two segment sizes, an GB200 NVL72 cluster can run large and small jobs side by side—for example, one 512 GPU job using 16 node segments alongside several 16 GPU jobs using single node segments. These scheduling policies minimize fragmentation while maintaining high efficiency across the cluster.

What is the GB200 NVL72 scheduling simulation framework?

To evaluate scheduling strategies at scale, we developed a standalone Slurm simulator that runs on a virtual machine and enables time-accelerated workload simulation. As shown in Figure 2, this simulator provides accurate and repeatable results by:

Running the Slurm code
Replaying production workloads or generating synthetic workloads
Simulating real-world conditions, including node failures and recoveries
Integrating with the metrics system for direct comparison of results

This setup provides significant leverage to test, compare, and confidently roll out new scheduling policies before deploying them in production.

Diagram showing a flowchart consisting of six labeled boxes connected by green arrows. From left to right, the first box is "Production Cluster" which flows into a second box labeled "Data." The "Data" box then connects to a third box labeled with “Slurm Versions” and “Configs.” Below, two boxes labeled “Production Metrics” and “Simulator Metrics” are positioned under the first and third boxes respectively. A horizontal green arrow labeled “Compare” links these two metric boxes, indicating the comparison step in the simulation process. — *Figure 2. Real and simulated metrics are compared across production and test environments in the Slurm simulator flow*

Simulation parameters

Parameters of the simulation environment the team modeled include:

Cluster capacity: 5,000 GB200 NVL72 nodes (20,000 GPUs)
Workload: 15,000 jobs over a seven-day period
Reliability: Average of 2.5% of nodes down at any given time

The bar chart titled "Simulation Job Distribution" shows five job size categories on the x-axis labeled as Small (≤2), Medium (3–15), Large (16–64), XLarge (65–256), and XXLarge (>256). Each category has two bars: a gray bar for percentage of jobs and a green bar for percentage of node hours. The Large category has the highest values for both bars, with the green bar (node hours) slightly exceeding the gray one. XLarge also has a high node-hour percentage but a lower job count. Small and Medium categories show high job count but low node hours. The y-axis represents the percentage from 0 to 50. — *Figure 3. Job distribution across node count buckets showing percentage of total jobs versus percentage of total node hours*

The team evaluated performance using a Large_Perf_Custom policy, designed to balance utilization and large job performance:

Jobs with 32 nodes or more ran with a segment size of 16
Smaller jobs ran with a segment size of two

What do the simulation results show?

To evaluate the performance of the new scheduling strategies, we focused on two key primary cluster metrics: fragmentation of blocks and overall GPU occupancy.

Fragmentation analysis

A key metric for GB200 NVL72 scheduling is how small jobs impact NVLink domain availability for large jobs. The simulator tracked how small jobs (1-18 nodes) were placed within each NVLink domain.

The key finding was that the topology plugin effectively placed small jobs on the last two nodes of each domain, minimizing fragmentation and preserving capacity for larger jobs.

The heat map titled "Heat map: Large_Perf Job Distribution" displays the percentage distribution of small jobs across nodes. The x-axis lists node indices from N1 to N18, and the y-axis lists job sizes including 2, 4, and 8 nodes. The color scale ranges from dark purple (low percentage) to bright yellow (high percentage), with the brightest concentrations located at nodes N17 and N18. This indicates that most small jobs were placed at the last two nodes of each domain, helping reduce fragmentation. — *Figure 4. Heat map showing concentrated placement of small jobs on the last two nodes of each domain to minimize fragmentation*

Occupancy metrics

While topology-aware scheduling introduces constraints, our results showed that its impact on overall occupancy can be almost entirely eliminated through an optimal topology-aware scheduling implementation. Figure 5 shows only ~1% difference between Large_Perf_Custom and NoTopo. The gap can be further filled with more small jobs.

Bar chart comparing two scheduling strategies: "Large_Perf_Custom" and "NoTopo." The Large_Perf_Custom bar is gray and shows 94.2% occupancy, while the NoTopo bar is green and shows 95.5% occupancy. Both bars stretch horizontally across a percentage scale from 0% to 100%, indicating high cluster utilization under both policies, with a ~1% difference favoring NoTopo. — *Figure 5. Simulation results show that occupancy increases with flexible segment sizes*

We compared occupancy under the Large_Perf_Custom algorithm we developed, versus a noTopo policy, where the noTopo configuration represents the best theoretical occupancy possible given the job size distribution, ignoring the large runtime penalties that would result from poor placement in the noTopo algorithm. The practical goal is to get as close as possible to noTopo occupancy while avoiding the performance penalties of topology-naive scheduling.

Results show that our simulation achieved occupancy within roughly 1% of noTopo, demonstrating that topology-aware scheduling can deliver high utilization without sacrificing performance.

What is the best job scheduling approach for GB200 NVL72?

Based on our simulation results and performance testing, we recommend a scheduling approach for NVIDIA GB200 NVL72 clusters that prioritizes large job performance while maintaining high utilization. Large jobs of 64 GPUs or more should be given access to the maximum number of NVLink domains, using segment sizing to ensure proportional GPU allocation across domains. Segment-based scheduling is essential for aligning resources with workload patterns. For jobs of 32 nodes or more, a segment size of 16 is recommended if the application can benefit from it, while smaller jobs are better suited to segment sizes of two to eight, depending on workload characteristics.

To maintain efficiency over time, it is important to monitor and optimize continuously. Tracking fragmentation metrics, adjusting segment sizes as workload patterns evolve, and validating changes with simulation tools before production deployment can help sustain high utilization without sacrificing performance. While block topology can introduce constraints that reduce occupancy, applying strategic scheduling policies can mitigate this effect and preserve performance benefits.

Get started with NVIDIA GB200 NVL72

The NVIDIA GB200 NVL72 system represents a major advancement in AI and HPC computing, and unlocking its full potential requires topology-aware scheduling. Our modeling demonstrates that, with simple configuration and segment-based scheduling, it is possible to achieve optimal performance while maintaining high cluster utilization. The ability to simulate different scheduling scenarios further enables confident deployment of new policies without risking production workloads. Learn more about NVIDIA GB200 NVL72.

Discuss (0)

About the Authors

About Sachin Lakharia
Sachin Lakharia is a principal software engineer at NVIDIA, where he leads multiple projects focused on scheduling, resource management, and data management for large-scale GPU infrastructure. His work supports the efficient operation of critical ML workloads across high-performance computing environments. Previously he has held several senior engineering roles at Meta Platforms (Facebook), including leading AI infrastructure resource management and data infrastructure initiatives. With over a decade of experience in building and scaling distributed systems, Sachin brings deep expertise in infrastructure, machine learning platforms, and resource optimization at hyperscale.

View all posts by Sachin Lakharia

About Vipin Sirohi
Vipin Sirohi is a principal HPC architect at NVIDIA with over a decade of experience in HPC and EDA infrastructure. He has deep expertise in large-scale workload management and has successfully managed enterprise scheduling platforms including LSF and Slurm. In his current role, he plays a pivotal role in architecting and overseeing scheduler operations for NVIDIA internal supercomputing environments, helping optimize performance, reliability, and utilization at scale.

View all posts by Vipin Sirohi

About Petr Lapukhov
Petr Lapukhov works in the GPU architecture and engineering team at NVIDIA with a focus on system and network evolution in the AI era.

View all posts by Petr Lapukhov

About Dheevatsa Mudigere
Dheevatsa Mudigere is a senior distinguished engineer in the NVIDIA Compute Architecture group, focusing on the application-driven co-design of large-scale AI systems. He and his team work on understanding current and future AI applications and developing HW/SW technology to enable more capable and efficient AI systems. Before NVIDIA, he worked on designing, building, and deploying production hyperscale AI systems.

View all posts by Dheevatsa Mudigere

About Eduardo Alvarez
Eduardo Alvarez is a senior technical lead at NVIDIA, where he focuses on AI inference at scale, performance optimization, workload economic analysis, and application enablement. He has a deep background in AI systems engineering, workload optimization, and accelerated computing—focused on translating innovations into real-world applications. Before NVIDIA, Eduardo held engineering roles at various semiconductor and energy tech companies.

View all posts by Eduardo Alvarez

About Mohamed Fawzy
As senior director of Engineering at NVIDIA, Mohamed Fawzy brings over nine years of experience leading large-scale machine learning and data infrastructure initiatives. His current role focuses on advancing AI platforms and infrastructure to optimize machine learning pipelines, improve developer productivity, and support innovative AI solutions. His expertise includes managing geo-distributed teams and scaling systems for complex AI and data needs.

View all posts by Mohamed Fawzy

Comments

Discussion (0)

No comments yet. Sign in and be the first to say something.

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling

How does NVIDIA GB200 NVL72 deliver exascale compute?

What is topology-aware job scheduling?

How do cluster segmentation and job scheduling work on GB200 NVL72?

How does GB200 NVL72 enable larger segment sizes?

What are best practices for GB200 NVL72 segment sizing?

How to schedule jobs on GB200 NVL72 systems

What is the GB200 NVL72 scheduling simulation framework?

Simulation parameters

What do the simulation results show?

Fragmentation analysis

Occupancy metrics

What is the best job scheduling approach for GB200 NVL72?

Get started with NVIDIA GB200 NVL72

Tags

About the Authors

Comments

Discussion (0)

More from NVIDIA Developer Blog