NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates
Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.
NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates
NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in C++, enables high-level, tile-based kernel development that automatically manages complex low-level GPU details for optimal performance and portability. Additionally, CUDA Tile programming is now supported on Compute Capability 9.0 (NVIDIA Hopper) GPUs in addition to all other supported GPU architectures.
We are also releasing CUDA Python 1.0, solidifying the support and stability of the CUDA Python SW ecosystem, and introducing critical features like green contexts and process checkpointing.
For performance enthusiasts, the newly launched NVIDIA CompileIQ compiler auto-tuning framework delivers up to a 15% speedup on critical kernels like GEMM and attention. This release also features official C++23 support in NVCC, expanded tensor interoperability with DLPack/mdspan in CCCL 3.3, and numerous updates to the math libraries (cuBLAS, cuSPARSE, cuSOLVER) and profiling tools (Nsight Compute and Nsight Systems).
Release of CUDA Tile C++
With the release of CUDA 13.3, CUDA Tile support is extended to C++, enabling the large existing C++ codebase and developer base to create highly-optimized GPU tile kernels. This model automates parallelism, memory movement, asynchrony, and other low-level details, resulting in C++ code that is portable across NVIDIA GPU architectures. For more information, check out our blog post.
Release of CUDA Python 1.0
CUDA Python is a set of libraries that expose CUDA to the Python programming language. By providing the 1.0 release, we are committing to semantic versioning: ensuring breaking API changes only during major-version releases. Minor releases add features and patch releases are bug fixes. Any public API scheduled for removal is first deprecated in a minor release with a clear replacement path.
The following is more information on the software components included in CUDA Python 1.0.
| library | description | next major version |
cuda.binding | Low-level Python bindings to CUDA C APIs. | 13.3.0 |
cuda.core | Pythonic access to CUDA Runtime and other core functionality | 1.0.0 |
cccl-cuda | Pythonic access to CCCL parallel algorithms and easy access to CCCL’s highly efficient and customizable parallel algorithms | 1.0.0 |
cuda-pathfinder | Utilities for locating CUDA components installed in the user’s Python environment | 1.6 |
cuda.coop is also available in the cuda-cccl package under the _experimental namespace, which is subject to API changes. cuda.coop provides the reusable block-wide and warp-wide device primitives for use within Numba CUDA kernels.
cuda.core is now stable
cuda.core provides a Pythonic interface to the CUDA runtime, including devices, streams, programs, linkers, memory resources, and graphs. Version 1.0 consolidates APIs that have been stabilizing over the previous release cycles into a single supported surface. At the same time, we added support for green contexts, CUDA checkpointing, and more.
- Green contexts: Split a GPU’s SMs into disjoint partitions, each with its own context and streams, so latency-sensitive kernels are shielded from long-running throughput kernels in the same process.
- Process checkpointing: Snapshot the full CUDA state of a running process—including device allocations, streams, context—and restore it later. Unlocks CRIU-style workflows for GPU processes: fault-tolerant long jobs, preemption and migration on shared clusters, and fast warm-start of inference workers. Only available in Linux.
- Inter-process sharing (IPC): Share GPU memory across Python processes without copying through the host. One process allocates, and others map the same physical VRAM into their own address space. Ideal for multi-process ML serving and zero-copy producer/consumer pipelines.
The following are quick examples of how to use cuda.core APIs.
from cuda.core import Device, Stream, Program, ProgramOptions, LaunchConfig, launch
# pick and activate a GPU
dev = Device()
dev.set_current()
# create a CUDA stream
stream = dev.create_stream()
# NVRTC compile + lookup
prog = Program(src, code_type="c++", options = ProgramOptions(arch=f"sm_{dev.arch}"))
kernel = prog.compile("cubin").get_kernel("my_kernel")
# launch a kernel
launch(stream, LaunchConfig(grid=64, block=256), kernel, *args)
# JIT-LTO linking
from cuda.core import Linker, LinkerOptions
module = Linker(
[obj1, obj2],
options=LinkerOptions(arch=f"sm_{dev.arch}")
).link("cubin")
# NVRTC precompiled headers
from cuda.core import ProgramOptions
opts = ProgramOptions(std="c++17", arch=f"sm_{dev.arch}", create_pch=True, pch_dir="/tmp/pch")
# Memory resources, incl. NUMA-aware pools
from cuda.core import DeviceMemoryResource, PinnedMemoryResource, PinnedMemoryResourceOptions, ManagedMemoryResource, ManagedMemoryResourceOptions
# NUMA-pinned host memory
pinned = PinnedMemoryResource(PinnedMemoryResourceOptions(numa_id=0))
# CUDA graphs: stream capture and explicit construction
from cuda.core.graph import GraphBuilder, GraphDef
gb = stream.create_graph_builder()
gb.begin_building()
graph = gb.end_building().complete()
graph.launch(stream)
gdef = GraphDef()
gdef.add_kernel_node(kernel, LaunchConfig(grid=64, block=256), args=args)
# IPC: share GPU memory across Python processes
from cuda.core import DeviceMemoryResource, DeviceMemoryResourceOptions
mr = DeviceMemoryResource(dev,
options=DeviceMemoryResourceOptions(max_size=1 << 20, ipc_enabled=True))
buffer = mr.allocate(nbytes) # buffer is picklable and can be sent over mp.Queue
# Green contexts: partition SMs into disjoint groups
from cuda.core import ContextOptions, SMResourceOptions
sm = dev.resources.sm
long_grp, crit_grp = sm.split(SMResourceOptions(count=(sm.sm_count - 16, 16)))[0]
ctx_crit = dev.create_context(ContextOptions(resources=[crit_grp]))
s_crit = ctx_crit.create_stream()
# Process checkpoint / restore (Linux)
from cuda.core import checkpoint
proc = checkpoint.Process(os.getpid())
proc.lock(timeout_ms=5000)
proc.checkpoint()
proc.restore()
proc.unlock()
# device allocations and context are restored
# TMA / TensorMapDescriptor
from cuda.core import StridedMemoryView, TensorMapDescriptor
tmap = StridedMemoryView(tensor).as_tensor_map(box_shape=(128,))
# DLPack-friendly strided views
from cuda.core.utils import StridedMemoryView
view = StridedMemoryView(torch_tensor); capsule = view.__dlpack__()
# System info (NVML)
from cuda.core import system
print(system.num_devices, system.driver_version)
# cuda.bindings.nvml
from cuda.bindings import nvml
nvml.init()
name = nvml.device_get_name(nvml.device_get_handle_by_index_v2(0))
# cuda.bindings.nvfatbin
from cuda.bindings import nvfatbin
handle = nvfatbin.create()
CCCL Python release 1.0.0: cuda.compute
cuda.compute brings the CUDA Core Compute Libraries (CCCL)’s highly tuned parallel algorithms—sort, scan, reduce, transform, unique, histogram, top-k, and more—to Python as host-callable building blocks. Changes since the last release include:
- Python lambdas can be used as algorithm operators, reducing boilerplate for simple reductions, scans, transforms, and predicates.
- Algorithms support operators with side effects (state), enabling use cases like running accumulators and conditional transforms.
- New
cuda.compute.upper_boundandcuda.compute.lower_boundAPIs expose CUB’s parallel binary search to Python. - Consolidated caching across all algorithms for faster repeated invocations.
import cuda.compute
from cuda.compute import OpKind
d_input = cp.arange(1, 1_000_001, dtype=cp.int32)
d_output = cp.empty(1, dtype=cp.int32)
h_init = np.array([0], dtype=np.int32)
cuda.compute.reduce_into(
d_input, d_output, OpKind.PLUS, d_input.size, h_init
)
cuda.compute.reduce_into(
d_input, d_output,
lambda a, b: a if a > b else b,
d_input.size, h_init,
)
cuda.coop exposes CCCL’s warp-wide and block-wide cooperative primitives for use inside Numba CUDA kernels. At the moment, this module is under the _experimental namespace and may have API changes that don’t follow semantic versioning.
from numba import cuda
from cuda.coop._experimental import block, warp
THREADS = 128
block_sum = coop.block.make_sum(numba.int32, THREADS)
@cuda.jit(link=block_sum.files)
def reduce_kernel(data, out):
# Each thread contributes one element to the block-wide reduction
total = block_sum(data[cuda.threadIdx.x])
if cuda.threadIdx.x == 0:
out[0] = total
h_in = np.ones(THREADS, dtype=np.int32)
d_in = cuda.to_device(h_in)
d_out = cuda.device_array(1, dtype=np.int32)
reduce_kernel[1, THREADS](d_in, d_out)
assert d_out.copy_to_host()[0] == THREADS # 128
New Numba CUDA MLIR backend
Numba CUDA MLIR is a new Numba-compatible kernel generator for Python, written from the ground up on top of MLIR and the modern NVVM toolchain. It preserves the familiar @cuda.jit programming model from Numba-CUDA while delivering lower compile latency, better diagnostics, and a cleaner path to target new GPU architectures and features as they land in the NVVM stack. Numba CUDA MLIR can be used as a drop-in replacement for numba.cuda by simply replacing the import statement:
# Before
from numba import cuda
# After
from numba_cuda_mlir import cuda
@cuda.jit
def vector_add(a, b, out):
i = cuda.grid(1)
if i < out.shape[0]:
out[i] = a[i] + b[i]
Beyond existing Numba-CUDA compatibility Numba CUDA MLIR also features:
- Faster JIT compile. Across a suite of real kernels (vector add, softmax, Cholesky, attention, Black-Scholes, FFT, matmul), warm JIT compile times are ~1.4x faster on geomean and up to ~2x faster on individual kernels versus Numba-CUDA.
- Lower launch latency. Host-side kernel dispatch overhead drops by roughly 2-3.5x for typical kernels and up to ~17x for kernels with many scalar arguments, where argument packing previously dominated.
You can test Numba CUDA MLIR 0.3 by installing it from PyPI numba-cuda-mlir[cu13] and follow its development on GitHub.
Try CUDA Python today
Install the CUDA Python stack directly from PyPI:
pip install cuda-python cuda-cccl numba-cuda-mlir[cu13]
This pulls in cuda.bindings 13.3.0, cuda.core 1.0.0, cuda.compute 1.0.0, along with cuda-pathfinder for library discovery.
CompileIQ launched
A new compiler auto-tuning framework for maximum performance on GPU kernels called CompileIQ, launches with CUDA 13.3. GPU compilers apply generic optimization heuristics that are broadly effective but aren’t necessarily optimal for specific kernels. CompileIQ flips this dynamic by using evolutionary and genetic algorithms to generate specialized compiler configurations custom-tailored to each kernel.
This unlocks extra performance. For example, for critical kernels like GEMM and attention, which account for over 90% of LLM inference compute, CompileIQ delivers up to a 15% speedup on already-optimized Triton attention and CUTLASS GEMM kernels. Read more about CompileIQ, including how it works and how to use it, in this blog post.
Math libraries
Core CUDA math libraries in CUDA 13.3 include several new features and notable performance improvements available, including:
- cuSPARSE:
- Support for CSC format in SpSV and SpSM.
- Support for mixed precision in SpMVOp.
- Support for mixed index type (64-bit offset, 32-bit index) CSR matrix in SpMvOp computation
- Improved
cusparseSpMVOp_createDescr()performance by 2.5x. - Introduced new API SPMVOP_ALG1, which supports:
- Updating matrix values while maintaining the same sparsity pattern.
- Optimized buffer size.
- Reduced preprocess overhead.
- cuBLAS:
- CUDA green context support.
- Performance improvement to FP4 matmuls on NVIDIA Blackwell Ultra.
- Performance improvement to TF32 matmuls on NVIDIA Blackwell and Blackwell Ultra.
- SYMV performance improvements for NVIDIA Hopper, Blackwell, and Blackwell Ultra.
- Improved user experience for FP64 emulated matmuls by enforcing a fixed workspace size that is constant across the problem space.
- cuSOLVER:
- A 64-bit interface
cusolverDnXpolarexposes the QDWH algorithm implementation for polar decomposition in cuSOLVERDn - A 64-bit interface
cusolverDnXstedc, which computes the eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method - Performance improvements for
cusolverDnXgeevwith eigenvectors by moving the eigenvector post-processing from the host to the device.
- A 64-bit interface
- Public 64-bit interface
cusolverDnXpolar, which exposes the QDWH algorithm implementation for polar decomposition in cuSOLVERDn (available in 13.2 U1). - Public 64-bit interface
cusolverDnXstedc, which computes the eigenvalues and, optionally, eigenvectors of a symmetric tridiagonal matrix using the divide and conquer method (available in 13.2 U1). - Performance improvements for
cusolverDnXgeevwith eigenvectors by moving the eigenvector post-processing from the host to the device. cusolverDn[D,Z]syevjuses low-precision preconditioning, which typically improves the time-to-solution by 20% for mid-sized and large matrices on B200, and by even more on GPUs with a large FP32: FP64 ratio.
CCCL
CUDA 13.3 ships with CCCL 3.3. Highlights include DLPack/mdspan interoperability, a comprehensive random number distribution library, new search and segmented scan algorithms, and a flexible N-to-M transform.
Tensor interoperability
Deep learning frameworks speak in tensors, but CUDA C++ code often has to work one level lower—raw pointers, shapes, strides, and hand-written indexing. CCCL makes it easier to preserve that tensor structure across the boundary between Python frameworks and CUDA C++. With DLPack interoperability, tensors from frameworks such as PyTorch, JAX, and CuPy can be converted into cuda::std::mdspan views with cuda::to_device_mdspan for use in C++ kernels, and cuda::std::mdspan views can be converted back to DLPack with cuda::to_dlpack_tensor.
CCCL also extends this tensor-view model inside kernels with cuda::shared_memory_mdspan. Instead of treating shared memory as a flat buffer, developers can create multi-dimensional views over shared-memory tiles, making indexing clearer and less error-prone. The shared-memory specialization also provides address-space safety checks and guarantees shared-memory load/store instructions.
Random number distributions
CCCL 3.3 adds a comprehensive set of device-compatible random distributions to <cuda/std/random>, bringing libcu++ to near-parity with the C++ standard library’s <random> header. CCCL 3.3 brings a comprehensive set of 17 random uniform, normal, Poisson, and Bernoulli distributions. In addition, CCCL 3.3 backports the cuda::std::philox4x32 and cuda::std::philox4x64 engines from C++26 to C++17 and adds cuda::pcg64 as an extension in <cuda/random>. PCG64 is the default PRNG in Numpy and provides a good balance between quality and performance.
#include <cuda/random>
#include <cuda/std/random>
__global__ void sample_kernel() {
cuda::pcg64 rng(threadIdx.x);
cuda::std::normal_distribution<float> dist(0.0f, 1.0f);
float sample = dist(rng);
}
Search: cub::DeviceFind::FindIf
CCCL 3.3 adds cub::DeviceFind::FindIf, a new speed-of-light device-wide search algorithm for finding the first element that satisfies a predicate.
cub::DeviceFind::FindIf(
d_temp, temp_bytes, input, output,
[] __device__ (int value) {
return value > 42;
}, num_items);
This algorithm delivers up to 7x speedup compared to the search implementation used in CCCL 3.2 and accelerates Thrust’s search and predicate-query algorithms, including thrust::find_if, thrust::all_of, thrust::any_of, thrust::none_of, thrust::equal, thrust::mismatch, thrust::is_sorted, thrust::partition_point, and more.
hrust::find_if in CCCL 3.2 and CCCL 3.3More new algorithms in CCCL 3.3 include:
- Segmented scan:
cub::DeviceSegmentedScanprovides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments. - Binary search:
cub::DeviceFind::LowerBound/UpperBoundperforms a parallel search for multiple values in an ordered sequence. - Transform:
cub::DeviceTransformnow supports transforming N input sequences into M output sequences.
Compilers/NVCC
C++23 support: Full C++23 integration in nvcc and nvrtc empowers developers to use the latest language standard.This release modernizes the CUDA development experience, ensuring codebase consistency with modern standards while significantly improving cross-platform portability.
- Enhanced
nvrtcout-of-the-box experience: By bundling standard CUDA C++ headers, NVRTC streamlines the runtime compilation process and reduces prerequisite setup.This update simplifies include-path management, enabling faster implementation of portable and robust runtime compilation workflows. - Integrated nvprune in
nvcc: The inclusion of pruning capabilities directly within the compiler allows for more efficient artifact management and simplified multi-arch deployment.
More CUDA 13.3 enhancements
More enhancements in CUDA 13.3 are detailed in this section.
MPS partial error isolation
MPS has added support for partial error isolation. When using this feature, the CUDA driver can attribute the error to the faulting partition/client and terminate that client’s work, while other clients in other partitions that did not cause the fault won’t be terminated. For more info on how to use this feature, see the release notes.
Enable graph recapture to an existing graph
In CUDA graphs, a new API cudaStreamBeginRecaptureToGraph() enables you to initiate a stream capture into an existing source graph. As the graph is recaptured, any updated node parameters will be updated in the existing node.
Default stream creation is optional in green contexts
Green Contexts used in the CUDA Driver API no longer require the creation of the default (NULL) stream via the CU_GREEN_CTX_DEFAULT_STREAM flag. Creation of this stream is now optional.
NVML reports inactive remapped rows
A new NVML API, nvmlDeviceGetRemappedRows_v2, can acquire the number of inactive row remappings while the old API, nvmlDeviceGetRemappedRows, now returns only the number of active row remappings.
Added mmap() support
This release extends mmap() support, providing a low-latency CPU mapping of discrete GPU memory in environments where it may be disadvantageous to install GDRCopy kernel drivers.
Get started
Download CUDA Toolkit 13.3 and get started today.
Acknowledgments
Thanks to NVIDIA contributors Andy Terrel, Rob Armstrong, Jackson Marusarz, Becca Zandstein, Mridula Prakash, Daniel Rodriguez, and Georgii Evtushenko.
Tags
About the Authors
Jonathan Bentz leads the CUDA technical marketing engineering team at NVIDIA, where his team focuses on creating and delivering engaging content and connecting with CUDA developers. Jonathan holds a PhD in Chemistry and a master’s degree in Computer Science from Iowa State University.
Comments
More from NVIDIA Developer Blog
-
Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile
May 26
-
Extract More Kernel Performance with NVIDIA CompileIQ Auto-Tuning
May 26
-
Run Key Genomics and Protein Folding Workloads Faster with NVIDIA RTX PRO 4500 Blackwell
May 26
-
Synthesize Realistic 3D Medical Images at Scale to Ship Pre‑Trained Models
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.