NVIDIA Developer Blog · · 16 min read

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile

Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile

CUDA Tile example.

AI-Generated Summary

Like
Dislike
  • NVIDIA CUDA Tile C++ enables tile-based GPU kernel programming within existing C++ codebases, abstracting low-level GPU parallelism, memory movement, and hardware features across NVIDIA architectures without explicit thread management.
  • The CUDA Tile C++ programming model uses multi-dimensional tensor spans and partition views to operate on fixed-size array tiles, allowing developers to express parallel computations like vector addition and matrix multiplication more declaratively and efficiently than traditional SIMT kernels.
  • Optimizations such as __restrict__ pointer qualifiers, 16-byte alignment assumptions, and masked load/store operations improve performance and memory efficiency; tile kernels are launched with a single thread per block, letting the compiler handle thread execution details.
  • CUDA Tile C++ kernels support profiling with NVIDIA Nsight Compute, providing detailed tile-specific statistics and source-level performance metrics similar to traditional CUDA C++ kernels.
  • Matrix multiplication kernels leverage tile partitions and NVIDIAs matrix multiply-accumulate (mma) operations for efficient accumulation of partial results, demonstrating CUDA Tiles capability for complex linear algebra workloads.
  • CUDA Tile C++ requires GPUs with compute capability 8.x or newer, NVIDIA Driver R580 or later, and CUDA Toolkit 13.3 or newer to utilize the tile programming model fully.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Developers can now use NVIDIA CUDA Tile programming within large existing C++  GPU codebases to develop highly optimized GPU kernels using tile-based abstractions. 

NVIDIA CUDA Tile, launched with NVIDIA CUDA 13.1, introduced tile-based programming for GPUs. Designed with a top-level language layer and another intermediate layer that any high-level programming language can target, CUDA Tile automatically makes use of the advanced capabilities of NVIDIA hardware—including tensor cores, shared memory, and tensor memory accelerators—without requiring the application to target them directly. 

Python was the first language supported for tile-based GPU applications. The newly released CUDA 13.3 adds support for writing tile kernels in C++, enabling developers to build highly optimized GPU kernels.  

What is CUDA Tile C++?

CUDA Tile C++ is an expression of the CUDA Tile programming model in C++, built on top of the CUDA Tile IR specification. It enables developers to write tile kernels in C++ and express GPU kernels using a tile-based model, rather than or in addition to a single instruction, multiple threads (SIMT) model. 

As a refresher, in the tile model:

  • Multi-dimensional arrays are the primary data storage.
  • Tiles are portions of arrays that kernels operate on.
  • Kernels are functions that are executed in parallel by blocks.
  • Blocks are subsets of the GPU; operations on tiles are parallelized across all the threads in each block.

CUDA Tile C++ automates parallelism within blocks, along with asynchrony, memory movement, and other low-level details of GPU programming. CUDA Tile C++ is portable across different NVIDIA GPU architectures, enabling developers to use the latest hardware features without having to rewrite code.

CUDA Tile C++ vector add example

Developers familiar with CUDA C++ for SIMT have likely encountered the canonical vector addition kernel. Assuming the data is already on the GPU, a vector add kernel in CUDA SIMT takes two vectors and adds them together element-wise to produce a third vector. This is one of the simplest CUDA kernels to write. It looks as follows.

__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
 /* calculate my thread index */
 int workIndex = threadIdx.x + blockIdx.x*blockDim.x;

 if(workIndex < vectorLength)
 {
  /* perform the vector addition */
  C[workIndex] = A[workIndex] + B[workIndex];
 }
}

In this kernel, each thread’s work is explicitly specified, and the programmer, when launching this kernel, will specify the number of blocks and threads to be launched.

Looking at the equivalent code written in CUDA Tile C++, there’s no need to specify what each thread does. Just break the data into tiles and specify the mathematical operations for these tiles. Everything else is handled.

The CUDA Tile C++ kernel looks like the following:

#include "cuda_tile.h"
__tile_global__ void vectorAdd(float* a, float* b, float* out, size_t n) {

/* set up the namespace */
  namespace ct = cuda::tiles;
  using namespace ct::literals;

/* attach shape to raw pointers */
  auto aSpan = ct::tensor_span{a,   ct::extents{n}};
  auto bSpan = ct::tensor_span{b,   ct::extents{n}};
  auto oSpan = ct::tensor_span{out, ct::extents{n}};

/* partition each span into tiles of size 8 */
  auto aView = ct::partition_view{aSpan, ct::shape{8_ic}};
  auto bView = ct::partition_view{bSpan, ct::shape{8_ic}};
  auto oView = ct::partition_view{oSpan, ct::shape{8_ic}};

/* load the a and b tiles from global memory */
  int bx = ct::bid().x;
  auto aTile = aView.load(bx);          // load bx-th tile
  auto bTile = bView.load(bx);

/* add the two tiles together, elementwise */
  auto oTile = aTile + bTile;

/* store the result tile to the output partition. */
  oView.store(oTile, bx); 
} 

This looks like a lot of code for a simple vectorAdd kernel. Don’t be alarmed. This overly verbose kernel is used to show all the steps in order. A simplified version doing the same thing with fewer lines of code follows. 

  1. The first difference is using __tile_global__ to signify to the compiler that this is a tile kernel. The array pointers and the array size are passed in as arguments, just as they are in the SIMT kernel.
__tile_global__ void vectorAdd(float* a, float* b, float* out, std::size_t n) {
  1. Then, set up the namespace for cuda::tiles and ct::literals.
  namespace ct = cuda::tiles;  
  using namespace ct::literals; 
  1. Create a tensor span, using this code ct::tensor_span for each of the three arrays. A tensor span is essentially a pointer to a multi-dimensional array in memory, similar to a C++23 std::mdspan.

    The tensor span carries information about the shape (extents) of the array as well as the layout of the array elements (for example, row major or column major).

    The ct::extents{} tells the tensor span what the dimensions of the array are. A 1D array uses n .
auto aSpan = ct::tensor_span{a,   ct::extents{n}};
auto bSpan = ct::tensor_span{b,   ct::extents{n}};
auto oSpan = ct::tensor_span{out, ct::extents{n}}
  1. Now create a partition view from a tensor span and a tile shape. A partition view is a wrapper around a tensor span that presents the array as a series of non-overlapping, fixed-sized partitions.

    The size of each partition is specified by the shape argument, which must be a compile-time argument.

    In this example, 8_ic is an integer constant that is defined by ct::literals.ct::shape<8>{} and ct::shape{8_ic} are equivalent in this context. The partition view that’s created is essentially the original array, sliced into chunks of 8, which is the tile size.
  auto aView = ct::partition_view{aSpan, ct::shape{8_ic}};
  auto bView = ct::partition_view{bSpan, ct::shape{8_ic}};
  auto oView = ct::partition_view{oSpan, ct::shape{8_ic}};
  1. Load input tiles by obtaining the block index in the X dimension with ct::bid().x. If working with multi-dimensional blocks, use the Y and Z dimensions as well.

    Then load the a and b tiles. Use auto for convenience, but to be explicit, aTile and bTile are of type ct::tile<float, ct::shape<8>>>. They’re 1D tiles of size 8, with elements of type float. With the partition view, it’s easy to pass in the block index. The load function automatically fetches the correct chunk of the array and loads it into a tile.
int bx = ct::bid().x;
auto aTile = aView.load(bx);         
auto bTile = bView.load(bx);
  1. Adding and store the results. This is one line of code does element-wise addition on input tiles and stores them in an output tile. Store that output tile to the oView partition view, indexed by the same block index in the X dimension, bx.
/* add the two tiles together, elementwise. */
auto oTile = aTile + bTile;

/* store the result tile to the output partition. */
oView.store(oTile, bx);

Complete vector add example

The following example show how to call this vector add kernel in C++ through a complete, runnable piece of code.

There are a few things to note to help the compiler make optimizations.

First, for best performance, the input and output arrays should only be accessed through their respective pointers while the kernel is running. There is no aliasing—access using another pointer or symbol—to the arrays when this is true. Labeling array pointers with the __restrict__ decorator conveys this to the compiler.

Using arrays with base pointers aligned to 16-byte boundaries helps the compiler generate more efficient memory access patterns. Tell the compiler that pointers are aligned by calling ct::assume_aligned<16> on each of the kernel arguments. Use the return values of these calls for the compiler to take advantage of this alignment. Pointers returned by cudaMalloc or similar CUDA APIs always fulfill this, as they have 256-byte alignment.

Finally, use a tile size much larger than 8. Make these adjustments to the runnable code that follows and add the use of load_masked and store_masked, which handles data that might not be divisible by the tile size.

The following is the complete code, including the kernel and main function. Notice the applied optimizations and reduced verbosity.

#include <cstdio>
#include <cstdlib>
#include "cuda_tile.h"

__tile_global__ void vectorAdd(float* __restrict__ a, float* __restrict__ b, float* __restrict__ out, size_t n) {
  namespace ct = cuda::tiles;
  using namespace ct::literals;

  a   = ct::assume_aligned(a,   16_ic);
  b   = ct::assume_aligned(b,   16_ic);
  out = ct::assume_aligned(out, 16_ic);  

  int bx     = ct::bid().x;
  
/* create partition views for the input tiles and load them */
  auto aTile = ct::partition_view{ct::tensor_span{a,   ct::extents{n}}, ct::shape{1024_ic}}.load_masked(bx);
  auto bTile = ct::partition_view{ct::tensor_span{b,   ct::extents{n}}, ct::shape{1024_ic}}.load_masked(bx);
  
/* add the two tiles together, elementwise. */
  auto oTile = aTile + bTile;

/* create the partition view for the output tile and then store the output tile*/  
  auto oView = ct::partition_view{ct::tensor_span{out, ct::extents{n}}, ct::shape{1024_ic}};  
  oView.store_masked(oTile, bx);
}  

/* define a macro to check for CUDA errors */
#define checkCudaError(X) do {\
  auto ret = X;\
  if (ret != cudaSuccess) {\
    printf("\n error on line %d, CUDART error string : %s", __LINE__, cudaGetErrorString(ret));\
    exit(1);\
  }\
} while (0)

int main() {
  constexpr size_t N = 2ULL << 25;
  constexpr int TILE_SIZE = 1024;
  constexpr int BLOCKS = (N + TILE_SIZE - 1) / TILE_SIZE;

/* declare and allocate the host arrays */
  float* h_a   = (float*)malloc(sizeof(float) * N);
  float* h_b   = (float*)malloc(sizeof(float) * N);
  float* h_out = (float*)malloc(sizeof(float) * N);

/* initialize the host arrays */
  for (size_t idx = 0; idx < N; ++idx) {
    h_a[idx] = (float)rand() / RAND_MAX;
    h_b[idx] = (float)rand() / RAND_MAX;
    h_out[idx] = -1.0f;
  }

/* declare the device arrays */
  float* d_a{nullptr};
  float* d_b{nullptr};
  float* d_out{nullptr};

/* allocate the device arrays */
  checkCudaError(cudaMalloc(&d_a, sizeof(float) * N));
  checkCudaError(cudaMalloc(&d_b, sizeof(float) * N));
  checkCudaError(cudaMalloc(&d_out, sizeof(float) * N));

/* copy the host arrays to the device arrays */
  checkCudaError(cudaMemcpy(d_a, h_a, sizeof(float) * N, cudaMemcpyHostToDevice));
  checkCudaError(cudaMemcpy(d_b, h_b, sizeof(float) * N, cudaMemcpyHostToDevice));

/* initialize the device output array to 0 */
  checkCudaError(cudaMemset(d_out, -1, sizeof(float) * N));

/* launch the kernel */
  vectorAdd<<<BLOCKS, 1>>>(d_a, d_b, d_out, N);

/* synchronize the device and check for errors */
  checkCudaError(cudaDeviceSynchronize());

/* copy the device array out back to the host */
  checkCudaError(cudaMemcpy(h_out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost));

/* compare the results to host results */

  float max_err = 0.0f;
  for (size_t idx = 0; idx < N; ++idx) {
    float expected = h_a[idx] + h_b[idx];
    max_err = fmaxf(max_err, fabsf(h_out[idx] - expected));
  }

  printf("N: %zu\n", N);
  printf("Max error: %e\n", max_err);

  checkCudaError(cudaFree(d_a));
  checkCudaError(cudaFree(d_b));
  checkCudaError(cudaFree(d_out));

  free(h_a);
  free(h_b);
  free(h_out);
}

If familiar with launching SIMT kernels, the process is similar, but requires a specific modification. This kernel was launched with:

vectorAdd<<<BLOCKS, 1>>>(d_a, d_b, d_out, N);

When launching a tile kernel, the first argument in the <<<>>> is the number of tile blocks (in SIMT, this would be the number of thread blocks). The second argument must be 1. The number of threads used to execute the kernel is determined by the compiler; always put 1 as this argument when launching a tile kernel.

Running CUDA 13.3 or later on compute capability 8.0 with NVIDIA Ampere architecture or newer GPU, these commands create the following output.

Adjust the -arch sm_120 command to match the architecture, include -std=c++20 when cuda_tile.h is used, and the --enable-tile option to compile tile kernels.

$ nvcc -std=c++20 --enable-tile -arch sm_120 -o vectorAdd vectorAdd.cu
$ ./vectorAdd
N: 67108864
Max error: 0.000000e+00

This completes the first CUDA Tile C++ program.

Developer tools

Tile C++ kernels can be profiled with NVIDIA Nsight Compute in the same way as SIMT kernels. The following command shows how to create a profile using Nsight Compute.

$ ncu -o VecAddProfile --set detailed ./vectorAdd

Once created and opened with the graphical version of Nsight Compute:

  • Select the vectorAdd kernel from the dropdown menu.
  • Choose the Details tab
  • Expand the Tile Statistics report section

Figure 1 shows the profile generated from Nsight Compute.

Image of the profile generated from Nsight Compute, showing the tile statistics for the vectorAdd kernel.
Figure 1. Profile generated from Nsight Compute, showing the tile statistics for the vectorAdd kernel

Notice the Tile Statistics report section includes the number of tile blocks specified, block size (chosen by compiler), and other tile-specific information.

The source page also supports tile kernels and performance metrics at the source-line level, just like CUDA C++ kernels. 

Matrix multiply

An earlier example showed vectorAdd with the details of loading and storing partition views. This matrix multiply example illustrates how to express matrix multiply using very simple code.

This kernel executes an MxK by KxN matrix multiply to compute an MxN matrix. In this kernel, M=8, N=16, and K can be variable, provided it’s a multiple of 8. Set K=24. These very small sizes are used to illustrate the concepts only.

The complete kernel follows, along with an overview of the high points.

#include "cuda_tile.h"

/* this kernel multiplies MxK and KxN matrices, where M=8 and N=16.  K is variable but must be divisible by 8.*/
__tile_global__ void kernel(float* __restrict__ a, float* __restrict__ b, size_t length, float* __restrict__ c) {
    namespace ct = cuda::tiles;
    using namespace ct::literals;

    a = ct::assume_aligned(a, 16_ic);
    b = ct::assume_aligned(b, 16_ic);
    c = ct::assume_aligned(c, 16_ic);

    auto aShape = ct::extents{8_ic, length};
    auto bShape = ct::extents{length, 16_ic};
    auto cShape = ct::extents{8_ic, 16_ic};

    auto aSpan = ct::tensor_span{a, aShape};
    auto bSpan = ct::tensor_span{b, bShape};
    auto cSpan = ct::tensor_span{c, cShape};

    auto aView = ct::partition_view{aSpan, ct::shape{4_ic, 8_ic}};
    auto bView = ct::partition_view{bSpan, ct::shape{8_ic, 4_ic}};
    auto cView = ct::partition_view{cSpan, ct::shape{4_ic, 4_ic}};
    
    using f32x4x4 = ct::tile<float, ct::shape<4, 4>>;
    auto accTile = ct::full<f32x4x4>(0);

    auto [xBlock, yBlock, dummy] = ct::bid();
    for (auto idx : ct::irange(0, 1 + int(length - 1) / 8)) {
        auto aTile = aView.load_masked(xBlock, idx);
        auto bTile = bView.load_masked(idx, yBlock);
        accTile = ct::mma(aTile, bTile, accTile);
    }

    cView.store_masked(accTile, xBlock, yBlock);
}
  1. Create extents with ct::extents objects for the a, b, and c matrices. Use either compile or runtime values. M=8 and N=16, but K is variable. These are used to create the tensor spans in the next step.
 auto aShape = ct::extents{8_ic, length};
 auto bShape = ct::extents{length, 16_ic};
 auto cShape = ct::extents{8_ic, 16_ic};
  1. Create tensor spans. This carries information about a, b, and c to create the partition views.
    auto aSpan = ct::tensor_span{a, aShape};
    auto bSpan = ct::tensor_span{b, bShape};
    auto cSpan = ct::tensor_span{c, cShape};
  1. Create partition views of a, b, and c with a partitioned as a 4×8 and the view b as an 8×4 view. Adjustments can be made, provided they divide properly into a and b values. These dimensions also determine that the c view is 4×4.
  auto aView = ct::partition_view{aSpan, ct::shape{4_ic, 8_ic}};
  auto bView = ct::partition_view{bSpan, ct::shape{8_ic, 4_ic}};
  auto cView = ct::partition_view{cSpan, ct::shape{4_ic, 4_ic}};

The 2D partitions are indexed in 2 dimensions. The a matrix is 8×24, and the partition view is 4×8, as shown in Figure 2.

The partition view sizes of aView and bView also determine the shape of accTile, the tile used to accumulate results during matrix multiplication. In this example, accTile is a 4×4 tile, matching the shape of cView.

A two-dimensional partition view of 8 x 24 partitioned into views of 4 x 8, and indexed by two-dimensional coordinates x and y. The first partition is (0,0), and the last partition is (1,2).
Figure 2. Two-dimensional partition view of an 8 x 24 matrix, partitioned into views of size 4 x 8
    using f32x4x4 = ct::tile<float, ct::shape<4, 4>>;
    auto accTile = ct::full<f32x4x4>(0);
  1. Execute the loop with ct::bid to obtain the block indices in the three dimensions. The loop iterates from 0 to the length / 8, corresponding to the overall K dimension divided by 8. The division by 8 matches the K-dimension of aView and bView is 8. Inside the loop, tiles from a and bare loaded using load_masked, and call ct::mma performs the matrix multiply, accumulating the result in accTile.
    auto [xBlock, yBlock, dummy] = ct::bid();

    for (auto idx : ct::irange(0, int(length / 8))) {
        auto aTile = aView.load_masked(xBlock, idx);
        auto bTile = bView.load_masked(idx, yBlock);
        accTile = ct::mma(aTile, bTile, accTile);
    }
  1. Store the value of the accTile into the partition view of c, the cView. And that’s it. Most of the kernel code is involved with setting up views for the data and loading/storing the data. The compute portion of the kernel is simple.
  cView.store_masked(accTile, xBlock, yBlock);
  1. Launch the kernel. Use dim3(2,4) because of the dimensions of cView.

    cView is 4×4, meaning each block is computing a 4×4 chunk of the C matrix. Since C is 8×16, divide the cView dimensions into the C matrix dimensions. Since 8/4=2, and 16/4=4 launch the kernel with dim3(2,4).
  kernel<<<dim3(2, 4), 1>>>(d_a, d_b, K, d_c);

Get started today with CUDA Tile C++

The following are required to run CUDA Tile C++ programs:

  • A GPU with compute capability 8.x or newer.
  • NVIDIA Driver R580 or later. If JIT compilation is required for tile kernels, the NVIDIA driver version must be equivalent to or newer than the version associated with the CUDA Toolkit used to generate the code. For example, CUDA Toolkit 13.3 requires an R610 driver or newer.
  • CUDA Toolkit 13.3

The power of tile-based programming is now available to C++ developers. Check out the documentation, the API reference manual, and CUDA Toolkit 13.3 today to start writing tile C++ kernels and experience the new standard for accelerated computing.

Acknowledgements

Thanks to NVIDIA contributors Jaydeep Marathe and Ezra Stein.

Discuss (0)

Tags

Data Science | Developer Tools & Techniques | Simulation / Modeling / Design | General | CUDA | Intermediate Technical | Deep dive | C++ | CUDA Tile

About the Authors

Avatar photo
About Jonathan Bentz
Jonathan Bentz leads the CUDA technical marketing engineering team at NVIDIA, where his team focuses on creating and delivering engaging content and connecting with CUDA developers. Jonathan holds a PhD in Chemistry and a master’s degree in Computer Science from Iowa State University.
Avatar photo
About Tony Scudiero
Tony Scudiero is a technical marketing engineer for the CUDA platform. He works to bring CUDA to developers of every type and ability. He has worked with large HPC systems and applications, real-time acoustic simulations (VRWorks Audio), and the Omniverse RTX Renderer during his tenure at NVIDIA.

Comments

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from NVIDIA Developer Blog