Inside CUDA: Performance Engineering

Dive deeper into CUDA to uncover the principles and practices behind high-performance GPU computing.

Motivation & Recap#

In the previous article — Hello CUDA! — we got acquainted with CUDA and explored the big picture of GPU architecture. Continuing that journey, in this article, we will discuss a core topic that constitutes the power (and complexity) of GPUs: Performance Engineering.

GPUs have long been known for their superior parallel computing capabilities. From training deep learning models with billions of parameters, simulating molecular dynamics, to financial risk analysis, all require GPUs.

In CUDA, the performance difference between unoptimized and optimized code can be tens, or even hundreds, of times. Achieving this speedup isn't just about "writing GPU code"; we must learn to think in parallel. This requires a deep understanding of both parallel algorithms and how they interact directly with the hardware. ¹

In this article, we will learn how to keep the GPU cores busy and harness the full power of the hardware. Instead of diving deep into parallel algorithms, we will focus on the fundamental techniques that determine the actual performance of each kernel.

Hardware & Experimental Setup#

Performance is always limited by hardware. An optimization technique might work effectively on one architecture but be inefficient on another.

To measure objectively, we will use two different hardware platforms:

NVIDIA GeForce MX330: An old, weak laptop GPU, representing low-end hardware (Pascal architecture).
NVIDIA RTX A4000: A modern, powerful workstation GPU based on the Ampere architecture.

Measuring on both will show us which techniques provide general benefits (effective on both) and which are only effective on a specific architecture.

Below is a comparison table of the key specifications for these two GPUs:

Specification	NVIDIA GeForce MX330	NVIDIA RTX A4000
Architecture	Pascal	Ampere
Compute Capability	6.1	8.6
CUDA Cores	384	6144
SM Count	3	48
VRAM	2 GB GDDR5	16 GB GDDR6 (ECC)
Memory Bandwidth	~56.1 GB/s	448 GB/s
L2 Cache	512 KB	4 MB
Registers / SM	65,536 (32-bit)	65,536 (32-bit)

Quick Analysis: The difference is enormous. The A4000 (Ampere) not only has 16 times more SMs (48 vs 3) but also nearly 8 times more memory bandwidth (448 vs 56.1 GB/s). This suggests that memory-bound kernels will benefit significantly on the A4000, while compute-bound kernels will benefit from the massive number of SMs/CUDA cores.

Performance Optimization Techniques#

We will explore some of the most important optimization techniques in CUDA, starting with the most powerful tool.

1. Shared Memory (SMEM)#

Accessing Global Memory (the GPU's VRAM) is one of the most expensive operations in a CUDA kernel. It has very high latency, potentially hundreds of clock cycles. If your kernel constantly reads from and writes to global memory, performance will be severely "bottlenecked".

Shared Memory (SMEM) is an on-chip memory region (located right inside the SM), small in size (typically 48KB to 128KB per SM) but with extremely high access speed and bandwidth. SMEM's latency is lower than L2 cache and only slightly higher than registers, making it an effective tool for data reuse within a block. ¹

The general strategy is:

Load a block of data from Global Memory into Shared Memory (only once, and attempt to load it in a coalesced manner - see section 3).
Have the threads in the block perform many computations by accessing data in Shared Memory.
Write the final result back to Global Memory.

Example 1: Matrix Multiplication

Let's consider the problem of multiplying two matrices $C = A \times B$ (of sizes $M \times N$ and $N \times K$ ).

Naive Kernel (No SMEM)

This kernel is simple; each thread calculates one element of the C matrix. However, each thread must perform N reads from matrix A and N reads from matrix B—all from Global Memory.

__global__ void matrix_multiplication_naive(
    const float *A, const float *B, float *C, 
    int M, int N, int K)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
 
    if (row < M && col < K)
    {
        float sum = 0.0f;
        // Loop N times, accessing Global Memory each time
        for (int k = 0; k < N; ++k)
        {
            sum += A[row * N + k] * B[k * K + col];
        }
        C[row * K + col] = sum;
    }
}

Optimized Kernel (Tiling & SMEM)

We use a technique called "tiling". Each block will be responsible for computing one "tile", e.g., $16 \times 16$ , of the C matrix. To do this, it will loop and load the corresponding $16 \times 16$ tiles from A and B into Shared Memory.

Execution idea for each thread in the block:

Loop (over the number of tiles)
Load: Load one element of tile A and one element of tile B (from Global Memory) into SMEM.
Synchronize (sync): Wait for all threads in the block to finish loading (__syncthreads();).
Compute: Calculate the dot product of the sub-tiles (currently in SMEM).
Synchronize (sync): Wait for all threads to finish computing before loading the next tile (__syncthreads();).
Write: Write the final sum to the C matrix in Global Memory (only once at the end).

#define TILE_SIZE 16
 
__global__ void matrix_multiplication_smem(
    const float *A, const float *B, float *C, 
    int M, int N, int K)
{
    // Declare Shared Memory for tiles of A and B
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];
 
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
 
    int tx = threadIdx.x; // x-coordinate within the tile
    int ty = threadIdx.y; // y-coordinate within the tile
 
    float sum = 0.0f;
 
    // Loop over all necessary tiles
    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; ++t)
    {
        // 1. Load tile from Global Memory into Shared Memory
        // Calculate the column index 'A' and row index 'B' in global memory
        int A_col = t * TILE_SIZE + tx;
        int B_row = t * TILE_SIZE + ty;
 
        // Each thread loads 1 element of A into As
        if (row < M && A_col < N) {
            As[ty][tx] = A[row * N + A_col]; // ty, tx are coordinates in SMEM
        } else {
            As[ty][tx] = 0.0f; // Padding if out of bounds
        }
 
        // Each thread loads 1 element of B into Bs
        if (B_row < N && col < K) {
            Bs[ty][tx] = B[B_row * K + col];
        } else {
            Bs[ty][tx] = 0.0f; // Padding if out of bounds
        }
 
        // 2. Synchronize to ensure all threads have finished loading
        __syncthreads();
 
        // 3. Compute from Shared Memory
        #pragma unroll // Hint for the compiler to unroll this loop
        for (int i = 0; i < TILE_SIZE; ++i)
        {
            sum += As[ty][i] * Bs[i][tx];
        }
 
        // 4. Synchronize before moving to the next tile
        __syncthreads();
    }
 
    // 5. Write the final result to Global Memory
    if (row < M && col < K)
    {
        C[row * K + col] = sum;
    }
}

Results (Example 1: Matrix Multiplication)

Method	MX330 (ms)	A4000 (ms)
Naive	37.240 ms	1.861 ms
Tiling & SMEM	15.846 ms	2.077 ms
`cuBLAS`	2.423 ms	0.206 ms

Analysis: This measurement data teaches us an extremely important lesson: Optimization is highly architecture-dependent.

On MX330 (Pascal): As predicted, the Tiling & SMEM technique brought a clear benefit, speeding up by $\approx$ 2.35x (37.240 / 15.846) compared to the naive kernel. On older hardware with a small L2 cache and low memory bandwidth, reducing global memory access using SMEM provides a significant performance improvement.
On RTX A4000 (Ampere): A surprising result emerges. Our Tiling & SMEM kernel (2.077 ms) is actually $\approx$ 1.1x slower than the naive kernel (1.861 ms).
- This doesn't mean SMEM is useless. It shows that the naive kernel, despite its simplicity, is benefiting greatly from the Ampere architecture's hardware mechanisms. With a large L2 cache (4MB) and extremely high memory bandwidth (448 GB/s), the A4000 hardware may have automatically hidden most of the global memory access latency without our manual intervention.
- Meanwhile, the Tiling & SMEM kernel introduces new "overhead" costs: more complex index calculation logic and, especially, the __syncthreads() instructions. On a powerful architecture like Ampere, the cost of these synchronization barriers can become greater than the benefit gained from using SMEM, especially when the naive kernel is already running very fast.
cuBLAS Library: In both cases, cuBLAS is superior. Especially on the A4000, it is $\approx$ 9x faster than the naive kernel and $\approx$ 10x faster than our SMEM kernel. This is because cuBLAS is not only optimized at the assembly level but is also designed to fully exploit Tensor Cores (available on the Ampere architecture), a type of specialized hardware for matrix multiplication operations that our basic float kernel does not use.

Example 2: Optimizing Atomic Operations

atomicAdd() is a necessary function to avoid race conditions when multiple threads update the same variable in global memory. ¹

Problem: Count Array Elements

Given an input array input (size N), we need to count the total number of elements in the array equal to a given constant K. This total count must be accumulated and stored in a single variable output in global memory.

Naive Kernel (Naive Atomic)

The simplest solution is for each thread that finds a valid element to call atomicAdd() directly on the output variable in global memory.

__global__ void count_equal_naive(
    const int *input, int *output, int N, int K)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N && input[idx] == K)
    {
        // Many atomic operations, causing a "bottleneck"
        atomicAdd(output, 1);
    }
}

Problem: If input has many elements equal to K, thousands of threads will simultaneously "contend" to call atomicAdd on the same address. These accesses will be serialized, completely losing parallelism.

Optimized Kernel (SMEM & Parallel Reduction)

A better strategy is to minimize the number of atomicAdd calls to global memory. We only call it once per block.

Grid-Stride Loop: Each thread will process multiple elements (separated by a stride). This is a powerful pattern for handling arrays of any size.
Each thread locally counts the elements it finds.
This local count result is stored in an array in Shared Memory (I[]).
Parallel Reduction: We perform a parallel reduction sum directly in shared memory to add up all the block's local counts. This 'reduction' uses SMEM and threads within the block, completely avoiding contention unlike atomicAdd to global memory. It is extremely efficient.
Finally, only one thread (e.g., threadIdx.x == 0) in the block calls atomicAdd a single time to add the block's total to output.

__global__ void count_equal_optimized(
    const int *input, int *output, int N, int K)
{
    // Use 'extern' to dynamically allocate SMEM when launching the kernel
    extern __shared__ int I[];
    
    int g_idx = blockIdx.x * blockDim.x + threadIdx.x;
    int l_idx = threadIdx.x;
    // Stride is the total number of threads in the grid
    int stride = gridDim.x * blockDim.x;
 
    // 1. Each thread counts locally (Grid-Stride Loop)
    int count = 0;
    while (g_idx < N)
    {
        if (input[g_idx] == K)
        {
            count++;
        }
        g_idx += stride; // Jump to the next element
    }
    
    // 2. Store local result in SMEM
    I[l_idx] = count;
    __syncthreads();
 
    // 3. Perform Parallel Reduction in SMEM
    for (int s = blockDim.x / 2; s > 0; s >>= 1)
    {
        if (l_idx < s)
        {
            I[l_idx] += I[l_idx + s];
        }
        __syncthreads();
    }
 
    // 4. Only thread 0 writes the block's total to Global Memory
    if (l_idx == 0)
    {
        atomicAdd(output, I[0]);
    }
}

Results (Example 2: Optimizing Atomic Operations)

Method	MX330 (ms)	A4000 (ms)
Naive `atomicAdd()`	12.271 ms	1.000 ms
Opt (SMEM + Reduction)	9.123 ms	0.990 ms

Analysis: This result further reinforces the observation about the importance of hardware architecture.

On MX330 (Pascal): The optimization technique (offloading global memory atomics) provided a $\approx$ 1.34x benefit (12.271 / 9.1230). This is reasonable because on the older Pascal architecture, atomicAdd contention on global memory is an expensive bottleneck. Switching to a reduction in SMEM significantly reduced this contention.
On RTX A4000 (Ampere): The performance speedup is almost non-existent, only a $\approx$ 1% improvement (1.000 / 0.990).
- Modern GPU architectures like Ampere (Compute Capability 8.6) have much more efficient atomic processing systems than Pascal. Global memory atomicAdd operations are often processed and coalesced efficiently right at the L2 cache, significantly reducing contention costs.
- Our "optimized" kernel replaced a bottleneck (global atomics) that Ampere's hardware had already almost completely solved, with a series of other operations: writing to SMEM, multiple __syncthreads() instructions, and complex logic.
- As a result, the total overhead time of the parallel reduction in SMEM is almost equal to the time saved from offloading global atomics. In other words, we "optimized" a problem that is no longer a major issue on modern hardware.

General Conclusion: Classic optimization techniques (like SMEM tiling, reducing atomics) are still extremely valuable, but their effectiveness is no longer absolute. A technique might be a "lifesaver" on one hardware generation, but become inefficient (or even counter-productive) on the next, due to improvements in caches, bandwidth, and specialized processing units (like L2 cache atomics, Tensor Cores).

2. Memory Coalescing#

This is one of the most important concepts when optimizing Global Memory access.

When the 32 threads in a warp (a group of 32 synchronously executing threads) access global memory, the GPU hardware tries to "coalesce" these 32 requests into as few memory transactions as possible.

Ideal Case (Coalesced): 32 threads access 32 contiguous locations in memory (e.g., A[idx], A[idx+1], ..., A[idx+31]). The GPU hardware reads memory in 32-byte or 128-byte 'segments'. When the access is coalesced, a warp (32 threads) reading 32 float values (32 * 4 = 128 bytes) requires only one single 128-byte transaction.
Worst Case (Uncoalesced): 32 threads access 32 random or non-contiguous locations (strided access). For example, if 32 threads access A[idx * 100], they might fall into 32 different memory segments. The GPU will have to perform 32 separate 128-byte transactions, wasting 31/32 of the bandwidth!

In the naive matrix multiplication example, accessing A[row * N + k] (row-wise) is usually coalesced (because adjacent threads in a warp have adjacent col values). Conversely, accessing B[k * K + col] (column-wise) is often uncoalesced, causing severe bandwidth waste. ¹

3. Bank Conflicts#

This problem occurs in Shared Memory. SMEM is not a monolithic block; it is divided into 32 memory "banks". Threads in a warp can access SMEM in parallel if they access different banks.

Bank Conflict: Occurs when two or more threads in a warp try to access addresses located in the same bank. These accesses will be serialized, negating the speed of SMEM.

Rule: Specifically, for SMEM with a 4-byte word size (for float or int), bank $k$ will contain 4-byte words at address $addr$ such that $(addr / 4) \pmod{32} = k$ .

Classic Example:

__shared__ float A[32][32];

Row-wise access: A[my_row][threadIdx.x]
- 32 threads access A[r][0], A[r][1], ..., A[r][31].
- These addresses are contiguous, so they fall into 32 different banks (Bank 0, 1, 2, ...).
- Result: Very fast, no conflict.
Column-wise access: A[threadIdx.x][my_col]
- Thread 0 accesses A[0][c] (e.g., Bank 1)
- Thread 1 accesses A[1][c] (32 elements away from A[0][c]). Its address also mod 32 results in Bank 1.
- Thread 2 accesses A[2][c] (64 elements away from A[0][c]). Its address also mod 32 results in Bank 1.
- Result: Disaster! A 32-way bank conflict! 32 threads access Bank 1 and are serialized.

Imagine the 32 banks as 32 cashier checkout lanes:

Row-wise access: 32 people (threads) go to 32 different lanes (Bank 0...31). All are served in parallel.
Column-wise access: All 32 people (threads) line up at a single lane (e.g., Bank 1). 31 people must wait.

How to avoid: Padding. We break the access pattern by changing the array dimension:

// Add 1 padding column
__shared__ float A[32][33];

Now, when accessing column-wise A[threadIdx.x][my_col]:

Thread 0 accesses A[0][c] (Bank c % 32)
Thread 1 accesses A[1][c] (33 elements away). Bank (c + 33) % 32 = Bank (c+1) % 32
Thread 2 accesses A[2][c] (66 elements away). Bank (c + 66) % 32 = Bank (c+2) % 32
Result: No more conflict! We "sacrificed" a little SMEM in exchange for parallel access speed.

4. Occupancy#

Occupancy is the ratio of active warps on an SM to the maximum number of warps that SM can support (e.g., 32 active warps / 64 max warps = 50% occupancy).

Occupancy is a core factor in Latency Hiding. When a warp has to stop (stall)—for example, waiting for data from global memory—the SM's scheduler can immediately switch to executing another warp that is in a ready state.

Low Occupancy: The SM doesn't have enough ready warps to switch to. When the only active warp stalls, the entire SM becomes "idle", wasting computational resources.
High Occupancy: The SM has many warps to choose from, helping to hide memory latency and keep the compute cores busy.

This is the fundamental difference: CPUs are latency-oriented (try to complete 1 task very quickly) while GPUs are throughput-oriented (try to complete the most tasks in a unit of time). High occupancy is the key to this throughput model.

Occupancy is limited by whichever resource on the SM is exhausted first:

Registers: If each thread uses too many registers, the SM won't have enough registers to hold many threads $\rightarrow$ occupancy decreases.
Shared Memory: If each block uses too much SMEM, the SM won't have enough SMEM to hold many blocks $\rightarrow$ occupancy decreases.
Threads/block count: If you set the number of threads per block too low (e.g., 64), you will never achieve high occupancy because a warp is 32 threads.
Hardware Limits: Each SM has a maximum number of blocks and threads it can manage. For example, the A4000 (CC 8.6) can handle a maximum of 1536 threads / SM (equivalent to 48 warps) and 32 blocks / SM. ¹

5. Kernel Fusion#

This is a technique of merging multiple kernels (that run sequentially) into a single kernel.

Example 1: Instead of running:

kernel_add(A, B, C); // C = A + B
kernel_scale(C, D); // D = C * alpha

We merge them into:

kernel_add_and_scale(A, B, D); // D = (A + B) * alpha

Example 2: The 'SAXPY' operation (Y = a*X + Y) or Y = a*X + b.

Unoptimized:
- kernel_scale(X, a, Temp); // Temp = a*X (Write Temp to Global Mem)
- kernel_add(Temp, b, Y); // Y = Temp + b (Read Temp from Global Mem)
Optimized (Fusion):
- kernel_fused(X, a, b, Y); // A single kernel calculates Y[i] = a*X[i] + b[i]. Temp only exists in a register, completely eliminating the read/write of Temp to global memory.

Benefits:

Reduces kernel launch overhead: Every __global__ call has a small overhead.
Eliminates Global Memory access: This is the biggest benefit. In the examples above, the intermediate variable (C or Temp) must be written to global memory, and the next kernel must read it back from global memory. A fused kernel can keep this intermediate value temporarily in a register, completely eliminating 2 expensive global memory accesses.

Trade-offs:

Increased complexity: Fused kernels are harder to write and debug.
Increased resource pressure: The new kernel does more work and may require more registers (to hold intermediate variables), which can reduce occupancy.

Kernel fusion is a trade-off: you reduce memory bandwidth requirements, but may increase computational resource requirements on the SM. You need to measure to see if the trade-off is worth it.

Conclusion#

GPU optimization is a process that requires both theoretical understanding and practical experience. It forces us to revisit seemingly familiar problems—like matrix multiplication or array summation—from a completely new parallel perspective.

The core of this process is identifying the performance bottleneck in the kernel: whether it is memory-bound, compute-bound, or latency-bound. Using tools like NVIDIA Nsight is crucial to 'profile' the kernel and find exactly where to focus efforts.

We have gone through basic but extremely important techniques. In future articles, we may explore more advanced topics such as Dynamic Parallelism, the power of Tensor Cores (Tensor Cores are available from the Volta architecture (CC ≥ 7.0) onwards) for deep learning, and Asynchronous Operations with streams.

views

— views

Nguyen Xuan Hoa

nguyenxuanhoakhtn@gmail.com