Hello CUDA!

Learn about GPU architecture and the CUDA model to understand how parallel hardware operates and optimize performance.

Comparison between GPUs and CPUs: GPUs devote more transistors to parallel processing than CPUs — GPUs allocate more transistors to parallel data processing than CPUs — the foundation for high performance in AI and HPC.
Source: NVIDIA CUDA Programming Guide

Introduction#

If you work in AI, Data Science, or High-Performance Computing (HPC), you've undoubtedly heard of CUDA. It's NVIDIA's platform and programming model that allows us to harness the power of thousands of "cores" inside a GPU.

The explosive rise of AI, especially Deep Learning, is intrinsically linked to this massive parallel processing capability.

But CUDA isn't just an API. It's a new perspective on algorithm design. A classic example: thrust::sort (a CUDA library) in large workloads with linear memory access can be tens to hundreds of times faster than the traditional std::sort running on a CPU, depending on the architecture and data.

Why such a massive difference? Because we are redesigning the algorithm to fit how the hardware operates.

Inspiration from FlashAttention#

One of the biggest inspirations recently comes from Tri Dao and the FlashAttention paper. Before it, the AI community believed the bottleneck of Transformers (like GPT) was compute-bound, i.e., the $O(N^2)$ operations of self-attention.

Tri Dao showed that the real bottleneck was actually memory bandwidth, not compute as many had thought.

The problem wasn't that the GPU was slow at calculating matrix multiplications (FLOPs). The problem was that reading and writing the enormous Q, K, V matrices constantly from Global Memory (HBM) to SRAM (super-fast on-chip memory) took too much time.

FlashAttention's solution is Kernel Fusion. Instead of running multiple separate kernels (one for $Q \times K^T$ , one for $Softmax$ , one for $\times V$ ) and writing intermediate results to Global Memory each time, FlashAttention performs the entire chain of operations in a single kernel. It uses SRAM/Shared Memory as a cache to hold data, minimizing Global Memory accesses.

The result was a 2-3x speedup in GPT-2 training.

This story is the perfect testament: to optimize GPU performance, you must understand its hardware and memory system.

What is CUDA?#

CUDA is a programming model that enables developers to harness and manage the massive parallelism of modern hardware. To think and program effectively in CUDA, one needs to understand three key aspects: Hardware, Execution Model, and Memory.

1. Hardware#

NVIDIA GPU architecture: SM, warp scheduler and CUDA cores — Each GPU consists of multiple Streaming Multiprocessors (SMs), each SM processes multiple warps (32 threads/warp) in parallel to hide memory latency and maximize throughput.
Source: ResearchGate — NVIDIA GPU Architecture

What are we programming for?

CPU (Central Processing Unit): Designed for low latency. It has a few "cores" that are super-smart, powerful, and can handle complex tasks sequentially or with small-scale parallelism.
GPU (Graphics Processing Unit): Designed for high throughput. It has thousands of simpler "CUDA cores" specialized for performing the same operation on thousands of different pieces of data at the same time.

The main execution unit on the GPU is the SM (Streaming Multiprocessor). This is the "heart" of the GPU. Each SM contains a number of CUDA cores (e.g., 128 cores).

So where does the parallelism come from?

Warp: A group of 32 threads. This is the scheduling unit of the SM.
Latency Hiding: An SM simultaneously manages many warps (e.g., 64 resident warps). At each clock cycle, the SM's Warp Scheduler selects one (or several) warps that are "ready" to be executed by the CUDA cores.
Scenario: Warp 1 performs an operation, then needs to read data from Global Memory (very slow, hundreds of cycles). Instead of "sitting and waiting", the SM immediately (with a zero-cost context switch) switches to executing Warp 2 (which is doing an addition). The next cycle, if Warp 2 also has to wait, it switches to Warp 3... And so on, the SM cycles through dozens of warps. By the time it gets back to Warp 1, the data from Global Memory has likely arrived. ¹

Note: Most CUDA performance optimizations revolve around the behavior of the 32 threads within a warp.

2. Execution Model: Grid, Block, Thread#

CUDA Execution Model with Grid, Block, and Thread — CUDA programming model: the kernel is executed by grids of blocks, each containing multiple threads — this structure maps directly to the GPU hardware.
Source: NVIDIA CUDA Programming Guide

How does a programmer organize and manage millions of threads? CUDA provides an excellent abstraction model:

Thread: The smallest unit, executes one copy of the kernel function (the function running on the GPU).
Block: A group of threads (e.g., 128, 256, up to 1024 threads). Important: Threads within the same block can communicate and cooperate via Shared Memory.
Grid: A group of blocks.

When you launch a kernel, you are telling the GPU:

// Kernel launch syntax
kernel_function<<<GridSize, BlockSize>>>(parameters...);

You are saying: "Create a Grid of GridSize blocks, with each block having BlockSize threads."

Logic $\rightarrow$ Physical Mapping: Block is a logical concept for the programmer. When a block is scheduled onto an SM, the SM divides that block into physical Warps. For example, a block of 1024 threads will be divided into $1024 / 32 = 32$ warps.

To let each thread know who and where it is, CUDA provides built-in variables: blockIdx, threadIdx, blockDim, gridDim.

Below is the "Hello World" example of CUDA: vector addition $C = A + B$ .

// Kernel to add vectors A and B, store in C
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
    // Calculate the global index for this thread
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Ensure the thread does not access out of bounds
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}
int main() {
    // Skip data initialization, cudaMalloc, cudaMemcpy...
    int N = 1000000; // 1 million elements
    // Use 256 threads per block
    int threadsPerBlock = 256;
    // Calculate the number of blocks needed
    // (N + 255) / 256
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    // Launch kernel!
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);
    // ...
}

3. The Memory System#

This is the most important part, deciding 90% of performance. As seen with FlashAttention, memory management is everything.

The GPU has a hierarchical memory system. The faster the speed, the smaller the capacity.

Memory Type	Location	Speed	Scope	Managed By?
Registers	On-chip (on SM)	Fastest (~1 cycle)	Per Thread	Compiler (automatic)
Shared Memory / L1	On-chip (on SM)	Very Fast (few cycles)	Per Block	Programmer (`__shared__`)
L2 Cache	On-chip (GPU-wide)	Fast	Entire Grid (GPU)	Hardware (automatic)
Global Memory (HBM/VRAM)	Off-chip (on card)	Very Slow (~400-800 cycles)	Entire Grid (GPU)	Programmer (`cudaMalloc`)
Constant Memory	Off-chip (cached on-chip)	Fast (if cache hit)	Entire Grid (GPU)	Programmer (`__constant__`)

Most kernels start by reading data from Global Memory (slow) and end by writing data back to it. Optimization is about minimizing Global Memory access by maximizing the use of Shared Memory and Registers.

Two core memory optimization concepts:

Memory Coalescing: Occurs on Global Memory. When the 32 threads in 1 warp simultaneously access 32 contiguous memory regions (e.g., A[idx], A[idx+1], ..., A[idx+31]), the GPU will merge these 32 requests into 1 single memory transaction. This is the ideal scenario, achieving maximum bandwidth. ¹
Bank Conflict: Occurs in Shared Memory. Shared Memory is divided into "banks" (usually 32). If multiple threads in a warp (e.g., 2 threads) simultaneously access different addresses that fall on the same bank, these accesses will be serialized. This kills parallel performance.

PGO: A Methodology for Optimization#

That’s the theory—but how do we actually know where our kernel is slowing down?

We use Profiling-Guided Optimization (PGO): a scientific methodology consisting of Measurement $\rightarrow$ Analysis $\rightarrow$ Hypothesis $\rightarrow$ Modification $\rightarrow$ Re-measurement.

Before diving into tools, we need a mental model for analysis. The most powerful and intuitive model is the Roofline Model.

Roofline Model (GPU)#

The Roofline Model is a chart that shows the theoretical upper limit of performance your kernel can achieve on a given hardware. It helps you immediately answer the question: “Is my kernel limited by computation or by memory bandwidth?”

Roofline model for GPU: limited by FLOPs and memory bandwidth — Source: NERSC

This model has two main axes:

Y-axis (GFLOPS): Indicates how many billions of floating-point operations per second your kernel performs. Higher is better.
X-axis (Arithmetic Intensity, AI): The core metric, measured in FLOPs/Byte. It answers: “For every byte of data you read/write from Global Memory, how many operations (FLOPs) do you perform?”

The roof has two parts:

Horizontal line (Flat Roof): The computational limit (Peak GFLOPS) of your GPU. Your kernel cannot run faster than this limit.
Diagonal line (Slanted Roof): The memory bandwidth limit (Peak Memory Bandwidth). The kernel’s speed is constrained by how fast it can feed data for computation.

When profiling, your kernel will appear as a point beneath this roof.

If the point lies below the diagonal line (Memory-bound region): Your bottleneck resembles FlashAttention’s issue—you’re limited by excessive memory reads/writes. To optimize, you must increase Arithmetic Intensity (e.g., using Shared Memory or Kernel Fusion) to move the point rightward.
If the point lies below the horizontal line (Compute-bound region): You’re using memory efficiently but are now limited by compute speed. To optimize, increase GFLOPS (e.g., leverage Tensor Cores or use half instead of float operations).

The Roofline Model gives you a clear compass for identifying what to optimize.

NVIDIA Nsight#

To gather this data and plot the Roofline chart, we use the NVIDIA Nsight suite:

Nsight Systems: Provides the macro view. It shows a complete timeline of your application—what the CPU and GPU are doing, how long cudaMemcpy (data transfer) takes, and when each kernel runs. It helps identify which kernel to optimize or whether you’re I/O-bound.
Nsight Compute: Provides the micro view. It dives inside a specific kernel, telling you precisely whether it’s memory-bound or compute-bound, its “occupancy” (latency hiding capability), cache hit/miss rates, uncoalesced accesses, and more.

After profiling, you can start forming hypotheses:

If Memory-bound (Bandwidth-limited):
- Hypothesis: I have uncoalesced memory access.
  Action: Adjust idx so that 32 threads access contiguous addresses.
- Hypothesis: I repeatedly load the same data from Global Memory.
  Action: Load that data into Shared Memory once, and let the entire block reuse it (as FlashAttention does!).
If Compute-bound (Computation-limited):
- Hypothesis: I’m using unnecessary double (64-bit) operations.
  Action: Switch to float (32-bit).
- Hypothesis: I’m performing matrix multiplications without Tensor Cores.
  Action: Use cuBLAS or program with half (16-bit) / tf32 to activate Tensor Cores.
If Latency-bound (Low Occupancy):
- Hypothesis: My kernel doesn’t have enough resident warps to hide latency. This could be due to excessive register or shared memory usage per block.
  Action: Reduce resource usage or adjust BlockSize to increase the number of active blocks per SM.

Conclusion#

Learning CUDA is a mindset-changing journey. It's not just about learning a new API, but learning to design algorithms for parallel hardware.

From the slow std::sort to the speedy thrust::sort, from standard self-attention to FlashAttention, it's all a result of understanding the "machine" you're running on and respecting its rules—especially the rules of memory.

Hopefully, this article has given you an overview of the exciting world of GPU computing.

views

— views

Nguyen Xuan Hoa

nguyenxuanhoakhtn@gmail.com