By the end of this chapter you will understand enough about GPU programming to solve your first ranked problem, GELU.

Introduction to GPUs

A GPU has a cluster of small, significantly less efficient CPU threads. Modern GPUs typically have hundreds thousands of threads, as opposed to CPUs' typical 16 or 32. By having this amount of threads, we gain massive parallelism: the ability to compute thousands of instructions at the same time. For example, below is the mandelbrot's set computed on a CPU (left) and GPU (right).

Both sides compute the same fractal: z = z² + c iterated up to 15 times per pixel. The GPU advances every pixel one pass per tick; the CPU runs the same passes serially, sweeping the grid pixel by pixel. Both are fast, GPU is faster.

The Programming model

In order to program the threads, you define a program in a chunk of code, and then set it to execute accross the threads simuntaneously. To grasp this concept use the visualization below. Each pixel represents a thread that you can hover over to see the computation they perform.

kernel(x, y)x, y ∈ 0..47

return;

invalid — only x, y, numbers, + - * / % ** ^ | &, sin cos sqrt abs min max PI, etc.

This idea of letting threads execute the same instruction on different data is something Nvidia calls the SIMT (Single-Instruction-Multiple-Data) model. How do we get different results if the instructions are the same? Each thread has its own threadIdx, which is the offset it uses for accessing the memory. You can see it in the visualization below.

kernel(threadIdx)step 0

memory[threadIdx] =

memory[threadIdx]= this thread's slot, threadIdx= this thread's ID.

invalid — only memory[…], threadIdx, numbers, + - * / % **, sin cos sqrt abs min max PI, etc.

t10

t11

t12

t13

t14

t15

1010

1111

1212

1313

1414

1515

every thread · one tick

Same kernel runs on every thread. Each one uses its threadIdx to write into its own slot of memory[]. Hover any thread or memory cell to trace the pair and see what that thread computes.

However, it is often the case that the data we are trying to address is multi-dimensional - a 2D matrix, a 3D tensor. To accomodate that, CUDA stores the threadIdx as a 3-component id with .x, .y and .z addressable separately.

Blocks

There is an abstraction on top of threads called blocks. A block represents a 3D grid over the thread grid. Together they form a finite 3D grid of finite 3D grids! The dimensions are controlled by you, the programmer. Blocks are also addressable by a unique identifier similar to threadIdx - blockIdx, also of 3 components. All the same principles for indexing memory at offset apply. Below is a visualization to get you intuitively acquainted with the idea.

launch shape9 blocks × 16 threads = 144 threads

gridDim · grid of blocks

gridDim.x3

gridDim.y3

gridDim.z1

blockDim · threads per block

blockDim.x4

blockDim.y2

blockDim.z2

click a block to focus · drag to orbit

hover a thread to see its block + thread index

Programming in CUDA

The chunks of code that GPU threads run are something called kernels. You can see what they look like in the panel below; they are a c++-like code with some extra syntax and built-in variables.

vector_add.cuhover any line

__global__ void vector_add(
    const float* a,
    const float* b,
    float* c
) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    c[tid] = a[tid] + b[tid];
}

Kernel declaration

CUDA kernels are normal functions with the __global__ prefix, which tells the compiler that the function should be executed on the GPU.

Our parameters here consist of an input pointer a and b, which are the vectors we're trying to add, and a pointer c, to which we write the output.

A complete CUDA kernel that adds two vectors together. Hover any line to see what it's doing.

Below is the final visualization for this chapter. While using it, try to be creative; you have alot of freedom to create interesting things. The more time you spend playing around with it, the easier the upcoming chapters will feel to you.

output(threadIdx, blockIdx)64 threads · shaded by output

kerneltx,ty,tz threadIdx · bx,by,bz blockIdx · bdx… blockDim · gdx… gridDim

return

invalid expression

gridDim · grid of blocks

gridDim.x2

gridDim.y2

gridDim.z1

blockDim · threads per block

blockDim.x4

blockDim.y4

blockDim.z1

0output1

click a thread to pin it · zoom in for values

Every thread runs the same kernel on its own coordinates and is shaded by the scalar result (dark = low, bright = high). Hover a thread to read its computation, click a block to dive in, or zoom in to see each output value in place.