Fundamentals
2 min readIntroduction to GPUs
By the end of this chapter you will understand enough about GPU programming to solve your first ranked problem, GELU.
Introduction to GPUs
A GPU has a cluster of small, significantly less efficient CPU threads. Modern GPUs typically have hundreds thousands of threads, as opposed to CPUs' typical 16 or 32. By having this amount of threads, we gain massive parallelism: the ability to compute thousands of instructions at the same time. For example, below is the mandelbrot's set computed on a CPU (left) and GPU (right).
z = z² + c iterated up to 15 times per pixel. The GPU advances every pixel one pass per tick; the CPU runs the same passes serially, sweeping the grid pixel by pixel. Both are fast, GPU is faster.The Programming model
In order to program the threads, you define a program in a chunk of code, and then set it to execute accross the threads simuntaneously. To grasp this concept use the visualization below. Each pixel represents a thread that you can hover over to see the computation they perform.
invalid — only x, y, numbers, + - * / % ** ^ | &, sin cos sqrt abs min max PI, etc.
This idea of letting threads execute the same instruction on different data is something Nvidia calls the SIMT (Single-Instruction-Multiple-Data) model. How do we get different results if the instructions are the same? Each thread has its own threadIdx, which is the offset it uses for accessing the memory. You can see it in the visualization below.
memory[threadIdx]= this thread's slot, threadIdx= this thread's ID.
invalid — only memory[…], threadIdx, numbers, + - * / % **, sin cos sqrt abs min max PI, etc.
threadIdx to write into its own slot of memory[]. Hover any thread or memory cell to trace the pair and see what that thread computes.However, it is often the case that the data we are trying to address is multi-dimensional - a 2D matrix, a 3D tensor. To accomodate that, CUDA stores the threadIdx as a 3-component id with .x, .y and .z addressable separately.
Blocks
There is an abstraction on top of threads called blocks. A block represents a 3D grid over the thread grid. Together they form a finite 3D grid of finite 3D grids! The dimensions are controlled by you, the programmer. Blocks are also addressable by a unique identifier similar to threadIdx - blockIdx, also of 3 components. All the same principles for indexing memory at offset apply. Below is a visualization to get you intuitively acquainted with the idea.
Programming in CUDA
The chunks of code that GPU threads run are something called kernels. You can see what they look like in the panel below; they are a c++-like code with some extra syntax and built-in variables.
__global__ void vector_add(const float* a,const float* b,float* c) {int tid = blockIdx.x * blockDim.x + threadIdx.x;c[tid] = a[tid] + b[tid];}
Kernel declaration
CUDA kernels are normal functions with the __global__ prefix, which tells the compiler that the function should be executed on the GPU.
Our parameters here consist of an input pointer a and b, which are the vectors we're trying to add, and a pointer c, to which we write the output.
Below is the final visualization for this chapter. While using it, try to be creative; you have alot of freedom to create interesting things. The more time you spend playing around with it, the easier the upcoming chapters will feel to you.