SIMD Explained: Hands-on Hardware Hacking for Software Developers
As software engineers, we often treat the GPU as a "black box" that runs shaders or CUDA kernels. This project peels back that layer. Here is a breakdown of why it's cool, how to get started, and what the code looks like.
Most of us are used to Von Neumann architecture (sequential CPU execution). Diving into a Verilog-based GPU helps you understand
SIMD (Single Instruction, Multiple Data)
You’ll see how a single instruction is physically broadcast to multiple execution units.
Parallelism Constraints
You’ll understand why branching (if/else) is expensive on a GPU (branch divergence) because you can see the hardware logic trying to manage it.
Memory Bandwidth
You'll realize why data movement is usually the bottleneck, not the computation itself.
The tiny-gpu project focuses on a minimal implementation. Instead of the thousands of cores in an NVIDIA RTX card, this gives you a handful of cores so you can actually trace the signal from the instruction fetch to the register file.
The Dispatcher
Sends instructions to the available cores.
The Cores
Simple Arithmetic Logic Units (ALUs) that perform calculations in parallel.
The Register File
Where local variables for each "thread" are stored.
Since this is hardware description code (Verilog), you don't "run" it like Python. You simulate it or synthesize it for an FPGA.
Clone the repo
git clone https://github.com/adam-maj/tiny-gpu.git
Install Tools
You'll need Icarus Verilog (for simulation) and GTKWave (to see the waveform diagrams).
Run a Simulation
Typically, you would run
iverilog -o gpu_sim testbench.v gpu.v
vvp gpu_sim
In Verilog, we aren't writing lines of code that execute one after another; we are describing hardware connections. Here’s a simplified conceptual example of how a GPU core might look in a project like this
// A simplified "Core" module
module gpu_core (
input clk,
input [31:0] instruction,
input [31:0] data_in,
output reg [31:0] data_out
);
// Internal registers (like a tiny scratchpad for this thread)
reg [31:0] regs [7:0];
always @(posedge clk) begin
// Decode and Execute (Simplified)
case(instruction[31:28])
4'b0001: regs[instruction[11:8]] <= regs[instruction[7:4]] + regs[instruction[3:0]]; // ADD
4'b0010: data_out <= regs[instruction[7:4]]; // OUTPUT
endcase
end
endmodule
always @(posedge clk)
This tells the hardware to perform the logic every time the clock "ticks."
Parallelism
In the main GPU module, you would instantiate multiple versions of this gpu_core, all receiving the same instruction at the same time. That’s SIMD in action!
When you run the simulation, you’ll get a .vcd file. When you open this in GTKWave, you see the "heartbeat" of the GPU.
Instead of a console log, you see lines jumping from 0 to 1. This shows exactly when a register gets its value and how many clock cycles a calculation takes.
Exploring tiny-gpu is like learning how to build a combustion engine when you’ve only ever driven a car. It makes you a much better "driver" (programmer) because you’ll know exactly why certain coding patterns are faster than others.