SIMD Explained: Hands-on Hardware Hacking for Software Developers

2026-01-18

As software engineers, we often treat the GPU as a "black box" that runs shaders or CUDA kernels. This project peels back that layer. Here is a breakdown of why it's cool, how to get started, and what the code looks like.

Most of us are used to Von Neumann architecture (sequential CPU execution). Diving into a Verilog-based GPU helps you understand

SIMD (Single Instruction, Multiple Data)
You’ll see how a single instruction is physically broadcast to multiple execution units.

Parallelism Constraints
You’ll understand why branching (if/else) is expensive on a GPU (branch divergence) because you can see the hardware logic trying to manage it.

Memory Bandwidth
You'll realize why data movement is usually the bottleneck, not the computation itself.

The tiny-gpu project focuses on a minimal implementation. Instead of the thousands of cores in an NVIDIA RTX card, this gives you a handful of cores so you can actually trace the signal from the instruction fetch to the register file.

The Dispatcher
Sends instructions to the available cores.

The Cores
Simple Arithmetic Logic Units (ALUs) that perform calculations in parallel.

The Register File
Where local variables for each "thread" are stored.

Since this is hardware description code (Verilog), you don't "run" it like Python. You simulate it or synthesize it for an FPGA.

Clone the repo

git clone https://github.com/adam-maj/tiny-gpu.git

Install Tools
You'll need Icarus Verilog (for simulation) and GTKWave (to see the waveform diagrams).

Run a Simulation
Typically, you would run

iverilog -o gpu_sim testbench.v gpu.v
vvp gpu_sim

In Verilog, we aren't writing lines of code that execute one after another; we are describing hardware connections. Here’s a simplified conceptual example of how a GPU core might look in a project like this

// A simplified "Core" module
module gpu_core (
    input clk,
    input [31:0] instruction,
    input [31:0] data_in,
    output reg [31:0] data_out
);
    // Internal registers (like a tiny scratchpad for this thread)
    reg [31:0] regs [7:0]; 

    always @(posedge clk) begin
        // Decode and Execute (Simplified)
        case(instruction[31:28]) 
            4'b0001: regs[instruction[11:8]] <= regs[instruction[7:4]] + regs[instruction[3:0]]; // ADD
            4'b0010: data_out <= regs[instruction[7:4]]; // OUTPUT
        endcase
    end
endmodule

always @(posedge clk)
This tells the hardware to perform the logic every time the clock "ticks."

Parallelism
In the main GPU module, you would instantiate multiple versions of this gpu_core, all receiving the same instruction at the same time. That’s SIMD in action!

When you run the simulation, you’ll get a .vcd file. When you open this in GTKWave, you see the "heartbeat" of the GPU.

Instead of a console log, you see lines jumping from 0 to 1. This shows exactly when a register gets its value and how many clock cycles a calculation takes.

Exploring tiny-gpu is like learning how to build a combustion engine when you’ve only ever driven a car. It makes you a much better "driver" (programmer) because you’ll know exactly why certain coding patterns are faster than others.