The Whole Algorithm
This page explains a complete GPT implementation in ~150 lines of pure Python. No libraries beyond math and random. It trains on a list of names and learns to generate new, plausible-sounding ones.
The code has five parts, each building on the last:
| Hyperparameter | Value | Meaning |
|---|---|---|
n_layer | 1 | One transformer block |
n_embd | 16 | Embedding dimension |
n_head | 4 | Attention heads |
block_size | 16 | Max context length |
vocab_size | 27 | 26 letters + BOS token |
num_steps | 1000 | Training iterations |
Total parameters: 7,633. That's it — a GPT you could write on paper.
The Tokenizer
Neural networks operate on numbers, not characters. The tokenizer converts each character to an integer ID using a simple lookup table. Every unique character in the dataset gets an ID from 0 to 25 (for a–z), and a special BOS (Beginning of Sequence) token gets ID 26.
# Build vocabulary from all unique characters uchars = sorted(set(''.join(docs))) # ['a', 'b', ..., 'z'] BOS = len(uchars) # 26 vocab_size = len(uchars) + 1 # 27
Each training example is a name surrounded by BOS tokens. The model learns to predict each next character, including when to stop (by predicting BOS at the end).
The opening BOS gives the model a starting signal: "begin generating a name." The closing BOS signals "this name is done." During inference, the model generates characters until it outputs BOS, then stops.
Autograd: The Value Class
The Value class is the engine that makes learning possible. It wraps a single floating-point number and records how it was computed, building a computation graph. When we call .backward(), it walks this graph in reverse to compute gradients — the derivative of the final loss with respect to every intermediate value.
class Value: def __init__(self, data, children=(), local_grads=()): self.data = data # the scalar value self.grad = 0 # d(loss)/d(this) self._children = children # inputs to the op self._local_grads = local_grads # d(this)/d(each child)
Each arithmetic operation creates a new Value whose children are the inputs. The local_grads tuple stores the partial derivative of the output with respect to each child — computed by the rules of calculus at the time of the forward pass.
Forward: building the graph
When you write c = a + b, Python calls a.__add__(b), which returns a new Value with data = a.data + b.data, children = (a, b), and local_grads = (1, 1) — because d(a+b)/da = 1 and d(a+b)/db = 1.
For multiplication c = a * b: local_grads = (b.data, a.data) — because d(ab)/da = b and d(ab)/db = a.
Backward: the chain rule
The backward pass applies the chain rule recursively. For each node, it multiplies the incoming gradient by each local gradient and accumulates into the children:
child.grad += local_grad * v.grad
This single line is the entire backpropagation algorithm.
Every operation in the entire GPT — matrix multiplies, softmax, RMSNorm, attention — decomposes into additions, multiplications, exp, log, pow, and relu. The Value class handles all of them, automatically computing gradients through thousands of operations.
The derivative rules
| Operation | Forward | Local gradients |
|---|---|---|
a + b | a.data + b.data | (1, 1) |
a * b | a.data × b.data | (b.data, a.data) |
a ** n | a.datan | (n × a.datan-1,) |
a.exp() | ea.data | (ea.data,) |
a.log() | ln(a.data) | (1/a.data,) |
a.relu() | max(0, a.data) | (1 if a>0 else 0,) |
Parameters: The Model's Knowledge
All of the model's knowledge lives in its parameters — a collection of randomly initialized Value objects arranged as matrices. Training adjusts these numbers to minimize the loss.
state_dict = {
'wte': matrix(vocab_size, n_embd), # token embeddings (27 x 16)
'wpe': matrix(block_size, n_embd), # position embeddings (16 x 16)
'ln_f': [Value(1.0)] * n_embd, # final layer norm gain (16)
'lm_head': matrix(vocab_size, n_embd), # output projection (27 x 16)
# Per layer:
'layer0.attn_wq': matrix(n_embd, n_embd), # (16 x 16)
'layer0.attn_wk': matrix(n_embd, n_embd),
'layer0.attn_wv': matrix(n_embd, n_embd),
'layer0.attn_wo': matrix(n_embd, n_embd),
'layer0.mlp_fc1': matrix(4*n_embd, n_embd), # (64 x 16)
'layer0.mlp_fc2': matrix(n_embd, 4*n_embd), # (16 x 64)
}
Each cell in each matrix is a Value object — a trainable number with a gradient. Total: 7,633 parameters. For comparison, GPT-3 has 175 billion.
The token embedding matrix wte has shape (27, 16). When we feed token ID 5 ('f'), we grab row 5 — a vector of 16 numbers that represent the character 'f'. These representations are learned during training. Characters that behave similarly end up with similar vectors.
Linear Layers & Softmax
The two most fundamental operations in a transformer are linear transformations (matrix-vector multiply) and softmax (converting scores to probabilities).
Linear: matrix-vector multiplication
def linear(x, w): return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]
Each row of the weight matrix w is dot-producted with input x to produce one output value. Hover over outputs to see which weights contribute.
Softmax: logits → probabilities
def softmax(logits): max_val = max(val.data for val in logits) exps = [(val - max_val).exp() for val in logits] # numerical stability total = sum(exps) return [e / total for e in exps]
Softmax exponentiates each value and normalizes so they sum to 1. Subtracting max_val first prevents overflow — the math is identical because ea-c/Σeb-c = ea/Σeb.
RMSNorm: keeping activations stable
def rmsnorm(x): ms = sum(xi * xi for xi in x) / len(x) scale = (ms + 1e-5) ** -0.5 return [xi * scale for xi in x]
RMSNorm divides each element by the root-mean-square of the vector, ensuring activations don't explode or vanish as they flow through layers. Simpler than LayerNorm (no mean subtraction, no learned parameters) but equally effective.
Self-Attention
Attention is the mechanism that lets tokens communicate with each other. Each token creates three vectors:
- Query (Q) — "What am I looking for?"
- Key (K) — "What do I contain?"
- Value (V) — "What information do I provide?"
The attention score between two tokens is the dot product of one token's Query with the other's Key. High score means "these tokens are relevant to each other." The scores become weights (via softmax) that determine how much of each token's Value vector to mix in.
# For each attention head: q_h = q[hs:hs+head_dim] # this token's query k_h = [ki[hs:hs+head_dim] for ki in keys] # all tokens' keys v_h = [vi[hs:hs+head_dim] for vi in values] # all tokens' values # Score = Q . K / sqrt(d) attn_logits = [sum(q_h[j] * k_h[t][j] ...) / head_dim**0.5 for t in range(len(k_h))] attn_weights = softmax(attn_logits) # Output = weighted sum of Values head_out = [sum(attn_weights[t] * v_h[t][j] ...) for j in range(head_dim)]
Multi-head attention
With 4 heads and n_embd=16, each head operates on a 4-dimensional slice of the embedding. Different heads learn different types of relationships — one might track vowel patterns, another consonant clusters, another positional proximity.
Causal masking is implicit here. Unlike batched implementations that need an explicit mask matrix, this code processes tokens one at a time, appending each key/value to a growing cache. Token at position 3 can only attend to positions 0–3 because those are the only keys that exist when it's processed.
The Full Transformer
Each transformer layer is: Attention (tokens talk to each other) then MLP (each token thinks individually), both with residual connections and RMSNorm.
def gpt(token_id, pos_id, keys, values): tok_emb = state_dict['wte'][token_id] # look up token embedding pos_emb = state_dict['wpe'][pos_id] # look up position embedding x = [t + p for t, p in zip(tok_emb, pos_emb)] for li in range(n_layer): x_residual = x x = rmsnorm(x) # pre-norm before attention # ... Q, K, V projection, attention, output projection ... x = [a + b for a, b in zip(x, x_residual)] x_residual = x x = rmsnorm(x) x = linear(x, mlp_fc1) # expand 16 โ 64 x = [xi.relu() for xi in x] x = linear(x, mlp_fc2) # compress 64 โ 16 x = [a + b for a, b in zip(x, x_residual)] x = rmsnorm(x) # final layer norm (ln_f) logits = linear(x, state_dict['lm_head']) # โ 27 scores return logits
The MLP expands the 16-dimensional vector to 64 dimensions, applies ReLU, then compresses back to 16. This "expand-activate-compress" pattern gives the model capacity to learn non-linear transformations that attention alone cannot express.
Training Loop
Training repeats three steps: forward pass (compute predictions and loss), backward pass (compute gradients), and parameter update (nudge parameters to reduce loss).
for step in range(1000): doc = docs[step % len(docs)] tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS] # Forward: predict each next token losses = [] for pos_id in range(n): logits = gpt(tokens[pos_id], pos_id, keys, values) probs = softmax(logits) loss_t = -probs[target_id].log() # cross-entropy loss losses.append(loss_t) loss = (1 / n) * sum(losses) loss.backward() # compute all gradients # ... Adam update (see next section) ...
For each position, we compute -log(prob of correct token). If the model assigns probability 1.0 to the right answer, loss is 0. If it assigns 0.01, loss is 4.6. The log penalizes confident wrong answers harshly.
Unlike batched training (which processes many examples at once), this code trains on a single name per step. No batching, no data loaders, no shuffled epochs. Just: pick a name, compute loss, update weights, repeat.
The Adam Optimizer
Adam maintains two running averages per parameter:
- m (first moment) — exponential moving average of the gradient → "which direction?"
- v (second moment) — exponential moving average of the squared gradient → "how big?"
# Adam update for each parameter m[i] = beta1 * m[i] + (1 - beta1) * p.grad v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2 m_hat = m[i] / (1 - beta1 ** (step + 1)) # bias correction v_hat = v[i] / (1 - beta2 ** (step + 1)) p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps)
Watch how Adam navigates a loss landscape compared to vanilla SGD. Adam adapts its step size per-parameter.
The learning rate also decays linearly: lr_t = 0.01 * (1 - step/1000). Large adjustments early, fine-tuning later.
Inference: Generating Names
After training, we generate names by repeatedly feeding the model its own output: start with BOS, get probabilities, sample a character, feed it back, repeat until BOS appears again.
token_id = BOS sample = [] for pos_id in range(block_size): logits = gpt(token_id, pos_id, keys, values) probs = softmax([l / temperature for l in logits]) token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0] if token_id == BOS: break sample.append(uchars[token_id])
Temperature
Dividing logits by temperature before softmax controls randomness. Low temperature makes the distribution sharp (deterministic), high temperature makes it flat (creative).
The generation loop
Watch the model generate a name character by character. Each step feeds the previous output back as input.
This is how every LLM works. GPT-4, Claude, Gemini — they all generate text one token at a time, feeding their own output back as input. The only differences are scale (billions of parameters), tokenization (subwords instead of characters), training data (the internet instead of 32K names), and engineering (GPUs instead of pure Python).
Understanding one layer means understanding them all.