GPT in Pure Python

The Whole Algorithm

This page explains a complete GPT implementation in ~150 lines of pure Python. No libraries beyond math and random. It trains on a list of names and learns to generate new, plausible-sounding ones.

"This file is the complete algorithm. Everything else is just efficiency." The architecture is identical to GPT-2/3/4. PyTorch, CUDA, and distributed training are optimizations on top of these exact operations.

The code has five parts, each building on the last:

1. Tokenizer — text → integers

↓

2. Autograd — automatic derivatives

↓

3. Model — transformer architecture

↓

4. Training — forward + backward + Adam

↓

5. Inference — generate new names

Hyperparameter	Value	Meaning
`n_layer`	1	One transformer block
`n_embd`	16	Embedding dimension
`n_head`	4	Attention heads
`block_size`	16	Max context length
`vocab_size`	27	26 letters + BOS token
`num_steps`	1000	Training iterations

Total parameters: 7,633. That's it — a GPT you could write on paper.

The Tokenizer

Neural networks operate on numbers, not characters. The tokenizer converts each character to an integer ID using a simple lookup table. Every unique character in the dataset gets an ID from 0 to 25 (for a–z), and a special BOS (Beginning of Sequence) token gets ID 26.

# Build vocabulary from all unique characters
uchars = sorted(set(''.join(docs)))  # ['a', 'b', ..., 'z']
BOS = len(uchars)                     # 26
vocab_size = len(uchars) + 1          # 27

Each training example is a name surrounded by BOS tokens. The model learns to predict each next character, including when to stop (by predicting BOS at the end).

Tokenizer

The opening BOS gives the model a starting signal: "begin generating a name." The closing BOS signals "this name is done." During inference, the model generates characters until it outputs BOS, then stops.

Autograd: The Value Class

The Value class is the engine that makes learning possible. It wraps a single floating-point number and records how it was computed, building a computation graph. When we call .backward(), it walks this graph in reverse to compute gradients — the derivative of the final loss with respect to every intermediate value.

class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data               # the scalar value
        self.grad = 0                  # d(loss)/d(this)
        self._children = children       # inputs to the op
        self._local_grads = local_grads # d(this)/d(each child)

Each arithmetic operation creates a new Value whose children are the inputs. The local_grads tuple stores the partial derivative of the output with respect to each child — computed by the rules of calculus at the time of the forward pass.

Forward: building the graph

When you write c = a + b, Python calls a.__add__(b), which returns a new Value with data = a.data + b.data, children = (a, b), and local_grads = (1, 1) — because d(a+b)/da = 1 and d(a+b)/db = 1.

For multiplication c = a * b: local_grads = (b.data, a.data) — because d(ab)/da = b and d(ab)/db = a.

Backward: the chain rule

The backward pass applies the chain rule recursively. For each node, it multiplies the incoming gradient by each local gradient and accumulates into the children:

child.grad += local_grad * v.grad

This single line is the entire backpropagation algorithm.

Computation Graph

Expression

Every operation in the entire GPT — matrix multiplies, softmax, RMSNorm, attention — decomposes into additions, multiplications, exp, log, pow, and relu. The Value class handles all of them, automatically computing gradients through thousands of operations.

The derivative rules

Operation	Forward	Local gradients
`a + b`	a.data + b.data	(1, 1)
`a * b`	a.data × b.data	(b.data, a.data)
`a ** n`	a.dataⁿ	(n × a.data^n-1,)
`a.exp()`	e^a.data	(e^a.data,)
`a.log()`	ln(a.data)	(1/a.data,)
`a.relu()`	max(0, a.data)	(1 if a>0 else 0,)

Parameters: The Model's Knowledge

All of the model's knowledge lives in its parameters — a collection of randomly initialized Value objects arranged as matrices. Training adjusts these numbers to minimize the loss.

state_dict = {
    'wte':   matrix(vocab_size, n_embd),  # token embeddings   (27 x 16)
    'wpe':   matrix(block_size, n_embd),  # position embeddings (16 x 16)
    'ln_f':  [Value(1.0)] * n_embd,       # final layer norm gain (16)
    'lm_head': matrix(vocab_size, n_embd), # output projection  (27 x 16)
    # Per layer:
    'layer0.attn_wq': matrix(n_embd, n_embd),  # (16 x 16)
    'layer0.attn_wk': matrix(n_embd, n_embd),
    'layer0.attn_wv': matrix(n_embd, n_embd),
    'layer0.attn_wo': matrix(n_embd, n_embd),
    'layer0.mlp_fc1': matrix(4*n_embd, n_embd), # (64 x 16)
    'layer0.mlp_fc2': matrix(n_embd, 4*n_embd), # (16 x 64)
}

All 7,633 Parameters

Each cell in each matrix is a Value object — a trainable number with a gradient. Total: 7,633 parameters. For comparison, GPT-3 has 175 billion.

The token embedding matrix wte has shape (27, 16). When we feed token ID 5 ('f'), we grab row 5 — a vector of 16 numbers that represent the character 'f'. These representations are learned during training. Characters that behave similarly end up with similar vectors.

Linear Layers & Softmax

The two most fundamental operations in a transformer are linear transformations (matrix-vector multiply) and softmax (converting scores to probabilities).

Linear: matrix-vector multiplication

def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]

Each row of the weight matrix w is dot-producted with input x to produce one output value. Hover over outputs to see which weights contribute.

Linear Layer

hover output nodes

Softmax: logits → probabilities

def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]  # numerical stability
    total = sum(exps)
    return [e / total for e in exps]

Softmax exponentiates each value and normalizes so they sum to 1. Subtracting max_val first prevents overflow — the math is identical because e^a-c/Σe^b-c = e^a/Σe^b.

Softmax

A 2.0

B 1.0

C -1.0

D 0.5

RMSNorm: keeping activations stable

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]

RMSNorm divides each element by the root-mean-square of the vector, ensuring activations don't explode or vanish as they flow through layers. Simpler than LayerNorm (no mean subtraction, no learned parameters) but equally effective.

Self-Attention

Attention is the mechanism that lets tokens communicate with each other. Each token creates three vectors:

Query (Q) — "What am I looking for?"
Key (K) — "What do I contain?"
Value (V) — "What information do I provide?"

The attention score between two tokens is the dot product of one token's Query with the other's Key. High score means "these tokens are relevant to each other." The scores become weights (via softmax) that determine how much of each token's Value vector to mix in.

# For each attention head:
q_h = q[hs:hs+head_dim]                     # this token's query
k_h = [ki[hs:hs+head_dim] for ki in keys]   # all tokens' keys
v_h = [vi[hs:hs+head_dim] for vi in values] # all tokens' values

# Score = Q . K / sqrt(d)
attn_logits = [sum(q_h[j] * k_h[t][j] ...) / head_dim**0.5
               for t in range(len(k_h))]
attn_weights = softmax(attn_logits)

# Output = weighted sum of Values
head_out = [sum(attn_weights[t] * v_h[t][j] ...) for j in range(head_dim)]

Attention Weights

Name

Head

Multi-head attention

With 4 heads and n_embd=16, each head operates on a 4-dimensional slice of the embedding. Different heads learn different types of relationships — one might track vowel patterns, another consonant clusters, another positional proximity.

Causal masking is implicit here. Unlike batched implementations that need an explicit mask matrix, this code processes tokens one at a time, appending each key/value to a growing cache. Token at position 3 can only attend to positions 0–3 because those are the only keys that exist when it's processed.

The Full Transformer

Each transformer layer is: Attention (tokens talk to each other) then MLP (each token thinks individually), both with residual connections and RMSNorm.

Forward Pass

Ready

def gpt(token_id, pos_id, keys, values):
    tok_emb = state_dict['wte'][token_id]   # look up token embedding
    pos_emb = state_dict['wpe'][pos_id]     # look up position embedding
    x = [t + p for t, p in zip(tok_emb, pos_emb)]

    for li in range(n_layer):
        x_residual = x
        x = rmsnorm(x)                      # pre-norm before attention
        # ... Q, K, V projection, attention, output projection ...
        x = [a + b for a, b in zip(x, x_residual)]

        x_residual = x
        x = rmsnorm(x)
        x = linear(x, mlp_fc1)           # expand 16 → 64
        x = [xi.relu() for xi in x]
        x = linear(x, mlp_fc2)           # compress 64 → 16
        x = [a + b for a, b in zip(x, x_residual)]

    x = rmsnorm(x)                               # final layer norm (ln_f)
    logits = linear(x, state_dict['lm_head'])  # → 27 scores
    return logits

The MLP expands the 16-dimensional vector to 64 dimensions, applies ReLU, then compresses back to 16. This "expand-activate-compress" pattern gives the model capacity to learn non-linear transformations that attention alone cannot express.

Differences from standard GPT-2: RMSNorm instead of LayerNorm, no biases anywhere, ReLU instead of GeLU. These simplifications barely affect quality but dramatically simplify the code.

Training Loop

Training repeats three steps: forward pass (compute predictions and loss), backward pass (compute gradients), and parameter update (nudge parameters to reduce loss).

for step in range(1000):
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]

    # Forward: predict each next token
    losses = []
    for pos_id in range(n):
        logits = gpt(tokens[pos_id], pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()   # cross-entropy loss
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)

    loss.backward()                          # compute all gradients
    # ... Adam update (see next section) ...

For each position, we compute -log(prob of correct token). If the model assigns probability 1.0 to the right answer, loss is 0. If it assigns 0.01, loss is 4.6. The log penalizes confident wrong answers harshly.

Training Loss

Step 0 / 1000

Unlike batched training (which processes many examples at once), this code trains on a single name per step. No batching, no data loaders, no shuffled epochs. Just: pick a name, compute loss, update weights, repeat.

The Adam Optimizer

Adam maintains two running averages per parameter:

m (first moment) — exponential moving average of the gradient → "which direction?"
v (second moment) — exponential moving average of the squared gradient → "how big?"

# Adam update for each parameter
m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
m_hat = m[i] / (1 - beta1 ** (step + 1))          # bias correction
v_hat = v[i] / (1 - beta2 ** (step + 1))
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps)

Watch how Adam navigates a loss landscape compared to vanilla SGD. Adam adapts its step size per-parameter.

Adam vs SGD

LR 0.010

The learning rate also decays linearly: lr_t = 0.01 * (1 - step/1000). Large adjustments early, fine-tuning later.

Inference: Generating Names

After training, we generate names by repeatedly feeding the model its own output: start with BOS, get probabilities, sample a character, feed it back, repeat until BOS appears again.

token_id = BOS
sample = []
for pos_id in range(block_size):
    logits = gpt(token_id, pos_id, keys, values)
    probs = softmax([l / temperature for l in logits])
    token_id = random.choices(range(vocab_size),
                              weights=[p.data for p in probs])[0]
    if token_id == BOS: break
    sample.append(uchars[token_id])

Temperature

Dividing logits by temperature before softmax controls randomness. Low temperature makes the distribution sharp (deterministic), high temperature makes it flat (creative).

Temperature Effect

Temp 1.00

The generation loop

Watch the model generate a name character by character. Each step feeds the previous output back as input.

Inference

Temp 0.50

This is how every LLM works. GPT-4, Claude, Gemini — they all generate text one token at a time, feeding their own output back as input. The only differences are scale (billions of parameters), tokenization (subwords instead of characters), training data (the internet instead of 32K names), and engineering (GPUs instead of pure Python).

The entire model is just arithmetic on floating-point numbers.
Understanding one layer means understanding them all.