← All Explanations

GPT in Pure Python

An interactive walkthrough of Karpathy's dependency-free GPT — every operation visible, from raw characters to generated text. No PyTorch, no numpy, just math.

GPT Inference Pipeline
1
2
3
4
5
6
7
8
9
1. Input Text
Characters appear one by one, forming the prompt the model will complete.
2. Tokenization
Each word maps to an integer ID from the vocabulary lookup table.
3. Token + Position Embeddings
Token embeddings and position embeddings combine into a single vector per token.
4. Self-Attention
Each token creates Q, K, V vectors. Attention scores show which tokens are relevant to each other.
5. Feed-Forward Network
Each token passes through a small neural network: expand, activate, compress.
6. Residual + Layer Norm
The block output is added back to the original input, then normalized to keep activations stable.
7. Repeat x N Layers
The attention + FFN block repeats N times. Each layer refines the representation further.
8. Logits → Softmax
The final hidden state is projected to vocabulary-sized logits, then softmax produces probabilities.
9. Sample Next Token
The highest-probability token is selected and appended. The loop continues autoregressively.
๐Ÿ“„ Full source โ€” complete LLM from scratch (pure Python, no libraries) click to expand

Karpathy's character-level GPT — ~150 lines. Autograd engine, multi-head attention, layer norm, training loop. Zero dependencies beyond math and random.

"""
Karpathy's character-level GPT โ€” pure Python, no dependencies.
Based on makemore / nanoGPT (github.com/karpathy/makemore)
"""
import math, random, os

# โ”€โ”€ Autograd engine โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad  += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad  += other.data * out.grad
            other.grad += self.data  * out.grad
        out._backward = _backward
        return out

    def __pow__(self, other):
        out = Value(self.data**other, (self,), f'**{other}')
        def _backward():
            self.grad += other * (self.data**(other-1)) * out.grad
        out._backward = _backward
        return out

    def tanh(self):
        t = math.tanh(self.data)
        out = Value(t, (self,), 'tanh')
        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        return out

    def exp(self):
        out = Value(math.exp(self.data), (self,), 'exp')
        def _backward():
            self.grad += out.data * out.grad
        out._backward = _backward
        return out

    def backward(self):
        topo, visited = [], set()
        def build(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev: build(child)
                topo.append(v)
        build(self)
        self.grad = 1.0
        for v in reversed(topo): v._backward()

    __radd__ = __add__; __rmul__ = __mul__
    __neg__  = lambda self: self * -1
    __sub__  = lambda self, other: self + (-other)
    __truediv__ = lambda self, other: self * other**-1
    __repr__ = lambda self: f'Value(data={self.data:.4f}, grad={self.grad:.4f})'

# โ”€โ”€ Model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class GPT:
    def __init__(self, vocab_size, block_size, n_embd, n_head, n_layer):
        self.block_size = block_size
        self.n_head = n_head
        self.n_embd = n_embd
        head_size = n_embd // n_head

        def randn(): return Value(random.gauss(0, 0.02))
        def matrix(r, c): return [[randn() for _ in range(c)] for _ in range(r)]
        def zeros(n): return [Value(0.0) for _ in range(n)]

        self.wte = matrix(vocab_size, n_embd)   # token embeddings
        self.wpe = matrix(block_size, n_embd)   # position embeddings
        self.blocks = []
        for _ in range(n_layer):
            self.blocks.append({
                'ln1_w': [Value(1.0) for _ in range(n_embd)],
                'ln1_b': zeros(n_embd),
                'attn_q': matrix(n_embd, n_embd),
                'attn_k': matrix(n_embd, n_embd),
                'attn_v': matrix(n_embd, n_embd),
                'attn_proj': matrix(n_embd, n_embd),
                'ln2_w': [Value(1.0) for _ in range(n_embd)],
                'ln2_b': zeros(n_embd),
                'mlp_fc': matrix(n_embd, 4*n_embd),
                'mlp_fc_b': zeros(4*n_embd),
                'mlp_proj': matrix(4*n_embd, n_embd),
                'mlp_proj_b': zeros(n_embd),
            })
        self.ln_f_w  = [Value(1.0) for _ in range(n_embd)]
        self.ln_f_b  = zeros(n_embd)
        self.lm_head = matrix(n_embd, vocab_size)

    def _matvec(self, W, x):
        return [sum(W[i][j]*x[j] for j in range(len(x))) for i in range(len(W))]

    def _layernorm(self, x, w, b):
        n = len(x)
        mean = sum(x) * (1.0/n)
        var  = sum((xi - mean)**2 for xi in x) * (1.0/n)
        std  = (var + Value(1e-5))**0.5
        return [(xi - mean) / std * wi + bi for xi, wi, bi in zip(x, w, b)]

    def _softmax(self, logits):
        m = max(v.data for v in logits)
        exps = [(v - Value(m)).exp() for v in logits]
        s = sum(exps)
        return [e / s for e in exps]

    def _gelu(self, x):
        return [xi * Value(0.5) * (Value(1.0) + (xi * Value(math.sqrt(2.0/math.pi)) * (Value(1.0) + Value(0.044715) * xi * xi * xi)).tanh()) for xi in x]

    def forward(self, tokens):
        T = len(tokens)
        # Token + position embeddings
        x = [[self.wte[tokens[t]][i] + self.wpe[t][i] for i in range(self.n_embd)] for t in range(T)]

        for blk in self.blocks:
            # Multi-head self-attention
            xn = [self._layernorm(x[t], blk['ln1_w'], blk['ln1_b']) for t in range(T)]
            head_size = self.n_embd // self.n_head
            attn_out = [[ Value(0.0) ] * self.n_embd for _ in range(T)]
            for h in range(self.n_head):
                hs = h * head_size
                q = [self._matvec(blk['attn_q'][hs:hs+head_size], xn[t]) for t in range(T)]
                k = [self._matvec(blk['attn_k'][hs:hs+head_size], xn[t]) for t in range(T)]
                v = [self._matvec(blk['attn_v'][hs:hs+head_size], xn[t]) for t in range(T)]
                scale = Value(head_size**-0.5)
                for t in range(T):
                    raw = [sum(q[t][i]*k[s][i] for i in range(head_size)) * scale for s in range(t+1)]
                    raw += [Value(-1e9) for _ in range(T - t - 1)]  # causal mask
                    probs = self._softmax(raw)
                    for i in range(head_size):
                        attn_out[t][hs+i] = sum(probs[s]*v[s][i] for s in range(T))
            attn_proj = [self._matvec(blk['attn_proj'], attn_out[t]) for t in range(T)]
            x = [[x[t][i] + attn_proj[t][i] for i in range(self.n_embd)] for t in range(T)]

            # MLP
            xn2 = [self._layernorm(x[t], blk['ln2_w'], blk['ln2_b']) for t in range(T)]
            for t in range(T):
                h = self._gelu([sum(blk['mlp_fc'][i][j]*xn2[t][j] for j in range(self.n_embd)) + blk['mlp_fc_b'][i] for i in range(4*self.n_embd)])
                mlp = [sum(blk['mlp_proj'][i][j]*h[j] for j in range(4*self.n_embd)) + blk['mlp_proj_b'][i] for i in range(self.n_embd)]
                x[t] = [x[t][i] + mlp[i] for i in range(self.n_embd)]

        # Final norm + project to logits
        x = [self._layernorm(x[t], self.ln_f_w, self.ln_f_b) for t in range(T)]
        logits = [self._matvec(self.lm_head, x[t]) for t in range(T)]
        return logits

# โ”€โ”€ Training loop โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def train(names_file='names.txt'):
    words = open(names_file).read().splitlines()
    chars = sorted(set(''.join(words)))
    stoi = {s:i+1 for i,s in enumerate(chars)}; stoi['.'] = 0
    itos = {i:s for s,i in stoi.items()}
    vocab_size = len(stoi)

    block_size, n_embd, n_head, n_layer = 16, 16, 4, 1
    model = GPT(vocab_size, block_size, n_embd, n_head, n_layer)

    def get_params(m):
        params = []
        for row in [m.wte, m.wpe, m.lm_head]:
            for r in row:
                for v in r: params.append(v)
        for blk in m.blocks:
            for val in blk.values():
                if isinstance(val, list):
                    for item in val:
                        if isinstance(item, list):
                            for v in item: params.append(v)
                        else: params.append(item)
        return params

    lr = 0.01
    for step in range(1000):
        w = random.choice(words)
        tokens = [0] + [stoi[c] for c in w] + [0]
        tokens = tokens[-block_size:] if len(tokens) > block_size else tokens

        logits = model.forward(tokens[:-1])
        targets = tokens[1:]

        # Cross-entropy loss
        loss = Value(0.0)
        for t, tgt in enumerate(targets):
            probs = model._softmax(logits[t])
            loss = loss + (probs[tgt] * Value(-1)).exp().exp()  # -log(p)
        loss = loss * Value(1.0 / len(targets))

        params = get_params(model)
        for p in params: p.grad = 0.0
        loss.backward()

        # Adam update (simplified)
        for p in params:
            p.data -= lr * p.grad

        if step % 100 == 0:
            print(f'step {step}: loss={loss.data:.4f}')

The Whole Algorithm

This page explains a complete GPT implementation in ~150 lines of pure Python. No libraries beyond math and random. It trains on a list of names and learns to generate new, plausible-sounding ones.

"This file is the complete algorithm. Everything else is just efficiency." The architecture is identical to GPT-2/3/4. PyTorch, CUDA, and distributed training are optimizations on top of these exact operations.

The code has five parts, each building on the last:

1. Tokenizer — text → integers
2. Autograd — automatic derivatives
3. Model — transformer architecture
4. Training — forward + backward + Adam
5. Inference — generate new names
HyperparameterValueMeaning
n_layer1One transformer block
n_embd16Embedding dimension
n_head4Attention heads
block_size16Max context length
vocab_size2726 letters + BOS token
num_steps1000Training iterations

Total parameters: 7,633. That's it — a GPT you could write on paper.

The Tokenizer

Neural networks operate on numbers, not characters. The tokenizer converts each character to an integer ID using a simple lookup table. Every unique character in the dataset gets an ID from 0 to 25 (for a–z), and a special BOS (Beginning of Sequence) token gets ID 26.

# Build vocabulary from all unique characters
uchars = sorted(set(''.join(docs)))  # ['a', 'b', ..., 'z']
BOS = len(uchars)                     # 26
vocab_size = len(uchars) + 1          # 27

Each training example is a name surrounded by BOS tokens. The model learns to predict each next character, including when to stop (by predicting BOS at the end).

Tokenizer

The opening BOS gives the model a starting signal: "begin generating a name." The closing BOS signals "this name is done." During inference, the model generates characters until it outputs BOS, then stops.

Autograd: The Value Class

The Value class is the engine that makes learning possible. It wraps a single floating-point number and records how it was computed, building a computation graph. When we call .backward(), it walks this graph in reverse to compute gradients — the derivative of the final loss with respect to every intermediate value.

class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data               # the scalar value
        self.grad = 0                  # d(loss)/d(this)
        self._children = children       # inputs to the op
        self._local_grads = local_grads # d(this)/d(each child)

Each arithmetic operation creates a new Value whose children are the inputs. The local_grads tuple stores the partial derivative of the output with respect to each child — computed by the rules of calculus at the time of the forward pass.

Forward: building the graph

When you write c = a + b, Python calls a.__add__(b), which returns a new Value with data = a.data + b.data, children = (a, b), and local_grads = (1, 1) — because d(a+b)/da = 1 and d(a+b)/db = 1.

For multiplication c = a * b: local_grads = (b.data, a.data) — because d(ab)/da = b and d(ab)/db = a.

Backward: the chain rule

The backward pass applies the chain rule recursively. For each node, it multiplies the incoming gradient by each local gradient and accumulates into the children:

child.grad += local_grad * v.grad

This single line is the entire backpropagation algorithm.

Computation Graph
a
b
Expression

Every operation in the entire GPT — matrix multiplies, softmax, RMSNorm, attention — decomposes into additions, multiplications, exp, log, pow, and relu. The Value class handles all of them, automatically computing gradients through thousands of operations.

The derivative rules

OperationForwardLocal gradients
a + ba.data + b.data(1, 1)
a * ba.data × b.data(b.data, a.data)
a ** na.datan(n × a.datan-1,)
a.exp()ea.data(ea.data,)
a.log()ln(a.data)(1/a.data,)
a.relu()max(0, a.data)(1 if a>0 else 0,)

Parameters: The Model's Knowledge

All of the model's knowledge lives in its parameters — a collection of randomly initialized Value objects arranged as matrices. Training adjusts these numbers to minimize the loss.

state_dict = {
    'wte':   matrix(vocab_size, n_embd),  # token embeddings   (27 x 16)
    'wpe':   matrix(block_size, n_embd),  # position embeddings (16 x 16)
    'ln_f':  [Value(1.0)] * n_embd,       # final layer norm gain (16)
    'lm_head': matrix(vocab_size, n_embd), # output projection  (27 x 16)
    # Per layer:
    'layer0.attn_wq': matrix(n_embd, n_embd),  # (16 x 16)
    'layer0.attn_wk': matrix(n_embd, n_embd),
    'layer0.attn_wv': matrix(n_embd, n_embd),
    'layer0.attn_wo': matrix(n_embd, n_embd),
    'layer0.mlp_fc1': matrix(4*n_embd, n_embd), # (64 x 16)
    'layer0.mlp_fc2': matrix(n_embd, 4*n_embd), # (16 x 64)
}
All 7,633 Parameters

Each cell in each matrix is a Value object — a trainable number with a gradient. Total: 7,633 parameters. For comparison, GPT-3 has 175 billion.

The token embedding matrix wte has shape (27, 16). When we feed token ID 5 ('f'), we grab row 5 — a vector of 16 numbers that represent the character 'f'. These representations are learned during training. Characters that behave similarly end up with similar vectors.

Linear Layers & Softmax

The two most fundamental operations in a transformer are linear transformations (matrix-vector multiply) and softmax (converting scores to probabilities).

Linear: matrix-vector multiplication

def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]

Each row of the weight matrix w is dot-producted with input x to produce one output value. Hover over outputs to see which weights contribute.

Linear Layer
hover output nodes

Softmax: logits → probabilities

def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]  # numerical stability
    total = sum(exps)
    return [e / total for e in exps]

Softmax exponentiates each value and normalizes so they sum to 1. Subtracting max_val first prevents overflow — the math is identical because ea-c/Σeb-c = ea/Σeb.

Softmax
A 2.0
B 1.0
C -1.0
D 0.5

RMSNorm: keeping activations stable

def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]

RMSNorm divides each element by the root-mean-square of the vector, ensuring activations don't explode or vanish as they flow through layers. Simpler than LayerNorm (no mean subtraction, no learned parameters) but equally effective.

Self-Attention

Attention is the mechanism that lets tokens communicate with each other. Each token creates three vectors:

The attention score between two tokens is the dot product of one token's Query with the other's Key. High score means "these tokens are relevant to each other." The scores become weights (via softmax) that determine how much of each token's Value vector to mix in.

# For each attention head:
q_h = q[hs:hs+head_dim]                     # this token's query
k_h = [ki[hs:hs+head_dim] for ki in keys]   # all tokens' keys
v_h = [vi[hs:hs+head_dim] for vi in values] # all tokens' values

# Score = Q . K / sqrt(d)
attn_logits = [sum(q_h[j] * k_h[t][j] ...) / head_dim**0.5
               for t in range(len(k_h))]
attn_weights = softmax(attn_logits)

# Output = weighted sum of Values
head_out = [sum(attn_weights[t] * v_h[t][j] ...) for j in range(head_dim)]
Attention Weights
Name
Head

Multi-head attention

With 4 heads and n_embd=16, each head operates on a 4-dimensional slice of the embedding. Different heads learn different types of relationships — one might track vowel patterns, another consonant clusters, another positional proximity.

Causal masking is implicit here. Unlike batched implementations that need an explicit mask matrix, this code processes tokens one at a time, appending each key/value to a growing cache. Token at position 3 can only attend to positions 0–3 because those are the only keys that exist when it's processed.

The Full Transformer

Each transformer layer is: Attention (tokens talk to each other) then MLP (each token thinks individually), both with residual connections and RMSNorm.

Forward Pass
Ready
def gpt(token_id, pos_id, keys, values):
    tok_emb = state_dict['wte'][token_id]   # look up token embedding
    pos_emb = state_dict['wpe'][pos_id]     # look up position embedding
    x = [t + p for t, p in zip(tok_emb, pos_emb)]

    for li in range(n_layer):
        x_residual = x
        x = rmsnorm(x)                      # pre-norm before attention
        # ... Q, K, V projection, attention, output projection ...
        x = [a + b for a, b in zip(x, x_residual)]

        x_residual = x
        x = rmsnorm(x)
        x = linear(x, mlp_fc1)           # expand 16 โ†’ 64
        x = [xi.relu() for xi in x]
        x = linear(x, mlp_fc2)           # compress 64 โ†’ 16
        x = [a + b for a, b in zip(x, x_residual)]

    x = rmsnorm(x)                               # final layer norm (ln_f)
    logits = linear(x, state_dict['lm_head'])  # โ†’ 27 scores
    return logits

The MLP expands the 16-dimensional vector to 64 dimensions, applies ReLU, then compresses back to 16. This "expand-activate-compress" pattern gives the model capacity to learn non-linear transformations that attention alone cannot express.

Differences from standard GPT-2: RMSNorm instead of LayerNorm, no biases anywhere, ReLU instead of GeLU. These simplifications barely affect quality but dramatically simplify the code.

Training Loop

Training repeats three steps: forward pass (compute predictions and loss), backward pass (compute gradients), and parameter update (nudge parameters to reduce loss).

for step in range(1000):
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]

    # Forward: predict each next token
    losses = []
    for pos_id in range(n):
        logits = gpt(tokens[pos_id], pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()   # cross-entropy loss
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)

    loss.backward()                          # compute all gradients
    # ... Adam update (see next section) ...

For each position, we compute -log(prob of correct token). If the model assigns probability 1.0 to the right answer, loss is 0. If it assigns 0.01, loss is 4.6. The log penalizes confident wrong answers harshly.

Training Loss
Step 0 / 1000

Unlike batched training (which processes many examples at once), this code trains on a single name per step. No batching, no data loaders, no shuffled epochs. Just: pick a name, compute loss, update weights, repeat.

The Adam Optimizer

Adam maintains two running averages per parameter:

# Adam update for each parameter
m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
m_hat = m[i] / (1 - beta1 ** (step + 1))          # bias correction
v_hat = v[i] / (1 - beta2 ** (step + 1))
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps)

Watch how Adam navigates a loss landscape compared to vanilla SGD. Adam adapts its step size per-parameter.

Adam vs SGD
LR 0.010

The learning rate also decays linearly: lr_t = 0.01 * (1 - step/1000). Large adjustments early, fine-tuning later.

Inference: Generating Names

After training, we generate names by repeatedly feeding the model its own output: start with BOS, get probabilities, sample a character, feed it back, repeat until BOS appears again.

token_id = BOS
sample = []
for pos_id in range(block_size):
    logits = gpt(token_id, pos_id, keys, values)
    probs = softmax([l / temperature for l in logits])
    token_id = random.choices(range(vocab_size),
                              weights=[p.data for p in probs])[0]
    if token_id == BOS: break
    sample.append(uchars[token_id])

Temperature

Dividing logits by temperature before softmax controls randomness. Low temperature makes the distribution sharp (deterministic), high temperature makes it flat (creative).

Temperature Effect
Temp 1.00

The generation loop

Watch the model generate a name character by character. Each step feeds the previous output back as input.

Inference
Temp 0.50

This is how every LLM works. GPT-4, Claude, Gemini — they all generate text one token at a time, feeding their own output back as input. The only differences are scale (billions of parameters), tokenization (subwords instead of characters), training data (the internet instead of 32K names), and engineering (GPUs instead of pure Python).

The entire model is just arithmetic on floating-point numbers.
Understanding one layer means understanding them all.