01 · TOKENIZE
Words become numbers.
Before anything can be reasoned about, your input is shattered into tokens — sub-word units drawn from a vocabulary of around 100k–200k pieces. "unforgettable" typically becomes un · forget · table.
Each token is looked up in an embedding table — a wide vector of learned numbers. From the model's point of view, your prompt now lives in a high-dimensional space where related ideas are geometrically close.
VOCAB · 100k–200k
DIM · 4096–18432
02 · ATTEND
Every token reads every other.
The transformer's defining trick: self-attention. Each token, in parallel, asks "which other tokens should I be paying attention to?" — and answers itself. Multiple attention heads ask this same question along different axes.
One head might track syntax, another follows entities across paragraphs, a third matches code brackets. The model isn't taught which head does what — that emerges from training.
HEADS · 32–128
CTX · up to 2M tokens
03 · TRANSFORM
Many layers, narrow signals.
An attention layer is paired with a feed-forward network — a dense MLP that reshapes the signal token-by-token. Stack 40–120 of these blocks and you have a transformer.
In mixture-of-experts models the feed-forward layer is replaced by many parallel experts; a router picks ~2 per token. That's how 671B-parameter DeepSeek runs with only 37B active per forward pass.
LAYERS · 40–120
EXPERTS · 8–256 (MoE)
04 · TRAIN
Three teachers in sequence.
Pretraining — trillions of tokens of text and code. The model only ever learns "predict the next token", but at that scale, learning to predict becomes learning to compress, and compression starts to look like understanding.
Post-training — supervised fine-tuning on curated demonstrations, then preference optimisation (RLHF, DPO). This is where the helpful assistant persona appears.
RLVR — reinforcement learning from verifiable rewards. Used heavily for reasoning models: the model gets rewarded when its chain-of-thought leads to a correct, checkable answer.
PRETRAIN · 15T+ tokens
POST · SFT → DPO → RLVR
05 · HARNESS
The model is the engine. The harness is the car.
What you call "GPT-5" or "Claude" in product is always a harness wrapped around model weights: a system prompt that defines its persona; a context window holding the conversation; a tool sheet listing functions it can call; a memory store; a safety filter on both sides.
Two products on the same weights can feel like different models. Two harnesses on the same task — agentic versus single-turn — can score 20 points apart on the same benchmark.
SYSTEM PROMPT
TOOLS
MEMORY
EVAL HARNESS