1.4 — The Transformer Revolution & How LLMs Work

Learning Objectives

Explain how attention mechanisms allow LLMs to understand context
Describe the three-stage training pipeline: pretraining, supervised fine-tuning, and RLHF
Understand tokenization and why context windows matter

From Text to Tokens

Before an LLM can process language, it converts text into tokens — small chunks that might be words, word fragments, or characters.

For example, the word "tokenization" might become ["token", "ization"] — two tokens. The sentence "Hello, world!" becomes ["Hello", ",", " world", "!"] — four tokens.

Why tokens matter:

Context window: LLMs can only process a limited number of tokens at once. Claude Opus 4.7's 1 million token context window can hold roughly 750,000 words — about 2,500 pages of text, or the contents of several full-length books.
Embeddings: Each token is converted into a high-dimensional vector (embedding) that captures its meaning and relationships to other tokens.
Cost: API pricing is typically per token. Understanding token counts helps you estimate costs.

Attention: The Core Innovation

The key innovation in the Transformer is self-attention — the ability to relate every token in a sequence to every other token simultaneously.

When processing the sentence "The cat sat on the mat because it was tired," attention allows the model to connect "it" to "cat" — even though they are far apart. Traditional RNNs struggled with this; Transformers handle it naturally.

Multi-head attention runs multiple attention calculations in parallel, each focusing on different types of relationships (syntax, semantics, long-range dependencies). The results are combined to produce a rich representation.

💡Key Concept

Decoder-only architecture: Most modern LLMs (GPT, Claude, Gemini, Llama) use a decoder-only Transformer. They predict the next token given all previous tokens. This simple objective — predicting what comes next — when applied at massive scale, produces systems with remarkable general capabilities.

The Three-Stage Training Pipeline

Modern LLMs are not just trained once — they go through three distinct phases:

Stage 1: Pretraining

The model is trained on a massive corpus of text (trillions of tokens scraped from the internet, books, code, and other sources) with one objective: predict the next token.

This is unsupervised learning at enormous scale. Through this process, the model develops:

Grammar and syntax
World knowledge
Reasoning patterns
Code understanding

Pretraining is the most computationally expensive phase — frontier-model training runs (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro) cost hundreds of millions to billions of dollars.

Stage 2: Supervised Fine-Tuning (SFT)

The pretrained model is powerful but not necessarily helpful or safe in conversation. SFT adapts it to be an assistant by training on curated examples of high-quality conversations.

Human trainers write example dialogues demonstrating how the model should respond to instructions, answer questions, and decline harmful requests.

Stage 3: RLHF — Reinforcement Learning from Human Feedback

In RLHF, human raters compare pairs of model responses and indicate which they prefer. A separate "reward model" is trained on these preferences. Then the LLM is fine-tuned using reinforcement learning to produce responses that score highly on the reward model.

RLHF is the key technique that makes LLMs helpful, harmless, and honest — rather than just statistically plausible text generators.

📝Note

Constitutional AI (Anthropic): Anthropic developed an extension of RLHF where the model critiques and revises its own outputs according to a set of principles ("the constitution") — reducing reliance on large volumes of human preference data.

Why LLMs Are So Powerful

Large language models exhibit emergent capabilities — abilities that appear at scale that were not explicitly trained for. As models get larger, they develop stronger reasoning, better code generation, and more reliable instruction-following.

This scaling property is what has driven the rapid investment in larger and larger models. The capability gap between a small model and a frontier model is often qualitative, not just quantitative.

Key Takeaways

LLMs tokenize text into subword units and convert them to numerical embeddings
Self-attention allows the model to consider all tokens in context simultaneously — the key innovation of Transformers
Training has three stages: pretraining (predict next token at scale), SFT (learn to be helpful), RLHF (learn human preferences)
Context window size determines how much text an LLM can process at once
Emergent capabilities arise at scale — the leap from a small to a frontier model is qualitative

The Transformer Revolution & How LLMs Work

Audio & video lessons are paid features