Learning Objectives
- Explain how attention mechanisms allow LLMs to understand context
- Describe the three-stage training pipeline: pretraining, supervised fine-tuning, and RLHF
- Understand tokenization and why context windows matter
From Text to Tokens
Before an LLM can process language, it converts text into tokens — small chunks that might be words, word fragments, or characters.
For example, the word "tokenization" might become ["token", "ization"] — two tokens. The sentence "Hello, world!" becomes ["Hello", ",", " world", "!"] — four tokens.
Why tokens matter:
- Context window: LLMs can only process a limited number of tokens at once. Claude Opus 4.7's 1 million token context window can hold roughly 750,000 words — about 2,500 pages of text, or the contents of several full-length books.
- Embeddings: Each token is converted into a high-dimensional vector (embedding) that captures its meaning and relationships to other tokens.
- Cost: API pricing is typically per token. Understanding token counts helps you estimate costs.
Attention: The Core Innovation
The key innovation in the Transformer is self-attention — the ability to relate every token in a sequence to every other token simultaneously.
When processing the sentence "The cat sat on the mat because it was tired," attention allows the model to connect "it" to "cat" — even though they are far apart. Traditional RNNs struggled with this; Transformers handle it naturally.
Multi-head attention runs multiple attention calculations in parallel, each focusing on different types of relationships (syntax, semantics, long-range dependencies). The results are combined to produce a rich representation.
💡Key Concept
Decoder-only architecture: Most modern LLMs (GPT, Claude, Gemini, Llama) use a decoder-only Transformer. They predict the next token given all previous tokens. This simple objective — predicting what comes next — when applied at massive scale, produces systems with remarkable general capabilities.
The Three-Stage Training Pipeline
Modern LLMs are not just trained once — they go through three distinct phases:
Stage 1: Pretraining
The model is trained on a massive corpus of text (trillions of tokens scraped from the internet, books, code, and other sources) with one objective: predict the next token.
This is unsupervised learning at enormous scale. Through this process, the model develops:
- Grammar and syntax
- World knowledge
- Reasoning patterns
- Code understanding
Pretraining is the most computationally expensive phase — frontier-model training runs (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro) cost hundreds of millions to billions of dollars.
Stage 2: Supervised Fine-Tuning (SFT)
The pretrained model is powerful but not necessarily helpful or safe in conversation. SFT adapts it to be an assistant by training on curated examples of high-quality conversations.
Human trainers write example dialogues demonstrating how the model should respond to instructions, answer questions, and decline harmful requests.
Stage 3: RLHF — Reinforcement Learning from Human Feedback
In RLHF, human raters compare pairs of model responses and indicate which they prefer. A separate "reward model" is trained on these preferences. Then the LLM is fine-tuned using reinforcement learning to produce responses that score highly on the reward model.
RLHF is the key technique that makes LLMs helpful, harmless, and honest — rather than just statistically plausible text generators.
📝Note
Constitutional AI (Anthropic): Anthropic developed an extension of RLHF where the model critiques and revises its own outputs according to a set of principles ("the constitution") — reducing reliance on large volumes of human preference data.
Why LLMs Are So Powerful
Large language models exhibit emergent capabilities — abilities that appear at scale that were not explicitly trained for. As models get larger, they develop stronger reasoning, better code generation, and more reliable instruction-following.
This scaling property is what has driven the rapid investment in larger and larger models. The capability gap between a small model and a frontier model is often qualitative, not just quantitative.
Key Takeaways
- LLMs tokenize text into subword units and convert them to numerical embeddings
- Self-attention allows the model to consider all tokens in context simultaneously — the key innovation of Transformers
- Training has three stages: pretraining (predict next token at scale), SFT (learn to be helpful), RLHF (learn human preferences)
- Context window size determines how much text an LLM can process at once
- Emergent capabilities arise at scale — the leap from a small to a frontier model is qualitative