Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
12 min read·Updated April 28, 2026

The Transformer Revolution & How LLMs Work

How attention mechanisms, tokenization, pretraining, fine-tuning, and RLHF combine to create the large language models powering modern AI.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Explain how attention mechanisms allow LLMs to understand context
  • Describe the three-stage training pipeline: pretraining, supervised fine-tuning, and RLHF
  • Understand tokenization and why context windows matter

From Text to Tokens

Before an LLM can process language, it converts text into tokens — small chunks that might be words, word fragments, or characters.

For example, the word "tokenization" might become ["token", "ization"] — two tokens. The sentence "Hello, world!" becomes ["Hello", ",", " world", "!"] — four tokens.

Why tokens matter:

  • Context window: LLMs can only process a limited number of tokens at once. Claude Opus 4.7's 1 million token context window can hold roughly 750,000 words — about 2,500 pages of text, or the contents of several full-length books.
  • Embeddings: Each token is converted into a high-dimensional vector (embedding) that captures its meaning and relationships to other tokens.
  • Cost: API pricing is typically per token. Understanding token counts helps you estimate costs.

Attention: The Core Innovation

The key innovation in the Transformer is self-attention — the ability to relate every token in a sequence to every other token simultaneously.

When processing the sentence "The cat sat on the mat because it was tired," attention allows the model to connect "it" to "cat" — even though they are far apart. Traditional RNNs struggled with this; Transformers handle it naturally.

Multi-head attention runs multiple attention calculations in parallel, each focusing on different types of relationships (syntax, semantics, long-range dependencies). The results are combined to produce a rich representation.

💡Key Concept

Decoder-only architecture: Most modern LLMs (GPT, Claude, Gemini, Llama) use a decoder-only Transformer. They predict the next token given all previous tokens. This simple objective — predicting what comes next — when applied at massive scale, produces systems with remarkable general capabilities.

The Three-Stage Training Pipeline

Modern LLMs are not just trained once — they go through three distinct phases:

Stage 1: Pretraining

The model is trained on a massive corpus of text (trillions of tokens scraped from the internet, books, code, and other sources) with one objective: predict the next token.

This is unsupervised learning at enormous scale. Through this process, the model develops:

  • Grammar and syntax
  • World knowledge
  • Reasoning patterns
  • Code understanding

Pretraining is the most computationally expensive phase — frontier-model training runs (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro) cost hundreds of millions to billions of dollars.

Stage 2: Supervised Fine-Tuning (SFT)

The pretrained model is powerful but not necessarily helpful or safe in conversation. SFT adapts it to be an assistant by training on curated examples of high-quality conversations.

Human trainers write example dialogues demonstrating how the model should respond to instructions, answer questions, and decline harmful requests.

Stage 3: RLHF — Reinforcement Learning from Human Feedback

In RLHF, human raters compare pairs of model responses and indicate which they prefer. A separate "reward model" is trained on these preferences. Then the LLM is fine-tuned using reinforcement learning to produce responses that score highly on the reward model.

RLHF is the key technique that makes LLMs helpful, harmless, and honest — rather than just statistically plausible text generators.

📝Note

Constitutional AI (Anthropic): Anthropic developed an extension of RLHF where the model critiques and revises its own outputs according to a set of principles ("the constitution") — reducing reliance on large volumes of human preference data.

Why LLMs Are So Powerful

Large language models exhibit emergent capabilities — abilities that appear at scale that were not explicitly trained for. As models get larger, they develop stronger reasoning, better code generation, and more reliable instruction-following.

This scaling property is what has driven the rapid investment in larger and larger models. The capability gap between a small model and a frontier model is often qualitative, not just quantitative.

Key Takeaways

  • LLMs tokenize text into subword units and convert them to numerical embeddings
  • Self-attention allows the model to consider all tokens in context simultaneously — the key innovation of Transformers
  • Training has three stages: pretraining (predict next token at scale), SFT (learn to be helpful), RLHF (learn human preferences)
  • Context window size determines how much text an LLM can process at once
  • Emergent capabilities arise at scale — the leap from a small to a frontier model is qualitative

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

🧭Recommended for you