Skip to content
Chapter 1. What Is a Language Model, Really?

Chapter 1. What Is a Language Model, Really?

Every time you type a message and an AI responds with something that sounds eerily human, the same trick is happening behind the scenes: the model is predicting the next word. That’s it. The entire field of large language models (the technology behind ChatGPT, Claude, Gemini, and every other AI assistant) comes down to one deceptively simple idea: given some text, what word comes next?

This chapter explains what that actually means, why it works so well, and how we got from crude statistical counting in the 1990s to trillion-parameter models that can write code, pass medical exams, and hold conversations that feel genuinely intelligent.


The Core Idea: Predicting the Next Token

A language model is a system that takes a sequence of text and assigns probabilities to what comes next.

That’s the entire definition. Everything else (the billions of parameters, the massive GPU clusters, the months of training) exists to make that prediction as accurate as possible.

Here’s a concrete example. Suppose you give a language model this text:

The capital of France is

The model doesn’t “know” geography. It doesn’t have a map stored inside it. What it does is assign a probability to every possible next word in its vocabulary. A well-trained model might produce something like:

Next tokenProbability
Paris0.92
the0.03
a0.01
located0.01
known0.005
… (thousands more)

The model assigns 92% probability to “Paris” and spreads the remaining 8% across thousands of other possibilities. It then picks one, usually the most likely, though there are ways to add randomness that we’ll cover in Chapter 17.

But here’s the key insight: the model doesn’t stop after one word. Once it picks “Paris,” it appends that to the input and runs the whole process again:

The capital of France is Paris

Now it predicts the next token after “Paris.” Maybe it picks “.” with high probability. Then it predicts what comes after the period. And so on, one token at a time, until it decides to stop.

This is called autoregressive generation: the model generates text by repeatedly predicting the next token, feeding each prediction back in as input for the next step. Every AI chatbot you’ve ever used works this way. Every single one.

What’s a Token?

You’ll notice I keep saying “token” instead of “word.” That’s because language models don’t actually work with words. They work with tokens, which are chunks of text that may or may not align with whole words.

The word “understanding” might be split into two tokens: “understand” and “ing.” The word “cat” is probably one token. A rare word like “defenestration” might be split into “def,” “en,” “est,” and “ration.”

Why? Because there are too many possible words in all the world’s languages to give each one its own entry. Instead, models use a fixed vocabulary of roughly 100,000 to 200,000 tokens that can be combined to represent any text. We’ll cover exactly how this works in Chapter 4, but for now, just think of tokens as the atomic units that the model reads and writes.


Why Does Prediction Lead to Understanding?

This is the question that trips people up. If all the model does is predict the next word, how can it write poetry, debug code, or explain quantum physics?

The answer is that to predict the next word well, you need to understand an enormous amount about the world.

Consider this prompt:

A patient presents with sudden onset chest pain radiating to the left arm,
shortness of breath, and diaphoresis. The most likely diagnosis is

To predict that the next words are “acute myocardial infarction” (a heart attack), the model needs to have learned, during training, the relationships between symptoms and diagnoses. It needs to understand medical terminology, the concept of differential diagnosis, and the statistical patterns of how doctors write about these cases.

Or consider:

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) +

To predict that the next token is “fibonacci(n-2)”, the model needs to understand recursion, function definitions, the Fibonacci sequence, and Python syntax.

The key realization is this: predicting the next token on trillions of examples forces the model to build internal representations of grammar, facts, reasoning patterns, and even something that looks like common sense. The model was never explicitly taught any of these things. They emerged as a side effect of getting better at prediction.

This is both remarkable and important to understand honestly. The model doesn’t “know” things the way you know things. It has learned statistical patterns from text. When those patterns are strong and consistent, like the capital of France, the model is extremely reliable. When the patterns are weak, ambiguous, or absent from the training data, the model can confidently produce complete nonsense. This is why language models hallucinate: they generate plausible-sounding text that is factually wrong, because their internal probability distribution favors fluency over truth.

There’s a genuine debate among researchers about whether what these models do constitutes “real” understanding or a very convincing imitation. Some argue that a system which can pass medical licensing exams, write working software, and explain complex physics has clearly learned something meaningful about the world. Others counter that the model is doing sophisticated pattern completion: it has seen millions of examples of how doctors write about symptoms, how programmers write recursive functions, and how physicists explain their field, and it’s interpolating between those examples without any deeper comprehension.

For the purposes of this book, we’ll take the practical view: what matters is what the model can do, and understanding how it does it. Whether that constitutes “true” understanding is a philosophical question we’ll leave to the philosophers. What we can say with certainty is that the mechanism (next-token prediction trained on trillions of tokens) produces capabilities that no one fully predicted, and that continue to surprise even the researchers who build these systems.

We’ll dig deeper into hallucinations and the limits of this approach in Chapter 26. For now, the takeaway is: next-token prediction produces something that looks like understanding, and in many practical cases functions like understanding, but it is fundamentally a different mechanism than human comprehension.


A Brief History: How We Got Here

The idea of predicting the next word in a sequence is not new. What changed is how well we can do it. Let’s trace the key milestones.

The Statistical Era: N-Grams (1950s–2000s)

The concept of modeling language as a statistical process goes back to Claude Shannon, the father of information theory. In his landmark 1948 paper “A Mathematical Theory of Communication,” Shannon demonstrated that you could model English text by looking at the probability of each letter (or word) given the ones that came before it.

This idea was formalized into n-gram models, which became the workhorses of natural language processing from the 1980s through the 2000s. An n-gram model predicts the next word based on the previous n-1 words.

A bigram model (n=2) looks at only the previous word:

P("Paris" | "is") = 0.003

It asks: of all the times the word “is” appeared in the training data, how often was it followed by “Paris”?

A trigram model (n=3) looks at the previous two words:

P("Paris" | "France is") = 0.15

Better, now we have more context. But there’s a fundamental problem: as you increase n, the number of possible combinations explodes. A trigram model over a vocabulary of 50,000 words needs to store probabilities for up to 50,000³ = 125 trillion combinations. Most of those combinations never appear in any training data, so the model has no information about them.

This is called the curse of dimensionality, and it meant that n-gram models were stuck looking at very short windows of context, typically 3 to 5 words. They could capture local patterns like “New York City” or “United States of America,” but they couldn’t handle long-range dependencies like:

The cat, which had been sitting on the mat in the living room all afternoon
while the rain poured outside, suddenly ___

An n-gram model looking at the last 3–5 words would see “poured outside suddenly” and have no idea that the subject of the sentence is “the cat.” It would need to look back 20+ words, which was computationally impossible with the counting-based approach.

N-gram models were useful. They powered spell checkers, basic machine translation, and speech recognition for decades. But they couldn’t truly understand language.

Neural Networks Enter the Scene (2003–2013)

In 2003, Yoshua Bengio and his colleagues published “A Neural Probabilistic Language Model,” which proposed using a neural network instead of counting to predict the next word. The key innovation was word embeddings: representing each word as a list of numbers (a vector) rather than a discrete symbol.

This was revolutionary because it solved the curse of dimensionality. Instead of needing to see the exact phrase “France is” in the training data, the model could generalize: if it knew that “France” was similar to “Germany” and “Spain” (because their vectors were close together), then knowing that “The capital of Germany is Berlin” helped it predict that “The capital of France is Paris.”

In 2013, Tomáš Mikolov and his team at Google published Word2Vec, a method for efficiently learning these word embeddings from large amounts of text. Word2Vec showed that the learned vectors captured meaningful relationships, the famous example being that the vector for “King” minus “Man” plus “Woman” produced a vector close to “Queen.” This demonstrated that neural networks could learn something resembling semantic meaning from raw text.

Recurrent Neural Networks and LSTMs (2014–2016)

The next breakthrough came from Recurrent Neural Networks (RNNs) and their more powerful variant, Long Short-Term Memory networks (LSTMs).

The idea behind an RNN is simple: process text one word at a time, and at each step, maintain a hidden state, a compressed summary of everything the model has seen so far. When the model reads the word “cat” at position 1, it updates its hidden state. When it reads “sat” at position 2, it updates the state again, incorporating both the new word and the memory of “cat.”

In 2014, Ilya Sutskever, Oriol Vinyals, and Quoc Le at Google published “Sequence to Sequence Learning with Neural Networks,” which showed that RNNs could translate between languages by encoding an entire sentence into a single vector and then decoding it into another language. That same year, Dzmitry Bahdanau and colleagues introduced the attention mechanism, a way for the decoder to look back at specific parts of the input sentence rather than relying on a single compressed vector.

LSTMs improved on basic RNNs by adding gates that controlled what information to keep and what to forget, allowing them to handle longer sequences. They powered Google Translate starting in 2016 and were the state of the art for several years.

But RNNs and LSTMs had a critical limitation: they processed text sequentially, one word at a time. This meant:

  1. Training was slow: you couldn’t parallelize across the sequence because each step depended on the previous one.
  2. Long-range dependencies were still hard: even with LSTMs, information from the beginning of a long document would fade by the time the model reached the end.

The Transformer Revolution (2017)

On June 12, 2017, a team of eight researchers at Google (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin) submitted a paper to arXiv titled “Attention Is All You Need.” It was later presented at the NeurIPS 2017 conference. It introduced the Transformer architecture, and it changed everything.

The Transformer’s key innovation was replacing sequential processing with self-attention, a mechanism that lets every word in a sequence look at every other word simultaneously. Instead of reading a sentence left to right and trying to remember what came before, the Transformer processes all positions in parallel and learns which words are relevant to which other words.

This solved both problems with RNNs:

  1. Training became massively parallel. Since every position is processed simultaneously, you can use thousands of GPUs efficiently. This made it practical to train on much larger datasets.
  2. Long-range dependencies became easy. A word at position 1 can directly attend to a word at position 1,000; there’s no information bottleneck.

We’ll cover exactly how self-attention works in Chapter 7. For now, the important thing is that the Transformer made it possible to scale language models to sizes that were previously unimaginable.

The name of the paper, “Attention Is All You Need,” turned out to be prophetic. Within two years, virtually every state-of-the-art NLP system was based on the Transformer. Within five years, it had expanded beyond language into images, audio, video, protein folding, and weather prediction. As of 2026, the Transformer remains the foundation of every frontier language model. No alternative architecture has displaced it.

The Scaling Era: GPT, BERT, and Beyond (2018–2023)

Once the Transformer architecture existed, the race to scale began.

GPT-1 (June 2018, OpenAI): The first Generative Pre-trained Transformer. It had 117 million parameters and was trained on about 7,000 books. It demonstrated that pre-training a Transformer on a large text corpus and then fine-tuning it on specific tasks could achieve strong results. It was a proof of concept.

BERT (October 2018, Google): Bidirectional Encoder Representations from Transformers. Unlike GPT, which only looked at text from left to right, BERT looked in both directions: it could see words both before and after a given position. BERT was designed for understanding text (answering questions, classifying sentiment) rather than generating it. It dominated NLP benchmarks for years.

GPT-2 (February 2019, OpenAI): 1.5 billion parameters, about 13 times larger than GPT-1. OpenAI initially refused to release the full model, citing concerns about misuse. It could generate surprisingly coherent paragraphs of text, which was alarming at the time and seems quaint now.

GPT-3 (June 2020, OpenAI): 175 billion parameters, over 100 times larger than GPT-2. This was the model that made the world pay attention. GPT-3 could write essays, answer questions, translate languages, and even write basic code, all without being specifically trained for those tasks. It demonstrated few-shot learning: give it a few examples of what you want, and it could generalize to new inputs.

ChatGPT (November 2022, OpenAI): Built on GPT-3.5, this was the product that brought language models to the mainstream. It wasn’t a fundamentally new model; it was GPT-3.5 fine-tuned with human feedback to be conversational and helpful. Within two months, it had over 100 million users.

GPT-4 (March 2023, OpenAI): A major leap in capability. OpenAI did not disclose the parameter count, but it was widely reported to use a Mixture-of-Experts architecture. GPT-4 could pass the bar exam, score in the 90th percentile on the SAT, and handle images as input. It had a context window of up to 128,000 tokens.

The Open-Source Explosion and MoE (2024–2025)

Two major shifts defined this period: the rise of open-weight models and the widespread adoption of Mixture-of-Experts (MoE) architectures.

Mixture-of-Experts is an architectural approach where the model contains many separate “expert” sub-networks, but only activates a small fraction of them for each token. This means a model can have hundreds of billions of total parameters (giving it enormous capacity to store knowledge) while only using a fraction of them for any given prediction (keeping inference fast and efficient).

In December 2024, the Chinese AI lab DeepSeek released DeepSeek-V3, an open-weight MoE model with 671 billion total parameters but only 37 billion active per token. It used 256 routed experts plus 1 shared expert per MoE layer, with a router that selected the top 8 routed experts for each token. Despite being trained at a fraction of the cost of Western frontier models, it matched or exceeded many of them on benchmarks. This sent shockwaves through the industry and demonstrated that efficient architecture design could compensate for raw compute.

Source: DeepSeek-V3 Technical Report, arXiv:2412.19437, December 2024.

On April 5, 2025, Meta released LLaMA 4, its first MoE model family. LLaMA 4 Scout had 109 billion total parameters (17 billion active, 16 experts), while LLaMA 4 Maverick had 400 billion total parameters (17 billion active, 128 experts). Both were natively multimodal (they could process images and text together) and LLaMA 4 Scout supported a context window of up to 10 million tokens, the largest of any model at the time. These were released as open weights, meaning anyone could download and run them.

Sources: Meta AI blog (April 5, 2025); Hugging Face blog, “Welcome Llama 4 Maverick & Scout” (April 5, 2025).

On August 7, 2025, OpenAI released GPT-5, its first confirmed MoE model, with a 400,000-token total context window (272,000 input tokens plus up to 128,000 output tokens) and native multimodal capabilities. OpenAI did not publish the exact parameter count, but industry estimates ranged from 1.8 trillion to several trillion total parameters. GPT-5 was followed by GPT-5.2 on December 11, 2025, which maintained the 400,000-token context window and introduced three variants: Instant, Thinking, and Pro.

Sources: OpenAI blog, “Introducing GPT-5” (August 7, 2025); OpenAI documentation; Ars Technica (December 11, 2025).

The Frontier in March 2026

As of this writing, the frontier of language models looks like this:

ModelDeveloperReleasedContext WindowArchitectureKey Feature
GPT-5.4OpenAIMar 5, 20261M tokensMoENative computer use, 3 variants (Standard, Thinking, Pro)
Claude Opus 4.6AnthropicFeb 5, 20261M tokens (GA Mar 13)Not disclosedAgent teams, 128K output tokens
Claude Sonnet 4.6AnthropicFeb 17, 20261M tokens (GA Mar 13)Not disclosedNear-Opus performance at $3/$15 per M tokens
Gemini 3.1 ProGoogleFeb 19, 20261M tokensNot disclosed77.1% on ARC-AGI-2, 2× predecessor
Qwen 3.5AlibabaFeb 16, 2026256K tokensMoE (397B total, 17B active)Open weights, 201 languages
LLaMA 4 MaverickMetaApr 5, 20251M tokensMoE (400B total, 17B active)Open weights, 128 experts
LLaMA 4 ScoutMetaApr 5, 202510M tokensMoE (109B total, 17B active)Open weights, largest context window
DeepSeek-V3DeepSeekDec 26, 2024128K tokensMoE (671B total, 37B active)Open weights, trained at low cost
Grok 3xAIFeb 17, 2025131K tokensNot disclosedReal-time X (Twitter) integration

Sources: OpenAI blog (March 5, 2026); Anthropic blog (February 5 and 17, 2026); Anthropic blog, “1M context is now generally available” (March 13, 2026); Google DeepMind (February 19, 2026); Alibaba Qwen blog (February 16, 2026); Meta AI blog (April 5, 2025); DeepSeek technical report (December 2024); xAI documentation (February 2025).

Several things stand out about this table:

  1. Context windows have converged around 1 million tokens. In early 2023, 4,000 tokens was standard. By early 2024, 128,000 was the frontier. Now, 1 million tokens is the norm for top-tier models, with LLaMA 4 Scout pushing to 10 million. One million tokens is roughly 750,000 words, about 10 full-length novels.

  2. MoE is the dominant architecture. Every model with published architecture details uses Mixture-of-Experts. The pattern is consistent: large total parameter counts (hundreds of billions to trillions) with small active parameter counts (17–37 billion per token). This gives models the knowledge capacity of a massive model with the inference speed of a much smaller one.

  3. Most frontier labs don’t publish parameter counts. OpenAI, Anthropic, and Google do not disclose exact architecture details for their flagship models. The open-weight models from Meta, DeepSeek, and Alibaba are the ones where we know the precise specifications.

  4. Open-weight models are competitive with closed ones. LLaMA 4, DeepSeek-V3, and Qwen 3.5 compete directly with GPT-5 and Claude on many benchmarks, and they’re free to download and run.

  5. The cost of building these models is staggering. Training a frontier model from scratch costs tens to hundreds of millions of dollars in compute alone. DeepSeek-V3 was notable for being trained on only 2.788 million H800 GPU hours (roughly $5.5 million in compute), a fraction of what Western labs spend. Most frontier labs spend $100 million to $500 million or more on a single training run, not counting the cost of the research team, data preparation, and infrastructure. We’ll cover training costs in detail in Chapter 14.


Walking Through a Prediction: Real Numbers

Let’s make this concrete by tracing what actually happens when a model predicts a single token. We’ll use simplified but realistic numbers based on a model similar in scale to LLaMA 4 Maverick.

Step 1: Tokenize the Input

The input text:

The weather in Tokyo is usually

The tokenizer breaks this into tokens. Using a typical tokenizer (like the one used by LLaMA), this becomes:

["The", " weather", " in", " Tokyo", " is", " usually"]

That’s 6 tokens. Each token gets mapped to an integer ID from the vocabulary:

[The → 450, weather → 9235, in → 304, Tokyo → 27856, is → 338, usually → 6892]

Step 2: Look Up Embeddings

Each token ID is used to look up a row in the embedding table, a giant matrix with one row per vocabulary entry. If the vocabulary has 128,000 tokens and the model’s hidden dimension is 5,120, the embedding table is a matrix of size 128,000 × 5,120. That’s 655 million numbers just for this one table.

The token “Tokyo” (ID 27856) gets mapped to a vector of 5,120 numbers:

Tokyo → [0.023, -0.041, 0.118, ..., -0.007]  (5,120 values)

These numbers aren’t random; they were learned during training. The vector for “Tokyo” is close to the vectors for “Osaka,” “Japan,” and “Kyoto” in this high-dimensional space, because those words appear in similar contexts in the training data.

Step 3: Add Position Information

The model needs to know that “The” is the first word and “usually” is the sixth. It adds positional encodings to each embedding, additional vectors that encode the position of each token in the sequence. Most modern models use a technique called Rotary Position Embeddings (RoPE), which we’ll cover in Chapter 6.

After this step, we have 6 vectors of 5,120 dimensions each, and each vector encodes both the meaning of the token and its position in the sequence.

Step 4: Pass Through Transformer Layers

This is where the heavy computation happens. The 6 vectors pass through a stack of Transformer layers. In a model like LLaMA 4 Maverick, there are roughly 100 of these layers.

Each layer does two things:

  1. Self-attention: Each token looks at all other tokens and decides which ones are relevant. The token “usually” might attend strongly to “weather” and “Tokyo” because they’re relevant to predicting what kind of weather Tokyo usually has. This produces updated vectors that incorporate information from across the entire sequence.

  2. Feed-forward network: Each token’s vector is passed through a neural network that transforms it further. In an MoE model, this is where the routing happens: the model picks which expert sub-networks to use for each token.

After passing through all ~100 layers, each of our 6 token vectors has been transformed from a simple word embedding into a rich representation that encodes the full context of the sentence.

Step 5: Predict the Next Token

We only care about the last token’s output, the vector at position 6 (after “usually”). This vector is multiplied by another large matrix (the “language model head”) that maps it back to vocabulary size: 5,120 dimensions → 128,000 scores, one for each possible next token.

These raw scores (called logits) are then converted to probabilities using a function called softmax (covered in Chapter 2). The result might look like:

TokenLogit (raw score)Probability
mild4.20.18
warm3.90.13
hot3.70.11
humid3.50.09
pleasant3.30.07
cool3.10.06

The model assigns the highest probability to “mild”, a reasonable prediction for Tokyo’s weather. If we’re generating text, we pick a token (maybe “mild,” maybe “warm” depending on the sampling strategy) and repeat the entire process with the new, longer sequence.

See It Yourself: Next-Token Prediction in 10 Lines of Python

You don’t need a data center to see next-token prediction in action. Here’s a working example using a small open model. Install the transformers library (pip install transformers torch), and run this:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "gpt2"  # 124M parameters, small enough to run on a laptop
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

text = "The capital of France is"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits[0, -1, :]  # scores for the LAST token position

probs = torch.softmax(logits, dim=0)
top5 = torch.topk(probs, 5)

for prob, idx in zip(top5.values, top5.indices):
    print(f"{tokenizer.decode(idx):>15s}  {prob:.4f}")

On a laptop, this runs in a few seconds and prints something like:

          Paris  0.0930
            the  0.0566
           one  0.0349
              a  0.0294
           not  0.0242

Even GPT-2, a model from 2019 with only 124 million parameters, puts “Paris” at the top. A frontier model with hundreds of billions of parameters would assign “Paris” a probability above 0.90. The mechanism is identical; the only difference is scale.

The Scale of This Computation

Let’s put some numbers on what just happened:

  • Embedding lookup: 6 tokens × 5,120 dimensions = 30,720 numbers retrieved
  • Each attention layer: involves multiplying matrices of size [6 × 5,120] by [5,120 × 5,120], multiple times
  • Each feed-forward layer: involves multiplying by matrices of size [5,120 × 20,480] and back
  • Total layers: ~100
  • Total multiply-add operations for this single prediction: roughly 800 billion

That’s 800 billion arithmetic operations to predict one token. A modern GPU like NVIDIA’s H100 can perform about 2,000 trillion operations per second (2 petaFLOPS in half-precision), so it can produce this prediction in under a millisecond. But when you’re generating a 1,000-token response, that’s 800 trillion operations total, and that’s for a single user’s request.

This is why running frontier language models requires data centers full of GPUs, and why inference costs billions of dollars per year across the industry.


A Timeline of Key Milestones

Here’s a condensed timeline of the most important moments in the development of language models:

YearMilestone
1948Claude Shannon publishes “A Mathematical Theory of Communication,” laying the groundwork for statistical language modeling
1980s–2000sN-gram models dominate NLP; used in spell checkers, speech recognition, and early machine translation
2003Bengio et al. publish “A Neural Probabilistic Language Model,” introducing neural network-based language modeling with word embeddings
2013Mikolov et al. at Google publish Word2Vec, enabling efficient learning of word embeddings that capture semantic relationships
2014Sutskever et al. publish Seq2Seq; Bahdanau et al. introduce the attention mechanism for neural machine translation
Jun 2017Vaswani et al. at Google submit “Attention Is All You Need” to arXiv, introducing the Transformer architecture
Jun 2018OpenAI releases GPT-1 (117M parameters): first Generative Pre-trained Transformer
Oct 2018Google releases BERT: bidirectional Transformer that dominates NLP benchmarks
Feb 2019OpenAI releases GPT-2 (1.5B parameters): generates coherent paragraphs
Jun 2020OpenAI releases GPT-3 (175B parameters): demonstrates few-shot learning
Nov 2022OpenAI launches ChatGPT: reaches 100M users in two months
Mar 2023OpenAI releases GPT-4: passes bar exam, handles images, rumored MoE architecture
Dec 2024DeepSeek releases V3 (671B total, 37B active): open-weight MoE model matches frontier performance
Apr 2025Meta releases LLaMA 4 (up to 10M token context): first open-weight natively multimodal MoE models
Aug 2025OpenAI releases GPT-5: confirmed MoE, 400K context, native multimodal
Dec 2025OpenAI releases GPT-5.2: 400K context, 128K output, three variants (Instant, Thinking, Pro)
Feb 2026Google releases Gemini 3.1 Pro (1M context); Alibaba releases Qwen 3.5 (397B total, 17B active); Anthropic releases Claude Opus 4.6 and Sonnet 4.6
Mar 2026OpenAI releases GPT-5.4 (1M context, native computer use); Anthropic makes Claude 4.6 1M context GA

The pattern is clear: models have gotten larger, faster, and more capable at an extraordinary rate. The Transformer architecture from 2017 is still the foundation of every model on this list; what’s changed is the scale, the training data, the fine-tuning techniques, and architectural refinements like MoE.


Key Takeaways

  • A language model predicts the next token given a sequence of text. That’s the core idea behind every LLM.
  • Tokens are chunks of text (not always whole words) from a fixed vocabulary of ~100K–200K entries.
  • Autoregressive generation means the model predicts one token at a time, feeding each prediction back as input.
  • Next-token prediction forces the model to learn grammar, facts, reasoning patterns, and more, but this is statistical pattern matching, not human-like understanding. Models can and do hallucinate.
  • N-gram models (1980s–2000s) predicted the next word by counting occurrences in training data, but were limited to short context windows of 3–5 words.
  • Neural language models (2003+) replaced counting with learned representations (embeddings), enabling generalization to unseen word combinations.
  • RNNs and LSTMs (2014–2016) processed text sequentially and introduced the attention mechanism, but were slow to train and struggled with long sequences.
  • The Transformer (2017) replaced sequential processing with self-attention, enabling massive parallelism and direct long-range connections between any two positions.
  • The scaling era (2018–2023) saw models grow from 117 million parameters (GPT-1) to hundreds of billions (GPT-3, GPT-4), with each jump bringing qualitatively new capabilities.
  • Mixture-of-Experts (2024–2026) became the dominant architecture, allowing models with hundreds of billions or trillions of total parameters to run efficiently by activating only a small fraction per token.
  • As of March 2026, frontier models from OpenAI, Anthropic, Google, Meta, DeepSeek, and Alibaba all support context windows of 1 million tokens or more, with open-weight models competing directly with closed ones.

What’s Next

Now that you understand what a language model does (predict the next token) the natural question is: how does it actually do the math? In Chapter 2, we’ll cover the essential mathematical operations that make all of this work: vectors, matrix multiplication, dot products, and softmax. No more and no less than what you need to understand the rest of this book.