Chapter 13. Scaling Laws, Why Bigger = Smarter

Part 4. Scaling, From Toy Models to Frontier

In Chapters 11 and 12, you saw that frontier language models have grown from millions to hundreds of billions of parameters, and that the industry has converged on specific design points like the ~400B total / 17B active MoE architecture. But this raises a fundamental question: how do researchers decide how big to make a model, and how much data to train it on? The answer lies in scaling laws, a set of empirical relationships that predict, with remarkable precision, how model performance improves as you increase model size, training data, and compute. These laws have guided every major training decision in the industry since 2020, and understanding them is essential to understanding why modern LLMs are built the way they are.

The Discovery: Performance Is Predictable

Before 2020, training a large language model was largely a matter of intuition and trial-and-error. Researchers would pick a model size, pick a dataset, train for a while, and see how it performed. There was a general sense that “bigger is better,” but no precise formula for how much better, or how to allocate a fixed budget between making the model bigger versus training it on more data.

That changed in January 2020, when a team at OpenAI led by Jared Kaplan (with co-authors including Dario Amodei, who later founded Anthropic) published “Scaling Laws for Neural Language Models.” This paper demonstrated that language model performance follows power-law relationships with three variables: the number of parameters (N), the amount of training data in tokens (D), and the total compute used for training (C). These relationships held across seven orders of magnitude in compute, six orders of magnitude in model size, and two orders of magnitude in data.

A power law is a mathematical relationship where one quantity varies as a fixed power of another. In plain terms: if you double the input, the output changes by a fixed multiplier, and this pattern repeats no matter how large the input gets. The key property of a power law is that it appears as a straight line on a log-log plot (where both axes use logarithmic scales).

Source: Kaplan et al., “Scaling Laws for Neural Language Models,” arXiv:2001.08361, January 2020. OpenAI and Johns Hopkins University.

The Kaplan Scaling Laws (2020)

Kaplan et al. found three core power-law relationships. Each one describes how the cross-entropy loss (the model’s prediction error, as defined in Chapter 3) decreases as you scale up one variable while keeping the others sufficiently large.

Loss vs. Model Size

When you train models of different sizes on enough data that they are not data-limited:

L(N) = (N_c / N) ^ alpha_N

Where:

L(N) is the cross-entropy loss
N is the number of non-embedding parameters
N_c is a constant (approximately 8.8 x 10^13)
alpha_N is approximately 0.076

This means that every time you increase the model size by 10x, the loss decreases by a fixed amount. Specifically, a 10x increase in parameters reduces the loss by 10^0.076 = about 1.19x (a 19% reduction in the reducible portion of the loss). The improvement is real but diminishing: going from 1 billion to 10 billion parameters gives the same proportional improvement as going from 10 billion to 100 billion.

Loss vs. Dataset Size

When you train a sufficiently large model on datasets of different sizes:

L(D) = (D_c / D) ^ alpha_D

Where:

D is the number of training tokens
D_c is a constant (approximately 5.4 x 10^13)
alpha_D is approximately 0.095

More data helps more than more parameters (alpha_D > alpha_N), but both follow the same power-law pattern.

Loss vs. Compute

When you train with an optimal balance of model size and data for a given compute budget:

L(C) = (C_c / C) ^ alpha_C

Where:

C is the total compute in floating-point operations (FLOPs)
alpha_C is approximately 0.050

What These Numbers Mean in Practice

Let’s make this concrete. Suppose you have a model achieving a cross-entropy loss of 3.0, and you want to reduce it to 2.8. The scaling laws tell you roughly how much you need to scale:

import numpy as np

alpha_N = 0.076  # Kaplan exponent for model size
alpha_D = 0.095  # Kaplan exponent for dataset size

# To reduce loss by a factor of (2.8/3.0), how much do we need to scale?
loss_ratio = 2.8 / 3.0  # ~0.933

# N_new / N_old = (loss_ratio) ^ (-1/alpha_N)
scale_N = loss_ratio ** (-1 / alpha_N)
print(f"Scale model size by: {scale_N:.1f}x")

# D_new / D_old = (loss_ratio) ^ (-1/alpha_D)
scale_D = loss_ratio ** (-1 / alpha_D)
print(f"Scale dataset size by: {scale_D:.1f}x")

Running this code shows that to reduce loss by just 7% (from 3.0 to 2.8), you would need to increase model size by roughly 2.5x or increase dataset size by roughly 2.1x. This illustrates a key property of power laws: improvements get progressively harder. Each additional increment of quality requires a larger multiplicative increase in resources.

The Key Insight from Kaplan

The most influential finding from Kaplan et al. was about how to allocate compute. They found that when you have a fixed compute budget, you should spend most of it on making the model bigger, not on training longer. Specifically, they found that optimal model size should scale as N proportional to C^0.73, while the number of training steps should scale as S proportional to C^0.03. This meant that as compute budgets grew, models should get much bigger but train for only slightly longer.

This finding directly influenced the design of GPT-3 (175 billion parameters, trained on 300 billion tokens). Under Kaplan’s framework, GPT-3 was roughly compute-optimal: a very large model trained for a relatively small number of tokens.

Source: Kaplan et al. (arXiv:2001.08361). GPT-3 from Brown et al., “Language Models are Few-Shot Learners,” NeurIPS 2020. GPT-3 had 175B parameters and was trained on 300B tokens.

The Chinchilla Revolution (2022)

Two years after Kaplan’s paper, a team at DeepMind led by Jordan Hoffmann published a paper that fundamentally changed how the industry trains language models. The paper, “Training Compute-Optimal Large Language Models” (March 2022), is universally known by the name of the model they trained to validate their findings: Chinchilla.

The Core Finding

Hoffmann et al. trained over 400 language models ranging from 70 million to over 16 billion parameters, on datasets ranging from 5 billion to 500 billion tokens. Their central finding was that Kaplan’s recommendation to prioritize model size over data was wrong. Instead, they found that for compute-optimal training, model size and training tokens should be scaled at equal rates: for every doubling of model size, the number of training tokens should also be doubled.

This translates to a simple rule of thumb: the optimal number of training tokens is approximately 20 times the number of parameters. A 10 billion parameter model should be trained on about 200 billion tokens. A 70 billion parameter model should be trained on about 1.4 trillion tokens.

Why Kaplan Was Wrong

The disagreement between Kaplan and Chinchilla comes down to methodology. Kaplan et al. used non-embedding parameters in their analysis and performed their experiments at relatively small scale, which magnified the bias from excluding embedding parameters. They also did not adjust the learning rate schedule to match the training duration, which biased their results toward favoring larger models trained for fewer steps. Additionally, Kaplan’s formulation assumed the irreducible loss was zero, rather than fitting it as a free parameter. Hoffmann et al. corrected these issues by using total parameters, training at larger scale, properly adjusting the learning rate schedule for each training run, and fitting the irreducible loss as part of their model.

The practical consequence was enormous. Kaplan’s framework suggested a roughly 3:1 scaling ratio (increase model size 3x for every 1x increase in data). Chinchilla showed the correct ratio was closer to 1:1. This meant that many existing models were severely undertrained: they had too many parameters relative to the amount of data they were trained on.

Source: Hoffmann et al., “Training Compute-Optimal Large Language Models,” arXiv:2203.15556, March 2022. DeepMind. Published in NeurIPS 2022. For a detailed analysis of the methodological differences between Kaplan and Chinchilla, see Pearce and Song, “Reconciling Kaplan and Chinchilla Scaling Laws,” arXiv:2406.12907, June 2024.

The Chinchilla Model

To prove their point, Hoffmann et al. trained Chinchilla: a 70 billion parameter model trained on 1.4 trillion tokens. This used the same compute budget as Gopher, DeepMind’s previous flagship model, which had 280 billion parameters but was trained on only 300 billion tokens.

The result was decisive: Chinchilla outperformed Gopher on virtually every benchmark, despite having 4x fewer parameters. It also outperformed GPT-3 (175B parameters, 300B tokens) and other larger models. The lesson was clear: a smaller model trained on more data beats a larger model trained on less data, when the total compute is the same.

Model	Parameters	Training Tokens	Tokens/Param Ratio	Compute Budget
GPT-3	175B	300B	1.7	~3.14 x 10^23 FLOPs
Gopher	280B	300B	1.1	~5.0 x 10^23 FLOPs
Chinchilla	70B	1.4T	20.0	~5.0 x 10^23 FLOPs

GPT-3 and Gopher were both massively undertrained by Chinchilla standards. GPT-3’s ratio of 1.7 tokens per parameter was roughly 12x below the Chinchilla-optimal ratio of 20. Gopher’s ratio of 1.1 was even worse: nearly 18x below optimal.

Sources: GPT-3 from Brown et al. (NeurIPS 2020): 175B parameters, 300B tokens, ~3.14 x 10^23 FLOPs (per Patterson et al.). Gopher from Rae et al., “Scaling Language Models: Methods, Analysis & Insights from Training Gopher,” arXiv:2112.11446, December 2021. DeepMind. 280B parameters, 300B tokens, trained on MassiveText corpus. Chinchilla from Hoffmann et al. (arXiv:2203.15556): 70B parameters, 1.4T tokens.

Visualizing the Chinchilla Scaling Law

Let’s compute the Chinchilla-optimal training tokens for a range of model sizes, and compare them to what real models actually used:

import numpy as np

def chinchilla_optimal_tokens(params_billions):
    """Chinchilla-optimal training tokens: ~20 tokens per parameter."""
    return params_billions * 20

models = [
    # (name, params_B, actual_tokens_T, year)
    ("GPT-3",           175,   0.3,  2020),
    ("Gopher",          280,   0.3,  2021),
    ("Chinchilla",       70,   1.4,  2022),
    ("LLaMA 1 65B",      65,   1.4,  2023),
    ("LLaMA 2 70B",      70,   2.0,  2023),
    ("LLaMA 3 8B",        8,  15.0,  2024),
    ("LLaMA 3 70B",      70,  15.0,  2024),
    ("LLaMA 3.1 405B",  405,  15.6,  2024),
]

print(f"{'Model':<20} {'Params':>7} {'Actual':>8} {'Optimal':>8} {'Ratio':>7} {'Status'}")
print(f"{'':20} {'(B)':>7} {'(T tok)':>8} {'(T tok)':>8} {'T/P':>7}")
print("-" * 70)

for name, params, actual_T, year in models:
    optimal_T = chinchilla_optimal_tokens(params) / 1000  # convert B to T
    ratio = (actual_T * 1000) / params  # tokens per parameter
    if ratio < 15:
        status = "Undertrained"
    elif ratio < 25:
        status = "~Optimal"
    else:
        status = f"Overtrained {ratio/20:.0f}x"
    print(f"{name:<20} {params:>6}B {actual_T:>7.1f}T {optimal_T:>7.1f}T {ratio:>6.1f} {status}")

print("\nChinchilla-optimal ratio: ~20 tokens per parameter")

This table reveals a dramatic shift in the industry. Pre-Chinchilla models (GPT-3, Gopher) had token-to-parameter ratios far below 20, meaning they were severely undertrained. Post-Chinchilla models like LLaMA 1 and Chinchilla itself hit the optimal ratio. But then something interesting happened: starting with LLaMA 3 in 2024, models began to be deliberately overtrained, with token-to-parameter ratios far exceeding the Chinchilla-optimal 20.

Beyond Chinchilla: The Overtraining Era

The Chinchilla scaling law tells you how to minimize loss for a given compute budget during training. But training compute is only half the story. Once a model is trained, it must be deployed for inference (generating responses to user queries), and inference compute depends on the model’s active parameter count, not on how much data it was trained on.

This creates a different optimization problem. If you plan to serve a model to millions of users, the total inference compute over the model’s lifetime may far exceed the training compute. In that case, it makes economic sense to spend extra compute during training (by training a smaller model on more data) to reduce the per-query inference cost.

Sardana and Nanda formalized this intuition in “Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws” (ICML 2024). They modified the Chinchilla scaling laws to include inference demand as a variable. Their key finding: when a model will serve a large number of inference requests over its lifetime, the optimal strategy is to train a smaller model on significantly more data than Chinchilla recommends. The higher the expected inference demand, the smaller the model should be (and the more it should be overtrained). This provides the mathematical justification for the overtraining trend that the industry had already adopted empirically.

Source: Sardana and Nanda, “Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws,” arXiv:2401.00448, ICML 2024 (PMLR Vol. 235). Databricks / MosaicML.

This is exactly what Meta did with LLaMA 3. The 8B parameter model was trained on 15 trillion tokens, giving it a token-to-parameter ratio of approximately 1,875:1, nearly 94x the Chinchilla-optimal ratio. The 70B model was trained on the same 15 trillion tokens, giving it a ratio of about 214:1, roughly 11x Chinchilla-optimal. Meta’s technical report explicitly acknowledged this: they observed that both models “continued to improve log-linearly after we trained them on up to 15T tokens,” meaning the models were still getting better even at these extreme overtraining ratios.

The result justified the approach: LLaMA 3 8B outperformed LLaMA 2 70B on many benchmarks, despite being nearly 9x smaller. By investing heavily in training data (which is a one-time cost), Meta produced a model that was dramatically cheaper to serve (an ongoing cost) while delivering comparable or better quality.

Source: Meta, “The Llama 3 Herd of Models,” arXiv:2407.21783, July 2024. LLaMA 3 8B and 70B both trained on over 15 trillion tokens. LLaMA 3.1 405B trained on 15.6 trillion tokens using 16,000 NVIDIA H100 GPUs, requiring approximately 30.84 million GPU hours.

The Modern Training Data Landscape

By March 2026, the amount of training data used by frontier models has grown enormously:

Model	Year	Parameters	Training Tokens	Tokens/Param
GPT-3	2020	175B	300B	1.7
Chinchilla	2022	70B	1.4T	20
LLaMA 3.1 405B	2024	405B	15.6T	38.5
DeepSeek-V3	2024	671B (37B active)	14.8T	22.1 (total) / 400 (active)
Qwen 3 235B	2025	235B (22B active)	36T	153 (total) / 1,636 (active)
Qwen 3-Max	2025	1T+ (MoE)	36T	~36 (total)

The trend is clear: training datasets have grown from hundreds of billions of tokens to tens of trillions. Qwen 3 was trained on approximately 36 trillion tokens spanning 119 languages, double the corpus used for Qwen 2.5. DeepSeek-V3 was pre-trained on 14.8 trillion tokens. Qwen 3-Max, Alibaba’s closed-weight flagship (preview released September 5, 2025; full release at the Yunqi Conference on September 24, 2025), scales to over 1 trillion total parameters while using the same 36 trillion token training corpus as the rest of the Qwen 3 family.

For MoE models, the token-to-parameter ratio can be calculated two ways: relative to total parameters or relative to active parameters. DeepSeek-V3’s ratio of 22 tokens per total parameter is close to Chinchilla-optimal, but its ratio of 400 tokens per active parameter means each active parameter has been trained on far more data than a Chinchilla-optimal dense model would see. This is one reason MoE models can achieve strong performance with relatively few active parameters: each active parameter is extremely well-trained.

Sources: DeepSeek-V3 technical report (arXiv:2412.19437, December 2024): 14.8T training tokens. Qwen 3 from Alibaba (April 29, 2025): approximately 36T training tokens spanning 119 languages. Qwen 3-Max from Alibaba (preview September 5, 2025; full release September 24, 2025 at Yunqi Conference): 1T+ total parameters, MoE architecture, 36T training tokens (Alibaba Cloud blog, “Qwen3-Max: Just Scale it”).

Compute-Optimal Training: A Worked Example

Let’s walk through the compute-optimal training calculation step by step, so you can see exactly how researchers decide on model size and training data.

The total compute for training a dense Transformer is approximately:

C ≈ 6 * N * D

Where C is in FLOPs, N is the number of parameters, and D is the number of training tokens. The factor of 6 comes from the forward pass (2 multiplications per parameter per token) plus the backward pass (4 multiplications per parameter per token, since gradients require roughly twice the forward pass compute).

Under the Chinchilla scaling law, the optimal allocation is D ≈ 20 * N. Substituting:

C ≈ 6 * N * (20 * N) = 120 * N^2

This means compute scales as the square of the model size for Chinchilla-optimal training. Let’s compute this for several model sizes:

import numpy as np

def compute_optimal_training(params_B):
    """Calculate Chinchilla-optimal training requirements."""
    N = params_B * 1e9
    D = 20 * N  # Chinchilla-optimal tokens
    C = 6 * N * D  # Total FLOPs

    # H100 GPU: ~990 TFLOPS for FP16 (theoretical peak)
    # Practical utilization: ~40-50% for large training runs
    h100_practical_tflops = 990 * 0.45  # ~445 TFLOPS
    h100_flops_per_second = h100_practical_tflops * 1e12

    gpu_seconds = C / h100_flops_per_second
    gpu_hours = gpu_seconds / 3600

    return {
        'params_B': params_B,
        'tokens_T': D / 1e12,
        'flops': C,
        'gpu_hours': gpu_hours,
    }

sizes = [1, 7, 70, 405]
print(f"{'Model Size':>12} {'Tokens':>10} {'FLOPs':>14} {'H100 GPU-hrs':>14}")
print("-" * 55)

for size in sizes:
    r = compute_optimal_training(size)
    print(f"{r['params_B']:>10.0f}B {r['tokens_T']:>8.1f}T "
          f"{r['flops']:>13.2e} {r['gpu_hours']:>13,.0f}")

print(f"\nAssumptions: Chinchilla-optimal (20 tokens/param)")
print(f"H100 at 45% MFU (~445 TFLOPS effective)")
print(f"\nNote: Real training runs use thousands of GPUs in parallel.")
print(f"A 70B model needing ~4.1M GPU-hours on 2,000 H100s takes ~85 days.")

This calculation shows why training frontier models is so expensive. A Chinchilla-optimal 405B model requires approximately 2 x 10^25 FLOPs, which translates to tens of millions of H100 GPU-hours. At $2-3 per GPU-hour (typical cloud pricing), this is tens of millions of dollars in compute alone, not counting data preparation, engineering, failed experiments, or hardware costs.

Emergent Abilities: Capabilities That Appear Suddenly at Scale

One of the most debated phenomena in the scaling laws literature is the concept of emergent abilities: capabilities that appear to be absent in smaller models but suddenly manifest in larger ones. This idea was formalized by Jason Wei et al. in their August 2022 paper “Emergent Abilities of Large Language Models,” published in Transactions on Machine Learning Research (TMLR).

The Original Claim

Wei et al. defined an ability as emergent if it is “not present in smaller models but is present in larger models.” They documented dozens of tasks from the BIG-bench benchmark suite where model performance was near-random (essentially guessing) for models below a certain size threshold, then jumped sharply to well-above-random performance once the model crossed that threshold. The transition appeared sudden and unpredictable: there was no way to forecast when a particular ability would emerge by extrapolating from smaller models.

Examples of tasks that showed this pattern included multi-step arithmetic, word unscrambling, and certain types of logical reasoning. For instance, a model with 10 billion parameters might score 0% on three-digit addition, while a model with 100 billion parameters might score 80%. The jump appeared discontinuous, like a phase transition in physics.

This finding had profound implications. If capabilities truly emerge unpredictably at scale, then the only way to discover what a model can do is to build it and test it. You cannot predict the capabilities of a 1-trillion-parameter model by studying 10-billion-parameter models. This made scaling both exciting (what new abilities might appear?) and concerning (what dangerous capabilities might emerge without warning?).

Source: Wei et al., “Emergent Abilities of Large Language Models,” Transactions on Machine Learning Research (TMLR), August 2022. Google Research and Stanford University.

The Counterargument: Are Emergent Abilities a Mirage?

In 2023, Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo published “Are Emergent Abilities of Large Language Models a Mirage?” at NeurIPS 2023 (the Conference on Neural Information Processing Systems). Their paper argued that the apparent emergence of abilities is an artifact of the evaluation metrics used, not a fundamental property of scaling.

Their key insight was that the tasks showing “emergent” behavior were all evaluated using discontinuous metrics like exact-match accuracy (either the answer is exactly right or it scores 0). When you use such metrics, a model that is gradually improving its internal representation of a task will show no improvement in the metric until it crosses the threshold of getting the answer exactly right. The jump from 0% to 80% accuracy does not mean the model suddenly acquired a new ability; it means the model’s gradually improving capability finally crossed the threshold where the metric could detect it.

Schaeffer et al. demonstrated this by re-evaluating the same tasks using continuous metrics like token-level log-likelihood or partial credit scoring. When they did this, the sharp transitions disappeared. Performance improved smoothly and predictably with scale, following the same power-law patterns as the loss function. There was no phase transition, just a measurement artifact.

This does not mean that larger models are not more capable. They clearly are. The scaling laws show that loss decreases smoothly with scale, and lower loss translates to better performance on downstream tasks. The debate is about whether the improvement is smooth and predictable (as the scaling laws suggest) or sudden and unpredictable (as the emergence narrative claims). The current evidence leans toward smooth improvement that can appear sudden when measured with coarse metrics.

Source: Schaeffer, Miranda, and Koyejo, “Are Emergent Abilities of Large Language Models a Mirage?”, NeurIPS 2023 (arXiv:2304.15004). Presented at the 37th Conference on Neural Information Processing Systems, New Orleans, December 2023.

What This Means for Practitioners

The practical takeaway is nuanced. On one hand, the scaling laws are remarkably predictive: if you know the loss at one scale, you can predict the loss at a larger scale with good accuracy. On the other hand, translating loss improvements into specific task performance is harder, because task performance depends on the evaluation metric, the difficulty distribution of the task, and the specific capabilities required.

For model developers, this means:

Scaling laws can reliably predict training loss, which is useful for planning compute budgets
Predicting specific benchmark scores from scaling laws is less reliable, especially for tasks with sharp pass/fail criteria
The safest assumption is that capabilities improve smoothly with scale, but the practical impact of those improvements may appear sudden when measured with binary metrics

Diminishing Returns and the Limits of Scaling

The power-law nature of scaling laws has a sobering implication: improvements get progressively more expensive. Because the exponents are small (alpha_N ≈ 0.076, alpha_C ≈ 0.050), each additional unit of improvement requires a multiplicative increase in resources.

The Math of Diminishing Returns

Let’s quantify this. Under the Kaplan scaling law for compute, L(C) = (C_c / C)^0.050. This formula describes the reducible portion of the loss (the part above the irreducible floor). To halve this reducible loss, you need to increase compute by:

import numpy as np

alpha_C = 0.050

# To halve the loss: L_new / L_old = 0.5
# (C_c/C_new)^alpha / (C_c/C_old)^alpha = 0.5
# (C_old/C_new)^alpha = 0.5
# C_new/C_old = 0.5^(-1/alpha) = 2^(1/alpha)

scale_factor = 2 ** (1 / alpha_C)
print(f"To halve the loss, multiply compute by: {scale_factor:,.0f}x")
print(f"That's {np.log10(scale_factor):.1f} orders of magnitude")

# To reduce loss by 10%:
scale_10pct = (0.9) ** (-1 / alpha_C)
print(f"\nTo reduce loss by 10%, multiply compute by: {scale_10pct:,.0f}x")

# To reduce loss by 1%:
scale_1pct = (0.99) ** (-1 / alpha_C)
print(f"To reduce loss by 1%, multiply compute by: {scale_1pct:.1f}x")

The numbers are striking. To halve the loss requires increasing compute by over a million times. Even a modest 10% reduction in loss requires roughly an 8x increase in compute. This is the fundamental challenge of scaling: the low-hanging fruit has been picked, and each subsequent improvement demands exponentially more resources.

This does not mean scaling is useless. A 10% reduction in loss can translate to meaningful improvements in real-world task performance, especially for tasks near the model’s capability boundary. But it does mean that the era of dramatic improvements from simply making models bigger is approaching its limits, at least for pre-training scaling alone.

The Irreducible Loss

There is also a theoretical floor: the irreducible loss (also called the entropy of natural language). No model, no matter how large, can predict natural language perfectly, because language contains genuine randomness and ambiguity. If someone writes “I went to the ___,” the next word could be “store,” “park,” “doctor,” or countless other options, and no amount of training data can resolve this ambiguity.

The irreducible loss for English text is estimated to be somewhere around 1.0-1.7 nats (natural units of information), depending on the text domain and tokenizer. For web text with BPE tokenization (the standard used by most modern models), estimates cluster around 1.5-1.7 nats. Current frontier models achieve losses in the range of 1.7-2.5 nats on typical web text, meaning they are approaching but have not yet reached the theoretical floor. As models get closer to this floor, the returns from scaling will diminish even further.

The Data Wall

The scaling laws assume you can always get more training data. But what if you cannot? This is the data wall: the point at which the demand for training data exceeds the available supply of high-quality human-generated text.

How Much Text Exists?

Villalobos et al. from Epoch AI published “Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data” (originally arXiv:2211.04325, November 2022; updated and published in Proceedings of Machine Learning Research, Vol. 235, at ICML 2024). They estimated the total stock of public human-generated text on the indexed web at approximately 510 trillion tokens (median estimate for the raw indexed web, with a 95% confidence interval of 130 trillion to 2.1 quadrillion tokens). However, not all of this text is usable for training. After adjusting for data quality (filtering out low-quality content, which removes 60-90% of raw text) and accounting for the benefits of multi-epoch training (training on the same data up to 3-5 times), the effective usable stock is substantially smaller. The paper’s updated abstract estimates the stock of public human-generated text at approximately 300 trillion tokens.

Their projection: if current trends in training dataset growth continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, with a median exhaustion year of 2028. If models are overtrained (as is the current trend), this timeline moves earlier.

Source: Villalobos et al., “Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data,” Proceedings of Machine Learning Research (PMLR), Vol. 235, pp. 49523-49544, 2024 (ICML 2024; arXiv:2211.04325). Epoch AI. Raw indexed web stock: ~510T tokens (median). Effective stock of public human text: ~300T tokens. Median data exhaustion year: 2028.

The Evidence Is Already Visible

We can already see the data wall’s effects. Consider the training data sizes of recent models:

Model	Training Tokens	Year
GPT-3	300B	2020
Chinchilla	1.4T	2022
LLaMA 3.1 405B	15.6T	2024
DeepSeek-V3	14.8T	2024
Qwen 3	36T	2025

Qwen 3’s 36 trillion tokens is already a significant fraction of the estimated effective stock of high-quality public text (approximately 300 trillion tokens per Villalobos et al.). The Qwen team explicitly noted that their training data “includes web data, books, PDFs, and synthetic code/math content generated by earlier Qwen models.” That last part is crucial: they are already supplementing human-generated data with synthetic data generated by their own models.

Sources: Qwen 3 from Alibaba (April 29, 2025): approximately 36T training tokens spanning 119 languages, including synthetic data generated by earlier Qwen models.

Synthetic Data: Models Training Models

When human-generated data runs out, the obvious question is: can models generate their own training data? This approach, called synthetic data generation, has become one of the most important techniques in modern LLM development.

The Phi Series: Proof of Concept

The most compelling demonstration of synthetic data’s potential came from Microsoft Research’s Phi series of models. In June 2023, Gunasekar et al. published “Textbooks Are All You Need” (arXiv:2306.11644), introducing phi-1, a 1.3 billion parameter model specialized for Python code. phi-1 was trained on a carefully curated mixture of “textbook quality” data from the web (6 billion tokens) and synthetically generated textbooks and exercises produced by GPT-3.5 (1 billion tokens). Despite its tiny size, phi-1 achieved 50.6% pass@1 accuracy on HumanEval (a standard code generation benchmark), competitive with models 10-100x larger.

The key insight was not just that synthetic data works, but that the quality and structure of the data matters more than the quantity. By generating data in a “textbook” format (clear explanations, worked examples, exercises with solutions), the Phi team created training data that was more information-dense than typical web scrapes.

This approach was extended with phi-1.5, phi-2, phi-3, and phi-4. By December 2024, phi-4 (14 billion parameters) used synthetic data as the bulk of its training corpus, generated using “a diverse array of techniques, including multi-agent prompting, self-revision workflows, and instruction reversal.” phi-4 surpassed its teacher model (GPT-4) on STEM-focused reasoning tasks, demonstrating that a student model trained on carefully generated synthetic data can exceed the capabilities of the model that generated that data.

Source: Gunasekar et al., “Textbooks Are All You Need,” arXiv:2306.11644, June 2023. Microsoft Research. phi-1: 1.3B parameters, trained on 6B web tokens + 1B synthetic tokens. phi-4 technical report: arXiv:2412.08905, December 2024. 14B parameters, synthetic data constitutes the bulk of training data.

The Model Collapse Problem

Synthetic data is not a free lunch. In July 2024, Shumailov et al. published “AI Models Collapse When Trained on Recursively Generated Data” in Nature, demonstrating a fundamental risk: when models are trained on data generated by other models (or by themselves), and this process is repeated over multiple generations, the resulting models progressively lose the ability to represent the tails of the original data distribution. Rare patterns, minority viewpoints, and unusual linguistic constructions gradually disappear, and the model’s outputs become increasingly homogeneous and generic.

The mechanism is straightforward. A model trained on human data learns to approximate the full distribution of human language, including rare patterns. When this model generates synthetic data, it tends to over-represent common patterns and under-represent rare ones (because common patterns have higher probability). A second model trained on this synthetic data learns an even more skewed distribution. After several generations, the distribution collapses to a narrow set of high-probability outputs.

This is called model collapse, and it is a serious concern as AI-generated content becomes an increasingly large fraction of the text on the internet. If future models are trained on web data that is heavily contaminated with AI-generated text, they may suffer from the same degenerative process.

Source: Shumailov et al., “AI Models Collapse When Trained on Recursively Generated Data,” Nature, Vol. 631, pp. 755-759, July 24, 2024.

How the Industry Is Handling Synthetic Data

As of March 2026, the practical approach to synthetic data involves several strategies to avoid model collapse:

Mixing synthetic and human data: Rather than training exclusively on synthetic data, labs mix it with human-generated data. The optimal ratio of synthetic to human data varies by domain and model size, but the general principle is to ensure that human-generated data remains a substantial portion of the training mixture to preserve distributional diversity.
Using stronger models to generate data for weaker ones: The Phi approach uses a more capable model (GPT-3.5 or GPT-4) to generate training data for a smaller model. This avoids the recursive generation problem because the data source is always a more capable model, not the model being trained.
Targeted synthetic data for specific domains: Rather than generating generic text, labs generate synthetic data for specific domains where human data is scarce, such as mathematical reasoning, code, and scientific problem-solving. Qwen 3’s training data explicitly includes “synthetic code/math content generated by earlier Qwen models.”
Quality filtering of synthetic data: Not all synthetic data is equally useful. Labs apply the same quality filtering techniques to synthetic data that they apply to web-scraped data, removing low-quality, repetitive, or incorrect examples.
Verification and self-correction: For domains like math and code where correctness can be verified (by checking the answer or running the code), synthetic data can be filtered by correctness. This produces high-quality training data that is guaranteed to be accurate, something that is much harder to achieve with web-scraped data.

Scaling Laws for MoE Models

The original Kaplan and Chinchilla scaling laws were derived for dense Transformer models, where every parameter is active for every token. How do scaling laws apply to Mixture of Experts models, where only a fraction of parameters are active?

This is an active area of research, and the picture is not yet as clean as for dense models. However, several key observations have emerged:

Total Parameters vs. Active Parameters

For MoE models, there are two ways to think about “model size”: total parameters (all experts) and active parameters (the experts selected for each token). The scaling laws for loss appear to depend primarily on the active parameter count, not the total parameter count. This makes intuitive sense: the loss for a given token depends on the computation performed for that token, which is determined by the active parameters.

However, the total parameter count matters for the model’s knowledge capacity. A model with 671B total parameters (DeepSeek-V3) has access to more stored knowledge than a model with 37B total parameters, even if both use 37B active parameters per token. The additional experts provide specialized knowledge that is activated when relevant.

The Practical Implication

For MoE models, the Chinchilla-optimal training token count should be calculated relative to the active parameter count, not the total parameter count. DeepSeek-V3 has 37B active parameters and was trained on 14.8T tokens, giving a ratio of approximately 400 tokens per active parameter. This is far above the Chinchilla-optimal 20, meaning DeepSeek-V3 is heavily overtrained relative to its active parameters. This is deliberate: the overtraining ensures that each expert is well-trained despite only seeing a fraction of the total tokens (since each expert is only activated for a subset of tokens).

The New Scaling Frontier: Test-Time Compute

By late 2024, a new dimension of scaling emerged that does not involve making models bigger or training them on more data. Instead, it involves spending more compute at inference time (when the model is generating a response) to improve the quality of that response. This is called test-time compute scaling or inference-time scaling.

The landmark moment was OpenAI’s release of o1 on September 12, 2024, the first widely available “reasoning model.” Rather than generating a response in a single forward pass, o1 generates an extended chain-of-thought (a long sequence of reasoning steps) before producing its final answer. The more tokens it generates in this thinking process, the better its answer tends to be.

OpenAI demonstrated that o1’s accuracy on the AIME (American Invitational Mathematics Examination) increased at a constant rate with the logarithm of test-time compute. This is a scaling law for inference, analogous to the scaling laws for training: spending 10x more compute at inference time yields a predictable improvement in accuracy.

This opened a new axis for scaling. Instead of only scaling pre-training compute (bigger models, more data), labs can now also scale inference compute (more thinking tokens per query). The two axes are complementary: a model that is both well-trained and given time to think can outperform a model that is only scaled on one axis.

By March 2026, test-time compute scaling has become a standard technique. DeepSeek-R1 (January 20, 2025) demonstrated that reinforcement learning alone could produce strong reasoning behavior in an open-weight model. OpenAI’s o3 (April 16, 2025) extended the o1 approach with full tool access. Anthropic introduced extended thinking in Claude 3.7 Sonnet (February 24, 2025), and later in the Claude 4 family (May 22, 2025). Google released Gemini 2.5 Deep Think (August 1, 2025), which explores multiple reasoning paths in parallel. All of these models use some form of inference-time reasoning, spending additional compute per query to improve answer quality. This represents a fundamental shift in how the industry thinks about scaling: the question is no longer just “how big should the model be?” but also “how much should the model think about each query?”

We will cover extended thinking in detail in Chapter 16.

Sources: OpenAI, “Learning to Reason with LLMs,” September 12, 2024 (openai.com/index/learning-to-reason-with-llms). o1-preview released September 12, 2024. OpenAI o3 released April 16, 2025 (openai.com/index/introducing-o3-and-o4-mini). DeepSeek-R1 released January 20, 2025 (arXiv:2501.12948). Anthropic, “Claude 3.7 Sonnet and Claude Code,” February 24, 2025 (anthropic.com/news/claude-3-7-sonnet). Claude 4 (Opus 4 and Sonnet 4) released May 22, 2025 (anthropic.com/news/claude-4). Google Gemini 2.5 Deep Think released August 1, 2025.

Putting It All Together: The Three Eras of Scaling

The history of LLM scaling can be divided into three eras, each defined by a different understanding of how to allocate resources:

Era 1: Bigger Models (2018-2022)

Guided by the Kaplan scaling laws, the industry focused on making models bigger. GPT-2 (1.5B, 2019) gave way to GPT-3 (175B, 2020), which gave way to Gopher (280B, 2021). The dominant strategy was to increase parameter count while keeping training data relatively modest (300B tokens was standard). Models were severely undertrained by modern standards.

Era 2: More Data (2022-2024)

The Chinchilla paper revealed that the industry had been allocating compute incorrectly. The focus shifted to training data: LLaMA 1 (2023) used 1.4T tokens, LLaMA 2 (2023) used 2T tokens, and LLaMA 3 (2024) used 15T tokens. Models began to be deliberately overtrained to optimize for inference efficiency. The MoE architecture emerged as the dominant design for frontier models, decoupling total parameters (knowledge capacity) from active parameters (inference cost).

Era 3: Smarter Compute (2024-present)

With pre-training data approaching the data wall and model architectures converging on similar designs, the industry discovered a new scaling axis: inference-time compute. Reasoning models like o1 demonstrated that spending more compute per query (through extended chain-of-thought) could yield improvements comparable to scaling up the model itself. Synthetic data generation became essential for continued pre-training scaling. The focus shifted from “how big?” to “how smart?” with the same resources.

import numpy as np

# Illustrate the three eras with representative models
eras = [
    # Era 1: Bigger Models
    ("GPT-2",       2019, 1.5,    10,     "Dense", "Era 1: Bigger Models"),
    ("GPT-3",       2020, 175,    300,    "Dense", "Era 1: Bigger Models"),
    ("Gopher",      2021, 280,    300,    "Dense", "Era 1: Bigger Models"),

    # Era 2: More Data
    ("Chinchilla",  2022, 70,     1400,   "Dense", "Era 2: More Data"),
    ("LLaMA 3 8B",  2024, 8,      15000,  "Dense", "Era 2: More Data"),
    ("LLaMA 3.1 405B", 2024, 405, 15600,  "Dense", "Era 2: More Data"),
    ("DeepSeek-V3", 2024, 671,    14800,  "MoE",   "Era 2: More Data"),

    # Era 3: Smarter Compute
    ("Qwen 3 235B", 2025, 235,    36000,  "MoE",   "Era 3: Smarter Compute"),
    ("Qwen 3-Max",  2025, 1000,   36000,  "MoE",   "Era 3: Smarter Compute"),
    ("Qwen 3.5",    2026, 397,    None,   "MoE",   "Era 3: Smarter Compute"),
]

print(f"{'Model':<18} {'Year':>5} {'Params':>8} {'Tokens':>10} {'Type':>6} {'Era'}")
print("-" * 72)
current_era = ""
for name, year, params, tokens, mtype, era in eras:
    if era != current_era:
        if current_era:
            print()
        current_era = era
    tok_str = f"{tokens/1000:.1f}T" if tokens and tokens >= 1000 else \
              f"{tokens}B" if tokens else "N/A"
    print(f"{name:<18} {year:>5} {params:>7.1f}B {tok_str:>10} {mtype:>6} {era}")

Hands-On: Exploring Scaling Laws

Let’s implement the scaling laws and visualize how loss changes with model size and compute:

import numpy as np

# Kaplan scaling law parameters (from arXiv:2001.08361)
alpha_N = 0.076   # exponent for model size
alpha_D = 0.095   # exponent for dataset size
alpha_C = 0.050   # exponent for compute

# Irreducible loss (entropy of natural language, approximate)
L_irreducible = 1.69  # nats, approximate for web text

# Reducible loss as a function of parameters (Kaplan)
def loss_vs_params(N, Nc=8.8e13, alpha=0.076):
    """Kaplan scaling law: loss vs non-embedding parameters."""
    return (Nc / N) ** alpha + L_irreducible

# Chinchilla-optimal compute for a given model size
def chinchilla_compute(N):
    """C = 6 * N * D, where D = 20 * N."""
    return 6 * N * (20 * N)

# Loss vs compute (Kaplan)
# C_c = 3.1e8 PetaFLOP-days in the paper.
# 1 PetaFLOP-day = 1e15 * 86400 = 8.64e19 FLOPs.
# So C_c in FLOPs = 3.1e8 * 8.64e19 ≈ 2.68e28 FLOPs.
def loss_vs_compute(C, Cc=2.68e28, alpha=0.050):
    """Kaplan scaling law: loss vs compute (C in FLOPs)."""
    return (Cc / C) ** alpha + L_irreducible

# Generate data
param_range = np.logspace(6, 12, 100)  # 1M to 1T parameters
compute_range = np.logspace(15, 27, 100)  # 10^15 to 10^27 FLOPs

loss_N = [loss_vs_params(n) for n in param_range]
loss_C = [loss_vs_compute(c) for c in compute_range]

# Print key points
print("Loss vs. Model Size (Kaplan scaling law)")
print(f"{'Parameters':>15} {'Loss (nats)':>12} {'Improvement':>12}")
print("-" * 42)
prev_loss = None
for n in [1e7, 1e8, 1e9, 1e10, 1e11, 1e12]:
    l = loss_vs_params(n)
    imp = f"{(prev_loss - l):.4f}" if prev_loss else "---"
    prev_loss = l
    label = f"{n:.0e}"
    print(f"{label:>15} {l:>11.4f} {imp:>12}")

print(f"\nIrreducible loss (floor): {L_irreducible}")
print(f"\nNotice: each 10x increase in parameters gives a smaller")
print(f"absolute improvement in loss. This is diminishing returns.")

# Compare Chinchilla-optimal vs overtrained
print(f"\n\nChinchilla-Optimal vs. Overtrained Training")
print(f"{'Model':>15} {'Params':>8} {'Tokens':>10} {'T/P Ratio':>10} {'Compute':>14}")
print("-" * 62)

configs = [
    ("Chinchilla-opt", 70e9, 20),
    ("2x overtrained", 70e9, 40),
    ("10x overtrained", 70e9, 200),
    ("LLaMA 3 70B", 70e9, 214),
]

for name, N, ratio in configs:
    D = ratio * N
    C = 6 * N * D
    print(f"{name:>15} {N/1e9:>6.0f}B {D/1e12:>8.1f}T {ratio:>9.0f} {C:>13.2e}")

This code demonstrates two key properties of scaling laws. First, the diminishing returns: each 10x increase in model size yields a progressively smaller improvement in loss. Second, the compute cost of overtraining: LLaMA 3 70B’s 214 tokens-per-parameter ratio requires roughly 10x more compute than Chinchilla-optimal training for the same model size, but the resulting model is much cheaper to serve because it is smaller than a Chinchilla-optimal model that would achieve similar quality.

Scaling Laws in Practice: How Labs Make Decisions

Scaling laws are not just academic curiosities. They are the primary tool that AI labs use to plan their training runs. Here is how the process typically works:

Step 1: Small-Scale Experiments

Before committing to a multi-million-dollar training run, labs train a series of small models (typically 100M to 1B parameters) on varying amounts of data. They measure the loss for each configuration and fit the scaling law parameters to their specific data and architecture.

Step 2: Extrapolation

Using the fitted scaling laws, labs extrapolate to predict the loss of much larger models. If the scaling laws predict that a 70B model trained on 15T tokens will achieve a loss of X, and that loss corresponds to the desired benchmark performance, the lab proceeds with the training run.

Step 3: Compute Budget Allocation

Given a fixed compute budget (determined by the number of GPUs available and the time allocated), the lab uses the scaling laws to determine the optimal split between model size and training data. If the goal is to minimize training loss, they follow the Chinchilla ratio. If the goal is to minimize inference cost for a target quality level, they overtrain a smaller model.

Step 4: Monitoring During Training

During the actual training run, the lab monitors the loss curve and compares it to the scaling law predictions. If the loss is tracking the prediction, the run is on track. If the loss is higher than predicted, something may be wrong (data quality issues, hyperparameter problems, hardware failures). If the loss is lower than predicted, the model may be benefiting from architectural improvements or data quality that the scaling laws did not account for.

Meta’s LLaMA 3 technical report explicitly describes this process: they trained hundreds of small models to fit their scaling laws, then used those laws to predict the performance of the full-scale 405B model before committing to the training run.

Key Takeaways

Scaling laws are empirical power-law relationships between model performance (measured by cross-entropy loss) and three variables: model size (parameters), training data (tokens), and compute (FLOPs). They were discovered by Kaplan et al. at OpenAI in January 2020 (arXiv:2001.08361) and refined by Hoffmann et al. at DeepMind in March 2022 (arXiv:2203.15556, the “Chinchilla” paper).
The Kaplan scaling laws found that loss scales as a power law with exponents of approximately 0.076 for model size, 0.095 for dataset size, and 0.050 for compute. These small exponents mean improvements are real but diminishing: each 10x increase in resources yields a progressively smaller improvement in loss.
The Chinchilla scaling law corrected Kaplan’s recommendation about compute allocation. Chinchilla showed that model size and training tokens should be scaled at equal rates, with an optimal ratio of approximately 20 tokens per parameter. This revealed that pre-Chinchilla models like GPT-3 (1.7 tokens/param) and Gopher (1.1 tokens/param) were severely undertrained.
Chinchilla (70B parameters, 1.4T tokens) outperformed Gopher (280B parameters, 300B tokens) despite having 4x fewer parameters, because it was trained on 4.7x more data with the same compute budget. This result reshaped the entire industry’s approach to training.
Modern models are deliberately overtrained beyond the Chinchilla-optimal ratio to reduce inference costs. LLaMA 3 8B was trained on 15T tokens (1,875 tokens/param, ~94x Chinchilla-optimal). The logic: training is a one-time cost, but inference is an ongoing cost that scales with the number of users. Sardana and Nanda formalized this in “Beyond Chinchilla-Optimal” (ICML 2024, arXiv:2401.00448), showing that when inference demand is high, the optimal strategy is to train a smaller model on more data than Chinchilla recommends.
Emergent abilities, capabilities that appear to suddenly manifest at scale, were documented by Wei et al. (TMLR, 2022) but challenged by Schaeffer et al. (NeurIPS 2023), who argued the apparent emergence is an artifact of discontinuous evaluation metrics. When continuous metrics are used, performance improves smoothly with scale, consistent with the scaling laws.
The data wall is the point at which demand for training data exceeds the available supply of high-quality human-generated text. Villalobos et al. (Epoch AI, ICML 2024) estimated the raw stock of indexed web text at approximately 510 trillion tokens, with an effective stock of public human text at approximately 300 trillion tokens after quality filtering. They projected exhaustion between 2026 and 2032. Qwen 3’s 36 trillion token training set already represents a significant fraction of this stock.
Synthetic data (AI-generated training data) is the primary strategy for pushing past the data wall. Microsoft’s Phi series demonstrated that small models trained on high-quality synthetic data can match or exceed much larger models. However, Shumailov et al. (Nature, July 2024) showed that recursive training on synthetic data causes model collapse, where the model progressively loses the ability to represent rare patterns. The practical solution is to mix synthetic and human data, use stronger models to generate data for weaker ones, and apply quality filtering.
Test-time compute scaling (inference-time reasoning) emerged in late 2024 as a new scaling axis. OpenAI’s o1 (September 12, 2024) demonstrated that spending more compute during inference (through extended chain-of-thought reasoning) yields predictable improvements in accuracy, following a scaling law analogous to the training scaling laws. By early 2026, every major lab had adopted this approach: DeepSeek-R1 (January 20, 2025), Claude 3.7 Sonnet with extended thinking (February 24, 2025), OpenAI o3 (April 16, 2025), Claude 4 (May 22, 2025), and Gemini 2.5 Deep Think (August 1, 2025). This opened a new dimension for improving model capabilities without increasing model size or training data.
The history of LLM scaling falls into three eras: Era 1 (2018-2022) focused on bigger models with modest data; Era 2 (2022-2024) shifted to more data with deliberate overtraining and MoE architectures; Era 3 (2024-present) adds inference-time compute scaling and synthetic data generation as the data wall approaches.

What’s Next

You now understand the scaling laws that govern how language model performance improves with size, data, and compute, including the Chinchilla revolution, the data wall, synthetic data, and the emergence of test-time compute scaling. In Chapter 14, we will dive into the pre-training process itself: how models actually learn from trillions of tokens, what the training data looks like, how it is filtered and prepared, and what it costs to train a frontier model from scratch.

Chapter 12. Mixture of Experts (MoE), The Dominant Architecture of 2026