Skip to content
Chapter 3. Neural Networks Without the Mysticism

Chapter 3. Neural Networks Without the Mysticism

In Chapter 2, you learned the four mathematical operations that power every language model: vectors, dot products, matrix multiplication, and softmax. But knowing the operations isn’t enough. You need to understand how they’re organized into a system that can learn. That’s what a neural network is: a specific arrangement of multiplications, additions, and simple nonlinear functions that starts out knowing nothing and gradually improves by adjusting its numbers based on feedback. This chapter shows you exactly how that works, from a single neuron to a complete network that learns from scratch.


A Single Neuron: The Building Block

A neuron in a neural network does three things:

  1. Multiply each input by a weight
  2. Add a bias
  3. Apply an activation function

That’s the entire operation. There’s no magic, no mystery, no hidden complexity. Let’s walk through each step.

Step 1: Multiply Inputs by Weights

Suppose a neuron receives two inputs: x₁ = 0.5 and x₂ = 0.8. The neuron has a weight for each input (numbers that were either randomly initialized or learned during training). Let’s say the weights are w₁ = 0.6 and w₂ = -0.3.

The neuron multiplies each input by its corresponding weight and sums the results:

weighted_sum = (x₁ × w₁) + (x₂ × w₂)
             = (0.5 × 0.6) + (0.8 × -0.3)
             = 0.30 + (-0.24)
             = 0.06

If this looks familiar, it should. This is a dot product between the input vector [0.5, 0.8] and the weight vector [0.6, -0.3]. Everything you learned in Chapter 2 applies directly here.

Step 2: Add a Bias

The neuron adds a single number called the bias to the weighted sum. The bias shifts the output up or down, giving the neuron flexibility to activate even when all inputs are zero. If the bias is b = 0.1:

z = weighted_sum + bias
  = 0.06 + 0.1
  = 0.16

The value z = 0.16 is called the pre-activation value. It’s the raw output before the activation function is applied.

Step 3: Apply an Activation Function

The final step is to pass z through an activation function, a simple mathematical function that introduces nonlinearity. Without an activation function, stacking multiple layers of neurons would be equivalent to a single layer (because multiplying matrices together just gives you another matrix). The activation function is what gives neural networks the ability to learn complex, nonlinear patterns.

The most common activation functions are:

Sigmoid: Squashes any number into the range (0, 1).

sigmoid(z) = 1 / (1 + e^(-z))

For z = 0.16:

sigmoid(0.16) = 1 / (1 + e^(-0.16))
              = 1 / (1 + 0.852)
              = 1 / 1.852
              = 0.540

The sigmoid function was the standard activation for decades. It has an intuitive interpretation: the output can be read as a probability (a value between 0 and 1). But it has a serious problem for deep networks that we’ll discuss later in this chapter.

ReLU (Rectified Linear Unit): Returns the input if it’s positive, and zero otherwise.

ReLU(z) = max(0, z)

For z = 0.16:

ReLU(0.16) = max(0, 0.16) = 0.16

For z = -0.5:

ReLU(-0.5) = max(0, -0.5) = 0

ReLU is dead simple. If the number is positive, pass it through unchanged; if it’s negative, output zero. Despite its simplicity, ReLU became the dominant activation function in deep learning after Nair and Hinton demonstrated its effectiveness in 2010, and it powered AlexNet’s breakthrough victory in the 2012 ImageNet competition. ReLU solved a critical training problem called the vanishing gradient problem that we’ll cover later in this chapter.

Source: Nair and Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” ICML 2010; Krizhevsky, Sutskever, and Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” NeurIPS 2012.

The Complete Neuron in One Line

Putting all three steps together, a single neuron computes:

output = activation(w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b)

Or in vector notation (using the dot product from Chapter 2):

output = activation(w · x + b)

Where w is the weight vector, x is the input vector, b is the bias, and activation is the chosen activation function.

That’s a neuron. It takes a list of numbers, computes a weighted sum, adds a bias, and applies a nonlinear function. Every neural network, from a tiny 2-neuron toy to GPT-5 with trillions of parameters, is built from this same building block.

A Brief History: The Perceptron

The concept of an artificial neuron dates back to 1958, when psychologist Frank Rosenblatt at the Cornell Aeronautical Laboratory introduced the perceptron. The Mark I Perceptron was a physical machine (a room-sized contraption of wires and motors) that could learn to classify simple visual patterns. It used a step function as its activation (output 1 if the weighted sum exceeds a threshold, 0 otherwise) and could learn by adjusting its weights based on whether it got the right answer.

Source: Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Psychological Review, 1958.

The perceptron generated enormous excitement. The New York Times reported that the Navy had built a machine that could learn to “walk, talk, see, write, reproduce itself and be conscious of its existence.” But in 1969, Marvin Minsky and Seymour Papert published Perceptrons, a book that mathematically proved single-layer perceptrons couldn’t solve certain simple problems, most famously, the XOR problem, which we’ll tackle later in this chapter. This contributed to the first “AI winter,” a period of reduced funding and interest in neural network research.

Source: Minsky and Papert, Perceptrons: An Introduction to Computational Geometry, MIT Press, 1969.

The solution to the XOR problem (and the key to making neural networks powerful) turned out to be stacking neurons into layers.


Layers: Stacking Neurons Together

A single neuron can only learn simple, linear relationships. To learn complex patterns, like the relationship between words in a sentence, or the features in an image, you need to combine many neurons into layers, and stack multiple layers on top of each other.

What Is a Layer?

A layer is a group of neurons that all receive the same inputs and produce their outputs simultaneously. If a layer has 4 neurons and each receives 3 inputs, then the layer has 4 × 3 = 12 weights (each neuron has its own weight for each input) plus 4 biases (one per neuron).

Here’s what a layer with 3 inputs and 4 neurons looks like:

Input [x₁, x₂, x₃]
         ↓
   ┌─────┼─────┐─────┐
   ↓     ↓     ↓     ↓
 [n₁]  [n₂]  [n₃]  [n₄]    ← 4 neurons, each with 3 weights + 1 bias
   ↓     ↓     ↓     ↓
Output [y₁, y₂, y₃, y₄]

Each neuron independently computes its weighted sum, adds its bias, and applies the activation function. The 4 outputs become the input to the next layer.

Here’s the key insight: a layer is just a matrix multiplication followed by an activation function. If you stack the weights of all 4 neurons into a matrix:

W = | w₁₁  w₁₂  w₁₃ |    (4 rows × 3 columns)
    | w₂₁  w₂₂  w₂₃ |    each row = one neuron's weights
    | w₃₁  w₃₂  w₃₃ |
    | w₄₁  w₄₂  w₄₃ |

Then the entire layer’s computation is:

output = activation(W · x + b)

Where W is the weight matrix, x is the input vector, b is the bias vector, and the activation is applied element-wise. This is exactly the matrix multiplication from Chapter 2, followed by a vector addition and a nonlinear function.

Deep Networks: Stacking Layers

A deep neural network is simply multiple layers stacked on top of each other. The output of one layer becomes the input to the next:

Input → [Layer 1] → [Layer 2] → [Layer 3] → Output

Each layer transforms its input, and the transformations build on each other:

  • Layer 1 might learn to detect simple features, basic patterns in the input data.
  • Layer 2 takes those simple features and combines them into more complex patterns.
  • Layer 3 combines those complex patterns into even higher-level representations.

In a language model like LLaMA 4 Maverick with 48 Transformer layers, the early layers tend to capture syntactic patterns (grammar, word order), the middle layers capture semantic relationships (meaning, context), and the later layers handle higher-level reasoning and prediction. We’ll explore this in detail in Chapter 10.

The term deep learning simply means using neural networks with many layers. There’s no formal threshold for “deep.” A network with 3 layers is technically deep, but modern language models have 48 to 120+ layers.

Why Depth Matters: The Power of Composition

Why not just use one giant layer with thousands of neurons instead of many smaller layers? Because depth gives you something width cannot: compositional learning.

Consider how you understand the sentence “The bank by the river flooded.” To correctly interpret “bank” as a riverbank (not a financial institution), you need to:

  1. First recognize the individual words
  2. Then notice the phrase “by the river”
  3. Then use that context to disambiguate “bank”
  4. Then understand the full sentence meaning

Each step builds on the previous one. A deep network mirrors this process. Each layer builds a more refined representation on top of what the previous layer computed. A single wide layer would have to do all of this in one step, which is much harder.

This is why the Transformer architecture (Chapter 7) uses dozens of layers. Each layer refines the representation of every token, gradually building up from raw word meanings to full contextual understanding.


Loss Functions: Measuring “How Wrong Was I?”

Before a neural network can learn, it needs a way to measure how wrong its predictions are. This measurement is called the loss (also called the cost or error). The loss function is the mathematical formula that computes this number.

The core idea is simple: compare what the network predicted to what the correct answer actually is, and produce a single number that says how far off the prediction was. A loss of 0 means the prediction was perfect. A larger loss means the prediction was worse.

Mean Squared Error (MSE)

The simplest loss function is Mean Squared Error. For each prediction, you compute the difference between the predicted value and the true value, square it, and average across all examples.

MSE = (1/n) × Σ(predicted - actual)²

Suppose a network predicts [0.7, 0.2, 0.1] for a problem where the correct answer is [1.0, 0.0, 0.0]:

MSE = (1/3) × [(0.7 - 1.0)² + (0.2 - 0.0)² + (0.1 - 0.0)²]
    = (1/3) × [0.09 + 0.04 + 0.01]
    = (1/3) × 0.14
    = 0.0467

MSE works well for regression problems (predicting continuous numbers), but for classification problems, like predicting the next token from a vocabulary of 128,000 options, there’s a better choice.

Cross-Entropy Loss: The Standard for Language Models

Cross-entropy loss is the loss function used by virtually every language model. It measures the difference between two probability distributions: the model’s predicted probabilities and the true answer.

Here’s the formula for a single prediction:

Loss = -log(predicted probability of the correct answer)

That’s it. You take the probability the model assigned to the correct token, take the logarithm, and negate it.

Why does this work? Let’s see with examples:

  • If the model assigns probability 0.99 to the correct token: Loss = -log(0.99) = 0.01 (very small, good prediction)
  • If the model assigns probability 0.5 to the correct token: Loss = -log(0.5) = 0.69 (moderate, uncertain prediction)
  • If the model assigns probability 0.01 to the correct token: Loss = -log(0.01) = 4.61 (very large, terrible prediction)

The logarithm has a useful property here: it penalizes confident wrong answers much more harshly than uncertain ones. If the model is 99% sure of the wrong answer (meaning it assigned only 1% to the correct answer), the loss is 4.61. If it’s 50/50, the loss is only 0.69. This harsh penalty for confident mistakes is exactly what you want. It forces the model to be well-calibrated, not just accurate.

In the general form for multiple classes (like a vocabulary of tokens), cross-entropy loss is:

Loss = -Σ(true_i × log(predicted_i))

Where the sum is over all classes. Since the true distribution is typically a one-hot vector (1 for the correct class, 0 for everything else), this simplifies to just -log(predicted probability of the correct class).

Cross-Entropy in Language Models

When a language model processes “The capital of France is” and needs to predict the next token, it produces a probability distribution over its entire vocabulary, say, 128,000 tokens. The correct answer is “Paris.” Cross-entropy loss looks at the probability the model assigned to “Paris” and computes -log of that probability.

If the model assigned 0.92 to “Paris”: Loss = -log(0.92) = 0.083 If the model assigned 0.01 to “Paris”: Loss = -log(0.01) = 4.605

During training, the model processes trillions of tokens, and the loss is averaged across all of them. The goal of training is to minimize this average loss, to make the model assign high probability to the correct next token as often as possible.

This is the number you see in training logs when researchers report “training loss.” A lower number means the model is making better predictions. When people say a model was “trained to convergence,” they mean the loss stopped decreasing meaningfully.


Backpropagation: How the Network Learns from Mistakes

Now we get to the heart of how neural networks learn. The network makes a prediction, the loss function measures how wrong it was, and then the network needs to figure out: which weights should I adjust, and by how much, to make the prediction less wrong next time?

This is the problem that backpropagation solves. It’s the algorithm that computes how much each weight in the network contributed to the error, so that each weight can be nudged in the right direction.

The Key Idea: The Chain Rule

Backpropagation is based on a concept from calculus called the chain rule. You don’t need to have studied calculus to understand the intuition.

The chain rule says: if A affects B, and B affects C, then you can figure out how A affects C by multiplying “how A affects B” by “how B affects C.”

Here’s a concrete example using a language model. Suppose:

  • Increasing a particular weight by 0.01 increases the logit for “Paris” by 0.5.
  • Increasing the logit for “Paris” by 1.0 increases the softmax probability for “Paris” by 0.02.

How much does that weight affect the probability of “Paris”?

Effect of weight on probability = (effect of weight on logit) × (effect of logit on probability)
                                = 0.5 × 0.02
                                = 0.01

That’s the chain rule. You multiply the effects along the chain.

In a neural network, the chain works like this:

weights → neuron outputs → next layer outputs → ... → final prediction → loss

Each weight affects the neuron’s output, which affects the next layer, which eventually affects the loss. Backpropagation uses the chain rule to trace this path backward, from the loss all the way back to each individual weight, computing how much each weight contributed to the error.

The Gradient: Which Direction to Adjust

The result of backpropagation is a gradient for each weight. The gradient is a number that tells you two things:

  1. Direction: Should this weight increase or decrease to reduce the loss? (The sign of the gradient tells you this.)
  2. Magnitude: How much does this weight affect the loss? (The size of the gradient tells you this.)

If the gradient of a weight is +0.3, it means: increasing this weight would increase the loss (make predictions worse), so we should decrease it. If the gradient is -0.5, it means: increasing this weight would decrease the loss (make predictions better), so we should increase it.

Walking Through Backpropagation Step by Step

Let’s trace backpropagation through a tiny network with one neuron, one input, one weight, and one bias. This is the simplest possible case, but the principle scales to networks with billions of parameters.

Setup:

  • Input: x = 2.0
  • Weight: w = 0.5
  • Bias: b = 0.1
  • Activation: sigmoid
  • True answer: y = 1.0

Forward pass (compute the prediction):

Step 1: z = w × x + b = 0.5 × 2.0 + 0.1 = 1.1
Step 2: prediction = sigmoid(1.1) = 1 / (1 + e^(-1.1)) = 0.750
Step 3: loss = (prediction - y)² = (0.750 - 1.0)² = 0.0625

The network predicted 0.750, but the true answer is 1.0. The loss is 0.0625.

Backward pass (compute gradients using the chain rule):

We need to find: how does changing w affect the loss? We trace backward through each step.

Step 3 (backward): How does the prediction affect the loss?
  d(loss)/d(prediction) = 2 × (prediction - y) = 2 × (0.750 - 1.0) = -0.500

Step 2 (backward): How does z affect the prediction?
  d(prediction)/d(z) = sigmoid(z) × (1 - sigmoid(z)) = 0.750 × 0.250 = 0.1875

Step 1 (backward): How does w affect z?
  d(z)/d(w) = x = 2.0

Now apply the chain rule. Multiply all the effects together:

d(loss)/d(w) = d(loss)/d(prediction) × d(prediction)/d(z) × d(z)/d(w)
             = -0.500 × 0.1875 × 2.0
             = -0.1875

The gradient of w is -0.1875. This means: increasing w would decrease the loss (because the gradient is negative). So we should increase w to make the prediction closer to 1.0. That makes sense, the prediction was 0.750, which is too low, and increasing the weight would increase the output.

For the bias:

d(z)/d(b) = 1  (the bias is just added, so its effect is 1)

d(loss)/d(b) = -0.500 × 0.1875 × 1.0 = -0.09375

The bias gradient is -0.09375, also negative, meaning we should increase the bias too.

Why “Backpropagation”?

The name comes from the direction of computation. The forward pass goes from inputs to outputs (left to right). The backward pass goes from the loss back to the weights (right to left), propagating the error signal backward through the network. At each layer, the algorithm computes the local gradient and passes it to the previous layer, which multiplies it by its own local gradient and passes it further back. This chain of multiplications is the chain rule in action.

In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published “Learning Representations by Back-Propagating Errors” in Nature, which provided a clear, general framework for training neural networks with hidden layers using this algorithm. While the mathematical idea had been explored earlier by others, this paper made backpropagation practical and widely understood, and it remains the foundation of how every neural network is trained today, including every language model.

Source: Rumelhart, Hinton, and Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, October 1986.

Backpropagation in Deep Networks

In a network with many layers, backpropagation works the same way. It just chains together more multiplications. If a network has 48 layers (like LLaMA 4 Maverick), the gradient for a weight in layer 1 involves multiplying together the local gradients from all 48 layers. This is computationally intensive but straightforward. It’s the same chain rule applied repeatedly.

The computational cost of backpropagation is roughly 2–3 times the cost of the forward pass. If a forward pass through a model takes 1 millisecond, the backward pass takes about 2–3 milliseconds. This is why training is much more expensive than inference. During training, you need both the forward pass (to make predictions) and the backward pass (to compute gradients) for every single training example.


Gradient Descent: Nudging Weights to Be Less Wrong

Backpropagation tells you the gradient, the direction and magnitude of each weight’s effect on the loss. Gradient descent is the algorithm that uses those gradients to actually update the weights.

The update rule is simple:

new_weight = old_weight - learning_rate × gradient

That’s the entire algorithm. Subtract a fraction of the gradient from each weight. Let’s break down each piece.

The Learning Rate

The learning rate is a small positive number (typically between 0.0001 and 0.01) that controls how big each update step is. It’s one of the most important settings in training a neural network.

  • Too large (e.g., 0.1): The weights change too much each step. The network overshoots the optimal values and bounces around wildly, never settling on a good solution. The loss might actually increase instead of decreasing.
  • Too small (e.g., 0.0000001): The weights barely change each step. Training takes forever, and the network might get stuck in a bad solution because it can’t take big enough steps to escape.
  • Just right (e.g., 0.001): The weights change enough to make steady progress but not so much that they overshoot. The loss decreases smoothly over time.

Finding the right learning rate is one of the key challenges in training neural networks. Modern training runs typically start with a higher learning rate and gradually decrease it over time, a technique called learning rate scheduling. This lets the network make big adjustments early in training (when it’s far from a good solution) and fine adjustments later (when it’s close to optimal).

Applying Gradient Descent to Our Example

Let’s continue from our backpropagation example. We had:

  • Weight w = 0.5, gradient = -0.1875
  • Bias b = 0.1, gradient = -0.09375
  • Learning rate = 0.1

Applying the update rule:

new_w = 0.5 - 0.1 × (-0.1875) = 0.5 + 0.01875 = 0.51875
new_b = 0.1 - 0.1 × (-0.09375) = 0.1 + 0.009375 = 0.109375

The weight increased from 0.5 to 0.51875, and the bias increased from 0.1 to 0.109375. Both moved in the direction that reduces the loss (the gradients were negative, so subtracting a negative number means adding).

Let’s verify this helped. With the new weights:

z = 0.51875 × 2.0 + 0.109375 = 1.14688
prediction = sigmoid(1.14688) = 0.759
loss = (0.759 - 1.0)² = 0.0581

The loss decreased from 0.0625 to 0.0581. The prediction moved from 0.750 to 0.759, closer to the target of 1.0. One step of gradient descent made the network slightly less wrong.

Repeat this process thousands or millions of times, and the network gradually converges on weights that produce accurate predictions. That’s training.

Why “Descent”?

The name “gradient descent” comes from the idea of descending a hill. Imagine the loss as a landscape, a surface with hills and valleys. The height at any point represents the loss for a particular set of weights. The gradient points uphill (in the direction of steepest increase). By moving in the opposite direction (subtracting the gradient), you move downhill, toward lower loss.

Each step of gradient descent moves you a little bit downhill. Over many steps, you reach a valley, a set of weights where the loss is low and the predictions are good.

In reality, the “landscape” isn’t 2D or 3D. It has as many dimensions as there are weights in the network. LLaMA 4 Maverick has 400 billion parameters, so its loss landscape has 400 billion dimensions. You can’t visualize this, but the math works the same way: the gradient points uphill in this high-dimensional space, and you move in the opposite direction.

Stochastic Gradient Descent (SGD) and Mini-Batches

In the description above, we computed the gradient using a single training example. In practice, training data contains billions of examples. Computing the gradient on all of them at once would be impossibly slow.

The solution is Stochastic Gradient Descent (SGD): instead of computing the gradient on the entire dataset, compute it on a small random subset called a mini-batch, typically 32 to 4,096 examples. The gradient from a mini-batch is a noisy approximation of the true gradient, but it’s good enough to make progress, and it’s much faster to compute.

Modern language model training uses mini-batches of thousands of sequences. For example, a training step might process 2,048 sequences of 4,096 tokens each, about 8 million tokens per step. The gradients are averaged across all tokens in the mini-batch, and the weights are updated once per step.

Modern Optimizers: Adam

Plain gradient descent (subtract learning_rate × gradient) works but is slow. Modern neural networks use more sophisticated optimizers that adapt the learning rate for each weight individually. The most widely used is Adam (Adaptive Moment Estimation), published by Kingma and Ba in 2015.

Adam keeps track of two things for each weight:

  1. The running average of the gradient (which direction has the gradient been pointing recently?)
  2. The running average of the squared gradient (how large have the gradients been recently?)

It uses these to adjust the effective learning rate for each weight. Weights with consistently large gradients get smaller learning rates (to avoid overshooting), and weights with small gradients get larger learning rates (to speed up learning). Nearly every language model training run uses Adam or a variant of it.

Source: Kingma and Ba, “Adam: A Method for Stochastic Optimization,” ICLR 2015.


The Vanishing Gradient Problem

There’s a critical issue that held back neural network research for decades, and understanding it explains why modern networks use ReLU instead of sigmoid.

Remember that backpropagation works by multiplying gradients together along the chain from the loss back to each weight. In a network with many layers, the gradient for an early layer is the product of many numbers, one from each layer in between.

Here’s the problem with sigmoid: the derivative of the sigmoid function is always between 0 and 0.25. The maximum value of sigmoid’(z) is 0.25, occurring at z = 0. For large positive or negative z, the derivative approaches 0.

Source: The derivative of sigmoid(z) is sigmoid(z) × (1 - sigmoid(z)), which has a maximum of 0.25 at z = 0.

When you multiply many numbers less than 0.25 together, the result shrinks exponentially:

Layer 10 gradient: 0.25^10 = 0.00000095
Layer 20 gradient: 0.25^20 = 0.0000000000009
Layer 48 gradient: 0.25^48 ≈ 0  (effectively zero)

By the time the gradient signal reaches the early layers, it has been multiplied by so many small numbers that it’s essentially zero. The early layers receive no useful learning signal. Their weights barely change during training. This is the vanishing gradient problem, and it’s why deep networks with sigmoid activations were nearly impossible to train.

This problem was a major reason why neural network research stalled after Minsky and Papert’s 1969 critique. Even after backpropagation was popularized in 1986, training networks deeper than a few layers remained extremely difficult.

How ReLU Solves It

ReLU’s derivative is either 0 (for negative inputs) or 1 (for positive inputs). When the input is positive, the gradient passes through unchanged, no shrinking. This means that in a deep network using ReLU, the gradient can flow from the loss all the way back to the first layer without being diminished.

Sigmoid derivative: always ≤ 0.25 → gradients shrink exponentially
ReLU derivative:    0 or 1        → gradients pass through unchanged (when active)

This is why the shift from sigmoid to ReLU around 2010–2012 was so important. It didn’t change the fundamental architecture of neural networks. It just changed one function. But it made deep networks trainable for the first time. Modern language models with 48 to 120+ layers would be impossible to train with sigmoid activations.

Modern Transformer models use variants of ReLU like SwiGLU (used in LLaMA, Mistral, and other recent models), which we’ll cover in Chapter 9. These variants maintain the gradient-friendly properties of ReLU while adding some additional flexibility.


The Complete Training Loop

Let’s put everything together. Training a neural network follows this loop, repeated millions or billions of times:

  1. Forward pass: Feed input through the network to get a prediction.
  2. Compute loss: Compare the prediction to the correct answer using the loss function.
  3. Backward pass (backpropagation): Compute the gradient of the loss with respect to every weight.
  4. Update weights (gradient descent): Adjust each weight by subtracting learning_rate × gradient.
  5. Repeat with the next batch of training data.

Each complete pass through the entire training dataset is called an epoch. Training a small model might take 10–100 epochs. Training a frontier language model involves processing trillions of tokens, the equivalent of reading the entire internet multiple times.

The loss typically decreases rapidly at first (the network is learning the most obvious patterns) and then more slowly (it’s refining subtle details). When the loss stops decreasing meaningfully, training is complete. The model has converged.


Real Example: Training a Network to Learn XOR from Scratch

Now let’s put everything together with a real, runnable example. We’ll train a neural network to solve the XOR problem, the same problem that Minsky and Papert proved a single-layer perceptron couldn’t solve in 1969.

What Is XOR?

XOR (exclusive or) is a logical operation that returns 1 when exactly one of its two inputs is 1, and 0 otherwise:

Input 1Input 2XOR Output
000
011
101
110

This is trivial for a human but impossible for a single neuron. Why? Because XOR is not linearly separable. You can’t draw a single straight line on a 2D plot that separates the 0 outputs from the 1 outputs. Plot the four points: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. The 1s are on opposite corners. No single line can separate them.

A single neuron computes a weighted sum plus bias, which defines a straight line (or hyperplane in higher dimensions). It can learn AND, OR, and NOT (all linearly separable) but not XOR. This is exactly what Minsky and Papert proved.

The solution: add a hidden layer. With two layers of neurons, the network can learn to first transform the inputs into a space where XOR is linearly separable, and then classify them. Let’s build this.

The Network Architecture

We’ll build a network with:

  • Input layer: 2 inputs (the two XOR inputs)
  • Hidden layer: 2 neurons with sigmoid activation
  • Output layer: 1 neuron with sigmoid activation

Total parameters: (2×2 + 2) + (2×1 + 1) = 4 + 2 + 2 + 1 = 9 weights and biases.

This is the smallest network that can solve XOR. Let’s implement it from scratch in Python, no PyTorch, no TensorFlow, just NumPy.

The Complete Code

import numpy as np

# --- Activation functions ---
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(a):
    """Derivative of sigmoid, given the sigmoid output a."""
    return a * (1 - a)

# --- Training data: XOR ---
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])  # 4 examples, 2 inputs each
y = np.array([[0], [1], [1], [0]])                 # 4 targets

# --- Initialize weights randomly ---
np.random.seed(0)
W1 = np.random.randn(2, 2) * 0.5   # input→hidden: 2 inputs, 2 neurons
b1 = np.zeros((1, 2))               # hidden biases
W2 = np.random.randn(2, 1) * 0.5   # hidden→output: 2 inputs, 1 neuron
b2 = np.zeros((1, 1))               # output bias

learning_rate = 2.0
losses = []

for epoch in range(10000):
    # --- Forward pass ---
    z1 = X @ W1 + b1              # hidden pre-activation: [4×2] @ [2×2] = [4×2]
    a1 = sigmoid(z1)              # hidden activation
    z2 = a1 @ W2 + b2             # output pre-activation: [4×2] @ [2×1] = [4×1]
    a2 = sigmoid(z2)              # output (prediction)

    # --- Compute loss (MSE) ---
    loss = np.mean((a2 - y) ** 2)
    losses.append(loss)

    # --- Backward pass ---
    # Output layer gradients
    d_a2 = 2 * (a2 - y) / 4                  # d(loss)/d(a2)
    d_z2 = d_a2 * sigmoid_derivative(a2)      # d(loss)/d(z2)
    d_W2 = a1.T @ d_z2                        # d(loss)/d(W2)
    d_b2 = np.sum(d_z2, axis=0, keepdims=True)

    # Hidden layer gradients (chain rule continues backward)
    d_a1 = d_z2 @ W2.T                        # d(loss)/d(a1)
    d_z1 = d_a1 * sigmoid_derivative(a1)      # d(loss)/d(z1)
    d_W1 = X.T @ d_z1                         # d(loss)/d(W1)
    d_b1 = np.sum(d_z1, axis=0, keepdims=True)

    # --- Update weights (gradient descent) ---
    W2 -= learning_rate * d_W2
    b2 -= learning_rate * d_b2
    W1 -= learning_rate * d_W1
    b1 -= learning_rate * d_b1

    if epoch % 2000 == 0:
        print(f"Epoch {epoch:>5d}  Loss: {loss:.6f}")

# --- Test the trained network ---
print("\nTrained predictions:")
for i in range(4):
    z1 = X[i:i+1] @ W1 + b1
    a1 = sigmoid(z1)
    z2 = a1 @ W2 + b2
    a2 = sigmoid(z2)
    print(f"  Input: {X[i]}  Predicted: {a2[0,0]:.4f}  Target: {y[i,0]}")

print(f"\nFinal loss: {losses[-1]:.6f}")
print(f"\nLearned weights:")
print(f"  W1 (input→hidden):\n{W1}")
print(f"  b1 (hidden biases): {b1}")
print(f"  W2 (hidden→output):\n{W2}")
print(f"  b2 (output bias): {b2}")

Running the Code

Save this as xor_network.py and run it with python xor_network.py. You’ll see output like:

Epoch     0  Loss: 0.254547
Epoch  2000  Loss: 0.000924
Epoch  4000  Loss: 0.000383
Epoch  6000  Loss: 0.000239
Epoch  8000  Loss: 0.000173

Trained predictions:
  Input: [0 0]  Predicted: 0.0129  Target: 0
  Input: [0 1]  Predicted: 0.9890  Target: 1
  Input: [1 0]  Predicted: 0.9889  Target: 1
  Input: [1 1]  Predicted: 0.0114  Target: 0

Final loss: 0.000135

The network learned XOR. The predictions are close to the targets: inputs [0,0] and [1,1] produce values near 0 (0.013 and 0.011), while inputs [0,1] and [1,0] produce values near 1 (0.989). The loss dropped from 0.25 to 0.00014 over 10,000 epochs.

(Note: The exact numbers may vary slightly depending on your NumPy version, since different versions can produce different random sequences from the same seed. The pattern (loss decreasing, predictions converging to the correct values) will be the same.)

Walking Through the Code

Let’s trace exactly what each part does.

Initialization:

W1 = np.random.randn(2, 2) * 0.5

This creates a 2×2 matrix of random numbers drawn from a normal distribution, scaled by 0.5. These are the initial weights for the hidden layer, 2 inputs connected to 2 neurons. The * 0.5 keeps the initial values small, which helps training start smoothly. The biases start at zero.

Forward pass:

z1 = X @ W1 + b1    # matrix multiply: [4×2] @ [2×2] = [4×2]
a1 = sigmoid(z1)     # apply activation element-wise
z2 = a1 @ W2 + b2   # matrix multiply: [4×2] @ [2×1] = [4×1]
a2 = sigmoid(z2)     # final prediction

This processes all 4 training examples simultaneously using matrix multiplication, exactly as described in Chapter 2. The input matrix X has 4 rows (one per example) and 2 columns (one per input). After multiplying by W1 (2×2), we get a 4×2 matrix, 4 examples, each with 2 hidden neuron values. After the second layer, we get a 4×1 matrix, one prediction per example.

Loss computation:

loss = np.mean((a2 - y) ** 2)

Mean squared error: subtract the target from the prediction, square it, and average across all 4 examples.

Backward pass:

d_a2 = 2 * (a2 - y) / 4                    # gradient of MSE w.r.t. predictions
d_z2 = d_a2 * sigmoid_derivative(a2)        # chain rule through sigmoid
d_W2 = a1.T @ d_z2                          # chain rule through matrix multiply

This is backpropagation in action. Each line applies the chain rule to compute the gradient of the loss with respect to one variable, working backward from the loss to the weights. The a1.T @ d_z2 line is a matrix multiplication that computes the gradient for all weights in W2 simultaneously, the same matrix multiplication from Chapter 2, but used for gradient computation instead of forward computation.

Weight update:

W2 -= learning_rate * d_W2

Gradient descent: subtract the gradient (scaled by the learning rate) from each weight. This is the update rule we discussed earlier.

What the Hidden Layer Learned

The hidden layer is the key to solving XOR. After training, the two hidden neurons have learned to transform the inputs into a new representation where XOR becomes linearly separable.

You can see this by looking at the hidden layer activations for each input:

for i in range(4):
    z1 = X[i:i+1] @ W1 + b1
    a1 = sigmoid(z1)
    print(f"  Input: {X[i]}  Hidden: [{a1[0,0]:.3f}, {a1[0,1]:.3f}]")

This will show something like:

  Input: [0 0]  Hidden: [0.044, 0.001]
  Input: [0 1]  Hidden: [0.979, 0.074]
  Input: [1 0]  Hidden: [0.976, 0.071]
  Input: [1 1]  Hidden: [1.000, 0.902]

In this transformed space, the XOR-1 outputs ([0,1] and [1,0]) have similar hidden values (both around [0.98, 0.07]), while the XOR-0 outputs have very different hidden values. [0,0] maps to [0.044, 0.001] and [1,1] maps to [1.000, 0.902]. The output neuron can now draw a single line to separate them, exactly what a single neuron can do.

This is the fundamental power of deep networks: each layer transforms the data into a representation that makes the next layer’s job easier. The hidden layer doesn’t solve XOR directly. It transforms the problem into one that a single neuron can solve. In a language model with 48 layers, each layer similarly transforms the token representations to make the final prediction easier.


From XOR to Language Models: Connecting the Dots

The XOR network we just built has 9 parameters. GPT-2 has 124 million. LLaMA 4 Maverick has 400 billion. But the training process is fundamentally the same:

  1. Forward pass: feed input through layers of matrix multiplications and activations.
  2. Compute loss: cross-entropy loss on the predicted next token vs. the actual next token.
  3. Backward pass: backpropagation computes gradients for every weight.
  4. Update weights: an optimizer (Adam, not plain gradient descent) adjusts each weight.
  5. Repeat on the next batch of training data.

The differences are in scale, not in kind:

PropertyXOR NetworkGPT-2LLaMA 4 Maverick
Parameters9124 million400 billion
Training examples4~8 billion tokensTrillions of tokens
Layers21248
ActivationSigmoidGELUSwiGLU
OptimizerPlain SGDAdamAdamW
Loss functionMSECross-entropyCross-entropy
Training time< 1 secondDays on 8 GPUsMonths on thousands of GPUs
Training costFree~$50,000 (2019)Hundreds of millions of dollars

Sources: GPT-2 architecture from OpenAI (2019); LLaMA 4 Maverick from Meta AI (April 2025). GPT-2 training cost estimated at ~$43,000 in compute based on 32 TPU v3 chips for 168 hours (Karpathy, 2024); frontier model training costs from industry reports.

The activation functions have evolved (sigmoid → ReLU → GELU → SwiGLU), the optimizers have gotten smarter (SGD → Adam → AdamW), and the scale has increased by a factor of 10 billion. But the core loop (forward, loss, backward, update) is identical to what we just implemented.

Every weight in GPT-5, Claude, Gemini, and every other language model was set by this same process: make a prediction, measure how wrong it was, compute gradients via backpropagation, and nudge the weights to be less wrong. Repeated trillions of times.


Key Takeaways

  • A neuron computes three things: multiply inputs by weights, add a bias, and apply an activation function. The formula is: output = activation(w · x + b). This is the building block of every neural network.

  • Activation functions introduce nonlinearity, which is what allows neural networks to learn complex patterns. Sigmoid squashes values to (0, 1) but causes the vanishing gradient problem in deep networks. ReLU (max(0, x)) solved this by letting gradients flow unchanged through positive activations, enabling the training of deep networks.

  • A layer is a group of neurons that process inputs simultaneously. Mathematically, a layer is a matrix multiplication followed by an activation function. Deep networks stack many layers, with each layer building more complex representations on top of the previous layer’s output.

  • The loss function measures how wrong the network’s prediction is. Mean Squared Error works for regression. Cross-entropy loss is the standard for classification and language models. It computes -log(probability assigned to the correct answer), penalizing confident wrong predictions harshly.

  • Backpropagation computes the gradient of the loss with respect to every weight by applying the chain rule backward through the network. The gradient tells you which direction to adjust each weight and by how much.

  • Gradient descent updates weights by subtracting learning_rate × gradient. The learning rate controls step size. Too large causes overshooting, too small causes slow training. Modern models use the Adam optimizer, which adapts the learning rate per weight.

  • The vanishing gradient problem occurs when gradients shrink exponentially in deep networks using sigmoid (whose derivative is always ≤ 0.25). ReLU solved this with a derivative of 0 or 1, enabling networks with dozens or hundreds of layers.

  • The XOR problem (unsolvable by a single neuron) is solved by adding a hidden layer that transforms the inputs into a linearly separable representation. This demonstrates the fundamental power of deep networks: each layer transforms data to make the next layer’s job easier.

  • The training loop for every neural network, from our 9-parameter XOR network to trillion-parameter language models, is the same: forward pass → compute loss → backpropagation → weight update → repeat.


What’s Next

You now understand the building blocks: neurons, layers, loss functions, backpropagation, and gradient descent. These are the components that make up every neural network. But language models don’t work on raw text. They need to convert words into numbers first. In Chapter 4, we’ll cover tokenization: how text is broken into tokens, how Byte Pair Encoding works step by step, and why the choice of tokenizer affects everything from model performance to how much your API calls cost.