Skip to content
Chapter 5. Embeddings, Giving Tokens Meaning

Chapter 5. Embeddings, Giving Tokens Meaning

In Chapter 4, you learned how text is broken into tokens and how each token is assigned an integer ID from a fixed vocabulary. But an integer by itself carries no meaning. The number 13225 (the token ID for “Hello” in GPT-4o’s tokenizer) tells the model nothing about greetings, friendliness, or the English language. To do anything useful, the model needs to convert each token ID into a rich numerical representation that captures what the token means, how it relates to other tokens, and how it behaves in different contexts. That representation is called an embedding, and the structure that stores all of them is called the embedding table.


The Embedding Table: A Giant Lookup Table

The embedding table is one of the simplest components in a language model, and also one of the most important. It is a two-dimensional grid of numbers (a matrix) with one row for every token in the vocabulary and one column for every dimension in the model’s hidden representation.

If the vocabulary has V tokens and the model uses D dimensions, the embedding table is a matrix of size V x D. Each row is a vector of D numbers that represents one token.

Here are the real dimensions for several production models:

ModelVocabulary Size (V)Embedding Dimension (D)Embedding Table Size (V x D)Size in float16
GPT-250,25776838.6 million~77 MB
Mistral 7B32,0004,096131 million~262 MB
GPT-3 175B50,25712,288617 million~1.2 GB
DeepSeek-V3128,0007,168917 million~1.8 GB
LLaMA 4 Maverick202,0485,1201.03 billion~2.1 GB

Sources: GPT-2 architecture from OpenAI (2019), 50,257 tokens, 768 embedding dimensions, 12 layers; Mistral 7B from Mistral AI (September 2023), 32,000 tokens, 4,096 hidden dimensions, 32 layers; GPT-3 175B from Brown et al. (2020), 50,257 tokens (same vocabulary as GPT-2), 12,288 hidden dimensions, 96 layers; DeepSeek-V3 from DeepSeek technical report (December 2024), 128,000 tokens, 7,168 hidden dimensions, 61 layers, 671B total parameters with 37B active; LLaMA 4 Maverick from HuggingFace Transformers Llama4TextConfig and Meta AI (April 2025), 202,048 tokens, 5,120 hidden dimensions, 48 layers (alternating dense and MoE layers), 40 query heads, 8 KV heads (Grouped Query Attention), 128 experts.

Look at those numbers. LLaMA 4 Maverick’s embedding table alone contains over a billion numbers. In 16-bit floating point (2 bytes per number), that’s about 2.1 GB of memory, just for the lookup table that converts token IDs into vectors. And this is before any of the attention layers, feed-forward networks, or expert modules that make up the rest of the model.

How the Lookup Works

The lookup operation is trivially simple. When the model receives a token ID, it uses that ID as a row index into the embedding table and retrieves the corresponding vector.

If the token “Hello” has ID 13225, the model goes to row 13225 of the embedding table and pulls out a vector of D numbers. For LLaMA 4 Maverick, that’s a vector of 5,120 numbers:

Token ID 13225 ("Hello") --> row 13225 --> [0.023, -0.041, 0.118, ..., -0.007]
                                            (5,120 numbers)

There is no computation involved in this step. It is a pure table lookup, like finding a word in a dictionary by its page number. The model does not calculate the embedding; it retrieves it from a table that was learned during training.

For a sequence of tokens, the model performs this lookup for each token in parallel:

"The capital of France is Paris"
    |       |      |     |    |    |
    v       v      v     v    v    v
[vec_The, vec_capital, vec_of, vec_France, vec_is, vec_Paris]

Each token becomes a vector of 5,120 numbers. The full sequence becomes a matrix of shape [6 x 5,120]: six rows (one per token), each with 5,120 columns (one per dimension). This matrix is the input to the first Transformer layer.

Embedding Dimension vs. Hidden Dimension

In many models, the embedding dimension and the hidden dimension used inside the Transformer layers are the same. In GPT-2, both are 768. In Mistral 7B, both are 4,096. In LLaMA 4 Maverick, both are 5,120: the official HuggingFace Transformers configuration defines hidden_size = 5120 as the “dimensionality of the embeddings and hidden states.”

Source: HuggingFace Transformers Llama4TextConfig documentation, hidden_size = 5,120, described as “Dimensionality of the embeddings and hidden states.”

However, some models do use different dimensions for the embedding table and the internal hidden layers. When this happens, the model includes a projection layer (a simple matrix multiplication) that maps from the embedding dimension to the hidden dimension after the embedding lookup. This is an architectural choice that lets the model use a smaller embedding table (saving memory) while still operating at a larger internal dimension (preserving capacity). GPT-3 175B, for example, uses a vocabulary of 50,257 tokens (the same as GPT-2) but a hidden dimension of 12,288, so its embedding table has shape [50,257 x 12,288].

For the rest of this chapter, we’ll focus on the embedding table itself, since that’s where tokens first acquire their numerical meaning.


What Does a Dimension Represent?

Each embedding vector has thousands of numbers. In LLaMA 4 Maverick, each token is represented by 5,120 numbers. A natural question is: what does each number mean? Does dimension 47 represent “how much this word relates to animals”? Does dimension 1,203 represent “formality level”?

The answer is no. No single dimension has a clean, human-interpretable meaning. The meaning of a token is encoded in the pattern of all dimensions taken together. This is called a distributed representation.

Why Distributed Representations?

Consider an alternative approach: you could design a system where dimension 1 means “is it a noun?” (1 = yes, 0 = no), dimension 2 means “is it an animal?” (1 = yes, 0 = no), dimension 3 means “is it positive in sentiment?” (1 = yes, 0 = no), and so on. This is called a localist representation, where each dimension has a specific, predefined meaning.

The problem with localist representations is that they don’t scale. Human language has an enormous number of concepts, relationships, and nuances. You would need millions of dimensions to capture every possible property of every possible word, and most of those dimensions would be zero for any given word (a very sparse, wasteful representation).

Distributed representations solve this by letting each dimension participate in encoding many different properties simultaneously. Dimension 47 might contribute a little bit to “animal-ness,” a little bit to “domesticity,” a little bit to “noun-ness,” and a little bit to dozens of other properties. The full meaning emerges from the combination of all 5,120 dimensions working together.

This is similar to how colors work on a screen. A pixel’s color is defined by three numbers: red, green, and blue (RGB). The color “orange” isn’t captured by any single channel; it’s the specific combination [255, 165, 0] that makes orange. If you only looked at the red channel (255), you couldn’t distinguish orange from red or magenta. You need all three channels together.

Token embeddings work the same way, except with 5,120 “channels” instead of 3. The “meaning” of a token is the specific pattern across all 5,120 dimensions. No single dimension tells you much on its own.

What Researchers Have Found

Despite the distributed nature of embeddings, researchers have discovered that certain directions in the embedding space (not individual dimensions, but combinations of dimensions) do correspond to interpretable concepts.

The most famous example comes from Word2Vec, published by Mikolov et al. at Google in 2013. They trained 300-dimensional word embeddings and discovered that vector arithmetic could capture semantic relationships:

vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen")

Source: Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781, January 2013. The original Word2Vec models used 300-dimensional vectors trained on the Google News corpus.

This means there is a direction in the 300-dimensional space that corresponds to the concept of “gender.” Moving along that direction transforms “King” into “Queen,” “Man” into “Woman,” “Uncle” into “Aunt,” and so on. But this direction is not a single dimension; it’s a combination of many dimensions that together encode the gender relationship.

Similarly, there are directions for:

  • Country to capital: vector(“France”) - vector(“Paris”) + vector(“Tokyo”) ≈ vector(“Japan”)
  • Verb tense: vector(“walking”) - vector(“walked”) + vector(“swam”) ≈ vector(“swimming”)
  • Comparative form: vector(“big”) - vector(“bigger”) + vector(“smaller”) ≈ vector(“small”)

These relationships are not programmed in. They emerge automatically from training on large amounts of text. The model discovers that words appearing in similar contexts should have similar vectors, and the geometric structure of the embedding space naturally organizes itself to reflect semantic relationships.


Semantic Space: Similar Meanings, Nearby Vectors

The most important property of embeddings is that tokens with similar meanings end up with similar vectors. This is not a design choice that someone programmed in; it’s a consequence of how embeddings are learned during training.

In Chapter 2, we saw this with GPT-2’s embeddings: the dot product between “cat” and “dog” was 5.28 (high similarity), while “cat” and “car” was only 1.73 (low similarity). “Paris” and “Tokyo” scored 8.41 (both capital cities), while “Paris” and “bicycle” scored just 0.52.

This property holds across the entire vocabulary. If you could somehow visualize the positions of all 202,048 tokens in LLaMA 4 Maverick’s 5,120-dimensional embedding space, you would see clusters:

  • Animals (“cat,” “dog,” “horse,” “elephant”) would cluster together
  • Countries (“France,” “Germany,” “Japan,” “Brazil”) would form another cluster
  • Programming keywords (“def,” “class,” “import,” “return”) would cluster together
  • Punctuation (".", “,”, “!”, “?”) would form their own group
  • Numbers (“1,” “2,” “100,” “1000”) would be near each other

Within each cluster, finer distinctions would be visible. Among animals, “cat” and “dog” (both common pets) would be closer to each other than either is to “elephant” (a wild animal). Among countries, “France” and “Germany” (both European) would be closer than “France” and “Japan.”

Measuring Similarity: Cosine Similarity

In Chapter 2, we used the dot product to measure similarity between vectors. The dot product works, but it has a limitation: it’s affected by the magnitude (length) of the vectors, not just their direction. A vector that happens to have large values will produce large dot products with everything, even unrelated vectors.

A more robust measure is cosine similarity, which measures only the direction of two vectors, ignoring their magnitude. It’s computed by dividing the dot product by the product of the two vectors’ lengths:

cosine_similarity(a, b) = (a . b) / (|a| * |b|)

Where |a| is the length (magnitude) of vector a, computed as the square root of the sum of squared elements.

Cosine similarity produces a value between -1 and 1:

  • 1 means the vectors point in exactly the same direction (identical meaning)
  • 0 means the vectors are perpendicular (unrelated)
  • -1 means the vectors point in opposite directions (opposite meaning)

In practice, most token embeddings have cosine similarities between 0 and 0.5 for related words, and near 0 or slightly negative for unrelated words. Truly synonymous words might reach 0.7 or higher.

Let’s see this with real GPT-2 embeddings:

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
embeddings = model.wte.weight.detach()

def get_embedding(word):
    token_id = tokenizer.encode(" " + word)[0]
    return embeddings[token_id]

def cosine_sim(w1, w2):
    e1, e2 = get_embedding(w1), get_embedding(w2)
    return torch.nn.functional.cosine_similarity(e1.unsqueeze(0), e2.unsqueeze(0)).item()

pairs = [
    ("cat", "dog"),
    ("cat", "kitten"),
    ("cat", "car"),
    ("Paris", "Tokyo"),
    ("Paris", "France"),
    ("Paris", "bicycle"),
    ("happy", "joyful"),
    ("happy", "sad"),
    ("king", "queen"),
    ("king", "table"),
]

print(f"{'Word 1':>10s}  {'Word 2':<10s}  {'Cosine Similarity':>18s}")
print("-" * 44)
for w1, w2 in pairs:
    sim = cosine_sim(w1, w2)
    print(f"{w1:>10s}  {w2:<10s}  {sim:>18.4f}")

Running this on GPT-2 produces results like:

    Word 1  Word 2      Cosine Similarity
--------------------------------------------
       cat  dog                    0.4442
       cat  kitten                 0.3218
       cat  car                    0.1487
     Paris  Tokyo                  0.4891
     Paris  France                 0.3756
     Paris  bicycle                0.0312
     happy  joyful                 0.2847
     happy  sad                    0.2103
      king  queen                  0.5190
      king  table                  0.1254

The pattern is clear. Semantically related words have higher cosine similarity: “king” and “queen” (0.52), “cat” and “dog” (0.44), “Paris” and “Tokyo” (0.49). Unrelated words have low similarity: “Paris” and “bicycle” (0.03), “king” and “table” (0.13).

(Note: GPT-2 is a relatively small model from 2019. Larger, more recent models produce embeddings with even cleaner semantic structure. But even GPT-2’s 768-dimensional embeddings capture meaningful relationships.)

Finding Nearest Neighbors

One practical application of embedding similarity is finding the nearest neighbors of a given word: which other words in the vocabulary have the most similar embeddings?

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
embeddings = model.wte.weight.detach()

def nearest_neighbors(word, top_k=10):
    token_id = tokenizer.encode(" " + word)[0]
    query = embeddings[token_id]

    # Compute cosine similarity with all tokens
    sims = torch.nn.functional.cosine_similarity(query.unsqueeze(0), embeddings)

    # Get top-k (skip index 0 which is the word itself after sorting)
    top_indices = sims.argsort(descending=True)

    print(f"Nearest neighbors of '{word}':")
    count = 0
    for idx in top_indices:
        decoded = tokenizer.decode([idx.item()]).strip()
        if decoded.lower() != word.lower() and len(decoded) > 1:
            sim = sims[idx].item()
            print(f"  {decoded:<20s}  cosine: {sim:.4f}")
            count += 1
            if count >= top_k:
                break

nearest_neighbors("Paris")
print()
nearest_neighbors("Python")
print()
nearest_neighbors("happy")

This code computes the cosine similarity between the query word’s embedding and every other embedding in the vocabulary, then returns the most similar ones. The results reveal what the model considers semantically close to each word, based purely on the patterns it learned from training data.

For “Paris,” you’ll see other cities and European locations. For “Python,” you’ll see other programming languages and technical terms. For “happy,” you’ll see other positive emotion words. These neighborhoods emerge entirely from training; nobody told the model that Paris is a city or that Python is a programming language.


How Embeddings Are Learned During Training

This is a critical point that many explanations gloss over: embeddings are not hand-coded, not looked up from a dictionary, and not computed from word definitions. They are learned from data during training, using the same backpropagation and gradient descent process we covered in Chapter 3.

The Starting Point: Random Initialization

Before training begins, every embedding vector is initialized with small random numbers. At this point, the embedding for “cat” is no more similar to “dog” than it is to “refrigerator” or “democracy.” The embedding table is just a matrix of random noise.

Before training:
  "cat"   --> [0.012, -0.003, 0.008, ..., -0.001]  (random)
  "dog"   --> [-0.007, 0.015, -0.002, ..., 0.009]  (random)
  "Paris" --> [0.004, 0.011, -0.006, ..., 0.003]   (random)

There is no semantic structure. “Cat” and “dog” are no closer together than “cat” and “Paris.”

Learning Through Prediction

During training, the model processes trillions of tokens of text and tries to predict the next token at each position. When it gets a prediction wrong, backpropagation computes gradients for every parameter in the model, including every number in the embedding table.

Here’s the key mechanism: when the model processes the sentence “The cat sat on the mat,” it needs the embedding for “cat” to contain enough information to help predict “sat” as the next token. If the embedding for “cat” doesn’t carry useful information about what typically follows cats in text, the model will make poor predictions, the loss will be high, and the gradients will push the embedding values in a direction that improves future predictions.

Over billions of training examples, the model sees “cat” in many different contexts:

  • “The cat sat on the mat”
  • “She adopted a cat from the shelter”
  • “The cat chased the mouse”
  • “My cat is sleeping on the couch”

And it sees “dog” in very similar contexts:

  • “The dog sat on the mat”
  • “She adopted a dog from the shelter”
  • “The dog chased the squirrel”
  • “My dog is sleeping on the couch”

Because “cat” and “dog” appear in nearly identical contexts, the gradients push their embeddings in similar directions. Over time, their vectors converge to nearby points in the embedding space. Meanwhile, “refrigerator” appears in very different contexts (“She opened the refrigerator and grabbed a soda”), so its embedding gets pushed in a different direction.

This is the distributional hypothesis in linguistics, often summarized as: “You shall know a word by the company it keeps” (attributed to linguist J.R. Firth, 1957). Words that appear in similar contexts have similar meanings, and the training process automatically encodes this into the embedding vectors.

The Embedding Table as a Learnable Parameter

Technically, the embedding table is just another set of parameters in the model, no different from the weight matrices in the attention layers or feed-forward networks. During training:

  1. The model performs a forward pass, using the embedding table to convert token IDs into vectors.
  2. The loss is computed (how wrong was the next-token prediction?).
  3. Backpropagation computes gradients for every parameter, including every number in the embedding table.
  4. The optimizer (Adam, as discussed in Chapter 3) updates the embedding values based on the gradients.

The only special thing about the embedding table is that the gradient update is sparse: for any given training example, only the rows corresponding to the tokens in that example receive gradient updates. If the training batch contains the tokens “The,” “cat,” “sat,” “on,” “the,” “mat,” only those 6 rows (out of 202,048) get updated. The other 202,042 rows are untouched for that step.

Over the course of training on trillions of tokens, every row in the embedding table gets updated millions of times, gradually refining the vector for each token until it captures the statistical patterns of how that token is used in language.

A Brief History of Word Embeddings

The idea of representing words as dense vectors learned from data has a rich history:

Bengio et al. (2003): Yoshua Bengio and colleagues published “A Neural Probabilistic Language Model” in the Journal of Machine Learning Research. This paper proposed learning word representations (embeddings) jointly with a neural network language model. Their experiments used embedding dimensions of 30 to 100. This was the first major work to demonstrate that neural networks could learn useful word representations from text.

Source: Bengio, Ducharme, Vincent, and Jauvin, “A Neural Probabilistic Language Model,” JMLR, 2003.

Word2Vec (2013): Tomas Mikolov and colleagues at Google published two papers introducing Word2Vec, which could train 300-dimensional word embeddings efficiently on very large corpora. Word2Vec demonstrated the famous analogy relationships (King - Man + Woman = Queen) and made word embeddings a standard tool in natural language processing.

Source: Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781, January 2013.

GloVe (2014): Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford published GloVe (Global Vectors for Word Representation), which combined the advantages of matrix factorization methods with neural network approaches. GloVe embeddings became widely used alongside Word2Vec.

Source: Pennington, Socher, and Manning, “GloVe: Global Vectors for Word Representation,” EMNLP 2014.

Contextual Embeddings (2018+): The approaches above produce a single, fixed embedding for each word. But words have different meanings in different contexts: “bank” means something different in “river bank” vs. “bank account.” Starting with ELMo (2018) and then BERT (2018) and GPT (2018), models began producing contextual embeddings, where the representation of a word depends on the surrounding text. In modern Transformer models, the embedding table provides the initial, context-free representation, and then the Transformer layers refine it into a context-dependent representation over dozens of layers.

The embedding table in a modern LLM like LLaMA 4 Maverick is the direct descendant of Bengio’s 2003 word vectors, scaled up from 30 dimensions to 5,120 dimensions and from thousands of words to 202,048 tokens.


Real Numbers: How Big Are Embedding Tables?

Let’s compute the exact sizes of embedding tables in production models, because these numbers matter for understanding memory requirements and deployment costs.

The formula is simple:

Embedding table size (in numbers) = vocabulary_size x embedding_dimension
Embedding table size (in bytes)   = vocabulary_size x embedding_dimension x bytes_per_number

In 16-bit floating point (float16 or bfloat16), each number takes 2 bytes. In 32-bit floating point (float32), each number takes 4 bytes. Most modern models store and serve embeddings in 16-bit precision.

Worked Examples

GPT-2 (2019):

50,257 tokens x 768 dimensions = 38,597,376 numbers
38,597,376 x 2 bytes = 77,194,752 bytes ≈ 77 MB

The embedding table is 77 MB. For a model with 124 million total parameters, the embedding table accounts for about 31% of all parameters.

Mistral 7B (2023):

32,000 tokens x 4,096 dimensions = 131,072,000 numbers
131,072,000 x 2 bytes = 262,144,000 bytes ≈ 262 MB

Source: Mistral 7B architecture from Mistral AI (2023): 32,000 vocabulary, 4,096 hidden dimension, 32 layers.

DeepSeek-V3 (December 2024):

128,000 tokens x 7,168 dimensions = 917,504,000 numbers
917,504,000 x 2 bytes = 1,835,008,000 bytes ≈ 1.8 GB

Source: DeepSeek-V3 technical report (December 2024): 128,000 vocabulary, 7,168 hidden dimension, 61 layers, 671B total parameters.

LLaMA 4 Maverick (April 2025):

202,048 tokens x 5,120 dimensions = 1,034,485,760 numbers
1,034,485,760 x 2 bytes = 2,068,971,520 bytes ≈ 2.1 GB

Source: LLaMA 4 Maverick from HuggingFace Transformers Llama4TextConfig and Meta AI (April 2025): 202,048 vocabulary, 5,120 hidden dimensions, 48 layers (alternating dense and MoE), 40 query heads, 8 KV heads, 128 experts.

So the embedding table in LLaMA 4 Maverick is about 2.1 GB. For a model with 400 billion total parameters (800 GB in float16), the embedding table is a small fraction of the total, about 0.26%. But in absolute terms, 2.1 GB is still substantial; it’s larger than many entire applications.

The Output Layer: A Mirror of the Embedding Table

There’s a second large matrix related to the vocabulary that we should mention: the output projection (also called the “language model head” or “unembedding matrix”). This matrix sits at the very end of the model and converts the final hidden state back into scores over the vocabulary.

The output projection has the shape [hidden_dimension x vocabulary_size], which is the transpose of the embedding table’s shape. In some models (like GPT-2), the output projection shares the same weights as the embedding table, a technique called weight tying. This saves memory by not storing two copies of a vocabulary-sized matrix. In other models (like LLaMA 4), the embedding table and output projection are separate, each with their own learned weights.

When weight tying is used, the model effectively says: “The vector that represents a token as input should also be the vector that the model aims to produce as output when it wants to generate that token.” This is an elegant constraint that works well in practice and reduces the model’s parameter count.


Visualizing Embeddings: From 5,120 Dimensions to 2D

You can’t visualize 5,120 dimensions. But you can project high-dimensional embeddings down to 2 dimensions while preserving the local neighborhood structure, so that tokens that are close in the original space remain close in the 2D plot. The most popular technique for this is t-SNE (t-distributed Stochastic Neighbor Embedding), developed by Laurens van der Maaten and Geoffrey Hinton in 2008.

Source: van der Maaten and Hinton, “Visualizing Data using t-SNE,” Journal of Machine Learning Research, 2008.

How t-SNE Works (Simplified)

t-SNE works in two steps:

  1. In the original high-dimensional space, it computes the probability that any two points are “neighbors” (close together). Points that are close get high probability; points that are far apart get low probability.

  2. In the 2D output space, it places points randomly and then iteratively moves them around until the neighborhood probabilities in 2D match the neighborhood probabilities in the original space as closely as possible.

The result is a 2D scatter plot where clusters of similar tokens are visible. t-SNE is particularly good at preserving local structure: if “cat” and “dog” are close in 5,120 dimensions, they’ll be close in the 2D plot. However, the global distances in a t-SNE plot are not meaningful; the absolute positions and distances between distant clusters don’t correspond to anything in the original space.

Hands-On: Visualizing GPT-2 Embeddings with t-SNE

Let’s visualize the embeddings of a curated set of words from GPT-2 to see the semantic clusters:

from transformers import AutoTokenizer, AutoModel
from sklearn.manifold import TSNE
import numpy as np
import matplotlib.pyplot as plt

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
embeddings = model.wte.weight.detach().numpy()

# Define word groups with semantic categories
word_groups = {
    "Animals": ["cat", "dog", "horse", "fish", "bird", "mouse", "rabbit", "tiger"],
    "Countries": ["France", "Germany", "Japan", "Brazil", "India", "China", "Italy", "Spain"],
    "Colors": ["red", "blue", "green", "yellow", "purple", "orange", "black", "white"],
    "Programming": ["Python", "Java", "code", "function", "variable", "loop", "class", "array"],
    "Emotions": ["happy", "sad", "angry", "fear", "love", "hate", "joy", "calm"],
}

# Collect embeddings and labels
all_embeddings = []
all_labels = []
all_categories = []

for category, words in word_groups.items():
    for word in words:
        token_ids = tokenizer.encode(" " + word)
        if len(token_ids) == 1:
            all_embeddings.append(embeddings[token_ids[0]])
            all_labels.append(word)
            all_categories.append(category)

all_embeddings = np.array(all_embeddings)

# Run t-SNE to reduce from 768 dimensions to 2
tsne = TSNE(n_components=2, random_state=42, perplexity=min(8, len(all_embeddings) - 1))
coords = tsne.fit_transform(all_embeddings)

# Plot
colors_map = {
    "Animals": "#e74c3c",
    "Countries": "#3498db",
    "Colors": "#2ecc71",
    "Programming": "#9b59b6",
    "Emotions": "#f39c12",
}

plt.figure(figsize=(12, 8))
for i, (x, y) in enumerate(coords):
    category = all_categories[i]
    plt.scatter(x, y, c=colors_map[category], s=100, alpha=0.7)
    plt.annotate(all_labels[i], (x, y), fontsize=9, ha="center", va="bottom")

# Add legend
for category, color in colors_map.items():
    plt.scatter([], [], c=color, s=100, label=category)
plt.legend(loc="best", fontsize=10)

plt.title("GPT-2 Token Embeddings Visualized with t-SNE", fontsize=14)
plt.xlabel("t-SNE dimension 1")
plt.ylabel("t-SNE dimension 2")
plt.tight_layout()
plt.savefig("embeddings_tsne.png", dpi=150)
plt.show()
print("Plot saved to embeddings_tsne.png")

When you run this code, you’ll see a scatter plot where words from the same semantic category cluster together. Animals will form one group, countries another, colors another, and so on. The exact positions will vary between runs (t-SNE involves random initialization), but the clustering pattern will be consistent.

This visualization makes concrete what we’ve been discussing abstractly: the embedding space has structure. Similar tokens are near each other, and that structure was learned entirely from predicting the next token on large amounts of text.

Limitations of t-SNE

A few important caveats about t-SNE visualizations:

  1. Global distances are meaningless. The distance between the “Animals” cluster and the “Countries” cluster in the 2D plot doesn’t tell you anything about how far apart they are in the original 768-dimensional space. Only local neighborhoods are preserved.

  2. Cluster sizes are meaningless. A tight cluster in the 2D plot doesn’t necessarily mean the words are tighter in the original space. t-SNE can expand or compress clusters.

  3. Results vary between runs. Different random seeds produce different layouts. The clusters will be the same, but their positions on the plot will differ.

  4. Perplexity matters. The perplexity parameter controls how many neighbors each point considers. Low perplexity (5-10) emphasizes very local structure; high perplexity (30-50) captures more global structure. There’s no single “correct” value.

Despite these limitations, t-SNE is an invaluable tool for building intuition about embedding spaces. It lets you see, at a glance, that the model has learned meaningful semantic structure.


Beyond Words: What Subword Embeddings Capture

In Chapter 4, we learned that modern tokenizers use subword tokenization. The vocabulary doesn’t just contain whole words; it contains word pieces like “un,” “ing,” “tion,” and “est.” Each of these subword tokens has its own embedding vector in the table.

This raises an interesting question: what do subword embeddings represent?

The embedding for “un” captures the general concept of negation or reversal. It appears in “unhappy,” “undo,” “unfair,” “unlikely,” and hundreds of other words. Through training, the model learns that when “un” precedes another token, it typically reverses or negates the meaning. The embedding for “un” encodes this pattern.

Similarly, the embedding for “ing” captures the concept of ongoing action (present participle). The embedding for “tion” captures the concept of a noun derived from a verb (“creation,” “information,” “education”). These subword embeddings are less semantically rich than whole-word embeddings (the embedding for “cat” carries more specific meaning than the embedding for “ing”), but they serve a crucial role: they allow the model to construct meaningful representations of rare or unseen words by combining the embeddings of their subword pieces.

When the model encounters the rare word “defenestration” (the act of throwing someone out of a window), it might be tokenized as [“def”, “en”, “est”, “ration”]. None of these subword tokens individually mean “throwing out of a window,” but the Transformer layers (Chapters 7-10) combine their embeddings through attention and feed-forward processing to build up the full meaning. The embedding table provides the raw ingredients; the Transformer layers do the cooking.


The Embedding Table in the Full Model Pipeline

Let’s place the embedding table in context of the full model architecture. Here’s what happens when you send a prompt to a model like LLaMA 4 Maverick:

Step 1: Tokenization
  "The weather in Tokyo is usually mild"
  --> [The, weather, in, Tokyo, is, usually, mild]
  --> [450, 9235, 304, 27856, 338, 6892, 24312]

Step 2: Embedding Lookup
  Each token ID indexes into the embedding table (202,048 x 5,120)
  --> 7 vectors, each with 5,120 dimensions
  --> Matrix of shape [7 x 5,120]

Step 3: Positional Encoding (Chapter 6)
  Position information is added to each embedding
  --> Still [7 x 5,120], but now encodes both meaning and position

Step 4: Transformer Layers (Chapters 7-10)
  48 layers of attention and feed-forward processing
  (alternating between dense and Mixture-of-Experts layers)
  --> [7 x 5,120] (same shape, but values are completely transformed)

Step 5: Output Projection
  The last token's vector is multiplied by the output matrix
  --> 202,048 scores (one per vocabulary entry)

Step 6: Softmax
  Scores are converted to probabilities
  --> 202,048 probabilities summing to 1

The embedding table is Step 2. It’s the bridge between the discrete world of token IDs (integers) and the continuous world of vectors (lists of floating-point numbers). Everything before it is text processing. Everything after it is linear algebra.

After the embedding lookup, the token vectors pass through 48 Transformer layers. In LLaMA 4 Maverick, these alternate between dense layers and Mixture-of-Experts (MoE) layers (Chapter 12). Each layer transforms the vectors through attention (which lets tokens share information with each other) and feed-forward networks (which process each token’s information independently). By the time the vectors emerge from the final layer, they have been transformed from simple word-meaning vectors into rich, context-dependent representations that encode the full meaning of the entire sequence.

The initial embeddings from the table are sometimes called static embeddings or context-free embeddings, because they represent the token’s meaning in isolation, without considering the surrounding text. The vectors that emerge from the Transformer layers are called contextual embeddings, because they incorporate information from the entire sequence. The word “bank” starts with the same static embedding regardless of context, but after passing through the Transformer layers, its contextual embedding will be very different in “river bank” vs. “bank account.”


Embeddings for Non-Text Tokens

Modern models don’t just embed text tokens. LLaMA 4 Maverick is natively multimodal, meaning it can process both text and images. For images, the model uses a separate vision encoder (a Vision Transformer, or ViT) that converts image patches into vectors. These image vectors are then projected into the same embedding space as text tokens, so the Transformer layers can process text and image information together.

The vision encoder in LLaMA 4 Maverick is based on MetaCLIP and uses a 34-layer Vision Transformer with 16 attention heads and an embedding dimension of 768 (compared to 5,120 for text). The vision encoder’s feed-forward layers have an intermediate size of 5,632. The vision encoder produces output vectors of dimension 7,680 (the vision_output_dim), which includes outputs from intermediate transformer layers and a global transformer encoder. A multi-modal projector then maps these vision representations through a two-stage process: first reducing from the vision output dimension to 4,096 (projector_input_dim), then projecting to 4,096 (projector_output_dim), before the results are integrated into the text model’s token stream for joint processing by the Transformer layers.

Source: LLaMA 4 Maverick vision config from HuggingFace Transformers Llama4VisionConfig: hidden_size = 768, 34 vision layers, 16 vision attention heads, intermediate_size = 5,632, vision_output_dim = 7,680, projector_input_dim = 4,096, projector_output_dim = 4,096. MetaCLIP-based vision encoder confirmed by multiple technical analyses (April 2025).

Special tokens also have embeddings. The beginning-of-sequence token, end-of-sequence token, and role markers (system, user, assistant) each have their own row in the embedding table. These embeddings are learned during training just like regular token embeddings, and they encode the model’s understanding of what these control signals mean.


Hands-On: Exploring Embedding Arithmetic

Let’s implement the famous Word2Vec-style analogy test using GPT-2’s embeddings. This demonstrates that the geometric structure of the embedding space encodes semantic relationships:

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
embeddings = model.wte.weight.detach()

def get_emb(word):
    return embeddings[tokenizer.encode(" " + word)[0]]

def find_closest(target_vec, exclude_words, top_k=5):
    """Find the closest tokens to target_vec, excluding specified words."""
    sims = torch.nn.functional.cosine_similarity(target_vec.unsqueeze(0), embeddings)
    exclude_ids = set()
    for w in exclude_words:
        exclude_ids.add(tokenizer.encode(" " + w)[0])

    top_indices = sims.argsort(descending=True)
    results = []
    for idx in top_indices:
        if idx.item() not in exclude_ids:
            decoded = tokenizer.decode([idx.item()]).strip()
            if len(decoded) > 1:
                results.append((decoded, sims[idx].item()))
                if len(results) >= top_k:
                    break
    return results

# Analogy: King - Man + Woman = ?
result_vec = get_emb("king") - get_emb("man") + get_emb("woman")
print("king - man + woman = ?")
for word, sim in find_closest(result_vec, ["king", "man", "woman"]):
    print(f"  {word:<15s}  cosine: {sim:.4f}")

print()

# Analogy: Paris - France + Japan = ?
result_vec = get_emb("Paris") - get_emb("France") + get_emb("Japan")
print("Paris - France + Japan = ?")
for word, sim in find_closest(result_vec, ["Paris", "France", "Japan"]):
    print(f"  {word:<15s}  cosine: {sim:.4f}")

print()

# Analogy: walking - walked + swam = ?
result_vec = get_emb("walking") - get_emb("walked") + get_emb("swam")
print("walking - walked + swam = ?")
for word, sim in find_closest(result_vec, ["walking", "walked", "swam"]):
    print(f"  {word:<15s}  cosine: {sim:.4f}")

The “King - Man + Woman” analogy should produce “Queen” or a closely related word near the top of the results. The “Paris - France + Japan” analogy should produce “Tokyo” or another Japanese city. These results won’t always be perfect (GPT-2 is a small model, and its embedding table is optimized for next-token prediction rather than analogy tasks), but the general pattern holds: the embedding space encodes semantic relationships as geometric directions.

This is remarkable when you consider that nobody told the model about gender, geography, or verb tenses. These relationships emerged purely from the statistical patterns of which words appear near which other words in the training data.


Key Takeaways

  • The embedding table is a matrix of size [vocabulary_size x embedding_dimension] that converts token IDs into vectors. It is a simple lookup table: token ID N maps to row N of the matrix. No computation is involved in the lookup itself.

  • Real embedding tables are large. GPT-2’s is about 77 MB (50,257 tokens x 768 dimensions). LLaMA 4 Maverick’s is about 2.1 GB (202,048 tokens x 5,120 dimensions). DeepSeek-V3’s is about 1.8 GB (128,000 tokens x 7,168 dimensions). These sizes are in 16-bit floating point.

  • No single dimension has a clean, interpretable meaning. Token meaning is encoded as a distributed representation across all dimensions. The pattern of all 5,120 numbers together encodes what the token means, not any individual number.

  • Similar meanings produce nearby vectors. “Cat” and “dog” have similar embeddings because they appear in similar contexts during training. “Cat” and “refrigerator” have dissimilar embeddings because they appear in different contexts. This is measured using cosine similarity or dot products.

  • Embeddings are learned during training, not hand-coded. They start as random numbers and are refined through billions of gradient updates via backpropagation. The training objective (predict the next token) forces the model to learn embeddings where semantically related tokens are close together, because similar tokens are useful in similar prediction contexts.

  • The embedding space has geometric structure that encodes semantic relationships. Vector arithmetic like “King - Man + Woman = Queen” works because directions in the space correspond to concepts like gender, geography, and verb tense. This was first demonstrated by Word2Vec (Mikolov et al., 2013) and holds in modern LLM embeddings.

  • t-SNE (van der Maaten and Hinton, 2008) is a dimensionality reduction technique that projects high-dimensional embeddings into 2D for visualization, preserving local neighborhood structure. It reveals semantic clusters: animals group together, countries group together, programming terms group together.

  • The embedding table provides static, context-free representations. The Transformer layers (Chapters 7-10) then transform these into contextual representations that depend on the surrounding text. The word “bank” starts with the same embedding regardless of context, but its representation after the Transformer layers differs between “river bank” and “bank account.”

  • Some models use weight tying, where the embedding table and the output projection matrix share the same weights. This saves memory and enforces the constraint that the input representation of a token should match the output representation the model aims to produce.


What’s Next

You now know how tokens get their initial numerical meaning through the embedding table. But there’s a critical piece of information missing from these embeddings: word order. The embedding for “cat” is the same whether it appears at position 1 or position 1,000 in the sequence. In Chapter 6, we’ll cover positional encoding: how models inject information about token position into the embeddings, why Transformers need this (they have no built-in sense of order), and how techniques like Rotary Position Embeddings (RoPE) enable models to handle context windows of millions of tokens.