Skip to content
Chapter 21. Vision, How Models See Images

Chapter 21. Vision, How Models See Images

Every frontier language model released in 2025 and 2026 can process images. You can paste a screenshot of a spreadsheet into ChatGPT and ask it to summarize the data. You can upload a photo of a whiteboard to Claude and have it extract the equations. You can send Gemini a chart and ask it to explain the trend. But how does a model that was built to predict the next text token suddenly “see” an image? The answer involves a separate neural network called a vision encoder that converts pixels into the same kind of numerical representations (vectors) that the language model already knows how to process. This chapter explains exactly how that works: how images become tokens, how those tokens connect to the language model, and where the whole system still falls short.


From Pixels to Patches: The Core Idea

A language model processes text as a sequence of tokens, where each token is represented by a vector (Chapter 5). To make a language model process images, you need to convert an image into a sequence of vectors that look, to the language model, like token embeddings. The model does not need to know whether a vector came from a word or from a piece of an image. It just processes vectors.

The question is: how do you turn an image into a sequence of vectors?

The naive approach would be to treat each pixel as a token. A 224 x 224 pixel image has 50,176 pixels. Each pixel has 3 color channels (red, green, blue), so you could represent each pixel as a 3-dimensional vector. But 50,176 tokens is far too many. The attention mechanism (Chapter 7) computes scores between every pair of tokens, which means O(n^2) computation. For 50,176 tokens, that is over 2.5 billion attention score computations per layer. And real images are much larger than 224 x 224.

The solution, introduced by the Vision Transformer (ViT), is to divide the image into patches and treat each patch as a token. Instead of one pixel = one token, one patch of pixels = one token. A 16 x 16 pixel patch contains 256 pixels (768 values when you include all 3 color channels). A 224 x 224 image divided into 16 x 16 patches produces (224 / 16) x (224 / 16) = 14 x 14 = 196 patches. That is 196 tokens instead of 50,176. The attention computation drops from 2.5 billion to about 38,000 score computations per layer.

import numpy as np

def image_to_patches(image_height, image_width, patch_size, channels=3):
    """
    Calculate how an image is divided into patches for a Vision Transformer.
    Returns the number of patches and the dimension of each patch vector.
    """
    patches_h = image_height // patch_size
    patches_w = image_width // patch_size
    num_patches = patches_h * patches_w
    patch_dim = patch_size * patch_size * channels  # Flattened patch
    
    return {
        "image_size": f"{image_height}x{image_width}",
        "patch_size": f"{patch_size}x{patch_size}",
        "grid": f"{patches_h}x{patches_w}",
        "num_patches": num_patches,
        "patch_dim": patch_dim,
        "attention_scores": num_patches * num_patches,
    }

configs = [
    (224, 224, 16, "ViT standard (224px, 16x16 patches)"),
    (224, 224, 32, "ViT with larger patches (224px, 32x32)"),
    (384, 384, 14, "SigLIP So400m (384px, 14x14 patches)"),
    (560, 560, 14, "LLaMA 3.2 90B vision (560px, 14x14 patches)"),
]

print(f"{'Config':<45} {'Patches':>8} {'Patch Dim':>10} {'Attn Scores':>12}")
print("-" * 80)
for h, w, p, label in configs:
    info = image_to_patches(h, w, p)
    print(f"{label:<45} {info['num_patches']:>8,} {info['patch_dim']:>10,} "
          f"{info['attention_scores']:>12,}")

The tradeoff is clear: smaller patches preserve more detail but create more tokens (and more computation). Larger patches are cheaper but lose fine-grained information. The original ViT paper used 16 x 16 patches as the default, which is why the paper is titled “An Image is Worth 16x16 Words.”


The Vision Transformer (ViT)

The Vision Transformer was introduced by Dosovitskiy et al. at Google Brain in October 2020 (arXiv:2010.11929, ICLR 2021). The key insight was that you do not need convolutional neural networks (CNNs) for computer vision. You can use the same Transformer architecture from NLP, with minimal modifications, if you just convert image patches into token-like embeddings.

How ViT Works, Step by Step

  1. Split the image into patches. A 224 x 224 RGB image is divided into a grid of non-overlapping patches. With 16 x 16 patches, this produces 196 patches, each containing 16 x 16 x 3 = 768 pixel values.

  2. Flatten and project each patch. Each patch is flattened into a 768-dimensional vector (for 16 x 16 x 3 pixels). This vector is then multiplied by a learned linear projection matrix to produce a patch embedding of the model’s hidden dimension (e.g., 768 for ViT-Base). This is exactly analogous to the token embedding lookup in a text Transformer (Chapter 5), except instead of looking up a token ID in a table, you multiply the raw pixel values by a weight matrix.

  3. Add position embeddings. Just like text tokens need positional information (Chapter 6), image patches need to know where they are in the image. ViT uses learned position embeddings: one embedding vector per patch position, added to the patch embedding.

  4. Prepend a [CLS] token. A special learnable vector is prepended to the sequence of patch embeddings. After processing through the Transformer, the output corresponding to this [CLS] token is used as the representation of the entire image (for classification tasks).

  5. Process through Transformer encoder layers. The sequence of patch embeddings (plus the [CLS] token) is fed through standard Transformer encoder blocks: multi-head self-attention, layer normalization, feed-forward networks, and residual connections. These are the same components you learned about in Chapters 7 through 10.

  6. Read the output. For classification, the output vector at the [CLS] position is fed through a small MLP head that produces class probabilities.

import numpy as np

def vit_forward_shapes(image_size=224, patch_size=16, hidden_dim=768,
                       num_layers=12, num_heads=12, mlp_ratio=4):
    """
    Trace the tensor shapes through a ViT forward pass.
    """
    channels = 3
    num_patches = (image_size // patch_size) ** 2
    patch_dim = patch_size * patch_size * channels
    
    steps = []
    
    # Step 1: Image to patches
    steps.append(("Input image", f"({image_size}, {image_size}, {channels})"))
    steps.append(("Patches (flattened)", f"({num_patches}, {patch_dim})"))
    
    # Step 2: Linear projection
    steps.append(("Patch projection weight", f"({patch_dim}, {hidden_dim})"))
    steps.append(("Patch embeddings", f"({num_patches}, {hidden_dim})"))
    
    # Step 3: Add position embeddings
    steps.append(("Position embeddings", f"({num_patches + 1}, {hidden_dim})"))
    
    # Step 4: Prepend [CLS] token
    seq_len = num_patches + 1  # patches + CLS
    steps.append(("Input to Transformer", f"({seq_len}, {hidden_dim})"))
    
    # Step 5: Transformer layers
    head_dim = hidden_dim // num_heads
    steps.append(("Per-head Q, K, V", f"({seq_len}, {head_dim}) x {num_heads} heads"))
    steps.append(("Attention scores (per head)", f"({seq_len}, {seq_len})"))
    steps.append(("FFN intermediate", f"({seq_len}, {hidden_dim * mlp_ratio})"))
    steps.append(("Transformer output", f"({seq_len}, {hidden_dim})"))
    
    # Step 6: CLS output
    steps.append(("[CLS] output vector", f"({hidden_dim},)"))
    
    print(f"ViT Forward Pass: {image_size}px image, {patch_size}x{patch_size} patches")
    print(f"Hidden dim: {hidden_dim}, Layers: {num_layers}, Heads: {num_heads}")
    print("=" * 55)
    for name, shape in steps:
        print(f"  {name:<30} {shape}")

vit_forward_shapes()

ViT Model Sizes

The original paper defined three model sizes, following the naming convention from BERT:

ModelLayersHidden DimMLP DimHeadsPatch SizeParams
ViT-Base (ViT-B/16)127683,0721216 x 16~86M
ViT-Large (ViT-L/16)241,0244,0961616 x 16~307M
ViT-Huge (ViT-H/14)321,2805,1201614 x 14~632M

The “/16” or “/14” suffix indicates the patch size. ViT-H/14 uses 14 x 14 patches, which produces more patches per image (256 instead of 196 for a 224 x 224 image) and therefore captures finer detail at the cost of more computation.

These sizes may seem small compared to the language models discussed in Chapter 11 (where 8B is considered “small”). That is because vision encoders do not need to store the same kind of factual knowledge that language models do. Their job is to extract visual features: edges, textures, shapes, objects, spatial relationships. The heavy lifting of reasoning and language generation happens in the language model that receives the vision encoder’s output.

Source: Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv:2010.11929, October 2020. ICLR 2021. ViT-Base: 12 layers, 768 hidden, 3072 MLP, 12 heads, ~86M params. ViT-Large: 24 layers, 1024 hidden, 4096 MLP, 16 heads, ~307M params. ViT-Huge: 32 layers, 1280 hidden, 5120 MLP, 16 heads, ~632M params. Confirmed from huggingface.co/google/vit-base-patch16-224 and the original paper Table 1.


From Classification to Language: CLIP and Contrastive Learning

The original ViT was trained for image classification: given an image, predict which of 1,000 ImageNet categories it belongs to. This is useful for computer vision research, but it does not help a language model understand images. A language model needs visual representations that are aligned with text, so that the concept of “a dog sitting on a couch” in an image maps to something close to the text description “a dog sitting on a couch.”

CLIP (Contrastive Language-Image Pre-training), introduced by Radford et al. at OpenAI in January 2021 (arXiv:2103.00020), solved this problem. CLIP trains two encoders simultaneously: a vision encoder (a ViT or ResNet) and a text encoder (a Transformer). Both encoders map their inputs into the same vector space. During training, CLIP is given batches of image-text pairs and learns to make the vector for an image close to the vector for its matching text description, and far from the vectors for non-matching descriptions.

The training objective is contrastive: given a batch of N image-text pairs, CLIP computes the similarity (dot product) between every image vector and every text vector, producing an N x N similarity matrix. The correct pairings lie on the diagonal. CLIP maximizes the similarity scores on the diagonal and minimizes the off-diagonal scores.

import numpy as np

def clip_contrastive_loss_demo(batch_size=4, embed_dim=512):
    """
    Demonstrate CLIP's contrastive learning objective.
    Shows how image and text embeddings are aligned.
    """
    np.random.seed(42)
    
    # Simulate image and text embeddings (normally from ViT and text Transformer)
    image_embeddings = np.random.randn(batch_size, embed_dim)
    text_embeddings = np.random.randn(batch_size, embed_dim)
    
    # Normalize to unit vectors (CLIP uses cosine similarity)
    image_embeddings /= np.linalg.norm(image_embeddings, axis=1, keepdims=True)
    text_embeddings /= np.linalg.norm(text_embeddings, axis=1, keepdims=True)
    
    # Compute similarity matrix (N x N)
    temperature = 0.07  # Learned temperature parameter
    similarity = (image_embeddings @ text_embeddings.T) / temperature
    
    print(f"Batch size: {batch_size}, Embedding dim: {embed_dim}")
    print(f"\nSimilarity matrix (scaled by temperature={temperature}):")
    print(f"Rows = images, Columns = texts")
    print(f"Diagonal = correct pairs (should be HIGH)")
    print(f"Off-diagonal = wrong pairs (should be LOW)\n")
    
    for i in range(batch_size):
        row = "  ".join(f"{similarity[i, j]:>7.2f}" for j in range(batch_size))
        marker = "  <-- correct pair" if True else ""
        print(f"  Image {i}: [{row}]  (correct: Text {i})")
    
    # The loss pushes diagonal values up and off-diagonal values down
    # This is a symmetric cross-entropy loss over both rows and columns
    
    return similarity

clip_contrastive_loss_demo()

After training on 400 million image-text pairs scraped from the internet, CLIP’s vision encoder produces vectors that are semantically meaningful in a way that aligns with language. The vector for a photo of a golden retriever is close to the text vector for “golden retriever,” “dog,” and “pet,” and far from “car” or “building.”

This is why CLIP (and its successors) became the foundation for vision-language models. The vision encoder does not just recognize objects; it produces representations that live in the same semantic space as text.

SigLIP: The Modern Successor

SigLIP (Sigmoid Loss for Language Image Pre-training), introduced by Zhai et al. at Google in 2023 (arXiv:2303.15343), improved on CLIP by replacing the softmax-based contrastive loss with a simpler sigmoid loss. Instead of computing a full N x N similarity matrix and normalizing across the batch, SigLIP treats each image-text pair independently with a binary classification: “does this image match this text?” This allows training with larger batch sizes and produces better-calibrated similarity scores.

SigLIP’s most widely used variant is SigLIP-So400m (Shape-optimized, 400 million parameters), which uses a ViT with 14 x 14 patches at 384 x 384 resolution. This encoder is used in Google’s PaliGemma family of vision-language models.

In February 2025, Google released SigLIP 2 (arXiv:2502.14786), which added captioning-based pretraining, self-supervised learning (self-distillation and masked prediction), and online data curation to the original SigLIP recipe. SigLIP 2 produces vision encoders with improved semantic understanding, better localization, and richer dense features.

Source: Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” arXiv:2103.00020, February 2021. CLIP trained on 400M image-text pairs. Zhai et al., “Sigmoid Loss for Language Image Pre-Training,” arXiv:2303.15343, 2023. SigLIP-So400m: ~400M params, ViT with 14x14 patches at 384x384 resolution, used in PaliGemma (confirmed from huggingface.co/google/siglip-so400m-patch14-384 and arxiv.org/abs/2310.09199). SigLIP 2: arXiv:2502.14786, February 2025 (confirmed from arxiv.org/html/2502.14786).


Connecting Vision to Language: Three Architectures

Having a vision encoder that produces good image representations is only half the problem. You also need a way to feed those representations into a language model so it can reason about the image and generate text responses. There are three main approaches, and understanding the differences between them is important because they have different tradeoffs in quality, efficiency, and flexibility.

Approach 1: Linear Projection (LLaVA-style)

The simplest approach, pioneered by LLaVA (Large Language and Vision Assistant, Liu et al., arXiv:2304.08485, NeurIPS 2023 Oral), is to use a linear projection layer (or a small MLP) to map the vision encoder’s output vectors directly into the language model’s embedding space.

Here is how it works:

  1. The vision encoder (e.g., CLIP ViT-L/14) processes the image and produces a sequence of patch vectors, one per image patch.
  2. A learned projection layer maps each patch vector from the vision encoder’s dimension to the language model’s hidden dimension.
  3. These projected vectors are inserted into the language model’s input sequence alongside the text token embeddings.
  4. The language model processes the combined sequence of text and image tokens using its standard Transformer layers.
import numpy as np

def llava_projection_demo():
    """
    Demonstrate how LLaVA connects a vision encoder to a language model
    using a simple linear projection.
    """
    # Vision encoder output: 576 patches, each 1024-dim (CLIP ViT-L/14 at 336px)
    vision_dim = 1024
    num_patches = 576  # (336/14)^2 = 24x24 = 576
    
    # Language model hidden dimension (e.g., Vicuna 7B)
    language_dim = 4096
    
    # The projection is a simple linear layer (or 2-layer MLP in LLaVA-1.5)
    # W: (vision_dim, language_dim)
    projection_params = vision_dim * language_dim
    
    # Text tokens
    num_text_tokens = 50  # "Describe what you see in this image."
    
    # Combined sequence
    total_tokens = num_patches + num_text_tokens
    
    print("LLaVA Architecture (Projection-based)")
    print("=" * 55)
    print(f"  Vision encoder output:  ({num_patches}, {vision_dim})")
    print(f"  Projection weight:      ({vision_dim}, {language_dim})")
    print(f"  Projected image tokens: ({num_patches}, {language_dim})")
    print(f"  Text token embeddings:  ({num_text_tokens}, {language_dim})")
    print(f"  Combined input to LLM:  ({total_tokens}, {language_dim})")
    print(f"  Projection parameters:  {projection_params:,}")
    print(f"\n  The LLM sees {total_tokens} tokens total:")
    print(f"    {num_patches} from the image + {num_text_tokens} from the text")
    print(f"    It processes them all with the same attention mechanism.")

llava_projection_demo()

The elegance of this approach is its simplicity. The vision encoder and language model are both pretrained separately. The only new component is the projection layer, which has relatively few parameters. LLaVA’s original projection was a single linear layer; LLaVA-1.5 upgraded to a two-layer MLP, which improved performance.

The downside is that every image patch becomes a token in the language model’s input sequence. A 336 x 336 image with 14 x 14 patches produces 576 image tokens. A 672 x 672 image produces 2,304 tokens. These image tokens consume context window space and add to the attention computation cost.

Approach 2: Cross-Attention (Flamingo-style)

Flamingo, introduced by Alayrac et al. at DeepMind in April 2022 (arXiv:2204.14198, NeurIPS 2022), took a different approach. Instead of inserting image tokens directly into the language model’s input sequence, Flamingo adds cross-attention layers that allow the language model to attend to the image representations at specific points in the network.

The architecture has three key components:

  1. A frozen vision encoder (a pretrained NFNet) processes the image and produces visual features.
  2. A Perceiver Resampler takes the variable-length visual features and compresses them into a fixed number of “visual tokens” (typically 64). This is important because it means the number of visual tokens does not depend on the image resolution.
  3. Gated cross-attention layers are inserted between the frozen language model layers. In these layers, the language model’s hidden states serve as queries, and the visual tokens from the Perceiver Resampler serve as keys and values. A learned gating parameter controls how much visual information flows into the language model.
def flamingo_architecture_demo():
    """
    Demonstrate how Flamingo connects vision to language
    using cross-attention with a Perceiver Resampler.
    """
    # Vision encoder output (variable length depending on image)
    vision_features = 196  # e.g., 14x14 patches
    vision_dim = 1024
    
    # Perceiver Resampler compresses to fixed length
    num_latent_queries = 64  # Fixed, regardless of image size
    resampler_dim = 1024
    
    # Language model
    language_dim = 4096
    num_lm_layers = 32
    cross_attn_every_n = 4  # Cross-attention every 4th layer
    num_cross_attn_layers = num_lm_layers // cross_attn_every_n
    
    print("Flamingo Architecture (Cross-Attention-based)")
    print("=" * 55)
    print(f"  Vision encoder output:    ({vision_features}, {vision_dim})")
    print(f"  Perceiver Resampler output: ({num_latent_queries}, {resampler_dim})")
    print(f"  Language model layers:    {num_lm_layers}")
    print(f"  Cross-attention layers:   {num_cross_attn_layers} "
          f"(every {cross_attn_every_n}th layer)")
    print(f"\n  In cross-attention layers:")
    print(f"    Query: language hidden states ({language_dim}-dim)")
    print(f"    Key/Value: visual tokens ({num_latent_queries} tokens, "
          f"{resampler_dim}-dim)")
    print(f"\n  Key advantage: image tokens do NOT consume context window.")
    print(f"  The {num_latent_queries} visual tokens are only accessed via "
          f"cross-attention,")
    print(f"  not concatenated into the main sequence.")

flamingo_architecture_demo()

The key advantage of cross-attention is that image tokens do not consume the language model’s context window. The visual information is accessed through a side channel (the cross-attention layers), leaving the full context window available for text. The Perceiver Resampler also provides a fixed-size bottleneck, so the cost does not scale with image resolution.

The downside is added architectural complexity: you need to add new cross-attention layers to the language model, which means more parameters to train and a more complex training pipeline.

Approach 3: Early Fusion (Gemini-style)

The third approach, used by Google’s Gemini family, is early fusion: images and text are tokenized into the same sequence from the very beginning, and the entire model is trained on mixed-modality data from scratch. There is no separate “vision encoder” that was pretrained independently. Instead, the model learns to process both visual and textual tokens through the same Transformer layers.

In early fusion, images are converted into discrete tokens (similar to how text is tokenized) or into continuous embeddings that are interleaved with text tokens. The model’s attention mechanism processes all tokens uniformly, whether they came from text or images.

The advantage is that the model can learn deep cross-modal interactions from the ground up, rather than trying to bridge two separately trained models. The disadvantage is that it requires training the entire model from scratch on massive multimodal datasets, which is significantly more expensive than the other approaches.

Comparison

FeatureProjection (LLaVA)Cross-Attention (Flamingo)Early Fusion (Gemini)
Vision encoderPretrained, frozen or fine-tunedPretrained, frozenTrained jointly
Connection methodLinear/MLP projectionPerceiver + cross-attentionUnified tokenization
Image tokens in contextYes (consumes context)No (side channel)Yes (consumes context)
Training costLow (only projection + LLM fine-tune)Medium (cross-attention layers)Very high (full model)
FlexibilityEasy to swap encodersModerateTightly coupled
Used byLLaVA, Qwen2.5-VL, Mistral Small 4 (Pixtral-based), many open modelsFlamingo, LLaMA 3.2Gemini, GPT-4o, Claude, Qwen 3.5

Source: Liu et al., “Visual Instruction Tuning,” arXiv:2304.08485, April 2023. NeurIPS 2023 Oral. LLaVA connects a frozen CLIP ViT-L/14 vision encoder to a Vicuna LLM through a linear projection layer (confirmed from github.com/haotian-liu/LLaVA and abhik.xyz/papers/visual-instruction-tuning). Alayrac et al., “Flamingo: a Visual Language Model for Few-Shot Learning,” arXiv:2204.14198, April 2022. NeurIPS 2022. Uses Perceiver Resampler (64 learned latent queries) and gated cross-attention layers inserted into a frozen language model (confirmed from deepmind.google/discover/blog/tackling-multiple-tasks-with-a-single-visual-language-model and arxiv.org/abs/2204.14198).


How Real Models Handle Vision in March 2026

Now that you understand the three architectural approaches, let’s look at how the major models actually implement vision.

LLaMA 4 Maverick: Projection with MetaCLIP

LLaMA 4 Maverick (Meta, April 2025) uses a projection-based approach. Its vision encoder is based on MetaCLIP, a variant of CLIP trained on Meta’s curated dataset. The vision encoder is a 34-layer Vision Transformer with the following configuration:

  • Hidden size: 768
  • Attention heads: 16
  • FFN intermediate size: 5,632
  • Vision output dimension: 7,680 (includes outputs from intermediate transformer layers and a global transformer encoder)

The vision encoder’s output is mapped to the language model’s token space through a two-stage multi-modal projector: first reducing from the vision output dimension (7,680) to 4,096, then projecting to 4,096 (matching the text model’s hidden dimension). The projected image vectors are then interleaved with text token embeddings and processed by the same MoE Transformer layers that handle text.

This is the same architecture described in Chapter 5 when we discussed multimodal embeddings. The key point is that once the image patches are projected into the language model’s embedding space, the Transformer treats them identically to text tokens. The attention mechanism (Chapter 7) computes attention scores between image tokens and text tokens, allowing the model to relate visual content to language.

Source: LLaMA 4 Maverick vision config from HuggingFace Transformers Llama4VisionConfig: hidden_size=768, 34 vision layers, 16 vision attention heads, intermediate_size=5,632, vision_output_dim=7,680, projector_input_dim=4,096, projector_output_dim=4,096. MetaCLIP-based vision encoder confirmed from changyulee.oopy.io, medium.com/@ashishchadha11944, and multiple technical analyses (April 2025).

LLaMA 3.2 Vision: Cross-Attention

LLaMA 3.2 Vision (Meta, September 2024) took a different approach from LLaMA 4. It uses cross-attention layers, similar to Flamingo. The 11B model builds on the LLaMA 3 8B text model, and the 90B model builds on the LLaMA 3 70B text model. A pretrained image encoder processes the image, and cross-attention layers are inserted at specific intervals in the language model to allow text tokens to attend to visual features.

The vision encoder is the same for both the 11B and 90B models, but the chunk size differs: 448 pixels for the 11B model and 560 pixels for the 90B model. This means the 90B model processes images at higher resolution, producing more visual tokens per image.

The cross-attention approach means that image tokens do not consume the language model’s context window. The visual information is accessed through dedicated cross-attention layers, leaving the full context available for text.

Source: LLaMA 3.2 Vision architecture confirmed from cglagovich.github.io/2024/10/12/llama-vision-compute.html (11B builds on 8B text model, 90B builds on 70B text model, same vision encoder for both, chunk size 448 for 11B and 560 for 90B), hf.co/blog/llama32, and huggingface.co/meta-llama/Llama-3.2-11B-Vision.

GPT-4o and GPT-5.4: Unified Multimodal

OpenAI’s GPT-4o (May 2024) and GPT-5.4 (March 2026) use an early fusion approach. GPT-4o processes text, images, and audio through a unified decoder-only Transformer architecture. All modalities are encoded into a shared embedding space with modality-agnostic weights and a common sequence processing stack. (Note: GPT-4o was retired from ChatGPT on February 13, 2026, but remains available through the API.)

For image input, GPT-4o uses a tiling system. When you send an image with detail: "high", the API:

  1. Scales the image so its longest side is at most 2,048 pixels.
  2. Scales it down further so the shortest side is at most 768 pixels.
  3. Divides the image into 512 x 512 pixel tiles.
  4. Each tile costs 170 tokens. An additional 85 base tokens are always added.

For detail: "low", the image is resized to 512 x 512 and represented as just 85 tokens, regardless of the original size.

def gpt4o_image_tokens(width, height, detail="high"):
    """
    Calculate the number of tokens an image consumes in GPT-4o.
    Based on OpenAI's documented tiling system.
    """
    if detail == "low":
        return 85
    
    # Step 1: Scale so longest side <= 2048
    max_side = max(width, height)
    if max_side > 2048:
        scale = 2048 / max_side
        width = int(width * scale)
        height = int(height * scale)
    
    # Step 2: Scale so shortest side <= 768
    min_side = min(width, height)
    if min_side > 768:
        scale = 768 / min_side
        width = int(width * scale)
        height = int(height * scale)
    
    # Step 3: Count 512x512 tiles
    import math
    tiles_w = math.ceil(width / 512)
    tiles_h = math.ceil(height / 512)
    total_tiles = tiles_w * tiles_h
    
    # Step 4: 170 tokens per tile + 85 base tokens
    tokens = total_tiles * 170 + 85
    
    return tokens

examples = [
    (512, 512, "low", "Small image, low detail"),
    (512, 512, "high", "Small image, high detail"),
    (1024, 1024, "high", "Medium image"),
    (1920, 1080, "high", "Full HD screenshot"),
    (4096, 3072, "high", "High-res photo"),
]

print(f"{'Description':<30} {'Size':>12} {'Detail':>7} {'Tokens':>8}")
print("-" * 62)
for w, h, detail, desc in examples:
    tokens = gpt4o_image_tokens(w, h, detail)
    print(f"{desc:<30} {f'{w}x{h}':>12} {detail:>7} {tokens:>8,}")

GPT-5.4, released on March 5, 2026, introduced full-resolution vision processing. Pre-release code commits in OpenAI’s Codex repository revealed a version check for “gpt-5.4” alongside a capability label for full-resolution image input. This means GPT-5.4 can process images at their native resolution without the downscaling and tiling that GPT-4o requires, potentially producing more accurate results on tasks that depend on fine visual detail. GPT-5.4 supports a 1.05 million token context window (as discussed in Chapter 20), and its vision capabilities contributed to its 75% score on OSWorld-Verified, surpassing the 72.4% human baseline for autonomous computer use.

Source: GPT-4o image tiling system: 170 tokens per 512x512 tile plus 85 base tokens (confirmed from community.openai.com/t/how-do-i-calculate-image-tokens-in-gpt4-vision/492318 and OpenAI vision documentation). GPT-4o unified architecture: “all modalities encoded into a single shared embedding space, with modality-agnostic weights” (confirmed from emergentmind.com/topics/gpt-4o-language-model). GPT-4o retired from ChatGPT on February 13, 2026; remains available via API (confirmed from openai.com/index/retiring-gpt-4o-and-older-models, help.openai.com/en/articles/20001051). GPT-5.4 full-resolution vision: pre-release Codex code commits, confirmed from nxcode.io/resources/news/gpt-5-4-leaked, thenextgentechinsider.com, piunikaweb.com. GPT-5.4 released March 5, 2026, 1.05M context window, 75% OSWorld-Verified surpassing 72.4% human baseline (confirmed from openai.com/index/introducing-gpt-5-4, computertech.co, aihaven.com, letsdatascience.com, automatio.ai/models/gpt-5-4). GPT-5.4 mini and nano released March 17, 2026, with multimodal capabilities; mini supports 400K context, text and image inputs, 72.1% OSWorld-Verified, 76.6% MMMU-Pro, $0.75/MTok input, $4.50/MTok output; nano is API-only, $0.20/MTok input, $1.25/MTok output, 66.1% MMMU-Pro (confirmed from openai.com/index/introducing-gpt-5-4-mini-and-nano, thenextgentechinsider.com, buildfastwithai.com).

Gemini: Native Multimodal from the Start

Google’s Gemini models (starting with Gemini 1.0 in December 2023) were designed as natively multimodal from the beginning. The architecture uses a Transformer decoder backbone that tokenizes images and video frames into discrete units similar to text tokens, enabling direct fusion across modalities. This is the early fusion approach: rather than bolting a vision encoder onto a text model, the entire model is trained on interleaved text, image, audio, and video data.

Gemini 2.5 Pro (March 2025) and Gemini 3.1 Pro (February 2026) continue this approach, with enhanced cross-modal attention mechanisms and support for up to 1 million tokens of context that can include a mix of text, images, and video.

Source: Gemini native multimodal architecture confirmed from emergentmind.com/topics/gemini-models (“tokenizing images and video frames into discrete units akin to text tokens, enabling direct fusion across modalities”), emergentmind.com/topics/gemini-foundation-model, and arxiv.org/html/2507.06261.

Claude: Vision Through the Same Reasoning Architecture

Anthropic’s Claude models have supported vision since the Claude 3 family launch on March 4, 2024. Claude processes images by converting them into tokens that are combined with text tokens inside its context window. Anthropic has not published detailed architectural papers about Claude’s vision system, but the API documentation reveals how images are tokenized.

Claude’s image token formula is straightforward:

tokens = (width_px * height_px) / 750

If an image’s long edge exceeds 1,568 pixels, or the image would exceed roughly 1,600 tokens, Claude scales it down (preserving aspect ratio) before processing. A 1,000 x 1,000 pixel image uses approximately 1,333 tokens. A 200 x 200 pixel image uses only about 53 tokens.

def claude_image_tokens(width, height, max_long_edge=1568):
    """
    Calculate the number of tokens an image consumes in Claude.
    Based on Anthropic's documented formula: tokens = (width * height) / 750.
    Images are scaled down if the long edge exceeds 1,568 pixels.
    """
    # Scale down if needed
    long_edge = max(width, height)
    if long_edge > max_long_edge:
        scale = max_long_edge / long_edge
        width = int(width * scale)
        height = int(height * scale)
    
    tokens = (width * height) / 750
    return int(tokens)

examples = [
    (200, 200, "Small thumbnail"),
    (1000, 1000, "Medium image"),
    (1092, 1092, "Standard photo"),
    (1920, 1080, "Full HD screenshot"),
    (4096, 3072, "High-res photo (scaled down)"),
]

print(f"{'Description':<30} {'Original':>12} {'Tokens':>8}")
print("-" * 55)
for w, h, desc in examples:
    tokens = claude_image_tokens(w, h)
    print(f"{desc:<30} {f'{w}x{h}':>12} {tokens:>8,}")

Claude processes visual content through the same reasoning architecture it uses for text, rather than using a separate specialized vision pipeline. This means Claude excels at tasks requiring interpretation and explanation (analyzing charts, reading documents, understanding diagrams) rather than pure perceptual tasks like precise object localization.

All Claude models from Claude 3 onward support vision: Claude 3 Opus, Sonnet, and Haiku (March 2024), through to Claude Opus 4.6 (February 5, 2026) and Claude Sonnet 4.6 (February 17, 2026). Supported image formats include JPEG, PNG, GIF, and WebP, with a maximum file size of 5 MB per image.

Source: Claude image token formula tokens = (width * height) / 750 and 1,568-pixel long edge limit confirmed from docs.anthropic.com/en/docs/build-with-claude/vision and platform.claude.com/docs/en/build-with-claude/vision. Claude 3 family launched March 4, 2024 (confirmed from claudefa.st/blog/models/claude-3). Claude processes visual content through the same reasoning architecture as text (confirmed from getstream.io/blog/anthropic-claude-visual-reasoning).

Qwen2.5-VL: Dynamic Resolution Without Downscaling

Alibaba’s Qwen2.5-VL (released January 2025, arXiv:2502.13923) introduced a significant architectural innovation: a native dynamic-resolution Vision Transformer that processes images at their original resolution instead of forcing them into a fixed size. Most vision encoders (including CLIP and SigLIP) resize all images to a fixed resolution (224 x 224, 384 x 384, etc.) before processing. Qwen2.5-VL skips this step entirely.

The vision encoder divides the input image into 14 x 14 patches at whatever resolution the image happens to be, up to a maximum of 1,280 x 28 x 28 patches. This means a small icon and a large photograph produce different numbers of tokens, proportional to their actual pixel content. The model also uses 2D rotary position embeddings (extending the RoPE concept from Chapter 6 to two spatial dimensions) so the vision encoder understands the spatial layout of patches regardless of image size.

Qwen2.5-VL is available in four sizes (3B, 7B, 32B, and 72B parameters) under permissive licenses. The 72B model matches or exceeds GPT-4o on several vision benchmarks, including DocVQA (document understanding) and ChartQA (chart interpretation). The dynamic resolution approach is particularly effective for document understanding, where preserving the original layout and text size is critical for accurate OCR and table extraction.

Source: Qwen2.5-VL released January 2025 (confirmed from roboflow.com/fine-tune-deploy-qwen2-5-vl, c-sharpcorner.com January 27 2025, llm-stats.com January 26 2025). Native dynamic-resolution ViT, 2D rotary position embeddings, and window attention confirmed from arxiv.org/abs/2502.13923 and huggingface.co/papers/2502.13923. Available in 3B, 7B, 32B, and 72B sizes (confirmed from huggingface.co/collections/Qwen/qwen25-vl and qwen-ai.com/qwen-vision).

Qwen3-VL and Qwen 3.5: The Next Generation

Alibaba continued to push the Qwen vision line with two major releases. Qwen3-VL (November 26, 2025, arXiv:2511.21631) expanded the family to include both dense variants (2B, 4B, 8B, 32B) and MoE variants (30B-A3B, 235B-A22B). It natively supports interleaved contexts of up to 256K tokens, integrating text, images, and video. Qwen3-VL introduced DeepStack feature fusion and enhanced spatial perception, including 3D grounding for embodied AI applications.

Then in February 2026, Alibaba released Qwen 3.5 (February 16, 2026), which represents a fundamental architectural shift: it uses early fusion training on multimodal tokens from the beginning, rather than attaching a separate vision encoder to a text model. The flagship Qwen3.5-397B-A17B has 397 billion total parameters with only 17 billion active per forward pass (MoE architecture). It supports 201 languages, processes images up to 1,344 x 1,344 pixels, and handles video clips up to 60 seconds at the base 262K context window (extensible to roughly two hours of video at the full 1 million token extended context). The 262K native context window is extensible to over 1 million tokens.

Alibaba then expanded the family with a medium series (February 24, 2026: Qwen3.5-27B dense, Qwen3.5-35B-A3B, and Qwen3.5-122B-A10B) and a small series (March 2, 2026: Qwen3.5-0.8B, 2B, 4B, and 9B). The 4B and 9B small models include native multimodal support with early fusion, bringing vision capabilities down to models that can run on consumer hardware. All models are released under the Apache 2.0 license.

This progression from Qwen2.5-VL (projection-based, January 2025) to Qwen 3.5 (early fusion, February 2026) mirrors the broader industry trend: open-source models are moving from bolting vision encoders onto text models toward training unified multimodal architectures from scratch.

Source: Qwen3-VL released November 26, 2025 (confirmed from emergentmind.com/topics/qwen3-vl-model citing Bai et al., 26 Nov 2025, and huggingface.co/papers/2511.21631). Dense variants 2B/4B/8B/32B and MoE variants 30B-A3B/235B-A22B (confirmed from liner.com/review/qwen3vl-technical-report and huggingface.co/Qwen/Qwen3-VL-2B-Instruct). Qwen 3.5 flagship released February 16, 2026, 397B total/17B active, early fusion, 201 languages, 262K native context extensible to 1M (confirmed from qwen-ai.com/limitations, techcommunity.microsoft.com/blog/azure-ai-foundry-blog/now-in-foundry-qwen3-5-medium-model-series/4498640, blockchain.news, nvidia.com/blog/develop-native-multimodal-agents-with-qwen3-5-vlm). Medium series released February 24, 2026: Qwen3.5-27B, 35B-A3B, 122B-A10B (confirmed from the-decoder.com, techie007.substack.com, techcommunity.microsoft.com). Small series released March 2, 2026: 0.8B, 2B, 4B, 9B with native multimodal in 4B/9B (confirmed from officechai.com, indiatoday.in, blockchain.news, oflight.co.jp). Video: up to 60 seconds at base context, approximately 2 hours at 1M extended context (confirmed from thenextgentechinsider.com, blockchain.news).

Mistral Small 4: Multimodal MoE

Mistral Small 4 (Mistral AI, March 16, 2026) is the latest example of vision becoming a default capability in open-source models. It is a 119B-parameter MoE model with 128 experts and approximately 6B active parameters per token (top-4 routing; the official blog states 6B while the HuggingFace model card lists 6.5B). It accepts both text and image inputs, supports a 256K context window, and ships under the Apache 2.0 license. The vision system builds on Mistral’s Pixtral architecture (the model uses the mistral3 model type internally), which processes images at their natural resolution and aspect ratio through a dedicated vision encoder connected to the language model via a two-layer fully connected projection network. Mistral Small 4 unifies instruct, reasoning, and multimodal capabilities into a single model, eliminating the need to choose between separate specialized models for different tasks.

Source: Mistral Small 4 released March 16, 2026, 119B total/~6B active (6B per mistral.ai blog, 6.5B per HuggingFace model card), 128 experts, top-4 routing, 256K context, Apache 2.0, multimodal text+image input, unifies instruct/reasoning/coding (confirmed from mistral.ai/news/mistral-small-4, huggingface.co/mistralai/Mistral-Small-4-119B-2603, testingcatalog.com, the-decoder.com, siliconrepublic.com). 40% reduction in end-to-end completion time and 3x more requests per second vs Mistral Small 3 (confirmed from huggingface.co model card). Uses mistral3 model type (Pixtral-based vision architecture) with two-layer FC projection from vision encoder to decoder (confirmed from huggingface.co/mistralai/Mistral-Small-4-119B-2603 config and ritvik19.medium.com/papers-explained-219-pixtral).


Real Use Cases: What Vision Models Can Do

Vision-language models are not just a research curiosity. They are used in production for tasks that would have been impossible or extremely expensive with traditional computer vision approaches just a few years ago.

Reading Screenshots and Documents

One of the most practical applications is document understanding. You can send a screenshot of a web page, a PDF, a spreadsheet, or a scanned document to a vision-language model and ask questions about it. The model can:

  • Extract text from images (OCR-like functionality, but with understanding)
  • Summarize the content of a document
  • Answer specific questions about data in tables or charts
  • Compare information across multiple pages

This works because the vision encoder converts the image into patch tokens that capture both the visual layout and the text content. The language model then reasons over these tokens to generate answers.

Analyzing Charts and Graphs

Vision-language models can interpret charts, graphs, and data visualizations. Given a bar chart, the model can identify which category has the highest value, describe trends, and even extract approximate numerical values. Given a line graph, it can describe whether a trend is increasing or decreasing, identify inflection points, and compare multiple series.

This capability is tested by benchmarks like MMMU (Massive Multi-discipline Multimodal Understanding), which includes college-level questions that require interpreting images alongside text. As of March 2026, the top-performing models on MMMU are:

ModelMMMU ScoreNotes
Gemini 3 Flash87.63%Released December 2025
Gemini 3 Pro87.51%Released November 2025
GPT-5.286.67%Released December 2025
Human experts (best)88.6%Upper bound
Human experts (worst)76.2%Lower bound

The leading models have surpassed the lower bound of human expert performance (76.2%) and are approaching the upper bound (88.6%). This benchmark includes questions from 30 subjects across 6 disciplines, requiring both visual perception and domain-specific knowledge.

OpenAI also publishes scores on MMMU-Pro, a harder variant of MMMU with augmented answer options and more complex visual reasoning. On MMMU-Pro, GPT-5.4 scores 81.2%, GPT-5.4 mini scores 76.6%, and GPT-5.4 nano scores 66.1%. These scores confirm that vision capabilities scale across the GPT-5.4 family: even the smallest nano model (priced at just $0.20 per million input tokens) can process and reason about images, though with lower accuracy than the flagship.

Source: MMMU benchmark scores from vals.ai/benchmarks/mmmu (updated March 17, 2026): Gemini 3 Flash 87.63%, Gemini 3 Pro 87.51%, GPT-5.2 86.67%. Human expert range 76.2% to 88.6% from the original MMMU paper (mmmu-benchmark.github.io). MMMU includes over 1,000 tasks spanning 30 subjects in 6 disciplines (confirmed from vals.ai). MMMU-Pro scores from OpenAI official benchmarks (openai.com/index/introducing-gpt-5-4-mini-and-nano): GPT-5.4 81.2%, GPT-5.4 mini 76.6%, GPT-5.4 nano 66.1% (all at xhigh reasoning effort).

Understanding UI Layouts

Vision-language models can analyze user interface screenshots and understand the spatial layout of elements. This is the foundation for computer use capabilities: models that can look at a screen, understand what they see, and decide where to click or what to type. GPT-5.4 scores 75% on OSWorld-Verified (a benchmark for autonomous computer use), above the human baseline of 72.4%. GPT-5.4 mini (released March 17, 2026) scores 72.1% on the same benchmark, nearly matching the flagship while running over 2x faster. This means even the smaller, cheaper model in the GPT-5.4 family can operate a desktop computer at roughly human-level proficiency.

Code from Screenshots

Developers use vision models to convert UI mockups and screenshots into working code. You can paste a screenshot of a website design and ask the model to generate the HTML and CSS that reproduces it. The model interprets the visual layout, identifies components (buttons, text fields, navigation bars), and generates structured code.


Limitations: Where Vision Models Still Fail

Despite impressive capabilities, vision-language models have systematic weaknesses that are important to understand. These are not edge cases; they are fundamental limitations of how current architectures process visual information.

Spatial Reasoning

Vision-language models struggle with precise spatial relationships. A study by researchers at Auburn University and the University of Alberta, titled “Vision language models are blind” (Rahmanzadehgervi et al., arXiv:2407.06581, July 2024), tested VLMs on seven visual tasks that are trivially easy for humans:

  • Whether two circles overlap
  • Whether two lines intersect
  • Which letter is circled in a word
  • Counting shapes in a logo

The models failed on most of these tasks. Across four state-of-the-art VLMs, the average accuracy was only 58.07%. Even the best-performing model (Claude 3.5 Sonnet) achieved only 77.84%, far from the human expected accuracy of 100%. The v6 revision of the paper (March 2025) confirmed that even slow-thinking reasoning models (like o1 and o3) consistently struggle with these tasks when geometric primitives overlap or are close together, though all VLMs perform at near-100% accuracy when shapes are well separated. A separate study (arXiv:2601.11644) found that CLIP achieves only 49% accuracy and BLIP-2 only 54% accuracy on basic directional relationships (above, below, left, right). These are tasks where random guessing would score 50%, meaning the models have essentially no spatial reasoning ability in these specific tests.

The root cause is the patch-based representation. When an image is divided into 14 x 14 or 16 x 16 patches, fine-grained spatial information within and across patch boundaries is lost. The model sees a grid of patch-level features, not individual pixels. Two lines that intersect at a single point may fall within the same patch, making the intersection invisible at the feature level.

Counting

Counting objects in images is a persistent failure mode. The GroundCount study (arXiv:2603.10978, March 2026) found that VLMs exhibit “persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks.” Even the best-performing model (Ovis2) achieved only 81.3% counting accuracy when augmented with an external object detection model. Without augmentation, counting accuracy is significantly worse.

The problem is that counting requires precise spatial localization of individual objects, which conflicts with the patch-based representation that aggregates information within each patch. If two objects fall within the same patch, the model may count them as one.

Fine-Grained Detail

Vision models process images at a fixed resolution determined by the vision encoder’s training. ViT-B/16 at 224 x 224 resolution means each patch covers a 16 x 16 pixel area. For a high-resolution photograph, this means significant downscaling before the model ever sees the image. Fine details like small text, thin lines, or subtle textures may be lost in this downscaling.

Higher-resolution vision encoders (like SigLIP at 384 x 384, or LLaMA 3.2’s 560 x 560 chunks) help, but they produce more tokens and increase computation. The tiling approach used by GPT-4o (processing the image as multiple 512 x 512 tiles) is a practical compromise: it preserves detail by processing the image in pieces, but each tile is processed independently before being combined, which can lose information about relationships that span tile boundaries.

Hallucination

Vision-language models hallucinate visual content. They may describe objects that are not in the image, misidentify colors, or invent text that does not appear in a screenshot. This is the visual equivalent of the text hallucination problem discussed in Chapter 26. The model generates plausible-sounding descriptions based on its training data rather than accurately reporting what is in the image.

A survey of hallucination in large vision-language models (arXiv:2402.00253) categorizes visual hallucinations into object hallucination (describing objects not present), attribute hallucination (wrong colors, sizes, or positions), and relation hallucination (incorrect spatial or logical relationships between objects).

Source: “Vision language models are blind,” Rahmanzadehgervi et al., arXiv:2407.06581, July 2024 (revised March 27, 2025, v6). Tested GPT-4o, Gemini, Claude, and Sonnet on 7 basic visual tasks; four VLMs averaged 58.07% accuracy, best model Claude 3.5 Sonnet at 77.84%; v6 adds slow-thinking model results confirming persistent failures when shapes overlap (confirmed from arxiv.org/abs/2407.06581 v6 abstract). CLIP 49% and BLIP-2 54% on directional relationships from arXiv:2601.11644 (confirmed from arxiv.org/html/2601.11644v1). GroundCount: arXiv:2603.10978, March 2026, 81.3% counting accuracy with Ovis2 using object detection augmentation (confirmed from arxiv.org/abs/2603.10978). Hallucination survey: arXiv:2402.00253 (confirmed from arxiv.org/abs/2402.00253).


The Token Cost of Vision

Processing images is expensive in terms of tokens. Every image patch that enters the language model consumes context window space and adds to the computation cost. This has direct implications for cost and latency.

def vision_token_cost():
    """
    Compare the token cost of processing images across different models.
    """
    print("Token Cost of a Single Image")
    print("=" * 70)
    print(f"{'Model/Config':<35} {'Resolution':>12} {'Image Tokens':>13}")
    print("-" * 70)
    
    configs = [
        ("GPT-4o (detail: low)", "512x512", 85),
        ("GPT-4o (detail: high, 1024x1024)", "1024x1024", 765),
        ("GPT-4o (detail: high, 1920x1080)", "1920x1080", 1_105),
        ("Claude (1000x1000)", "1000x1000", 1_333),
        ("Claude (1920x1080, scaled)", "1920x1080", 1_843),
        ("LLaVA (CLIP ViT-L/14, 336px)", "336x336", 576),
        ("LLaMA 3.2 11B (448px chunks)", "448x448", 1_024),
        ("LLaMA 3.2 90B (560px chunks)", "560x560", 1_600),
        ("LLaMA 4 Maverick", "variable", "variable"),
    ]
    
    for model, res, tokens in configs:
        print(f"  {model:<35} {res:>12} {str(tokens):>13}")
    
    print(f"\nFor comparison:")
    print(f"  A typical text prompt:           ~100-500 tokens")
    print(f"  A full page of text:             ~500-800 tokens")
    print(f"  One high-res image (GPT-4o):     ~765-1,105 tokens")
    print(f"  One high-res image (Claude):     ~1,333-1,843 tokens")
    print(f"\n  A single image can cost as much as 1-2 pages of text.")

vision_token_cost()

The token cost of images means that multi-image conversations consume context window space quickly. If you send 10 high-resolution images to GPT-4o, that is roughly 7,650 to 11,050 tokens just for the images, before any text. This is why the cross-attention approach (used by LLaMA 3.2 and Flamingo) is attractive for applications that need to process many images: the visual tokens do not consume the main context window.


How Training Works for Vision-Language Models

Training a vision-language model typically happens in stages, not all at once. The most common approach (used by LLaVA and many open-source models) has two stages:

Stage 1: Alignment pre-training. The vision encoder and language model are both frozen (their weights do not change). Only the projection layer is trained. The training data consists of image-caption pairs: the model learns to generate a caption given an image. This stage teaches the projection layer to map visual features into the language model’s embedding space. It is cheap and fast because only the small projection layer is being updated.

Stage 2: Visual instruction tuning. The projection layer and the language model are both unfrozen and fine-tuned together (the vision encoder may remain frozen or be fine-tuned with a lower learning rate). The training data consists of visual instruction-following examples: conversations where a user asks questions about an image and the model provides detailed answers. This stage teaches the model to follow instructions, reason about images, and generate helpful responses.

def training_stages():
    """
    Illustrate the two-stage training process for a LLaVA-style model.
    LLaVA-1.5 13B completes full training in ~1 day on a single 8-A100 node.
    """
    print("Two-Stage Training for Vision-Language Models (LLaVA-1.5 13B)")
    print("=" * 65)
    
    print("\nStage 1: Alignment Pre-training")
    print("-" * 65)
    print("  Frozen:    Vision encoder (CLIP ViT-L, ~307M params)")
    print("  Frozen:    Language model (e.g., Vicuna 13B)")
    print("  Trained:   Projection layer (2-layer MLP, ~8M params)")
    print("  Data:      ~558K image-caption pairs (CC3M filtered)")
    print("  Goal:      Align visual features with text embedding space")
    print("  Time:      ~6 hours on 8x A100 (48 GPU-hours)")
    
    print("\nStage 2: Visual Instruction Tuning")
    print("-" * 65)
    print("  Frozen:    Vision encoder (or fine-tuned with low LR)")
    print("  Trained:   Language model (~13B params)")
    print("  Trained:   Projection layer (~8M params)")
    print("  Data:      ~665K visual instruction-following examples")
    print("  Goal:      Follow instructions, reason about images")
    print("  Time:      ~20 hours on 8x A100 (160 GPU-hours)")
    
    print("\n  Total: ~1 day wall-clock on a single 8x A100 node")
    print("  Total: ~208 GPU-hours")
    print("  Compare to pre-training LLaMA 3 8B: ~1.3M GPU-hours")
    print("  Visual instruction tuning is ~6,250x cheaper than pre-training.")

training_stages()

This two-stage approach is why vision-language models have proliferated so rapidly. You do not need to train a model from scratch. You take an existing pretrained vision encoder (CLIP, SigLIP) and an existing pretrained language model (LLaMA, Mistral, Qwen), connect them with a small projection layer, and fine-tune. LLaVA-1.5’s 13B model completes both stages in roughly one day on a single 8-GPU node, using only publicly available data. The total training cost is a tiny fraction of what it costs to pre-train either component.

Source: LLaVA-1.5 training time: “~6 hours of pretraining and ~20 hours of visual instruction tuning, using 8x A100s” for the 13B model, with 558K pretrain and 665K finetune samples (confirmed from Liu et al., “Improved Baselines with Visual Instruction Tuning,” arXiv:2310.03744, Section 3.3 “Computational cost” and Table 7). LLaVA-1.5 achieves SoTA on 11 benchmarks using only publicly available data (confirmed from llava-vl.github.io and arxiv.org/abs/2310.03744).

For early fusion models like Gemini, the training process is fundamentally different. The entire model is trained from scratch on multimodal data, which means the vision and language capabilities are learned jointly. This is much more expensive but can produce deeper cross-modal understanding.


The Evolution of Vision in LLMs: A Timeline

YearMilestoneApproach
2020ViT (Dosovitskiy et al., arXiv:2010.11929)Patches as tokens, Transformer encoder
2021CLIP (Radford et al., arXiv:2103.00020)Contrastive image-text alignment
2022Flamingo (Alayrac et al., arXiv:2204.14198)Cross-attention, Perceiver Resampler
2023LLaVA (Liu et al., arXiv:2304.08485)Simple projection, visual instruction tuning
2023GPT-4V (OpenAI, September 25, 2023)Closed-source, tiling system
2023SigLIP (Zhai et al., arXiv:2303.15343)Improved contrastive learning
2023Gemini 1.0 (Google, December 2023)Native multimodal, early fusion
2024Claude 3 (Anthropic, March 4, 2024)Vision via same reasoning architecture
2024GPT-4o (OpenAI, May 2024)Unified omni-modal architecture
2024LLaMA 3.2 Vision (Meta, September 2024)Cross-attention with frozen text model
2025Qwen2.5-VL (Alibaba, January 2025)Dynamic resolution, native-scale ViT
2025SigLIP 2 (Google, February 2025)Enhanced contrastive + self-supervised
2025LLaMA 4 Maverick (Meta, April 2025)MetaCLIP + projection, natively multimodal
2025Gemini 3 Pro (Google, November 18, 2025)Early fusion, 1M context
2025Qwen3-VL (Alibaba, November 26, 2025)Dense + MoE variants, 256K context, DeepStack
2025Gemini 3 Flash (Google, December 17, 2025)Early fusion, optimized for speed
2026Qwen 3.5 (Alibaba, February 16, 2026)Early fusion, 397B MoE, native VLM
2026Qwen 3.5 medium (Alibaba, February 24, 2026)27B dense, 35B-A3B, 122B-A10B, early fusion
2026Qwen 3.5 small (Alibaba, March 2, 2026)0.8B to 9B, native multimodal in 4B/9B
2026GPT-5.4 (OpenAI, March 5, 2026)Full-resolution vision, unified architecture
2026Mistral Small 4 (Mistral, March 16, 2026)MoE (119B/~6B active), multimodal, 256K context
2026GPT-5.4 mini/nano (OpenAI, March 17, 2026)Multimodal mini (400K ctx, 72.1% OSWorld) and nano (API-only)

The trend is clear: vision capabilities have moved from specialized research models to a standard feature of every frontier language model. The architectural approaches have converged toward either projection-based methods (for open models that want to leverage existing pretrained components), dynamic resolution approaches (for models prioritizing document and fine-detail understanding), or early fusion (for closed models with the resources to train from scratch, and increasingly for open models like Qwen 3.5). By March 2026, every major model family (GPT, Claude, Gemini, LLaMA, Qwen, Mistral) supports image input as a baseline capability, and the newest open-source models (Qwen 3.5, Mistral Small 4) ship with native multimodal support from day one. Vision is also pushing down to smaller models: Qwen3.5-4B and 9B include native multimodal support, GPT-5.4 mini (released March 17, 2026) brings multimodal understanding to a faster, cheaper model with a 400K-token context window and 72.1% OSWorld-Verified (nearly matching the flagship’s 75%), and even GPT-5.4 nano scores 66.1% on MMMU-Pro at just $0.20 per million input tokens.


Key Takeaways

  • Vision Transformers (ViT) convert images into sequences of patch embeddings by dividing the image into a grid of patches (typically 14 x 14 or 16 x 16 pixels each), flattening each patch, and projecting it into a vector. A 224 x 224 image with 16 x 16 patches produces 196 tokens. This is the foundation of how all modern vision-language models process images.

  • CLIP (Radford et al., arXiv:2103.00020, 2021) and its successor SigLIP (Zhai et al., arXiv:2303.15343, 2023) train vision encoders whose output vectors are aligned with text in a shared embedding space. This alignment is what allows a language model to “understand” image content: the visual representations are in the same semantic space as words.

  • There are three main architectures for connecting vision encoders to language models: (1) projection-based (LLaVA, LLaMA 4), where image tokens are projected into the LLM’s embedding space and concatenated with text tokens; (2) cross-attention (Flamingo, LLaMA 3.2), where dedicated cross-attention layers allow the LLM to attend to visual features without consuming context window space; and (3) early fusion (Gemini, GPT-4o), where the model is trained from scratch on interleaved multimodal data.

  • LLaMA 4 Maverick uses a 34-layer MetaCLIP-based vision encoder (hidden_size=768, 16 heads, vision_output_dim=7,680) with a two-stage projector that maps visual features to the language model’s 4,096-dimensional embedding space. LLaMA 3.2 Vision uses cross-attention layers inserted into the text model. GPT-4o uses a tiling system (170 tokens per 512 x 512 tile plus 85 base tokens). GPT-5.4 introduced full-resolution vision processing.

  • On the MMMU benchmark (multimodal understanding across 30 subjects), the top models as of March 2026 are Gemini 3 Flash (87.63%), Gemini 3 Pro (87.51%), and GPT-5.2 (86.67%). These scores surpass the lower bound of human expert performance (76.2%) but remain below the upper bound (88.6%). On the harder MMMU-Pro variant, GPT-5.4 scores 81.2% and GPT-5.4 mini scores 76.6%, confirming that vision capabilities scale across model sizes within the same family.

  • Vision-language models have systematic limitations in spatial reasoning (CLIP: 49% on directional relationships, near random chance; four VLMs averaged only 58.07% on basic visual tasks per arXiv:2407.06581v6), counting (persistent hallucinations, best augmented accuracy 81.3% per GroundCount, arXiv:2603.10978), and fine-grained detail (due to patch-based downscaling). These are not edge cases but fundamental consequences of how images are tokenized into patches. Even slow-thinking reasoning models fail on these tasks when shapes overlap or are close together (arXiv:2407.06581v6).

  • Training a vision-language model is dramatically cheaper than pre-training a language model from scratch. LLaVA-1.5’s two-stage approach (alignment pre-training + visual instruction tuning) completes in roughly one day on a single 8x A100 node (~208 GPU-hours), compared to over 1 million GPU-hours for pre-training the underlying language model. That is roughly 6,250x cheaper. This is why vision capabilities have proliferated so rapidly across open-source models.

  • Images are expensive in tokens. A single high-resolution image in GPT-4o costs 765 to 1,105 tokens; in Claude, a 1,000 x 1,000 pixel image costs roughly 1,333 tokens (calculated as width x height / 750). Multi-image conversations consume context window space quickly, which is one reason cross-attention architectures (where image tokens do not consume the main context) are attractive for image-heavy applications.

  • Dynamic resolution is an emerging trend. Qwen2.5-VL (Alibaba, January 2025) processes images at their native resolution instead of forcing them into a fixed size, using a native dynamic-resolution ViT with 2D rotary position embeddings. This preserves fine detail for document understanding and OCR tasks without the information loss from downscaling.

  • Early fusion is spreading to open-source models. Qwen 3.5 (Alibaba, February 16, 2026) trains on text and image tokens together from the beginning, rather than attaching a separate vision encoder. Mistral Small 4 (March 16, 2026) ships with native multimodal support in a 119B MoE model with only ~6B active parameters. Vision is no longer an add-on; it is a default capability.

  • Vision is pushing to smaller models. Alibaba’s Qwen3.5-4B and 9B (March 2, 2026) include native multimodal support with early fusion, bringing vision capabilities to models that run on consumer hardware. OpenAI’s GPT-5.4 mini (March 17, 2026) brings multimodal understanding to a faster, cheaper model with a 400K-token context window and 72.1% OSWorld-Verified (nearly matching the flagship’s 75% and exceeding the 72.4% human baseline). Even GPT-5.4 nano, priced at just $0.20 per million input tokens, scores 66.1% on MMMU-Pro. The era of vision as a premium feature reserved for frontier models is ending.


What’s Next

You now understand how language models process images: the vision encoders that convert pixels into patch tokens, the architectural approaches that connect visual and textual representations, and the limitations that remain. But vision is just one modality. In Chapter 22, we will explore native multimodal models that go beyond bolting a vision encoder onto a text model, examining unified architectures that process text, images, audio, and video from scratch, the emerging capability of generating images directly within language models, and how video understanding extends the patch-based approach into the temporal dimension.