Chapter 24. Serving Infrastructure, From GPU to API Response

Part 8. The Ecosystem, How It All Connects

A model that exists only as a file of weights on disk is useless. The infrastructure that loads those weights onto hardware, processes incoming requests, and returns responses at scale is what turns a trained model into a product. This chapter explains exactly how that infrastructure works: the hardware that runs frontier models, the parallelism strategies that split one model across dozens of GPUs, the batching and scheduling techniques that serve millions of users simultaneously, the quantization methods that shrink models to fit smaller hardware, and the economics that determine what inference actually costs.

Why Serving Is the Hard Part

Training a model is expensive, but it happens once (or a few times). Serving that model to users happens continuously, millions of times per day, for months or years. For most AI companies, inference costs dwarf training costs over the lifetime of a model.

The challenge is that large language models are enormous. A 70-billion-parameter model stored in 16-bit precision (2 bytes per parameter) requires 140 GB just for the weights. That exceeds the memory of any single GPU. A frontier model like GPT-5.4 or Claude Opus 4.6, with hundreds of billions or trillions of parameters, requires an entire cluster of GPUs working together to serve a single request.

On top of the model weights, you need memory for the KV cache (the stored attention keys and values from Chapter 20 that grow with every token generated), memory for intermediate computations (activations), and memory for the serving framework itself. The total memory footprint during inference is significantly larger than the model weights alone.

And all of this must happen fast. Users expect responses to start appearing within a second or two. A voice assistant like GPT-4o needs to respond in under 320 milliseconds. Serving infrastructure must balance three competing demands: low latency (fast responses), high throughput (many simultaneous users), and low cost (affordable per-token pricing).

The Hardware: What Runs Frontier Models

Three families of hardware dominate LLM serving in March 2026: NVIDIA GPUs, Google TPUs, and (to a lesser extent) AMD GPUs. Each has different strengths.

NVIDIA H100: The Workhorse

The NVIDIA H100, built on the Hopper architecture and launched in late 2022, has been the dominant GPU for LLM inference for the past three years. Its key specifications:

Memory: 80 GB HBM3 (High Bandwidth Memory)
Memory bandwidth: 3.35 TB/s (SXM variant)
FP8 Tensor Core performance: 1,979 TFLOPS (trillions of floating-point operations per second)
FP16/BF16 Tensor Core performance: 989 TFLOPS (dense); 1,979 TFLOPS with 2:4 structured sparsity
NVLink bandwidth: 900 GB/s bidirectional (for GPU-to-GPU communication within a node)
Power: 700W TDP
Cloud pricing: approximately $1.25 to $3.90 per GPU-hour depending on provider (as of March 2026, after AWS cut prices 44% in June 2025)

The H100 introduced the Transformer Engine, which automatically switches between FP8 and FP16 precision during computation to maximize throughput without sacrificing accuracy. This was specifically designed for Transformer-based models. It also supports Multi-Instance GPU (MIG), which lets you partition a single H100 into up to seven isolated instances for serving smaller models.

The H100 can hold a 70B-parameter model in FP16 on two GPUs (140 GB of weights across 160 GB of total memory), or a quantized version on a single GPU. For frontier models, you need 8 or more H100s working together.

Source: H100 SXM specifications: 80 GB HBM3, 3.35 TB/s bandwidth, FP8 dense 1,979 TFLOPS, BF16 dense 989 TFLOPS, 700W TDP, NVLink 900 GB/s (confirmed from glennklockwood.com/garden/processors/H100, gmicloud.ai GPU comparison table, rightnowai.co, gpus.io). Cloud pricing $1.25-$3.90/hr after AWS price cuts (confirmed from getdeploying.com/reference/cloud-gpu showing $1.25 lowest across 37+ providers, intuitionlabs.ai, introl.com, datacenterdynamics.com).

NVIDIA B200: The New Frontier

The NVIDIA B200, built on the Blackwell architecture, began shipping to cloud providers in late 2025 and represents a generational leap:

Memory: 192 GB HBM3e
Memory bandwidth: 8 TB/s
FP8 dense Tensor Core performance: 4,500 TFLOPS (with 2:4 structured sparsity: 9,000 TFLOPS)
FP4 dense Tensor Core performance: 9,000 TFLOPS (with sparsity: 18,000 TFLOPS)
BF16 dense Tensor Core performance: 2,250 TFLOPS (with sparsity: 4,500 TFLOPS)
NVLink bandwidth: 1,800 GB/s (2x the H100)
Power: 1,000W TDP (HGX variant; the GB200 variant runs at 1,200W with higher clocks)
Cloud pricing: approximately $2.25 to $6.50 per GPU-hour (as of March 2026)

The B200 delivers roughly 4x the training performance and up to 30x the inference performance compared to the H100 for LLM workloads, according to NVIDIA. The 192 GB of memory means a 70B-parameter model fits on a single B200 in FP16 with room to spare for the KV cache. The native FP4 support reduces model weight memory by 4x compared to FP16, enabling even larger models or larger batch sizes on the same hardware.

The B200 also introduces a second-generation Transformer Engine with improved mixed-precision handling. For inference specifically, the combination of more memory (2.4x the H100), more bandwidth (2.4x), and more compute (2.3x in BF16 dense, or 2.3x in FP8 dense) means the B200 can serve the same model to significantly more users simultaneously. The real inference gains are even larger because the B200’s native FP4 and FP8 support lets you run at lower precision than the H100 typically uses, and structured sparsity can double effective throughput for models that support it.

Source: B200 HGX specifications (1,000W variant): 192 GB HBM3e (8 stacks × 24 GB; ~180-186 GB usable after ECC overhead in production systems), 8 TB/s bandwidth, BF16 dense 2,250 TFLOPS, FP8 dense 4,500 TFLOPS, FP4 dense 9,000 TFLOPS, 1,800 GB/s NVLink, 1,000W TDP (confirmed from glennklockwood.com/garden/processors/B200 detailed performance tables, techpowerup.com, rightnowai.co). The GB200 variant at 1,200W achieves BF16 2,500 TFLOPS, FP8 5,000 TFLOPS, FP4 10,000 TFLOPS (confirmed from glennklockwood.com). Cloud pricing $2.25-$6.50/hr (confirmed from getdeploying.com showing $2.25/hr lowest across 21 providers, lyceum.technology, nodepedia.com). 4x training, 30x inference vs H100 (confirmed from clarifai.com, nvidia.com DGX B200 page).

NVIDIA B300 (Blackwell Ultra): The Latest

The NVIDIA B300, also known as Blackwell Ultra, shipped in January 2026. It increases HBM capacity from 192 GB to 288 GB of HBM3e (using 12-high stacks instead of 8-high), with the same 8 TB/s bandwidth as the B200. The key improvement is compute: 15 PFLOPS of dense FP4 performance per chip (1.5x the B200’s 9,000 TFLOPS FP4) and 2x the attention performance of the B200, according to NVIDIA. The B300 is designed specifically for the long-context, reasoning-heavy workloads that dominate inference in 2026. Cloud pricing for the B300 ranges from approximately $5.65 to $18.00 per GPU-hour as of March 2026, reflecting its premium positioning. The extra 96 GB of HBM (288 GB vs. 192 GB) is particularly valuable for serving models with very large KV caches at long context lengths.

Source: B300 (Blackwell Ultra): 288 GB HBM3e, 8 TB/s bandwidth, 15 PFLOPS dense FP4, 1.5x dense FP4 and 2x attention performance over B200, shipped January 2026 (confirmed from spheron.network, novacorein.com, nvidia.com/en-us/data-center/dgx-b300, server-parts.eu, techspot.com). Cloud pricing $5.65-$18.00/hr (confirmed from getdeploying.com/reference/cloud-gpu showing Verda, Enverge, DigitalOcean as providers).

Google TPUs: The Vertical Alternative

Google’s Tensor Processing Units (TPUs) take a fundamentally different approach. Instead of selling individual chips, Google designs TPUs as components of a vertically integrated system where chips, interconnects, and software are co-designed.

TPU v5p (announced December 2023):

BF16 performance: 459 TFLOPS per chip
INT8 performance: 918 TOPS per chip
HBM capacity: 95 GB per chip
HBM bandwidth: 2.76 TB/s per chip
Pod size: up to 8,960 chips interconnected via high-bandwidth ICI (Inter-Chip Interconnect)

TPU v6e (Trillium) (announced May 2024, generally available late 2024):

BF16 performance: 918 TFLOPS per chip (4.7x over TPU v5e)
INT8 performance: 1,836 TOPS per chip
HBM capacity: 32 GB per chip
HBM bandwidth: 1.6 TB/s per chip (doubled from v5e)
Pod size: up to 256 chips per pod, delivering 235 PFLOPS aggregate
Energy efficiency: 67% more efficient than TPU v5e

TPU v7 (Ironwood) (announced April 2025, in preview on Google Cloud as of March 2026):

BF16 performance: 2,307 TFLOPS per chip (5x over TPU v5p)
FP8 performance: 4,614 TFLOPS per chip
HBM capacity: 192 GB HBM3e per chip (matching the B200)
HBM bandwidth: 7.38 TB/s per chip
ICI bandwidth: 1,200 GB/s bidirectional per chip
Pod size: up to 9,216 chips per pod, delivering 42.5 ExaFLOPS aggregate (FP8)
Dual-chiplet architecture: each chip exposes two devices (2 TensorCores, 4 SparseCores)

Ironwood represents a generational leap for Google’s TPU line. At 4,614 FP8 TFLOPS per chip, it slightly exceeds the B200 HGX’s 4,500 FP8 TFLOPS while consuming roughly 600W (compared to the B200’s 1,000W). The 192 GB of HBM3e per chip matches the B200 exactly, and the 7.38 TB/s bandwidth is close to the B200’s 8 TB/s. The real differentiator is pod scale: an Ironwood pod with 9,216 chips delivers 42.5 ExaFLOPS of FP8 compute and 1.77 PB of aggregate HBM, far exceeding what any single GPU cluster can achieve. Ironwood is currently in preview and requires contacting Google Cloud’s account team for access.

The key advantage of TPUs is not just the per-chip specs but the interconnect. TPU pods connect chips via a custom high-bandwidth mesh network (ICI) that enables efficient communication across thousands of chips. This makes TPUs particularly strong for serving very large models that must be split across many chips.

Google uses TPUs internally to serve Gemini, and they are available to external customers through Google Cloud. The per-chip comparison with NVIDIA GPUs can be misleading; the better comparison is between a TPU pod and an equivalent GPU cluster, where the TPU’s tighter integration often wins on total cost of ownership for large-scale serving.

Source: TPU v5p: 459 TFLOPS BF16, 918 TOPS INT8, 95 GB HBM, 2.76 TB/s bandwidth, up to 8,960 chips per pod (confirmed from cloud.google.com/tpu/docs/v5p, theregister.com, tweaktown.com). TPU v6e (Trillium): 918 TFLOPS BF16, 1,836 TOPS INT8, 32 GB HBM, 1.6 TB/s bandwidth, 4.7x compute over v5e, 67% more energy efficient, 256 chips per pod, 235 PFLOPS aggregate (confirmed from cosmo-edge.com, cloud.google.com/tpu/docs/v6e, heise.de, cloud.google.com/blog introducing-trillium). TPU v7 (Ironwood): 2,307 TFLOPS BF16, 4,614 TFLOPS FP8, 192 GB HBM3e, 7.38 TB/s bandwidth, 1,200 GB/s ICI, up to 9,216 chips per pod, 42.5 ExaFLOPS FP8 aggregate, dual-chiplet architecture, ~600W per chip (confirmed from cloud.google.com/tpu/docs/tpu7x official specification table, theregister.com, theoutpost.ai, ctol.digital, cloud.google.com/blog/products/compute/ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads).

Hardware Comparison at a Glance

def gpu_comparison_table():
    """
    Compare the key specifications of hardware used for LLM serving in 2026.
    """
    print("LLM Serving Hardware Comparison (March 2026)")
    print("=" * 100)
    
    headers = ["Spec", "H100 SXM", "B200 (HGX)", "TPU v6e", "TPU v7 (Ironwood)"]
    rows = [
        ("Memory (GB)", "80", "192", "32", "192"),
        ("Mem BW (TB/s)", "3.35", "8.0", "1.6", "7.38"),
        ("BF16 dense TFLOPS", "989", "2,250", "918", "2,307"),
        ("FP8 dense TFLOPS", "1,979", "4,500", "918*", "4,614"),
        ("Interconnect", "NVLink 900GB/s", "NVLink 1.8TB/s", "ICI 800GB/s", "ICI 1.2TB/s"),
        ("Power (W)", "700", "1,000", "N/A (pod)", "~600"),
        ("Cloud $/hr", "$1.25-3.90", "$2.25-6.50", "GCP only", "GCP (preview)"),
        ("Best for", "General LLM", "Frontier LLM", "Cost-efficient", "Large-scale inference"),
    ]
    
    col_widths = [20, 16, 16, 12, 20]
    header_line = ""
    for h, w in zip(headers, col_widths):
        header_line += f"{h:<{w}}"
    print(header_line)
    print("-" * 90)
    
    for row in rows:
        line = ""
        for val, w in zip(row, col_widths):
            line += f"{val:<{w}}"
        print(line)
    
    print("\n  Note: TPU pricing is bundled into Google Cloud; direct per-chip")
    print("  comparison with NVIDIA GPUs is not straightforward.")
    print("  *TPU v6e FP8 figure is INT8 TOPS (918), not native FP8.")
    print("  TPU v7 (Ironwood) is in preview as of March 2026.")
    print("  B200 with 2:4 structured sparsity doubles the dense TFLOPS shown.")
    print("  Many sources cite sparsity numbers; this table shows dense (worst-case).")

gpu_comparison_table()

The Two Phases of Inference

Before we discuss how to split models across GPUs, you need to understand what happens when a model processes a request. LLM inference has two distinct phases with very different computational characteristics.

Phase 1: Prefill (Processing the Input)

When a request arrives, the model must process the entire input prompt at once. This is the prefill phase. Every token in the prompt is processed in parallel through all the Transformer layers. The model computes the attention keys and values for every input token and stores them in the KV cache for later use.

The prefill phase is compute-bound: the bottleneck is how fast the GPU can perform matrix multiplications. The input might be thousands of tokens, and each token requires a full forward pass through the model’s attention and feed-forward layers. The GPU’s Tensor Cores are fully utilized during this phase.

The time spent in prefill determines the Time to First Token (TTFT): how long the user waits before seeing the first word of the response. For a short prompt, TTFT might be 100-200 milliseconds. For a long prompt (say, 100,000 tokens), TTFT can be several seconds because the model must process all those tokens before generating anything.

Phase 2: Decode (Generating the Output)

After prefill, the model generates output tokens one at a time. Each new token requires a forward pass through the entire model, but this time only a single token is being processed (the most recently generated one). The model reads the stored KV cache to compute attention over all previous tokens, generates one new token, appends its key and value to the KV cache, and repeats.

The decode phase is memory-bandwidth-bound: the bottleneck is how fast the GPU can read the model weights and KV cache from memory. Each decode step processes only one token, so the matrix multiplications are tiny (a vector times a matrix, not a matrix times a matrix). The GPU’s compute cores are mostly idle, waiting for data to arrive from memory.

This is why memory bandwidth matters so much for inference. The B200’s 8 TB/s bandwidth (vs. the H100’s 3.35 TB/s) translates almost directly into faster token generation during the decode phase.

def inference_phases_demo():
    """
    Illustrate the two phases of LLM inference and their characteristics.
    """
    print("The Two Phases of LLM Inference")
    print("=" * 70)
    
    print("\n  Phase 1: PREFILL (processing the input prompt)")
    print("  " + "-" * 55)
    print("  Input:     [all prompt tokens processed in parallel]")
    print("  Bottleneck: Compute (matrix multiplications)")
    print("  GPU usage:  Tensor Cores fully utilized")
    print("  Metric:     Time to First Token (TTFT)")
    print("  Example:    1,000-token prompt -> ~150ms on H100")
    print("              100,000-token prompt -> ~3-5s on H100")
    
    print("\n  Phase 2: DECODE (generating output tokens)")
    print("  " + "-" * 55)
    print("  Input:     [one new token at a time]")
    print("  Bottleneck: Memory bandwidth (reading weights + KV cache)")
    print("  GPU usage:  Compute cores mostly idle, waiting for data")
    print("  Metric:     Time Per Output Token (TPOT)")
    print("  Example:    ~10-30ms per token on H100 (30-100 tokens/sec)")
    
    print("\n  Key insight: Prefill is compute-bound, decode is memory-bound.")
    print("  Different optimizations target each phase.")

inference_phases_demo()

The Arithmetic Intensity Gap

The difference between the two phases can be quantified using arithmetic intensity: the ratio of compute operations to bytes of data moved. High arithmetic intensity means the workload is compute-bound; low arithmetic intensity means it is memory-bound.

During prefill with a large batch, arithmetic intensity is high: you are performing large matrix multiplications where many operations share the same weight data. The GPU’s compute capacity is the limiting factor.

During decode with a single request, arithmetic intensity is extremely low: for each token, you must read the entire model’s weights from memory (hundreds of gigabytes) to perform a relatively small computation. The GPU spends most of its time waiting for data to arrive from HBM.

This is why the naive approach of “just buy a faster GPU” does not always help for inference. If your workload is memory-bandwidth-bound (as decode usually is), a GPU with 2x more compute but the same bandwidth will not generate tokens any faster. You need more bandwidth, which is exactly what the B200 provides (8 TB/s vs. 3.35 TB/s for the H100).

Model Parallelism: Splitting One Model Across Multiple GPUs

When a model is too large to fit on a single GPU, you must split it across multiple GPUs. There are two primary strategies, and most production systems use both simultaneously.

Tensor Parallelism: Splitting Each Layer

Tensor parallelism (TP) splits the matrix operations within each Transformer layer across multiple GPUs. Instead of one GPU performing a full matrix multiplication, the matrix is divided into chunks, and each GPU computes its chunk in parallel. The partial results are then combined via an all-reduce communication step.

For example, consider a feed-forward layer with a weight matrix of shape [8192, 32768]. With tensor parallelism across 4 GPUs:

GPU 0 gets columns [0:8192] of the weight matrix
GPU 1 gets columns [8192:16384]
GPU 2 gets columns [16384:24576]
GPU 3 gets columns [24576:32768]

Each GPU multiplies the input by its chunk of the weight matrix, producing a partial result. The GPUs then communicate to combine these partial results (via all-reduce) before proceeding to the next operation.

def tensor_parallelism_demo():
    """
    Demonstrate how tensor parallelism splits a matrix multiplication
    across multiple GPUs.
    """
    import numpy as np
    
    # Simulated weight matrix: hidden_dim x ffn_dim
    hidden_dim = 8192
    ffn_dim = 32768
    num_gpus = 4
    chunk_size = ffn_dim // num_gpus  # 8192 columns per GPU
    
    print("Tensor Parallelism: Splitting a Feed-Forward Layer")
    print("=" * 60)
    print(f"\n  Full weight matrix: [{hidden_dim} x {ffn_dim}]")
    print(f"  Split across {num_gpus} GPUs: each gets [{hidden_dim} x {chunk_size}]")
    print(f"\n  Memory per GPU for this layer:")
    
    full_size_gb = (hidden_dim * ffn_dim * 2) / (1024**3)  # FP16
    per_gpu_gb = full_size_gb / num_gpus
    print(f"    Full layer (FP16):  {full_size_gb:.2f} GB")
    print(f"    Per GPU (FP16):     {per_gpu_gb:.2f} GB")
    
    print(f"\n  Computation flow:")
    print(f"    1. Input vector x (shape [{hidden_dim}]) broadcast to all GPUs")
    print(f"    2. Each GPU computes: partial = x @ W_chunk  (shape [{chunk_size}])")
    print(f"    3. All-reduce: combine partial results across GPUs")
    print(f"    4. Result: full output vector (shape [{ffn_dim}])")
    
    print(f"\n  Requires high-bandwidth GPU interconnect (NVLink).")
    print(f"  H100 NVLink: 900 GB/s | B200 NVLink: 1,800 GB/s")

tensor_parallelism_demo()

Tensor parallelism requires extremely fast communication between GPUs because the all-reduce step happens after every layer operation. This is why NVLink matters: at 900 GB/s (H100) or 1,800 GB/s (B200), NVLink is 14-28x faster than PCIe Gen5 (64 GB/s). Tensor parallelism across GPUs connected only by PCIe would be impractically slow.

In practice, tensor parallelism is used within a single server node (typically 8 GPUs connected via NVLink). The degree of tensor parallelism (TP=2, TP=4, TP=8) depends on the model size and the available GPU memory.

Pipeline Parallelism: Splitting by Layers

Pipeline parallelism (PP) takes a different approach: instead of splitting each layer across GPUs, it assigns entire layers to different GPUs. If a model has 80 Transformer layers and you have 4 GPUs:

GPU 0 runs layers 1-20
GPU 1 runs layers 21-40
GPU 2 runs layers 41-60
GPU 3 runs layers 61-80

Data flows through the GPUs sequentially, like an assembly line. GPU 0 processes the input through its layers, sends the result to GPU 1, which processes through its layers, and so on.

def pipeline_parallelism_demo():
    """
    Demonstrate how pipeline parallelism assigns layers to GPUs.
    """
    print("Pipeline Parallelism: Splitting by Layers")
    print("=" * 60)
    
    total_layers = 80
    num_gpus = 4
    layers_per_gpu = total_layers // num_gpus
    
    print(f"\n  Model: {total_layers} Transformer layers")
    print(f"  GPUs:  {num_gpus}")
    print(f"  Layers per GPU: {layers_per_gpu}")
    
    print(f"\n  Assignment:")
    for i in range(num_gpus):
        start = i * layers_per_gpu + 1
        end = (i + 1) * layers_per_gpu
        print(f"    GPU {i}: Layers {start:>2}-{end:>2}")
    
    print(f"\n  Data flow:")
    print(f"    Input -> GPU 0 (layers 1-20)")
    print(f"         -> GPU 1 (layers 21-40)")
    print(f"         -> GPU 2 (layers 41-60)")
    print(f"         -> GPU 3 (layers 61-80) -> Output")
    
    print(f"\n  Communication: only between adjacent GPUs (point-to-point)")
    print(f"  Works across nodes (servers) connected by InfiniBand")
    print(f"  Lower bandwidth requirement than tensor parallelism")

pipeline_parallelism_demo()

Pipeline parallelism has a lower communication requirement than tensor parallelism: data only flows between adjacent GPUs (point-to-point), not between all GPUs (all-reduce). This makes it suitable for splitting across server nodes connected by InfiniBand networking (typically 400 Gb/s or 800 Gb/s), which is much slower than NVLink but sufficient for the smaller data transfers pipeline parallelism requires.

The downside is pipeline bubbles: when GPU 0 finishes processing a request and sends the result to GPU 1, GPU 0 sits idle until the next request arrives. With a single request, only one GPU is active at a time. The solution is to overlap multiple requests: while GPU 1 processes request A through layers 21-40, GPU 0 can start processing request B through layers 1-20. This is called micro-batching, and it is essential for making pipeline parallelism efficient.

Expert Parallelism: For MoE Models

Mixture-of-Experts (MoE) models like LLaMA 4 Maverick (400B total parameters, 128 experts, 17B active per token) introduce a third type of parallelism: expert parallelism (EP). Each expert is placed on a different GPU (or group of GPUs). When a token is routed to a specific expert, the computation happens on the GPU that holds that expert.

Expert parallelism is particularly efficient because only a small fraction of experts are active for any given token. LLaMA 4 Maverick activates only 2 experts per token (1 shared + 1 routed) out of 128 total. This means the active computation per token is only 17B parameters, even though the full model has 400B parameters. The total model weights (~800 GB in FP16) must be distributed across GPUs, but the per-token compute cost is equivalent to a much smaller dense model.

Source: LLaMA 4 Maverick: 400B total parameters, 17B active per token, 128 experts, FP16 model size ~800 GB, INT4 quantized ~200 GB (confirmed from spheron.network, neuralstackly.com, apxml.com, dasroot.net). Inference cost approximately $0.15-$0.27 per million input tokens depending on provider (confirmed from groq.com/pricing, inworld.ai, aicostcheck.com).

Combining Parallelism Strategies

Production serving systems typically combine all three strategies. A common configuration for serving a large model on a cluster of 8-GPU servers:

Tensor parallelism (TP=8) within each server: split each layer across all 8 GPUs connected by NVLink
Pipeline parallelism (PP=2-4) across servers: split the model’s layers across 2-4 servers connected by InfiniBand
Expert parallelism for MoE models: distribute experts across GPUs, with routing handled by the serving framework

def combined_parallelism():
    """
    Show how parallelism strategies combine in a real deployment.
    """
    print("Combined Parallelism: Real-World Deployment Example")
    print("=" * 65)
    
    print("\n  Scenario: Serving a ~400B MoE model (e.g., LLaMA 4 Maverick)")
    print("  Hardware: 4 servers, each with 8x H100 GPUs (32 GPUs total)")
    
    print("\n  Parallelism strategy:")
    print("    Tensor Parallelism (TP=8):  Each layer split across 8 GPUs")
    print("                                within one server (via NVLink)")
    print("    Pipeline Parallelism (PP=4): Layers distributed across 4 servers")
    print("                                (via InfiniBand)")
    print("    Expert Parallelism:          128 experts spread across 32 GPUs")
    print("                                (~4 experts per GPU)")
    
    print("\n  Memory budget per GPU (H100, 80 GB):")
    print("    Model weights (FP16): ~800 GB / 32 GPUs = ~25 GB per GPU")
    print("    KV cache:             ~20-40 GB (depends on batch size)")
    print("    Activations + overhead: ~10-15 GB")
    print("    Total:                ~55-80 GB per GPU")
    
    print("\n  With B200 (192 GB per GPU), the same model could be served")
    print("  on fewer GPUs, reducing cost and communication overhead.")

combined_parallelism()

Batching: Serving Many Users at Once

A single decode step for one user barely utilizes the GPU’s compute capacity. The model weights must be read from memory regardless of whether you are generating one token for one user or one token for 100 users. Batching multiple requests together amortizes the cost of reading model weights across many users, dramatically improving throughput.

Static Batching: The Naive Approach

The simplest approach is static batching: collect a fixed number of requests, process them together as a batch, and return all results when the longest request finishes. This is how early serving systems worked.

The problem is obvious: if one request in a batch generates 500 tokens and another generates 10 tokens, the short request must wait for the long request to finish before its result is returned. GPU utilization drops as requests complete at different times, because the batch cannot accept new requests until all current requests are done.

Continuous Batching: The Modern Approach

Continuous batching (also called iteration-level scheduling), introduced by the Orca system in 2022, solves this problem. Instead of treating a batch as a fixed group, continuous batching treats each decode iteration as an opportunity to adjust the batch. When a request finishes generating (it produces an end-of-sequence token), it is immediately removed from the batch, and a waiting request takes its place.

This means the GPU is always working on a full batch. Short requests exit quickly and are replaced by new requests. Long requests continue generating without blocking others. The result is dramatically higher throughput and lower average latency.

def batching_comparison():
    """
    Compare static batching vs continuous batching.
    """
    print("Static vs Continuous Batching")
    print("=" * 65)
    
    print("\n  Static Batching:")
    print("  " + "-" * 50)
    print("  Time ->  [====Request A (500 tokens)================]")
    print("           [==Req B (10)==                   IDLE      ]")
    print("           [=====Req C (200)======           IDLE      ]")
    print("           [===Req D (50)===                 IDLE      ]")
    print("  Problem: B, C, D finish early but must wait for A.")
    print("           GPU sits idle. No new requests until A finishes.")
    
    print("\n  Continuous Batching:")
    print("  " + "-" * 50)
    print("  Time ->  [====Request A (500 tokens)================]")
    print("           [==B==][==E==][===F===][==G==][===H===][=I=]")
    print("           [=====C======][====J=====][====K=====]")
    print("           [===D===][=====L======][===M===][==N==]")
    print("  Solution: As B finishes, E joins. As D finishes, L joins.")
    print("            GPU always has a full batch. Much higher throughput.")

batching_comparison()

Every major serving framework in 2026 uses continuous batching. It is the single most important optimization for serving throughput.

Source: Orca introduced iteration-level scheduling (continuous batching) for LLM serving, scheduling at the granularity of individual iterations rather than entire requests (confirmed from friendli.ai/research/orca, insujang.github.io, medium.com/@martiniglesiasgo). The original Orca paper (OSDI 2022) demonstrated up to 36.9x throughput improvement over FasterTransformer at the same latency level (confirmed from usenix.org/conference/osdi22/presentation/yu). In practice, continuous batching typically achieves 2-10x improvement depending on workload characteristics.

Prefill-Decode Disaggregation

A more advanced optimization separates the prefill and decode phases onto different hardware. Since prefill is compute-bound and decode is memory-bandwidth-bound, they benefit from different GPU configurations:

Prefill GPUs: Optimized for compute throughput. Can use fewer GPUs with high compute density.
Decode GPUs: Optimized for memory bandwidth. Benefit from more GPUs with high bandwidth.

Some production systems route incoming requests to prefill-specialized GPUs first, then transfer the computed KV cache to decode-specialized GPUs for token generation. This avoids the problem of prefill operations (which are compute-heavy) interfering with decode operations (which are latency-sensitive) on the same GPU.

Two research systems formalized this idea. DistServe (OSDI 2024, arXiv:2401.09670) demonstrated that disaggregating prefill and decode onto separate GPU pools achieves up to 4.48x higher goodput (useful throughput meeting latency targets) or 10.2x tighter SLO compliance compared to colocated systems like vLLM. Splitwise (ISCA 2024, arXiv:2311.18677), from Microsoft Research, showed that splitting phases across homogeneous or heterogeneous hardware achieves up to 1.4x higher throughput at 20% lower cost, or 2.35x more throughput under the same power and cost budgets.

This approach is moving from research into production. On March 13, 2026, AWS and Cerebras Systems announced a collaboration to build disaggregated inference infrastructure, pairing AWS Trainium3 chips (optimized for the compute-heavy prefill phase) with Cerebras Wafer-Scale Engines (optimized for the memory-bound decode phase). AWS also announced llm-d, an open-source disaggregated inference framework. The trend is clear: as inference workloads grow, treating prefill and decode as separate engineering problems with separate hardware is becoming the standard approach.

Source: DistServe: up to 4.48x goodput or 10.2x tighter SLO vs state-of-the-art, OSDI 2024 (confirmed from usenix.net/conference/osdi24/presentation/zhong-yinmin, arxiv.org/abs/2401.09670v2, hao-ai-lab.github.io). Splitwise: 1.4x throughput at 20% lower cost or 2.35x throughput same budget, ISCA 2024 (confirmed from microsoft.com/en-us/research/publication/splitwise, arxiv.org/abs/2311.18677). AWS/Cerebras disaggregated inference announced March 13, 2026 (confirmed from press.aboutamazon.com, aws.amazon.com/blogs/machine-learning/introducing-disaggregated-inference-on-aws-powered-by-llm-d, thenextgentechinsider.com).

The KV Cache: The Hidden Memory Hog

You learned about the KV cache in Chapter 20 (context windows). During inference, the model stores the attention keys and values for every token it has processed, so it does not need to recompute them when generating each new token. This cache grows linearly with sequence length and can consume enormous amounts of memory.

How Much Memory Does the KV Cache Use?

The KV cache size depends on the model architecture and the sequence length. The formula is:

KV cache size (bytes) = 2 × num_layers × num_kv_heads × head_dim × seq_length × bytes_per_element

The factor of 2 accounts for both keys and values. Let us calculate for a concrete example:

def kv_cache_calculator():
    """
    Calculate KV cache memory for different models and sequence lengths.
    """
    print("KV Cache Memory Calculator")
    print("=" * 70)
    
    models = [
        ("LLaMA 3.1 70B (GQA, 8 KV heads)", 80, 8, 128, "BF16"),
        ("LLaMA 4 Maverick (GQA, 8 KV heads)", 48, 8, 256, "BF16"),
        ("GPT-4 class (est. 96 layers, MHA)", 96, 96, 128, "BF16"),
    ]
    
    seq_lengths = [4096, 32768, 131072, 1000000]
    
    for name, layers, kv_heads, head_dim, dtype in models:
        bytes_per = 2 if dtype == "BF16" else 1  # BF16 = 2 bytes
        print(f"\n  {name}")
        print(f"  Layers: {layers}, KV heads: {kv_heads}, Head dim: {head_dim}")
        print(f"  {'Seq Length':>12}  {'KV Cache Size':>14}  {'Per-request':>14}")
        print(f"  {'-'*12}  {'-'*14}  {'-'*14}")
        
        for seq_len in seq_lengths:
            # 2 for K and V, bytes_per for precision
            cache_bytes = 2 * layers * kv_heads * head_dim * seq_len * bytes_per
            cache_gb = cache_bytes / (1024**3)
            cache_mb = cache_bytes / (1024**2)
            
            if cache_gb >= 1:
                size_str = f"{cache_gb:.1f} GB"
            else:
                size_str = f"{cache_mb:.0f} MB"
            
            print(f"  {seq_len:>12,}  {size_str:>14}  {'per request':>14}")
    
    print(f"\n  Key insight: With a batch of 32 requests at 32K tokens each,")
    print(f"  the KV cache for LLaMA 70B alone uses ~10 GB.")
    print(f"  At 1M tokens, a single request's KV cache is ~19.5 GB.")
    print(f"  This is why KV cache management is critical for serving.")

kv_cache_calculator()

Notice the impact of Grouped-Query Attention (GQA), which you learned about in earlier chapters. LLaMA 3.1 70B uses only 8 KV heads (instead of the full 64 attention heads), reducing the KV cache by 8x compared to standard Multi-Head Attention. This architectural choice was made specifically to reduce serving costs. Without GQA, the KV cache for a 70B model at 32K tokens would be ~80 GB per request instead of ~10 GB, making it impractical to serve multiple users simultaneously.

PagedAttention: Virtual Memory for the KV Cache

The KV cache creates a memory management problem. Different requests have different sequence lengths, and those lengths change as tokens are generated. Pre-allocating the maximum possible KV cache for every request wastes enormous amounts of memory. But allocating exactly the right amount requires knowing in advance how long each response will be, which is impossible.

PagedAttention, introduced in the vLLM paper (SOSP 2023, arXiv:2309.06180), solves this by borrowing an idea from operating systems: virtual memory with paging.

In an operating system, physical memory is divided into fixed-size pages. Programs see a continuous virtual address space, but the OS maps virtual pages to physical pages that may be scattered across memory. This eliminates fragmentation and allows efficient memory sharing.

PagedAttention does the same thing for the KV cache:

The KV cache is divided into fixed-size blocks (e.g., 16 tokens per block).
Each request has a block table that maps logical token positions to physical memory blocks.
New blocks are allocated only when needed (as new tokens are generated).
When a request finishes, its blocks are freed and can be reused by other requests.
Requests that share a common prefix (e.g., the same system prompt) can share the same physical blocks, avoiding redundant storage.

def paged_attention_demo():
    """
    Illustrate how PagedAttention manages KV cache memory.
    """
    print("PagedAttention: Virtual Memory for the KV Cache")
    print("=" * 65)
    
    print("\n  Without PagedAttention (pre-allocated):")
    print("  " + "-" * 50)
    print("  Request A (actual: 150 tokens, allocated: 2048)")
    print("  [USED: 150 tokens][WASTED: 1,898 tokens = 92.7% waste]")
    print("  Request B (actual: 2000 tokens, allocated: 2048)")
    print("  [USED: 2,000 tokens][WASTED: 48 tokens]")
    print("  Request C (actual: 50 tokens, allocated: 2048)")
    print("  [USED: 50 tokens][WASTED: 1,998 tokens = 97.6% waste]")
    
    print("\n  With PagedAttention (block-based, 16 tokens/block):")
    print("  " + "-" * 50)
    print("  Physical memory blocks: [0][1][2][3][4][5][6]...")
    print("  Request A: block table -> [0, 3, 7, 12, ...] (10 blocks)")
    print("  Request B: block table -> [1, 2, 5, 8, ...] (125 blocks)")
    print("  Request C: block table -> [4, 9, 15, 20]     (4 blocks)")
    print("  Blocks allocated on demand. Near-zero waste.")
    
    print("\n  Memory savings:")
    pre_alloc = 3 * 2048  # 3 requests, max 2048 each
    actual = 150 + 2000 + 50
    paged = (10 + 125 + 4) * 16  # blocks * tokens_per_block
    print(f"    Pre-allocated: {pre_alloc:,} token slots ({actual/pre_alloc*100:.1f}% utilized)")
    print(f"    PagedAttention: {paged:,} token slots ({actual/paged*100:.1f}% utilized)")

paged_attention_demo()

The vLLM paper showed that PagedAttention achieves near-zero waste in KV cache memory and improves serving throughput by 2-4x compared to systems like FasterTransformer and Orca, with the improvement being more pronounced for longer sequences and larger models.

Source: PagedAttention introduced in “Efficient Memory Management for Large Language Model Serving with PagedAttention” (SOSP 2023, arXiv:2309.06180). Achieves near-zero KV cache memory waste and 2-4x throughput improvement over FasterTransformer and Orca (confirmed from dl.acm.org/doi/abs/10.1145/3600006.3613165, harvard.edu, emergentmind.com, snowan.gitbook.io).

RadixAttention: Smarter Cache Reuse

RadixAttention, introduced by SGLang (developed at UC Berkeley), takes KV cache management further. Instead of treating each request’s KV cache independently, RadixAttention organizes the cache as a radix tree (a prefix tree data structure). Requests that share a common prefix (such as the same system prompt or few-shot examples) automatically share the same cached key-value pairs.

This is particularly valuable for agent workflows (Chapter 23), where the same system prompt and tool definitions are sent with every request. Without RadixAttention, the system prompt’s KV cache is recomputed for every request. With RadixAttention, it is computed once and shared across all requests that use the same prefix, achieving up to 6.4x higher throughput in prefix-heavy workloads.

Source: SGLang with RadixAttention developed by UC Berkeley and LMSYS, uses radix tree for automatic KV cache reuse, achieving up to 6.4x higher throughput (confirmed from arxiv.org/html/2312.07104, lmsys.org, inference.net).

Speculative Decoding: Using a Small Model to Speed Up a Large One

The decode phase is slow because the large model generates one token at a time, and each token requires reading the entire model’s weights from memory. Speculative decoding is a technique that uses a small, fast “draft” model to guess multiple tokens ahead, then has the large “target” model verify those guesses in a single parallel pass.

Here is how it works:

The draft model (a small, fast model, perhaps 1-7B parameters) generates K candidate tokens autoregressively. This is fast because the draft model is small.
The target model (the large model you actually want to serve) processes all K candidate tokens in a single forward pass. This is efficient because the target model can process multiple tokens in parallel (like the prefill phase), which is compute-bound and utilizes the GPU well.
The target model checks each candidate token against its own probability distribution. If the candidate matches (or is close enough), it is accepted. If not, the target model’s token replaces it, and all subsequent candidates are discarded.
The process repeats from the first rejected position.

The key insight is that the verification step (step 2) costs roughly the same as generating a single token, because the target model processes all K candidates in parallel. If the draft model guesses correctly most of the time, you effectively generate K tokens for the cost of one target model forward pass.

def speculative_decoding_demo():
    """
    Demonstrate how speculative decoding works step by step.
    """
    print("Speculative Decoding: Draft and Verify")
    print("=" * 65)
    
    print("\n  Setup:")
    print("    Target model: 70B parameters (slow, high quality)")
    print("    Draft model:  7B parameters (fast, decent quality)")
    print("    Speculation length: K=5 tokens")
    
    print("\n  Step 1: Draft model generates 5 candidate tokens (fast)")
    draft_tokens = ["The", "quick", "brown", "fox", "jumped"]
    print(f"    Draft: {' '.join(draft_tokens)}")
    
    print("\n  Step 2: Target model verifies all 5 in one forward pass")
    print("    Token 1 'The'   -> Target agrees    [ACCEPT]")
    print("    Token 2 'quick' -> Target agrees    [ACCEPT]")
    print("    Token 3 'brown' -> Target agrees    [ACCEPT]")
    print("    Token 4 'fox'   -> Target disagrees [REJECT -> 'dog']")
    print("    Token 5 'jumped'-> Discarded (after rejection)")
    
    print("\n  Result: 3 tokens accepted + 1 corrected = 4 tokens generated")
    print("          in the time of ~1 target model forward pass")
    
    print("\n  Without speculative decoding: 4 tokens = 4 forward passes")
    print("  With speculative decoding:    4 tokens ~ 1 forward pass")
    print("  Speedup: ~2-3x in practice (acceptance rate ~60-80%)")
    
    print("\n  Critical property: The output is IDENTICAL to what the target")
    print("  model would have generated alone. No quality loss.")

speculative_decoding_demo()

The critical property of speculative decoding is that it produces exactly the same output distribution as the target model alone. There is no quality degradation. The draft model’s guesses are only used if the target model agrees with them. This makes speculative decoding a “free lunch” in terms of output quality.

In practice, speculative decoding achieves 2-3x speedup for autoregressive generation. The actual speedup depends on the acceptance rate (how often the draft model’s guesses match the target model) and the speculation length K. Higher acceptance rates and longer speculation lengths yield greater speedups.

Source: Speculative decoding achieves 2-3x speedup by using a small draft model to propose tokens verified by the target model in parallel, with no change to output quality (confirmed from introl.com, mtbox-ai.com, n1n.ai, inference.net). Original papers by Leviathan et al. (2023) and Chen et al. (2023) (confirmed from matx.com/research/sd).

Quantization for Deployment: Making Models Smaller and Faster

You learned about quantization in earlier chapters as a training technique. For serving, quantization is even more important: it reduces the memory footprint of model weights, allowing larger models to fit on fewer GPUs and increasing the number of users that can be served simultaneously.

The core idea is simple: instead of storing each model weight as a 16-bit floating-point number (2 bytes), store it as an 8-bit integer (1 byte) or a 4-bit integer (0.5 bytes). This reduces memory by 2x or 4x, respectively.

Precision Formats

Format	Bits	Bytes per Parameter	Memory for 70B Model	Relative Quality
FP32	32	4	280 GB	Baseline (training)
FP16/BF16	16	2	140 GB	~Same as FP32
FP8	8	1	70 GB	Very close to FP16
INT8	8	1	70 GB	Very close to FP16
INT4	4	0.5	35 GB	Slight degradation
FP4	4	0.5	35 GB	Slight degradation

At INT4, a 70B model shrinks from 140 GB (FP16) to approximately 35 GB, fitting on a single GPU with 40+ GB of memory (like an A100 or H100) with room for the KV cache. This is the difference between needing two GPUs and needing one, which halves the hardware cost.

GPTQ: Post-Training Quantization

GPTQ (Generative Post-Training Quantization) is a weight-only quantization method that compresses model weights to 4-bit or 3-bit precision after training. It works by:

Processing the model layer by layer
For each layer, using a small calibration dataset (typically 128-256 examples) to measure how quantization errors in one weight affect the layer’s output
Adjusting the remaining weights to compensate for quantization errors (using an approximate second-order method based on the Hessian matrix)
Storing the quantized weights along with small scaling factors for dequantization

GPTQ is designed for GPU inference. The quantized weights are stored in INT4 format and dequantized to FP16 on-the-fly during matrix multiplication. This is fast on modern GPUs because the memory savings (reading 4-bit weights instead of 16-bit) more than compensate for the dequantization overhead.

AWQ: Activation-Aware Weight Quantization

AWQ (Activation-Aware Weight Quantization) takes a different approach. Instead of treating all weights equally, AWQ observes that a small fraction of weights (typically 1%) are disproportionately important because they correspond to large activation values. AWQ protects these “salient” weights by keeping them at higher precision, while aggressively quantizing the rest.

The key insight is that not all weights matter equally. Weights that multiply large activations have a bigger impact on the output than weights that multiply small activations. By identifying and protecting these critical weights, AWQ achieves better quality than GPTQ at the same bit width.

GGUF: CPU-Friendly Quantization

GGUF (GPT-Generated Unified Format) is the quantization format used by llama.cpp and Ollama for local inference on consumer hardware. Unlike GPTQ and AWQ, which are optimized for GPU inference, GGUF is designed to work efficiently on CPUs and supports hybrid CPU/GPU execution (offloading some layers to the GPU while running others on the CPU).

GGUF offers a range of quantization levels (Q2_K through Q8_0) that trade off quality against size. The most popular choice for local inference is Q4_K_M, which provides a good balance of quality and memory usage. A 70B model quantized to Q4_K_M is approximately 40 GB, which can run on a high-end consumer setup with 64 GB of RAM using CPU inference, or on a GPU with 48 GB of VRAM.

def quantization_comparison():
    """
    Compare quantization methods for a 70B parameter model.
    """
    print("Quantization Methods Compared (70B Parameter Model)")
    print("=" * 75)
    
    methods = [
        ("FP16 (no quant)", 140.0, "Baseline", "GPU", "Reference quality"),
        ("GPTQ INT4", 35.0, "~99% of FP16", "GPU", "Fast GPU inference"),
        ("AWQ INT4", 35.0, "~99.2% of FP16", "GPU", "Better quality than GPTQ"),
        ("GGUF Q4_K_M", 40.0, "~98.5% of FP16", "CPU+GPU", "Local/consumer HW"),
        ("GGUF Q8_0", 70.0, "~99.8% of FP16", "CPU+GPU", "Higher quality local"),
        ("FP8", 70.0, "~99.9% of FP16", "GPU", "Minimal quality loss"),
    ]
    
    print(f"  {'Method':<20} {'Size (GB)':>10} {'Quality':>16} {'Target':>8} {'Notes':<25}")
    print(f"  {'-'*20} {'-'*10} {'-'*16} {'-'*8} {'-'*25}")
    
    for method, size, quality, target, notes in methods:
        print(f"  {method:<20} {size:>10.1f} {quality:>16} {target:>8} {notes:<25}")
    
    print(f"\n  Key tradeoff: INT4 quantization cuts memory by 4x with ~1% quality loss.")
    print(f"  This often means the difference between 1 GPU and 2 GPUs,")
    print(f"  which halves the serving cost.")

quantization_comparison()

When to Use Which

FP8: Use when you have B200 GPUs with native FP8 support and want minimal quality loss. Best for production API serving where quality is paramount.
GPTQ INT4: Use for GPU-based serving when you need to minimize GPU count. Well-supported by vLLM and TensorRT-LLM.
AWQ INT4: Use when you want slightly better quality than GPTQ at the same bit width. Also well-supported by serving frameworks.
GGUF Q4_K_M: Use for local inference on consumer hardware (laptops, desktops) via llama.cpp or Ollama.
GGUF Q8_0: Use for local inference when you have enough memory and want higher quality.

Source: GPTQ and AWQ are weight-only quantization methods targeting 4-bit precision for GPU inference. AWQ preserves salient weights based on activation magnitudes for better quality. GGUF is CPU-friendly quantization for llama.cpp (confirmed from youngju.dev, rohan-paul.com, dasroot.net, gopenai.com). A 70B model in FP16 requires ~140 GB; INT4 reduces to ~35 GB (confirmed from youngju.dev, spheron.network, twm.me).

Serving Frameworks: The Software Stack

The hardware and algorithms described above are implemented in serving frameworks that handle the complexity of loading models, managing memory, scheduling requests, and generating tokens. Four frameworks dominate LLM serving in March 2026.

vLLM: The Open-Source Standard

vLLM is the most widely used open-source LLM serving framework. Built around PagedAttention, it provides:

Continuous batching for high throughput
PagedAttention for efficient KV cache management
Tensor parallelism and pipeline parallelism for multi-GPU serving
Quantization support: GPTQ, AWQ, FP8, and more
OpenAI-compatible API: drop-in replacement for the OpenAI API
Speculative decoding support

vLLM achieves 2-4x higher throughput than naive serving implementations and has become the default choice for teams deploying open-source models in production. It supports all major model architectures (LLaMA, Mistral, Qwen, Falcon, and many others) and runs on NVIDIA, AMD, and Intel GPUs.

TensorRT-LLM: NVIDIA’s Optimized Engine

TensorRT-LLM is NVIDIA’s inference engine, optimized specifically for NVIDIA GPUs. It uses fused CUDA kernels, optimized FP8/INT8 computation paths, and deep integration with NVIDIA’s Triton Inference Server. On H100 GPUs with FP8 precision, TensorRT-LLM can achieve over 10,000 output tokens per second at peak throughput with time-to-first-token latencies below 100 milliseconds.

TensorRT-LLM typically delivers higher raw performance than vLLM on NVIDIA hardware, but it is more complex to set up and less flexible. It requires a model compilation step that converts the model into an optimized engine format, which can take significant time and must be redone for each hardware configuration.

SGLang: The Newcomer

SGLang, developed at UC Berkeley and hosted by LMSYS, combines a frontend domain-specific language (DSL) for structured LLM programs with a high-performance backend runtime. Its key innovation is RadixAttention for automatic KV cache reuse across requests. SGLang is particularly strong for agent workloads and structured generation tasks where many requests share common prefixes.

Hugging Face TGI: The Ecosystem Play

Text Generation Inference (TGI) is Hugging Face’s serving solution. It provides continuous batching, Flash Attention, tensor parallelism, and quantization support. TGI’s main advantage is tight integration with the Hugging Face model ecosystem, making it easy to deploy any model from the Hugging Face Hub. As of December 2025, TGI entered maintenance mode, with Hugging Face focusing on multi-backend support (integrating vLLM and TensorRT-LLM as backends).

def serving_framework_comparison():
    """
    Compare the major LLM serving frameworks in March 2026.
    """
    print("LLM Serving Frameworks Comparison (March 2026)")
    print("=" * 80)
    
    frameworks = [
        ("vLLM", "Open source", "PagedAttention, continuous batching",
         "NVIDIA, AMD, Intel", "Most popular, flexible"),
        ("TensorRT-LLM", "NVIDIA", "Fused kernels, FP8 optimization",
         "NVIDIA only", "Highest perf on NVIDIA"),
        ("SGLang", "UC Berkeley", "RadixAttention, structured gen",
         "NVIDIA, AMD", "Best for agents/prefix sharing"),
        ("TGI", "Hugging Face", "HF ecosystem integration",
         "NVIDIA, AMD, Intel", "Maintenance mode since Dec 2025"),
    ]
    
    for name, org, features, hw, notes in frameworks:
        print(f"\n  {name} ({org})")
        print(f"    Key features: {features}")
        print(f"    Hardware:     {hw}")
        print(f"    Notes:        {notes}")

serving_framework_comparison()

Source: vLLM uses PagedAttention for 2-4x throughput improvement, supports continuous batching, tensor/pipeline parallelism, GPTQ/AWQ/FP8 quantization, OpenAI-compatible API (confirmed from emergentmind.com, lobehub.com, glukhov.org, easecloud.io). TensorRT-LLM achieves 10,000+ output tokens/sec on H100 with FP8, TTFT under 100ms (confirmed from nvidia.github.io/TensorRT-LLM, introl.com). SGLang uses RadixAttention for up to 6.4x throughput on prefix-heavy workloads (confirmed from arxiv.org/html/2312.07104, lmsys.org). TGI entered maintenance mode December 11, 2025 (confirmed from huggingface.co/docs/inference-endpoints/engines/tgi).

The Economics of Inference

Understanding the cost structure of LLM inference is essential for anyone building products with these models. Whether you use a commercial API or run your own infrastructure, the economics follow the same fundamental principles.

API Pricing: What You Pay Per Token

Commercial API providers charge per token (or per million tokens). Here is the pricing landscape as of March 2026:

def api_pricing_table():
    """
    Compare API pricing across major providers (March 2026).
    All prices are per 1 million tokens.
    """
    print("LLM API Pricing Comparison (March 2026)")
    print("=" * 80)
    
    models = [
        # (Model, Input $/MTok, Output $/MTok, Context, Notes)
        ("GPT-5.4", 2.50, 15.00, "1.05M", "2x input above 272K"),
        ("GPT-5.4 mini", 0.75, 4.50, "400K", "Released Mar 17, 2026"),
        ("GPT-5.4 nano", 0.20, 1.25, "API-only", "Released Mar 17, 2026"),
        ("Claude Opus 4.6", 5.00, 25.00, "1M", "No surcharge since Mar 13"),
        ("Claude Sonnet 4.6", 3.00, 15.00, "1M", "Best value for reasoning"),
        ("Claude Haiku 4.5", 1.00, 5.00, "200K", "Speed tier"),
        ("Gemini 3.1 Pro", 2.00, 12.00, "1M", "Same price as Gemini 3 Pro"),
        ("Gemini 3 Flash", 0.50, 3.00, "1M", "3x faster than Pro"),
        ("Gemini Flash-Lite 2.5", 0.10, 0.40, "1M", "Budget tier (text)"),
        ("Gemini Flash-Lite 3.1", 0.25, 1.50, "1M", "Budget tier, Mar 3 2026"),
        ("DeepSeek V3.2", 0.28, 0.42, "128K", "Open source, self-hostable"),
        ("LLaMA 4 Maverick", 0.20, 0.60, "1M", "Open source, via providers"),
    ]
    
    print(f"  {'Model':<22} {'Input':>8} {'Output':>8} {'Context':>8} {'Notes':<30}")
    print(f"  {'-'*22} {'-'*8} {'-'*8} {'-'*8} {'-'*30}")
    
    for model, inp, out, ctx, notes in models:
        inp_str = f"${inp:.2f}" if isinstance(inp, float) else inp
        out_str = f"${out:.2f}" if isinstance(out, float) else out
        print(f"  {model:<22} {inp_str:>8} {out_str:>8} {ctx:>8} {notes:<30}")
    
    print(f"\n  Price range: 250x between cheapest (Flash-Lite 2.5 input at $0.10)")
    print(f"  and most expensive (Claude Opus 4.6 output at $25). GPT-5.4 nano")
    print(f"  at $0.20/$1.25 undercuts Gemini 3.1 Flash-Lite ($0.25) on input.")

api_pricing_table()

Several pricing patterns are worth noting:

Input is cheaper than output. Every provider charges more for output tokens than input tokens (typically 3-6x more). This is because output tokens are generated one at a time in the memory-bandwidth-bound decode phase, while input tokens are processed in parallel during the compute-efficient prefill phase.

Caching reduces input costs dramatically. Most providers offer prompt caching: if you send the same system prompt repeatedly, the cached version costs 80-90% less. GPT-5.4 cached input costs $0.25/MTok (90% discount from $2.50). Claude Sonnet 4.6 cache reads cost $0.30/MTok (90% discount from $3.00). For applications with consistent system prompts (which is most applications), caching is the single biggest cost optimization.

Long context costs more. GPT-5.4 charges 2x the standard input rate for prompts exceeding 272K tokens. Anthropic removed its long-context surcharge on March 13, 2026 (previously 2x input and 1.5x output above 200K tokens). Gemini 3.1 Pro charges $4.00/$18.00 per MTok above 200K tokens (2x the standard rate).

Open-source models are dramatically cheaper. DeepSeek V3.2 at $0.28/$0.42 per MTok delivers quality that rivals models costing 10-30x more. LLaMA 4 Maverick at $0.20/$0.60 per MTok (via Groq; as low as $0.15 input on some providers) is even cheaper on input. These prices reflect the lower cost of serving open-source models on commodity hardware, without the margin that commercial providers add.

Source: GPT-5.4 pricing: $2.50 input, $0.25 cached, $15.00 output per MTok, 2x above 272K (confirmed from openai.com/api/pricing, community.openai.com, apidog.com). GPT-5.4 mini: $0.75 input, $0.075 cached, $4.50 output per MTok; GPT-5.4 nano: $0.20 input, $1.25 output per MTok; both released March 17, 2026, mini with 400K context window (confirmed from community.openai.com/t/introducing-gpt-5-4-mini-and-nano, openai.com/index/introducing-gpt-5-4-mini-and-nano, buildfastwithai.com, technosports.co.in, geeky-gadgets.com, blockchain.news, siliconangle.com). Claude Opus 4.6: $5/$25, Sonnet 4.6: $3/$15, Haiku 4.5: $1/$5 per MTok (confirmed from curlscape.com, pecollective.com, tldl.io, karangoyal.cc, cloud.google.com/vertex-ai/generative-ai/pricing). Anthropic removed long-context surcharge March 13, 2026 for Claude 4.6 models (confirmed from blockchain.news, the-decoder.com, karangoyal.cc). Gemini 3.1 Pro: $2/$12 standard, $4/$18 above 200K; Gemini 3 Flash: $0.50/$3; Gemini 2.5 Flash-Lite: $0.10/$0.40; Gemini 3.1 Flash-Lite: $0.25/$1.50 (confirmed from cloud.google.com/vertex-ai/generative-ai/pricing, invertedstone.com, macaron.im, apidog.com). DeepSeek V3.2: $0.28/$0.42 per MTok via DeepSeek API (confirmed from tldl.io, deepseek.com, api-docs.deepseek.com). LLaMA 4 Maverick: $0.20/$0.60 per MTok via Groq, $0.15/$0.60 via DeepInfra, $0.27/$0.85 via Together AI (confirmed from groq.com/pricing, inworld.ai, aicostcheck.com, artificialanalysis.ai).

Self-Hosting Economics

For organizations with sufficient scale, running your own inference infrastructure can be significantly cheaper than using commercial APIs. The tradeoff is operational complexity: you need to manage hardware, software, scaling, monitoring, and on-call support.

def self_hosting_economics():
    """
    Compare the economics of API usage vs self-hosting.
    """
    print("API vs Self-Hosting: Cost Comparison")
    print("=" * 70)
    
    print("\n  Scenario: Serving LLaMA 4 Maverick (400B MoE, 17B active)")
    print("  Volume: 100 million tokens per day (input + output)")
    
    # API cost
    api_input_rate = 0.20  # per MTok (Groq pricing)
    api_output_rate = 0.60  # per MTok (Groq pricing)
    daily_tokens_m = 100  # million tokens
    # Assume 40% input, 60% output
    daily_api_cost = (daily_tokens_m * 0.4 * api_input_rate + 
                      daily_tokens_m * 0.6 * api_output_rate)
    monthly_api = daily_api_cost * 30
    
    print(f"\n  Option 1: Third-party API (e.g., Groq, Fireworks)")
    print(f"    Input rate:  ${api_input_rate}/MTok")
    print(f"    Output rate: ${api_output_rate}/MTok")
    print(f"    Daily cost:  ${daily_api_cost:.2f}")
    print(f"    Monthly:     ${monthly_api:.2f}")
    
    # Self-hosting cost
    # Maverick needs ~200 GB in INT4, fits on 4x H100 (320 GB total)
    num_gpus = 4
    gpu_hourly = 2.50  # H100 cloud price
    daily_gpu = num_gpus * gpu_hourly * 24
    monthly_gpu = daily_gpu * 30
    
    print(f"\n  Option 2: Self-hosted on 4x H100 (INT4 quantized)")
    print(f"    GPU cost:    {num_gpus} x ${gpu_hourly}/hr = ${num_gpus * gpu_hourly:.2f}/hr")
    print(f"    Daily cost:  ${daily_gpu:.2f}")
    print(f"    Monthly:     ${monthly_gpu:.2f}")
    print(f"    + Engineering overhead (monitoring, on-call, updates)")
    
    print(f"\n  Breakeven analysis:")
    print(f"    API monthly:       ${monthly_api:>10,.2f}")
    print(f"    Self-host monthly: ${monthly_gpu:>10,.2f} (hardware only)")
    
    if monthly_api > monthly_gpu:
        savings = (1 - monthly_gpu / monthly_api) * 100
        print(f"    Self-hosting saves: ~{savings:.0f}% on hardware alone")
    else:
        print(f"    API is cheaper at this volume")
    
    print(f"\n  At higher volumes (1B+ tokens/day), self-hosting savings grow.")
    print(f"  At lower volumes (<10M tokens/day), API is usually cheaper.")

self_hosting_economics()

The crossover point depends on your volume, the model you are serving, and your engineering team’s capacity. As a rough guideline:

Under 10 million tokens/day: Use APIs. The operational overhead of self-hosting is not worth the savings.
10-100 million tokens/day: Evaluate both options. Self-hosting may save 30-60% on hardware costs, but factor in engineering time.
Over 100 million tokens/day: Self-hosting almost always wins on cost, often by 50-80%. At this scale, the engineering investment is amortized across enough volume to be worthwhile.

The Cost Trend: Rapidly Declining

LLM inference costs have been falling dramatically. GPT-4’s initial pricing in March 2023 was $30 per million input tokens and $60 per million output tokens. Three years later, GPT-5.4 (a significantly more capable model) costs $2.50/$15.00, a 12x reduction in input cost and 4x reduction in output cost. The smaller GPT-5.4 mini ($0.75/$4.50) and GPT-5.4 nano ($0.20/$1.25), released March 17, 2026, push the floor even lower: GPT-5.4 nano’s input pricing is 150x cheaper than GPT-4’s original input pricing, while delivering capabilities that exceed GPT-4 on many benchmarks. Gemini 2.5 Flash-Lite at $0.10/$0.40 is 300x cheaper than GPT-4’s original input pricing.

This decline is driven by four factors:

Hardware improvements: Each GPU generation delivers 2-4x more inference performance per dollar.
Software optimizations: Continuous batching, PagedAttention, speculative decoding, and better quantization methods extract more throughput from the same hardware.
Competition: The entry of DeepSeek, open-source models, and multiple cloud providers has driven prices down aggressively.
Architectural efficiency: MoE models like DeepSeek V3.2 and LLaMA 4 Maverick deliver frontier-class quality while activating only a fraction of their total parameters per token, dramatically reducing the compute cost per request.

Putting It All Together: From Request to Response

Let us trace a complete request through a production serving system to see how all these pieces fit together.

def request_lifecycle():
    """
    Trace a request through a production LLM serving system.
    """
    print("Lifecycle of an LLM API Request")
    print("=" * 70)
    
    steps = [
        ("1. Request arrives",
         "User sends prompt via API. Load balancer routes to a server.",
         "~1-5 ms"),
        ("2. Tokenization",
         "Prompt text converted to token IDs using the model's tokenizer.",
         "~1-2 ms"),
        ("3. KV cache check",
         "Check if prompt prefix matches cached KV entries (RadixAttention).\n"
         "         If cache hit, skip prefill for cached portion.",
         "~0.1 ms"),
        ("4. Prefill phase",
         "Process all input tokens in parallel through the model.\n"
         "         Compute attention keys/values, store in KV cache.\n"
         "         This is compute-bound (GPU Tensor Cores fully utilized).",
         "~50-500 ms"),
        ("5. Join decode batch",
         "Request joins the continuous batch for token generation.\n"
         "         Scheduler assigns KV cache memory blocks (PagedAttention).",
         "~0.1 ms"),
        ("6. Decode loop",
         "Generate tokens one at a time (or via speculative decoding).\n"
         "         Each token: read weights + KV cache, compute, write new KV.\n"
         "         This is memory-bandwidth-bound.",
         "~10-30 ms/token"),
        ("7. Streaming response",
         "Tokens sent to user as they are generated (server-sent events).\n"
         "         User sees text appearing word by word.",
         "Continuous"),
        ("8. Completion",
         "Model generates end-of-sequence token. Request exits batch.\n"
         "         KV cache blocks freed for reuse. Slot opens for next request.",
         "~0.1 ms"),
    ]
    
    for step, desc, timing in steps:
        print(f"\n  {step} ({timing})")
        for line in desc.split('\n'):
            print(f"    {line.strip()}")
    
    print(f"\n  Total for a typical request (500-token prompt, 200-token response):")
    print(f"    TTFT (time to first token): ~100-300 ms")
    print(f"    Full response:              ~2-6 seconds")
    print(f"    Tokens per second:          ~30-100 (varies by model and hardware)")

request_lifecycle()

Key Takeaways

LLM inference has two distinct phases. The prefill phase processes the input prompt in parallel and is compute-bound (limited by GPU TFLOPS). The decode phase generates output tokens one at a time and is memory-bandwidth-bound (limited by how fast weights and KV cache can be read from HBM). Different optimizations target each phase.
Three hardware families dominate LLM serving in March 2026. NVIDIA H100 (80 GB HBM3, 3.35 TB/s, ~$1.25-3.90/hr) is the current workhorse. NVIDIA B200 (192 GB HBM3e, 8 TB/s, ~$2.25-6.50/hr) delivers 2.3x more BF16 dense compute and 2.4x more bandwidth. The B300 (Blackwell Ultra, 288 GB HBM3e, shipped January 2026) adds 50% more memory and 1.5x more FP4 compute over the B200. Google TPU v7 Ironwood (192 GB HBM3e, 7.38 TB/s, 4,614 FP8 TFLOPS per chip) is in preview and matches the B200 on per-chip specs while scaling to 9,216-chip pods. TPU v6e (918 TFLOPS BF16) remains available for cost-efficient workloads.
Model parallelism splits large models across GPUs. Tensor parallelism splits each layer’s matrix operations across GPUs within a node (requires NVLink). Pipeline parallelism assigns entire layers to different GPUs across nodes (works over InfiniBand). Expert parallelism distributes MoE experts across GPUs. Production systems combine all three.
Continuous batching (introduced by Orca, OSDI 2022) treats each decode iteration as an opportunity to adjust the batch, replacing completed requests with waiting ones. The original paper demonstrated up to 36.9x throughput improvement over FasterTransformer, and it is used by every major serving framework. Prefill-decode disaggregation (DistServe, OSDI 2024; Splitwise, ISCA 2024) takes this further by running the compute-bound prefill and memory-bound decode phases on separate hardware, achieving up to 4.48x higher goodput. AWS and Cerebras announced a production disaggregated inference collaboration on March 13, 2026.
The KV cache stores attention keys and values for all processed tokens and grows linearly with sequence length. For a 70B model with GQA at 32K tokens, the KV cache is approximately 10 GB per request. Grouped-Query Attention reduces KV cache size by 8x compared to standard Multi-Head Attention, which is why nearly all modern models use it.
PagedAttention (vLLM, SOSP 2023) manages KV cache memory using virtual memory paging, allocating fixed-size blocks on demand and freeing them when requests complete. This achieves near-zero memory waste and 2-4x throughput improvement. RadixAttention (SGLang) extends this with prefix-based cache sharing for up to 6.4x throughput on agent workloads.
Speculative decoding uses a small draft model to guess multiple tokens, then verifies them in parallel with the large target model. This achieves 2-3x speedup with zero quality loss, because rejected guesses are replaced by the target model’s tokens.
Quantization reduces model weight precision from 16-bit to 8-bit or 4-bit, cutting memory by 2-4x. GPTQ and AWQ target GPU inference at INT4 (a 70B model shrinks from 140 GB to ~35 GB). GGUF targets CPU/hybrid inference for local deployment via llama.cpp. FP8 on B200 GPUs offers minimal quality loss with 2x memory savings.
Four serving frameworks dominate. vLLM (open source, PagedAttention, most popular), TensorRT-LLM (NVIDIA, highest performance on NVIDIA GPUs), SGLang (RadixAttention, best for agents), and TGI (Hugging Face ecosystem, now in maintenance mode).
API pricing spans a 250x range in March 2026: from Gemini 2.5 Flash-Lite at $0.10/$0.40 per MTok to Claude Opus 4.6 at $5/$25 per MTok. GPT-5.4 nano ($0.20/$1.25) and GPT-5.4 mini ($0.75/$4.50), both released March 17, 2026, offer aggressive pricing for near-flagship performance. GPT-5.4 nano undercuts Gemini 3.1 Flash-Lite ($0.25) on input pricing. Prompt caching saves 80-90% on repeated system prompts. Open-source models (DeepSeek V3.2 at $0.28/$0.42, LLaMA 4 Maverick at $0.20/$0.60 via Groq) are 10-30x cheaper than frontier closed models.
Inference costs are falling rapidly. GPT-5.4 input pricing ($2.50/MTok) is 12x cheaper than GPT-4’s launch pricing ($30/MTok) three years earlier, despite being a significantly more capable model. GPT-5.4 nano at $0.20/MTok input is 150x cheaper than GPT-4’s original input pricing. This decline is driven by hardware improvements, software optimizations (batching, caching, quantization), MoE architectural efficiency, and market competition.
Self-hosting becomes cost-effective above ~10 million tokens per day, with savings of 30-80% on hardware costs at higher volumes. Below that threshold, commercial APIs are usually cheaper when engineering overhead is factored in.

What’s Next

You now understand the infrastructure that turns trained models into products: the GPUs and TPUs that provide the compute and memory bandwidth, the parallelism strategies that split models across hardware, the batching and caching techniques that maximize throughput, the quantization methods that shrink models to fit smaller hardware, and the economics that determine what inference costs. In Chapter 25, we will explore the divide between open and closed models: what “open weights” actually means, how to run models locally, and the fine-tuning techniques that let you customize open models for your specific needs.

Chapter 25. Open vs. Closed Models, The Great Divide