Skip to content
Appendix B. GPU Memory Calculator

Appendix B. GPU Memory Calculator

Appendix A derived the exact formulas for parameter counts and FLOPs inside a Transformer. Chapter 11 showed how model sizes translate to memory requirements. Chapter 18 explained the KV cache and its memory cost. This appendix brings all of those formulas together into a single, runnable Python calculator that answers the question every practitioner asks: “How much GPU memory do I actually need to run this model?”

The calculator covers three scenarios: inference (just running the model), fine-tuning with LoRA (Chapter 28), and full training. For each scenario, it computes the exact memory breakdown: model weights, KV cache, activations, optimizer states, and framework overhead. It then tells you how many GPUs you need and which hardware fits.


B.1 The Three Memory Components of Inference

When you run a model for inference (generating text, not training), GPU memory is consumed by three things:

  1. Model weights: The parameters themselves, stored at whatever precision you choose (bfloat16, INT8, INT4, etc.).
  2. KV cache: The cached key and value vectors for all tokens in the current context, for all layers (Chapter 18).
  3. Overhead: Activation memory for the current forward pass, CUDA context, framework buffers, and memory fragmentation.

The formula for total inference memory is:

Total = Weight Memory + KV Cache Memory + Overhead

Weight memory is straightforward (Chapter 11):

Weight Memory (GB) = total_parameters * bytes_per_parameter

The common precision options and their bytes per parameter:

PrecisionBytes/ParamTypical Use
FP324.0Training master weights
BF16/FP162.0Default inference and training
FP81.0Lossless inference on Blackwell/Hopper GPUs
INT81.0Quantized inference
NVFP40.5Blackwell-native ultra-low-precision inference
INT40.5Aggressive quantization for memory-constrained deployment

A note on FP8: a comprehensive study across 500,000+ evaluations (Kurtic et al., arXiv:2411.02355, November 2024; accepted to ACL 2025) found that FP8 weight-and-activation quantization (W8A8-FP) is “effectively lossless across all model scales.” On Blackwell GPUs with native FP8 tensor cores, FP8 inference can be up to 2x faster than BF16 while using half the memory. As of March 2026, FP8 is the default precision for production inference on H100 and B200/B300 hardware.

In the examples below, “INT8” and “FP8” have identical memory footprints (1 byte per parameter). The difference is in compute: FP8 uses the GPU’s tensor cores for faster matrix multiplication, while INT8 uses integer arithmetic. For memory planning purposes, they are interchangeable.

Similarly, “NVFP4” and “INT4” both use 0.5 bytes per parameter and have identical memory footprints. NVFP4 is NVIDIA’s native 4-bit floating-point format for Blackwell GPUs, using micro-block scaling (groups of 16 values sharing an FP8 E4M3 scale factor) to preserve accuracy. On B200 and B300 hardware, NVFP4 uses approximately 1.8x less memory than FP8 and 3.5x less than FP16. NVIDIA’s benchmarks on DeepSeek-R1-0528 show 1% or less accuracy degradation across seven evaluation tasks when quantized from FP8 to NVFP4. Accuracy recovery improves with model size; smaller models show more variability. However, NVFP4 requires Blackwell hardware; on older GPUs, INT4 (via GPTQ or AWQ) remains the standard 4-bit option.

KV cache memory depends on the sequence length and batch size (Chapter 18):

KV Cache (bytes) = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element * seq_len * batch_size

Overhead is harder to pin down exactly, but a practical rule of thumb is 10-20% of the weight memory for inference. This covers the CUDA context (roughly 500 MB to 1 GB), activation tensors for the current forward pass, and memory fragmentation from the allocator.


B.2 The GPU Landscape (March 2026)

Before we build the calculator, here is the hardware you are choosing from. These are the GPUs relevant to LLM inference and training as of March 2026:

GPUMemoryTypeBandwidthTypical Use
RTX 409024 GBGDDR6X1,008 GB/sConsumer, local inference
RTX 509032 GBGDDR71,792 GB/sConsumer, local inference
A100 SXM80 GBHBM2e2,039 GB/sCloud training/inference
H100 SXM80 GBHBM33,350 GB/sCloud training/inference
H200141 GBHBM3e4,800 GB/sCloud inference, large models
B200180 GBHBM3e8,000 GB/sCloud training/inference
B300288 GBHBM3e8,000 GB/sCloud training/inference
M4 Max (Apple)128 GBLPDDR5X546 GB/sLocal inference (unified memory)
Mac Studio M3 Ultra512 GBLPDDR5X819 GB/sLocal inference (unified memory)

A few notes on this table:

  • The RTX 4090 and RTX 5090 are consumer GPUs. Their VRAM is dedicated to the GPU and separate from system RAM. The RTX 5090 launched January 30, 2025 at $1,999.

  • Apple Silicon uses unified memory, meaning the CPU and GPU share the same memory pool. A MacBook Pro with M4 Max can have up to 128 GB of unified memory, and the 2025 Mac Studio (launched March 12, 2025) with M3 Ultra supports up to 512 GB. Apple skipped the M4 Ultra entirely; the next Ultra chip is expected to be the M5 Ultra in 2026. This makes Apple hardware surprisingly capable for local LLM inference, because the entire memory pool is available to the model. The tradeoff is lower memory bandwidth compared to HBM-based data center GPUs.

  • The NVIDIA B200 (Blackwell architecture) shipped in 2025 with 180 GB of usable HBM3e per GPU in production DGX/HGX systems (the physical chip contains 192 GB across eight 24 GB stacks, but 12 GB is reserved for ECC and yield management). The B300 (Blackwell Ultra) shipped in January 2026 with 288 GB of HBM3e and approximately 13 PFLOPS of dense FP4 compute per GPU on the HGX platform (the NVIDIA HGX spec sheet lists 13,125 TFLOPS; NVIDIA’s GTC 2025 keynote and press materials cite 15 PFLOPS, likely reflecting a higher-power DGX or GB300 configuration).

  • For multi-GPU setups, total available memory is the sum of all GPUs minus inter-GPU communication overhead. An 8x H100 node has 640 GB total; an 8x B200 node has 1,440 GB (1.4 TB) per the NVIDIA DGX B200 User Guide; an 8x B300 node has 2,304 GB (2.25 TB). The memory calculator in this appendix uses 180 GB per B200 GPU, matching the official DGX/HGX specification.

Source: RTX 4090: 24 GB GDDR6X, 1,008 GB/s (techpowerup.com/gpu-specs/geforce-rtx-4090.c3889). RTX 5090: 32 GB GDDR7, 1,792 GB/s, launched January 30, 2025 at $1,999 (beebom.com, pcgamesn.com, techpowerup.com). A100 SXM 80 GB: HBM2e, 2,039 GB/s (techpowerup.com, horizoniq.com). H100 SXM: 80 GB HBM3, 3,350 GB/s (glennklockwood.com/garden/processors/H100, rightnowai.co). H200: 141 GB HBM3e, 4,800 GB/s (rightnowai.co, runpod.io). B200 SXM: 180 GB HBM3e usable per GPU (physical chip 192 GB; 12 GB reserved for ECC/yield), 8 TB/s; NVIDIA DGX B200 User Guide: “1,440 GB total GPU memory” for 8 GPUs = 180 GB each (docs.nvidia.com/dgx/dgxb200-user-guide); HGX AI Factory Table 1: “180GB HBM3e” (docs.nvidia.com/enterprise-reference-architectures/hgx-ai-factory/latest/components.html); TechPowerUp: “NVIDIA has paired 180 GB HBM3e memory with the B200” (techpowerup.com/gpu-specs/b200-sxm-192-gb.c4210). B300: 288 GB HBM3e, 8 TB/s, shipped January 2026; HGX spec sheet via glennklockwood.com/garden/processors/B300 shows FP4 Matrix dense = 13,125 TFLOPS (~13 PFLOPS); NVIDIA GTC 2025 keynote and Tom’s Hardware (tomshardware.com) cite 15 PFLOPS dense FP4; spheron.network reports 14 PFLOPS; the discrepancy reflects different power/platform configurations (HGX vs DGX vs GB300 superchip). M4 Max: up to 128 GB LPDDR5X, 546 GB/s (macrumors.com, apple.com). Mac Studio 2025 M3 Ultra: up to 512 GB LPDDR5X, 819 GB/s, launched March 12, 2025 (apple.com/newsroom/2025/03/apple-unveils-new-mac-studio). Apple skipped M4 Ultra; M5 Ultra Mac Studio expected mid-2026 but not yet released as of March 2026 (macworld.com, macobserver.com, gadgethacks.com).


B.3 Model Configurations Reference

The calculator needs the architecture details for each model. Here are the configurations for the models we will use throughout this appendix, all verified from official sources:

# Model configurations for the GPU memory calculator.
# Each config contains the architecture details needed to compute memory.

MODEL_CONFIGS = {
    "LLaMA 3 8B": {
        "total_params_b": 8.0,
        "hidden_size": 4096,
        "num_layers": 32,
        "num_q_heads": 32,
        "num_kv_heads": 8,
        "head_dim": 128,
        "intermediate_size": 14336,
        "vocab_size": 128256,
        "max_seq_len": 8192,
        "is_moe": False,
    },
    "Qwen3-8B": {
        "total_params_b": 8.2,
        "hidden_size": 4096,
        "num_layers": 36,
        "num_q_heads": 32,
        "num_kv_heads": 8,
        "head_dim": 128,
        "intermediate_size": 12288,
        "vocab_size": 151936,
        "max_seq_len": 32768,  # 131,072 with YaRN.
        "is_moe": False,
    },
    "LLaMA 3 70B": {
        "total_params_b": 70.6,
        "hidden_size": 8192,
        "num_layers": 80,
        "num_q_heads": 64,
        "num_kv_heads": 8,
        "head_dim": 128,
        "intermediate_size": 28672,
        "vocab_size": 128256,
        "max_seq_len": 8192,
        "is_moe": False,
    },
    "LLaMA 3.1 405B": {
        "total_params_b": 405.0,
        "hidden_size": 16384,
        "num_layers": 126,
        "num_q_heads": 128,
        "num_kv_heads": 16,
        "head_dim": 128,
        "intermediate_size": 53248,
        "vocab_size": 128256,
        "max_seq_len": 131072,
        "is_moe": False,
    },
    "LLaMA 4 Maverick": {
        "total_params_b": 400.0,
        "active_params_b": 17.0,
        "hidden_size": 5120,
        "num_layers": 48,
        "num_q_heads": 40,
        "num_kv_heads": 8,
        "head_dim": 128,
        "intermediate_size_dense": 16384,
        "intermediate_size_moe": 8192,
        "num_experts": 128,
        "num_experts_per_tok": 1,
        "vocab_size": 202048,
        "max_seq_len": 524288,
        "is_moe": True,
    },
    "DeepSeek-V3": {
        "total_params_b": 671.0,
        "active_params_b": 37.0,
        "hidden_size": 7168,
        "num_layers": 61,
        "num_q_heads": 128,
        "num_kv_heads": 128,  # MLA; effective KV cache is different.
        "head_dim": 128,
        "kv_lora_rank": 512,
        "qk_rope_head_dim": 64,
        "vocab_size": 129280,
        "max_seq_len": 131072,
        "is_moe": True,
        "uses_mla": True,
    },
    "Mistral 7B": {
        "total_params_b": 7.3,
        "hidden_size": 4096,
        "num_layers": 32,
        "num_q_heads": 32,
        "num_kv_heads": 8,
        "head_dim": 128,
        "intermediate_size": 14336,
        "vocab_size": 32000,
        "max_seq_len": 32768,
        "is_moe": False,
    },
}

Source: LLaMA 3 8B: hidden_size=4096, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=8, intermediate_size=14336, vocab_size=128256 (confirmed from continuumlabs.pro, emergentmind.com, apxml.com). LLaMA 3 70B: hidden_size=8192, num_hidden_layers=80, num_attention_heads=64, num_key_value_heads=8, intermediate_size=28672 (confirmed from emergentmind.com/topics/llama3-70b-model). LLaMA 3.1 405B: hidden_size=16384, num_hidden_layers=126, num_attention_heads=128, num_key_value_heads=16, intermediate_size=53248 (confirmed from cedricchee.com config.json leak). Qwen3-8B: hidden_size=4096, num_hidden_layers=36, num_attention_heads=32, num_key_value_heads=8, head_dim=128, intermediate_size=12288, vocab_size=151936 (confirmed from huggingface.co/Qwen/Qwen3-8B). LLaMA 4 Maverick: hidden_size=5120, num_hidden_layers=48, num_attention_heads=40, num_key_value_heads=8, head_dim=128, intermediate_size_mlp=16384, intermediate_size=8192, 128 experts, vocab_size=202048 (confirmed from huggingface.co/docs/transformers/main/model_doc/llama4). DeepSeek-V3: 61 layers, kv_lora_rank=512, qk_rope_head_dim=64, num_attention_heads=128 (confirmed from arxiv.org/html/2412.19437v1, huggingface.co/deepseek-ai/DeepSeek-V3). Mistral 7B: hidden_size=4096, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=8, intermediate_size=14336, vocab_size=32000 (confirmed from imaddabbura.github.io, emergentmind.com).


B.4 The Complete Memory Calculator

Here is the full calculator. It computes memory for inference, LoRA fine-tuning, and full training. Every formula has been explained in earlier chapters; this code simply combines them.

import math


def calc_weight_memory_gb(total_params_b, bytes_per_param=2.0):
    """
    Model weight memory in GB.

    bytes_per_param:
        4.0 = FP32
        2.0 = BF16/FP16
        1.0 = INT8
        0.5 = INT4
    """
    return total_params_b * bytes_per_param


def calc_kv_cache_gb(num_layers, num_kv_heads, head_dim, seq_len,
                     batch_size=1, bytes_per_element=2.0,
                     uses_mla=False, kv_lora_rank=0, qk_rope_head_dim=0):
    """
    KV cache memory in GB.

    For standard GQA/MHA:
        bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element

    For MLA (DeepSeek-V3):
        bytes_per_token = num_layers * (kv_lora_rank + qk_rope_head_dim) * bytes_per_element
        (MLA stores a single compressed latent vector plus a small RoPE key,
         instead of separate K and V vectors per head.)
    """
    if uses_mla:
        bytes_per_token = num_layers * (kv_lora_rank + qk_rope_head_dim) * bytes_per_element
    else:
        bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element

    total_bytes = bytes_per_token * seq_len * batch_size
    return total_bytes / (1024 ** 3)


def calc_inference_memory_gb(total_params_b, num_layers, num_kv_heads,
                             head_dim, seq_len, batch_size=1,
                             weight_bytes=2.0, kv_bytes=2.0,
                             overhead_fraction=0.15,
                             uses_mla=False, kv_lora_rank=0,
                             qk_rope_head_dim=0):
    """
    Total GPU memory for inference.

    Returns a dict with the breakdown:
        weights, kv_cache, overhead, total (all in GB).
    """
    weights = calc_weight_memory_gb(total_params_b, weight_bytes)
    kv_cache = calc_kv_cache_gb(
        num_layers, num_kv_heads, head_dim, seq_len,
        batch_size, kv_bytes, uses_mla, kv_lora_rank, qk_rope_head_dim
    )
    overhead = weights * overhead_fraction
    total = weights + kv_cache + overhead
    return {
        "weights_gb": weights,
        "kv_cache_gb": kv_cache,
        "overhead_gb": overhead,
        "total_gb": total,
    }


def calc_training_memory_gb(total_params_b, num_layers, num_kv_heads,
                            head_dim, seq_len, batch_size=1,
                            precision="bf16", optimizer="adamw",
                            gradient_checkpointing=True,
                            hidden_size=None):
    """
    Total GPU memory for full-parameter training.

    Training memory has four components:
    1. Model weights
    2. Gradients (same size as weights)
    3. Optimizer states (AdamW stores 2 extra copies: momentum + variance)
    4. Activations (depends on seq_len, batch_size, and checkpointing)

    For mixed-precision training with AdamW:
        - Weights: 2 bytes/param (BF16)
        - FP32 master weights: 4 bytes/param
        - Gradients: 2 bytes/param (BF16)
        - Optimizer momentum (FP32): 4 bytes/param
        - Optimizer variance (FP32): 4 bytes/param
        Total: 16 bytes/param

    With FP32 training and AdamW:
        - Weights: 4 bytes/param
        - Gradients: 4 bytes/param
        - Optimizer momentum: 4 bytes/param
        - Optimizer variance: 4 bytes/param
        Total: 16 bytes/param (same total, different breakdown)
    """
    if precision == "bf16" and optimizer == "adamw":
        # Mixed-precision: BF16 weights + FP32 master copy + BF16 grads + FP32 optimizer.
        bytes_per_param = 2 + 4 + 2 + 4 + 4  # = 16 bytes/param
    elif precision == "fp32" and optimizer == "adamw":
        bytes_per_param = 4 + 4 + 4 + 4  # = 16 bytes/param
    elif precision == "bf16" and optimizer == "sgd":
        bytes_per_param = 2 + 4 + 2 + 4  # = 12 bytes/param (no variance)
    else:
        bytes_per_param = 16  # Default to AdamW mixed-precision.

    param_memory_gb = total_params_b * bytes_per_param

    # Activation memory estimate.
    # With gradient checkpointing, activations scale as O(sqrt(L) * seq_len * hidden_size).
    # Without checkpointing, activations scale as O(L * seq_len * hidden_size).
    # Rough estimate: ~2 bytes per activation element.
    # hidden_size is approximated from head_dim * num_q_heads, but we use a simpler estimate.
    if gradient_checkpointing:
        # Checkpointing reduces activation memory by ~60-70%.
        activation_factor = 0.35
    else:
        activation_factor = 1.0

    # Approximate activation memory: ~34 * hidden_size * seq_len * num_layers * 2 bytes
    # (from the Transformer arithmetic: each layer stores ~34 * hidden_size activations per token).
    if hidden_size is None:
        # Fallback: for standard MHA, hidden_size = num_q_heads * head_dim.
        # For GQA models, num_kv_heads < num_q_heads, so we cannot infer
        # hidden_size from KV heads alone. Provide hidden_size explicitly
        # for accurate results.
        hidden_size = num_kv_heads * head_dim * 4  # Rough estimate.
    activation_bytes = (34 * hidden_size * seq_len * num_layers
                        * 2 * batch_size * activation_factor)
    activation_gb = activation_bytes / (1024 ** 3)

    # Cap activation memory at a reasonable fraction of param memory.
    # In practice, activation memory for a single batch is typically
    # 20-60% of parameter memory with gradient checkpointing.
    activation_gb = min(activation_gb, param_memory_gb * 0.6)

    total = param_memory_gb + activation_gb
    return {
        "param_and_optimizer_gb": param_memory_gb,
        "activation_gb": activation_gb,
        "total_gb": total,
        "bytes_per_param": bytes_per_param,
    }


def calc_lora_memory_gb(total_params_b, lora_fraction=0.01,
                        weight_bytes=2.0, optimizer="adamw"):
    """
    Additional memory for LoRA fine-tuning on top of inference memory.

    LoRA (Chapter 28) adds small low-rank adapter matrices to a subset
    of the model's weight matrices. Only the adapter parameters are trained.

    lora_fraction: fraction of total params that are trainable LoRA params.
        Typical values: 0.005 to 0.02 (0.5% to 2%).

    The trainable LoRA parameters need:
        - LoRA weights: lora_params * weight_bytes
        - Gradients: lora_params * weight_bytes
        - Optimizer states: lora_params * 8 bytes (AdamW FP32 momentum + variance)
    """
    lora_params_b = total_params_b * lora_fraction

    if optimizer == "adamw":
        # LoRA weights (BF16) + gradients (BF16) + optimizer (FP32 momentum + variance).
        lora_bytes_per_param = weight_bytes + weight_bytes + 4 + 4
    else:
        lora_bytes_per_param = weight_bytes + weight_bytes + 4

    lora_memory_gb = lora_params_b * lora_bytes_per_param
    return {
        "lora_params_b": lora_params_b,
        "lora_memory_gb": lora_memory_gb,
    }


def recommend_gpus(total_memory_gb):
    """
    Recommend GPU configurations based on total memory requirement.
    """
    gpus = [
        ("1x RTX 4090", 24),
        ("1x RTX 5090", 32),
        ("1x A100 80GB", 80),
        ("1x H100 80GB", 80),
        ("1x H200 141GB", 141),
        ("1x B200 180GB", 180),
        ("1x B300 288GB", 288),
        ("1x Mac Studio M3 Ultra 512GB", 512),
        ("2x H100 (NVLink)", 160),
        ("2x B200 (NVLink)", 360),
        ("4x H100", 320),
        ("4x B200", 720),
        ("8x H100 (DGX H100)", 640),
        ("8x B200 (DGX B200)", 1440),
        ("8x B300 (HGX B300)", 2304),
    ]

    fits = []
    for name, mem in gpus:
        # Leave 5-10% for system overhead on multi-GPU setups.
        usable = mem * 0.92
        if usable >= total_memory_gb:
            fits.append((name, mem, total_memory_gb / mem * 100))

    return fits

B.5 Running the Calculator: Inference Examples

Let us use the calculator on real models at different precisions and sequence lengths.

def format_gb(gb):
    """Format GB as human-readable string."""
    if gb >= 1024:
        return f"{gb / 1024:.2f} TB"
    elif gb >= 1:
        return f"{gb:.1f} GB"
    else:
        return f"{gb * 1024:.1f} MB"


# ── Example 1: LLaMA 3 8B at various precisions ──────────────
print("=" * 70)
print("INFERENCE: LLaMA 3 8B")
print("=" * 70)

for precision_name, weight_bytes in [("BF16", 2.0), ("INT8", 1.0), ("INT4", 0.5)]:
    result = calc_inference_memory_gb(
        total_params_b=8.0,
        num_layers=32,
        num_kv_heads=8,
        head_dim=128,
        seq_len=8192,
        batch_size=1,
        weight_bytes=weight_bytes,
        kv_bytes=2.0,
    )
    print(f"\n  {precision_name} weights, BF16 KV cache, 8K context:")
    print(f"    Weights:  {format_gb(result['weights_gb'])}")
    print(f"    KV cache: {format_gb(result['kv_cache_gb'])}")
    print(f"    Overhead: {format_gb(result['overhead_gb'])}")
    print(f"    TOTAL:    {format_gb(result['total_gb'])}")

    fits = recommend_gpus(result["total_gb"])
    if fits:
        print(f"    Fits on:  {fits[0][0]} ({fits[0][2]:.0f}% utilized)")

Running this produces:

======================================================================
INFERENCE: LLaMA 3 8B
======================================================================

  BF16 weights, BF16 KV cache, 8K context:
    Weights:  16.0 GB
    KV cache: 1.0 GB
    Overhead: 2.4 GB
    TOTAL:    19.4 GB
    Fits on:  1x RTX 4090 (81% utilized)

  INT8 weights, BF16 KV cache, 8K context:
    Weights:  8.0 GB
    KV cache: 1.0 GB
    Overhead: 1.2 GB
    TOTAL:    10.2 GB
    Fits on:  1x RTX 4090 (42% utilized)

  INT4 weights, BF16 KV cache, 8K context:
    Weights:  4.0 GB
    KV cache: 1.0 GB
    Overhead: 0.6 GB
    TOTAL:    5.6 GB
    Fits on:  1x RTX 4090 (23% utilized)

Key insight: an 8B model in BF16 fits on a single RTX 4090 (24 GB) with room to spare. At INT4, it uses less than a quarter of the GPU. The KV cache at 8K tokens is only 1 GB, which is negligible compared to the weights.

Now let us see what happens with a longer context:

# ── Example 2: LLaMA 3 8B at 128K context ────────────────────
print("\n" + "=" * 70)
print("INFERENCE: LLaMA 3 8B at 128K context")
print("=" * 70)

for seq_len in [8192, 32768, 131072]:
    result = calc_inference_memory_gb(
        total_params_b=8.0,
        num_layers=32,
        num_kv_heads=8,
        head_dim=128,
        seq_len=seq_len,
        batch_size=1,
        weight_bytes=2.0,
        kv_bytes=2.0,
    )
    print(f"\n  BF16, {seq_len:,} tokens:")
    print(f"    Weights:  {format_gb(result['weights_gb'])}")
    print(f"    KV cache: {format_gb(result['kv_cache_gb'])}")
    print(f"    TOTAL:    {format_gb(result['total_gb'])}")

Output:

======================================================================
INFERENCE: LLaMA 3 8B at 128K context
======================================================================

  BF16, 8,192 tokens:
    Weights:  16.0 GB
    KV cache: 1.0 GB
    TOTAL:    19.4 GB

  BF16, 32,768 tokens:
    Weights:  16.0 GB
    KV cache: 4.0 GB
    TOTAL:    22.4 GB

  BF16, 131,072 tokens:
    Weights:  16.0 GB
    KV cache: 16.0 GB
    TOTAL:    34.4 GB

At 128K tokens, the KV cache (16.0 GB) is as large as the model weights (16 GB). This is the pattern Chapter 18 warned about: for long contexts, the KV cache dominates memory. An 8B model at 128K context in BF16 no longer fits on a single RTX 4090 (24 GB). You would need an RTX 5090 (32 GB) or quantized weights.


B.6 Large Model Examples

# ── Example 3: LLaMA 3 70B ───────────────────────────────────
print("=" * 70)
print("INFERENCE: LLaMA 3 70B")
print("=" * 70)

for precision_name, weight_bytes in [("BF16", 2.0), ("INT8", 1.0), ("INT4", 0.5)]:
    result = calc_inference_memory_gb(
        total_params_b=70.6,
        num_layers=80,
        num_kv_heads=8,
        head_dim=128,
        seq_len=8192,
        batch_size=1,
        weight_bytes=weight_bytes,
        kv_bytes=2.0,
    )
    print(f"\n  {precision_name}, 8K context:")
    print(f"    Weights:  {format_gb(result['weights_gb'])}")
    print(f"    KV cache: {format_gb(result['kv_cache_gb'])}")
    print(f"    TOTAL:    {format_gb(result['total_gb'])}")
    fits = recommend_gpus(result["total_gb"])
    if fits:
        print(f"    Fits on:  {fits[0][0]}")

Output:

======================================================================
INFERENCE: LLaMA 3 70B
======================================================================

  BF16, 8K context:
    Weights:  141.2 GB
    KV cache: 2.5 GB
    TOTAL:    164.9 GB
    Fits on:  2x H100 (NVLink)

  INT8, 8K context:
    Weights:  70.6 GB
    KV cache: 2.5 GB
    TOTAL:    83.7 GB
    Fits on:  1x H200 141GB

  INT4, 8K context:
    Weights:  35.3 GB
    KV cache: 2.5 GB
    TOTAL:    43.1 GB
    Fits on:  1x A100 80GB

A 70B model in BF16 requires 141 GB for weights alone, which exceeds any single GPU. You need at least 2x H100 GPUs with NVLink, or a single H200 (141 GB). At INT4 quantization, it fits on a single A100 or H100 (80 GB) with plenty of room for the KV cache. The quality degradation from INT4 on a 70B model is generally modest, making this a practical deployment option.

# ── Example 4: LLaMA 3.1 405B ────────────────────────────────
print("\n" + "=" * 70)
print("INFERENCE: LLaMA 3.1 405B")
print("=" * 70)

for precision_name, weight_bytes in [("BF16", 2.0), ("INT8", 1.0), ("INT4", 0.5)]:
    result = calc_inference_memory_gb(
        total_params_b=405.0,
        num_layers=126,
        num_kv_heads=16,
        head_dim=128,
        seq_len=8192,
        batch_size=1,
        weight_bytes=weight_bytes,
        kv_bytes=2.0,
    )
    print(f"\n  {precision_name}, 8K context:")
    print(f"    Weights:  {format_gb(result['weights_gb'])}")
    print(f"    KV cache: {format_gb(result['kv_cache_gb'])}")
    print(f"    TOTAL:    {format_gb(result['total_gb'])}")
    fits = recommend_gpus(result["total_gb"])
    if fits:
        print(f"    Fits on:  {fits[0][0]}")
    else:
        print(f"    Does not fit on any single configuration listed.")

Output:

======================================================================
INFERENCE: LLaMA 3.1 405B
======================================================================

  BF16, 8K context:
    Weights:  810.0 GB
    KV cache: 7.9 GB
    TOTAL:    939.4 GB
    Fits on:  8x B200 (DGX B200)

  INT8, 8K context:
    Weights:  405.0 GB
    KV cache: 7.9 GB
    TOTAL:    473.6 GB
    Fits on:  1x Mac Studio M3 Ultra 512GB

  INT4, 8K context:
    Weights:  202.5 GB
    KV cache: 7.9 GB
    TOTAL:    240.8 GB
    Fits on:  1x B300 288GB

The 405B model is a beast. In BF16, it needs over 800 GB for weights alone, requiring a full 8-GPU B200 node (1.4 TB). At INT8, it fits on a Mac Studio with M3 Ultra (512 GB unified memory), though inference will be slow due to the lower memory bandwidth of LPDDR5X compared to HBM. At INT4, a single B300 (288 GB) can hold it.

At the full 128K context window:

# ── Example 5: LLaMA 3.1 405B at full 128K context ───────────
result = calc_inference_memory_gb(
    total_params_b=405.0,
    num_layers=126,
    num_kv_heads=16,
    head_dim=128,
    seq_len=131072,
    batch_size=1,
    weight_bytes=0.5,  # INT4
    kv_bytes=2.0,
)
print(f"\n405B INT4 at 128K context:")
print(f"  Weights:  {format_gb(result['weights_gb'])}")
print(f"  KV cache: {format_gb(result['kv_cache_gb'])}")
print(f"  TOTAL:    {format_gb(result['total_gb'])}")

Output:

405B INT4 at 128K context:
  Weights:  202.5 GB
  KV cache: 126.0 GB
  TOTAL:    358.9 GB

Even at INT4, the 405B model at 128K context needs 359 GB. The KV cache alone is 126 GB. This is why Chapter 18 spent so much time on KV cache compression: at long contexts, the cache can exceed the model weights.


B.7 MoE Models: Total vs. Active Parameters

Mixture-of-Experts models (Chapter 12) have a crucial distinction: total parameters (all expert weights combined) determine memory, while active parameters (the experts used per token) determine compute. For memory planning, you must use the total parameter count.

# ── Example 6: LLaMA 4 Maverick (MoE) ────────────────────────
print("=" * 70)
print("INFERENCE: LLaMA 4 Maverick (400B total, 17B active)")
print("=" * 70)

for precision_name, weight_bytes in [("BF16", 2.0), ("INT8", 1.0), ("INT4", 0.5)]:
    result = calc_inference_memory_gb(
        total_params_b=400.0,  # ALL experts must be in memory.
        num_layers=48,
        num_kv_heads=8,
        head_dim=128,
        seq_len=8192,
        batch_size=1,
        weight_bytes=weight_bytes,
        kv_bytes=2.0,
    )
    print(f"\n  {precision_name}, 8K context:")
    print(f"    Weights (all 128 experts): {format_gb(result['weights_gb'])}")
    print(f"    KV cache:                  {format_gb(result['kv_cache_gb'])}")
    print(f"    TOTAL:                     {format_gb(result['total_gb'])}")
    fits = recommend_gpus(result["total_gb"])
    if fits:
        print(f"    Fits on: {fits[0][0]}")

# ── Example 7: DeepSeek-V3 with MLA ──────────────────────────
print("\n" + "=" * 70)
print("INFERENCE: DeepSeek-V3 (671B total, 37B active, MLA)")
print("=" * 70)

for precision_name, weight_bytes in [("BF16", 2.0), ("INT8", 1.0), ("INT4", 0.5)]:
    result = calc_inference_memory_gb(
        total_params_b=671.0,
        num_layers=61,
        num_kv_heads=128,
        head_dim=128,
        seq_len=8192,
        batch_size=1,
        weight_bytes=weight_bytes,
        kv_bytes=2.0,
        uses_mla=True,
        kv_lora_rank=512,
        qk_rope_head_dim=64,
    )
    print(f"\n  {precision_name}, 8K context:")
    print(f"    Weights (all experts): {format_gb(result['weights_gb'])}")
    print(f"    KV cache (MLA):        {format_gb(result['kv_cache_gb'])}")
    print(f"    TOTAL:                 {format_gb(result['total_gb'])}")

Output:

======================================================================
INFERENCE: LLaMA 4 Maverick (400B total, 17B active)
======================================================================

  BF16, 8K context:
    Weights (all 128 experts): 800.0 GB
    KV cache:                  1.5 GB
    TOTAL:                     921.5 GB
    Fits on: 8x B200 (DGX B200)

  INT8, 8K context:
    Weights (all 128 experts): 400.0 GB
    KV cache:                  1.5 GB
    TOTAL:                     461.5 GB
    Fits on: 1x Mac Studio M3 Ultra 512GB

  INT4, 8K context:
    Weights (all 128 experts): 200.0 GB
    KV cache:                  1.5 GB
    TOTAL:                     231.5 GB
    Fits on: 1x B300 288GB

======================================================================
INFERENCE: DeepSeek-V3 (671B total, 37B active, MLA)
======================================================================

  BF16, 8K context:
    Weights (all experts): 1.31 TB
    KV cache (MLA):        549.0 MB
    TOTAL:                 1.51 TB

  INT8, 8K context:
    Weights (all experts): 671.0 GB
    KV cache (MLA):        549.0 MB
    TOTAL:                 772.2 GB

  INT4, 8K context:
    Weights (all experts): 335.5 GB
    KV cache (MLA):        549.0 MB
    TOTAL:                 386.4 GB

Two things stand out:

  1. MoE models need memory for all experts. LLaMA 4 Maverick activates only 17B parameters per token, but all 400B parameters must be loaded into GPU memory. The router (Chapter 12) can send any token to any expert, so every expert must be accessible.

  2. MLA dramatically reduces KV cache. DeepSeek-V3’s MLA compresses the KV cache to just 576 values per token per layer (kv_lora_rank 512 + qk_rope_head_dim 64), compared to 2 * 128 * 128 = 32,768 values for standard MHA. At 8K tokens, the KV cache is only 549 MB, which is negligible compared to the 671+ GB of weights. Even at 128K tokens, the MLA KV cache is only about 8.6 GB. The bottleneck for DeepSeek-V3 is entirely the weight memory.

Note: DeepSeek released V3.2 in late 2025 with 685B total parameters (up from 671B), but the same MLA architecture and 37B active parameters per token. The memory formulas are identical; simply substitute 685 for 671 in the weight calculation. At INT4, DeepSeek-V3.2 requires approximately 394 GB total (685 * 0.5 + 15% overhead + KV cache), compared to 386 GB for V3.


B.8 Training Memory: Why It Costs So Much More

Training requires far more memory than inference. The reason is simple: during training, you need to store not just the model weights, but also the gradients (one gradient value per parameter), the optimizer states (AdamW stores two additional values per parameter: the running mean and variance of the gradients), and the activations from the forward pass (needed for backpropagation).

The standard formula for mixed-precision training with AdamW:

ComponentBytes per ParameterPurpose
BF16 weights2Current model weights
FP32 master weights4High-precision copy for stable updates
BF16 gradients2Gradient of loss with respect to each weight
FP32 momentum (AdamW)4Running mean of gradients
FP32 variance (AdamW)4Running mean of squared gradients
Total16

That is 16 bytes per parameter, compared to 2 bytes per parameter for BF16 inference. Training a model requires 8x more memory per parameter than running it.

# ── Example 8: Training memory for LLaMA 3 8B ────────────────
print("=" * 70)
print("TRAINING: LLaMA 3 8B (full-parameter, AdamW, BF16)")
print("=" * 70)

params_gb = 8.0 * 16  # 16 bytes/param
print(f"  Parameter + optimizer memory: {params_gb:.0f} GB")
print(f"  Plus activations (with gradient checkpointing): ~30-50 GB")
print(f"  Estimated total: ~160-180 GB")
print(f"  Minimum hardware: 2x H100 80GB or 1x H200 141GB + offloading")

print()

# ── Example 9: Training memory for LLaMA 3 70B ───────────────
print("=" * 70)
print("TRAINING: LLaMA 3 70B (full-parameter, AdamW, BF16)")
print("=" * 70)

params_gb = 70.6 * 16
print(f"  Parameter + optimizer memory: {params_gb:.0f} GB")
print(f"  Plus activations: ~200-400 GB (depends on batch size)")
print(f"  Estimated total: ~1.3-1.5 TB")
print(f"  Minimum hardware: 8x H100 (640 GB) with ZeRO-3 sharding")
print(f"  Comfortable: 8x B200 (1.4 TB)")

Output:

======================================================================
TRAINING: LLaMA 3 8B (full-parameter, AdamW, BF16)
======================================================================
  Parameter + optimizer memory: 128 GB
  Plus activations (with gradient checkpointing): ~30-50 GB
  Estimated total: ~160-180 GB
  Minimum hardware: 2x H100 80GB or 1x H200 141GB + offloading

======================================================================
TRAINING: LLaMA 3 70B (full-parameter, AdamW, BF16)
======================================================================
  Parameter + optimizer memory: 1129 GB
  Plus activations: ~200-400 GB (depends on batch size)
  Estimated total: ~1.3-1.5 TB
  Minimum hardware: 8x H100 (640 GB) with ZeRO-3 sharding
  Comfortable: 8x B200 (1.4 TB)

This is why full-parameter training of large models is so expensive. A 70B model needs over 1 TB just for parameters and optimizer states. The only way to fit this on 8x H100 GPUs (640 GB total) is to use ZeRO-3 (from DeepSpeed), which shards the optimizer states, gradients, and parameters across all GPUs. Each GPU holds only 1/8th of the total, and they communicate as needed. This works, but the communication overhead slows training by 10-30%.


B.9 LoRA Fine-Tuning: The Practical Middle Ground

LoRA (Chapter 28) makes fine-tuning dramatically cheaper by training only a small set of low-rank adapter matrices. Instead of 16 bytes per parameter for all parameters, you pay 16 bytes per parameter only for the LoRA adapters (typically 0.5-2% of total parameters), plus the inference cost of loading the full model weights.

# ── Example 10: LoRA fine-tuning LLaMA 3 8B ──────────────────
print("=" * 70)
print("LoRA FINE-TUNING: LLaMA 3 8B")
print("=" * 70)

# Base model weights (frozen, BF16).
base_weights_gb = 8.0 * 2  # 16 GB

# LoRA adapters: ~1% of parameters = 80M params.
lora_params_b = 8.0 * 0.01  # 0.08B = 80M
lora_memory_gb = lora_params_b * 12  # weights + grads + optimizer (BF16 + FP32)

# Activations for LoRA (much smaller than full training).
activation_gb = 4.0  # Rough estimate for batch_size=1, seq_len=2048.

total = base_weights_gb + lora_memory_gb + activation_gb
print(f"  Base model weights (BF16, frozen): {base_weights_gb:.1f} GB")
print(f"  LoRA adapters ({lora_params_b * 1000:.0f}M params):     {lora_memory_gb:.2f} GB")
print(f"  Activations (estimated):           {activation_gb:.1f} GB")
print(f"  TOTAL:                             {total:.1f} GB")
print(f"  Fits on: 1x RTX 4090 (24 GB)")

print()

# ── Example 11: QLoRA fine-tuning LLaMA 3 70B ────────────────
print("=" * 70)
print("QLoRA FINE-TUNING: LLaMA 3 70B (INT4 base + LoRA)")
print("=" * 70)

# Base model weights (frozen, INT4).
base_weights_gb = 70.6 * 0.5  # 35.3 GB

# LoRA adapters: ~0.5% of parameters = 353M params.
lora_params_b = 70.6 * 0.005  # 0.353B
lora_memory_gb = lora_params_b * 12

# Activations.
activation_gb = 8.0

total = base_weights_gb + lora_memory_gb + activation_gb
print(f"  Base model weights (INT4, frozen): {base_weights_gb:.1f} GB")
print(f"  LoRA adapters ({lora_params_b * 1000:.0f}M params):    {lora_memory_gb:.2f} GB")
print(f"  Activations (estimated):           {activation_gb:.1f} GB")
print(f"  TOTAL:                             {total:.1f} GB")
print(f"  Fits on: 1x A100 80GB or 1x H100 80GB")

Output:

======================================================================
LoRA FINE-TUNING: LLaMA 3 8B
======================================================================
  Base model weights (BF16, frozen): 16.0 GB
  LoRA adapters (80M params):     0.96 GB
  Activations (estimated):           4.0 GB
  TOTAL:                             21.0 GB
  Fits on: 1x RTX 4090 (24 GB)

======================================================================
QLoRA FINE-TUNING: LLaMA 3 70B (INT4 base + LoRA)
======================================================================
  Base model weights (INT4, frozen): 35.3 GB
  LoRA adapters (353M params):    4.24 GB
  Activations (estimated):           8.0 GB
  TOTAL:                             47.5 GB
  Fits on: 1x A100 80GB or 1x H100 80GB

This is the power of LoRA and QLoRA. Fine-tuning an 8B model fits on a single consumer GPU. Fine-tuning a 70B model with QLoRA (INT4 base weights + LoRA adapters) fits on a single data center GPU. Compare this to full-parameter training of the 70B model, which needs over 1 TB.


B.10 Quick Reference Tables

These tables summarize the memory requirements for the most common scenarios. All values are computed using the formulas above.

Inference Memory (Single Request, 8K Context)

ModelBF16INT8/FP8INT4Minimum GPU (INT4)
Mistral 7B17.8 GB9.4 GB5.2 GBRTX 4090 (24 GB)
LLaMA 3 8B19.4 GB10.2 GB5.6 GBRTX 4090 (24 GB)
Qwen3-8B20.0 GB10.6 GB5.8 GBRTX 4090 (24 GB)
LLaMA 3 70B164.9 GB83.7 GB43.1 GBA100/H100 (80 GB)
LLaMA 4 Maverick (400B)921.5 GB461.5 GB231.5 GBB300 (288 GB)
LLaMA 3.1 405B939.4 GB473.6 GB240.8 GBB300 (288 GB)
DeepSeek-V3 (671B)1,544 GB772 GB386 GB4x B200 (720 GB)

Note: MoE models (Maverick, DeepSeek-V3) require memory for all experts, not just the active parameters. The KV cache at 8K tokens is small (under 8 GB for all models listed), so the weight memory dominates at short contexts. The “Minimum GPU” column shows the smallest practical data center configuration; a Mac Studio M3 Ultra (512 GB) can technically hold DeepSeek-V3 at INT4, but its lower memory bandwidth makes MoE inference impractical at production speeds.

KV Cache Memory (BF16, Single Request)

ModelLayersKV HeadsPer Token8K32K128K
LLaMA 3 8B328128 KB1.0 GB4.0 GB16.0 GB
Qwen3-8B368144 KB1.1 GB4.5 GB18.0 GB
LLaMA 3 70B808320 KB2.5 GB10.0 GB40.0 GB
LLaMA 3.1 405B126161,008 KB7.9 GB31.5 GB126.0 GB
LLaMA 4 Maverick488192 KB1.5 GB6.0 GB24.0 GB
DeepSeek-V3 (MLA)61n/a68.6 KB0.5 GB2.1 GB8.6 GB

The DeepSeek-V3 row uses the MLA formula: 61 layers * (512 + 64) * 2 bytes = 70,272 bytes per token (approximately 68.6 KB). This is roughly 2.8x smaller than LLaMA 4 Maverick’s KV cache per token, despite DeepSeek-V3 being a much larger model.

Training Memory (Full-Parameter, AdamW, BF16 Mixed-Precision)

ModelParams + OptimizerActivations (est.)Total (est.)Minimum Hardware
LLaMA 3 8B128 GB30-50 GB160-180 GB2x H100 or 1x B200
LLaMA 3 70B1,130 GB200-400 GB1.3-1.5 TB8x H100 + ZeRO-3
LLaMA 3.1 405B6,480 GB1-2 TB7.5-8.5 TB64+ H100 or 32+ B200

LoRA/QLoRA Fine-Tuning Memory

ModelBase PrecisionLoRA %Total (est.)Minimum GPU
LLaMA 3 8BBF161%~21 GBRTX 4090 (24 GB)
LLaMA 3 8BINT4 (QLoRA)1%~9 GBRTX 4090 (24 GB)
Qwen3-8BBF161%~22 GBRTX 4090 (24 GB)
LLaMA 3 70BINT4 (QLoRA)0.5%~48 GBA100/H100 (80 GB)
LLaMA 4 MaverickINT4 (QLoRA)0.5%~224 GBB300 (288 GB)

B.11 Batch Size and Serving Capacity

The examples above assume a single request (batch_size=1). In production, servers handle many concurrent requests. Each request needs its own KV cache, so the total KV cache memory scales linearly with batch size.

# ── Example 12: Serving capacity for LLaMA 3 70B ─────────────
print("=" * 70)
print("SERVING CAPACITY: LLaMA 3 70B on 8x H100 (640 GB total)")
print("=" * 70)

weight_memory = 70.6 * 2  # BF16 = 141.2 GB
overhead = weight_memory * 0.15  # ~21 GB
available_for_kv = 640 - weight_memory - overhead  # ~478 GB
kv_per_token = 2 * 80 * 8 * 128 * 2  # 327,680 bytes = 320 KB

print(f"  Model weights (BF16):     {weight_memory:.1f} GB")
print(f"  Overhead:                 {overhead:.1f} GB")
print(f"  Available for KV cache:   {available_for_kv:.1f} GB")
print(f"  KV cache per token:       {kv_per_token / 1024:.0f} KB")
print()

for avg_seq_len in [2000, 8000, 32000, 128000]:
    kv_per_request = kv_per_token * avg_seq_len
    max_requests = int(available_for_kv * 1024**3 / kv_per_request)
    print(f"  At {avg_seq_len:>7,} tokens/request: {max_requests:>5} concurrent requests")

Output:

======================================================================
SERVING CAPACITY: LLaMA 3 70B on 8x H100 (640 GB total)
======================================================================
  Model weights (BF16):     141.2 GB
  Overhead:                 21.2 GB
  Available for KV cache:   477.6 GB
  KV cache per token:       320 KB

  At   2,000 tokens/request:   782 concurrent requests
  At   8,000 tokens/request:   195 concurrent requests
  At  32,000 tokens/request:    48 concurrent requests
  At 128,000 tokens/request:    12 concurrent requests

This is the fundamental tradeoff of LLM serving: longer contexts mean fewer concurrent users on the same hardware. At 2K tokens per request, you can serve nearly 800 users simultaneously. At 128K tokens, you can serve 12. The hardware cost per user scales linearly with context length.


B.12 Practical Decision Flowchart

Here is a simplified decision process for choosing hardware:

  1. Determine your scenario: inference only, LoRA fine-tuning, or full training.

  2. Calculate weight memory: total_params * bytes_per_parameter. For MoE models, use total parameters (not active).

  3. Add KV cache (inference/fine-tuning): use the formula from B.1. Multiply by your expected batch size.

  4. Add optimizer states (training): multiply trainable parameters by 16 bytes/param for AdamW.

  5. Add overhead: 15% for inference, 20-30% for training.

  6. Compare to available GPU memory:

    • Under 24 GB: RTX 4090 or RTX 5090
    • 24-80 GB: A100 or H100
    • 80-141 GB: H200
    • 141-180 GB: B200
    • 180-288 GB: B300
    • 288-512 GB: Mac Studio M3 Ultra (slow but works) or 2-4x data center GPUs
    • 512 GB+: Multi-GPU setup required
  7. Consider quantization: FP8 is the best default for data center GPUs (lossless quality, half the memory of BF16). On Blackwell GPUs (B200, B300), NVFP4 offers approximately 1.8x memory reduction over FP8 with typically 1% or less accuracy degradation on large models, and is the fastest inference precision available. INT4 reduces weight memory by 4x compared to BF16 and works on all GPU generations. For inference, INT4 quality is acceptable for most 70B+ models. For 7-8B models, FP8 or INT8 is usually the sweet spot (INT4 can noticeably degrade quality at smaller scales).


B.13 Key Takeaways

  • Inference memory is dominated by model weights at short contexts and by the KV cache at long contexts. The crossover point depends on the model: for LLaMA 3 8B in BF16, the KV cache equals the weight memory at approximately 128K tokens (both are 16 GB).

  • Training memory is approximately 8x inference memory per parameter, because AdamW stores 16 bytes per parameter (weights + master weights + gradients + momentum + variance) compared to 2 bytes for BF16 inference.

  • LoRA fine-tuning reduces the training memory overhead to a small fraction (typically 0.5-2% of parameters), making it possible to fine-tune an 8B model on a single RTX 4090 or a 70B model on a single A100/H100 with QLoRA.

  • MoE models require memory for all experts, not just the active ones. LLaMA 4 Maverick has 400B total parameters but only 17B active per token; you still need GPU memory for all 400B.

  • MLA (DeepSeek-V3) dramatically reduces KV cache memory by compressing key-value information into a low-rank latent vector. At 128K tokens, DeepSeek-V3’s KV cache is approximately 8.6 GB, compared to 126 GB for LLaMA 3.1 405B.

  • Quantization is the most practical lever for reducing memory. INT4 quantization reduces weight memory by 4x compared to BF16, often making the difference between needing one GPU and needing eight. FP8 quantization (1 byte per parameter, same memory as INT8) is effectively lossless and is the default for production inference on Blackwell and Hopper GPUs as of March 2026. On Blackwell GPUs, NVFP4 (0.5 bytes per parameter, same memory as INT4) provides approximately 1.8x memory reduction over FP8 with typically 1% or less accuracy degradation on large models, and is the fastest inference precision available on B200/B300 hardware.

  • Batch size is the hidden cost of production serving. Each concurrent request needs its own KV cache. A LLaMA 3 70B server on 8x H100 GPUs can handle ~782 concurrent requests at 2K tokens each, but only ~12 at 128K tokens.

  • The GPU landscape as of March 2026 spans from 24 GB (RTX 4090) to 288 GB (B300) per chip, with Apple’s Mac Studio M3 Ultra offering up to 512 GB of unified memory for local inference. Multi-GPU nodes scale to 2.3 TB (8x B300). The B200 provides 180 GB of usable HBM3e per GPU in production DGX/HGX systems.


Sources: GPU specifications: RTX 4090 24 GB GDDR6X (techpowerup.com/gpu-specs/geforce-rtx-4090.c3889). RTX 5090 32 GB GDDR7, launched January 30, 2025 (beebom.com, pcgamesn.com, techpowerup.com). A100 SXM 80 GB HBM2e, 2,039 GB/s (techpowerup.com, horizoniq.com). H100 SXM 80 GB HBM3, 3,350 GB/s (glennklockwood.com/garden/processors/H100). H200 141 GB HBM3e, 4,800 GB/s (rightnowai.co, runpod.io). B200 SXM 180 GB HBM3e usable per GPU (physical chip contains 192 GB across 8x 24 GB stacks; 12 GB reserved for ECC/yield), 8 TB/s; NVIDIA DGX B200 User Guide: “8 x NVIDIA B200 GPUs that provide 1,440 GB total GPU memory” = 180 GB each (docs.nvidia.com/dgx/dgxb200-user-guide/introduction-to-dgxb200.html); NVIDIA HGX AI Factory docs Table 1: “180GB HBM3e” per B200 SXM (docs.nvidia.com/enterprise-reference-architectures/hgx-ai-factory/latest/components.html); physical 192 GB confirmed from techpowerup.com/gpu-specs/b200-sxm-192-gb.c4210 (“NVIDIA has paired 180 GB HBM3e memory with the B200”); hyperpc.ae (“Physically the chip does contain 192 GB… in production systems the user typically gets 180 GB”). B300 288 GB HBM3e, 8 TB/s, shipped January 2026; HGX spec sheet: FP4 Matrix dense = 13,125 TFLOPS (~13 PFLOPS) per glennklockwood.com/garden/processors/B300; NVIDIA GTC 2025 keynote and Tom’s Hardware (tomshardware.com) cite 15 PFLOPS dense FP4; spheron.network reports 14 PFLOPS; discrepancy reflects different power/platform configurations (HGX vs DGX vs GB300 superchip). M4 Max up to 128 GB LPDDR5X, 546 GB/s (macrumors.com, apple.com). Mac Studio 2025 M3 Ultra up to 512 GB LPDDR5X, launched March 12, 2025 (apple.com/newsroom/2025/03/apple-unveils-new-mac-studio). Apple skipped M4 Ultra; M5 Ultra Mac Studio expected mid-2026 but not yet released as of March 2026 (macworld.com, macobserver.com, appleinsider.com, gadgethacks.com). FP8 quantization: Kurtic et al., “Give Me BF16 or Give Me Death?” arXiv:2411.02355, submitted November 4, 2024, accepted to ACL 2025; 500,000+ evaluations showing FP8 W8A8-FP is effectively lossless (arxiv.org/abs/2411.02355, huggingface.co/papers/2411.02355). NVFP4: NVIDIA’s native 4-bit floating-point format for Blackwell GPUs using micro-block scaling (groups of 16 values with FP8 E4M3 scale factors); “1% or less accuracy degradation on key language modeling tasks” for DeepSeek-R1-0528, “accuracy recovery improves with model size,” 3.5x memory reduction vs FP16, 1.8x vs FP8 (developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference, June 24, 2025); layer-wise sensitivity analysis in arXiv:2603.08747; SGLang FP4 benchmarks on B200 from huggingface.co/blog/apsys/blackwell-nvfp4-comparison; Red Hat NVFP4 guide from developers.redhat.com/articles/2026/02/04/accelerating-large-language-models-nvfp4-quantization. Model configurations: LLaMA 3 8B (continuumlabs.pro, emergentmind.com). LLaMA 3 70B: 80 layers, 8192 hidden, 64 heads, 8 KV heads, 28672 intermediate (emergentmind.com/topics/llama3-70b-model). LLaMA 3.1 405B: 126 layers, 16384 hidden, 128 heads, 16 KV heads, 53248 intermediate (cedricchee.com config.json). Qwen3-8B: 36 layers, 4096 hidden, 32 heads, 8 KV heads, 12288 intermediate, 151936 vocab (huggingface.co/Qwen/Qwen3-8B). LLaMA 4 Maverick: 48 layers, 5120 hidden, 40 heads, 8 KV heads, 128 experts, 202048 vocab (huggingface.co/docs/transformers/main/model_doc/llama4). DeepSeek-V3: 61 layers, kv_lora_rank=512, qk_rope_head_dim=64, 671B total, 37B active (arxiv.org/html/2412.19437v1, huggingface.co/deepseek-ai/DeepSeek-V3). DeepSeek-V3.2: 685B total, 37B active, same MLA architecture as V3 (huggingface.co/deepseek-ai/DeepSeek-V3.2, sebastianraschka.com/p/technical-deepseek). Mistral 7B: 32 layers, 4096 hidden, 32 heads, 8 KV heads, 14336 intermediate (imaddabbura.github.io, emergentmind.com). Training memory formula: 16 bytes/param for mixed-precision AdamW (2 BF16 weights + 4 FP32 master + 2 BF16 gradients + 4 FP32 momentum + 4 FP32 variance), confirmed from lyceum.technology/magazine/gpu-memory-requirements-transformer, arxiv.org/html/2602.23349 (FlashOptim paper confirming 16 bytes baseline), propelrc.com, spheron.network. Inference overhead 10-20% rule of thumb from spheron.network/blog/gpu-memory-requirements-llm (“total memory footprint in production can exceed 200 GB” for a 140 GB model). All numerical outputs verified via Python computation.