Skip to content
Appendix D. Timeline of Key Milestones (2017 to March 2026)

Appendix D. Timeline of Key Milestones (2017 to March 2026)

This book covers nearly a decade of rapid progress in language models, from the original Transformer paper to the trillion-parameter frontier models of 2026. This appendix organizes every major milestone referenced throughout the book into a single chronological timeline. Each entry includes the date, the event, and the chapter(s) where it is discussed in detail.

The pace of progress is striking. In 2017, the state of the art was a 65-million-parameter translation model. By March 2026, frontier models exceed a trillion parameters, process over a million tokens of context, see images, hear audio, use tools, and reason through multi-step problems autonomously.


How to Read This Timeline

Each entry follows this format:

  • Date: The specific release or publication date, verified via web search as of March 2026.
  • Event: What happened, with key numbers (parameters, context windows, benchmarks).
  • Chapter: Where this event is discussed in the book.

Events are grouped by year, then listed chronologically within each year. When an exact day is known, it is included. When only a month is confirmed, the entry is placed at the start of that month.


2017

DateEventChapter
June 12Vaswani et al. submit “Attention Is All You Need” to arXiv (arXiv:1706.03762), introducing the Transformer architecture. The paper proposes self-attention as a replacement for recurrence and convolution, with a ~65M parameter model for machine translation. Presented at NeurIPS 2017 in December.1, 7, 10

The Transformer paper is the single most important event in this timeline. Every model, technique, and system described in this book builds directly on the architecture it introduced.


2018

DateEventChapter
JuneOpenAI publishes “Improving Language Understanding by Generative Pre-Training,” introducing GPT-1 with 117 million parameters. It demonstrates that unsupervised pre-training followed by supervised fine-tuning can achieve strong results across diverse NLP tasks.1
October 11Google publishes the BERT paper (arXiv:1810.04805), introducing bidirectional pre-training with masked language modeling. BERT dominates NLP benchmarks for the next two years and establishes the encoder-only Transformer as the standard for understanding tasks.1

2019

DateEventChapter
February 14OpenAI releases GPT-2 with 1.5 billion parameters, initially withholding the full model due to concerns about misuse. GPT-2 demonstrates that scaling up a simple next-token prediction objective produces surprisingly coherent long-form text. The full 1.5B model is released on November 5, 2019.1, 27

2020

DateEventChapter
May 28OpenAI publishes the GPT-3 paper (arXiv:2005.14165), describing a 175-billion-parameter model trained on 300 billion tokens with a 2,048-token context window. GPT-3 demonstrates strong few-shot learning: it can perform tasks from just a few examples in the prompt, without any fine-tuning. The API beta launches on June 11.1, 11, 13
July 28Zaheer et al. publish BigBird (arXiv:2007.14062, NeurIPS 2020), combining sliding window attention, global tokens, and random attention to handle sequences up to 8x longer than standard attention. Proven to be Turing complete.20

2021

DateEventChapter
June 17Hu et al. publish LoRA (arXiv:2106.09685), introducing low-rank adaptation for fine-tuning large models. Instead of updating all parameters, LoRA freezes the pre-trained weights and injects small trainable matrices, reducing trainable parameters by 10,000x while matching full fine-tuning performance. Presented at ICLR 2022.28

2022

DateEventChapter
January 27OpenAI announces InstructGPT via blog post, demonstrating that Reinforcement Learning from Human Feedback (RLHF) can align a 1.3B parameter model to be preferred by humans over the 175B GPT-3. The accompanying paper (arXiv:2203.02155, submitted March 4) describes the three-step process (SFT, reward modeling, PPO) that becomes the standard alignment recipe. Presented at NeurIPS 2022.15
MarchHoffmann et al. at DeepMind publish the Chinchilla scaling laws paper (arXiv:2203.15556), showing that most large models are undertrained. The key finding: for a given compute budget, model size and training data should be scaled in roughly equal proportions. This overturns the prevailing “bigger model, less data” approach and reshapes how every subsequent model is trained.13
May 27Tri Dao et al. publish FlashAttention (arXiv:2205.14135), an IO-aware exact attention algorithm that reduces GPU memory usage from O(n^2) to O(n) by tiling the computation in on-chip SRAM. FlashAttention becomes the default attention implementation in virtually all LLM training and inference systems. Presented at NeurIPS 2022.20, Appendix A
September 21OpenAI releases Whisper, an open-source speech recognition model trained on 680,000 hours of multilingual audio data. Whisper achieves near-human accuracy on English transcription and supports 99 languages.22
November 30OpenAI launches ChatGPT, a conversational interface built on GPT-3.5 (a fine-tuned version of GPT-3). ChatGPT reaches one million users in five days, becoming the fastest-growing consumer application in history at that time. It brings LLMs into mainstream public awareness.1, 15
DecemberBai et al. at Anthropic publish Constitutional AI (arXiv:2212.08073), introducing a method where the model critiques and revises its own outputs based on a set of principles, reducing the need for human feedback in alignment.15, 26

2023

2023 is the year LLMs went from research curiosity to industry infrastructure. Open-weight models emerged as serious competitors to closed APIs, Mixture-of-Experts architectures proved their efficiency, and the first inference optimization systems laid the groundwork for production-scale serving.

DateEventChapter
February 24Meta announces LLaMA (Large Language Model Meta AI), releasing models from 7B to 65B parameters to the research community. LLaMA demonstrates that smaller, well-trained models can match much larger ones, sparking the open-source LLM movement.1, 11
March 13Stanford releases Alpaca, fine-tuning LLaMA 7B on 52,000 instruction-response pairs generated by OpenAI’s text-davinci-003. Alpaca shows that instruction-following behavior can be distilled from a large model into a small one cheaply, launching a wave of open fine-tuned models.28
March 14OpenAI releases GPT-4, a multimodal model that accepts both text and image inputs. GPT-4 achieves human-level performance on many professional exams and represents a major leap in reasoning capability over GPT-3.5.1, 11, 16
May 23Dettmers et al. publish QLoRA (arXiv:2305.14314), combining 4-bit NormalFloat (NF4) quantization with LoRA to enable fine-tuning of a 65B parameter model on a single 48GB GPU while preserving full 16-bit fine-tuning performance. Presented at NeurIPS 2023.28
May 29Rafailov et al. publish Direct Preference Optimization (DPO) (arXiv:2305.18290), which eliminates the need for a separate reward model in RLHF by directly optimizing the policy with a simple classification loss on preference data. DPO becomes a widely adopted alternative to PPO for alignment. Presented at NeurIPS 2023.15
June 13OpenAI introduces function calling in the GPT API, allowing models to output structured JSON that triggers external tool execution. This is the foundational capability that enables agents.23
June 20vLLM launches with PagedAttention (Kwon et al., arXiv:2309.06180), an inference serving system that manages KV cache memory like virtual memory pages. vLLM achieves 2-4x higher throughput than existing systems. Presented at SOSP 2023.18, 24
July 17Tri Dao publishes FlashAttention-2 (arXiv:2307.08691), rewritten from scratch using NVIDIA CUTLASS 3.x primitives. FlashAttention-2 reaches 230 TFLOPs/s on A100 (up from 124 TFLOPs/s for FlashAttention-1), achieving 72% model FLOP utilization. Presented at ICLR 2024.20, Appendix A
July 18Meta releases LLaMA 2 in partnership with Microsoft, with 7B, 13B, and 70B parameter models under a permissive commercial license. This is the first time a frontier-class open model is freely available for commercial use.1, 11
September 27Mistral AI releases Mistral 7B under the Apache 2.0 license, a 7.3B parameter model that outperforms LLaMA 2 13B on most benchmarks. Mistral 7B introduces sliding window attention (window size 4,096) and Grouped Query Attention to the open-source ecosystem.8, 11, 20
November 6At OpenAI DevDay, parallel function calling is introduced, allowing models to invoke multiple tools simultaneously in a single response.23
December 6Google announces Gemini 1.0 (Ultra, Pro, Nano), its first natively multimodal model family, trained from the ground up on text, images, audio, and video. Gemini replaces the PaLM model family.1, 22
December 11Mistral AI releases Mixtral 8x7B, a Mixture-of-Experts model with 46.7B total parameters and ~12.9B active per token. Mixtral matches GPT-3.5 performance while running at a fraction of the cost, demonstrating that MoE architectures are viable for open-weight models. Released under Apache 2.0.12

2024

2024 is the year of multimodal models, reasoning breakthroughs, and the emergence of the agent paradigm. Every major lab ships models that can see, hear, and use tools. The first dedicated reasoning models appear, and the MCP protocol is born.

DateEventChapter
March 4Anthropic releases the Claude 3 family (Haiku, Sonnet, Opus), its first multimodal models with vision capabilities and a 200K-token context window. Claude 3 Opus matches GPT-4 on many benchmarks.1, 11, 21
March 18NVIDIA announces the Blackwell B200 GPU at GTC 2024, with 208 billion transistors, 192 GB HBM3e (180 GB usable), and up to 2,250 TFLOPS BF16 dense. Blackwell introduces a second-generation Transformer Engine and native FP4 support.24, Appendix B
April 18Meta releases LLaMA 3 with 8B and 70B parameter models, trained on 15 trillion tokens. LLaMA 3 uses GQA across all sizes and a 128K vocabulary.8, 11
May 13OpenAI releases GPT-4o (“omni”), a natively multimodal model that processes text, images, and audio through a single neural network. GPT-4o achieves 232ms average audio response latency, approaching human conversational speed.21, 22
May 22Anthropic publishes the Chinchilla-era scaling analysis showing that most models are still undertrained relative to compute-optimal ratios.13
June 20Anthropic releases Claude 3.5 Sonnet, which outperforms Claude 3 Opus on most benchmarks at twice the speed and one-fifth the cost. It becomes the most widely used Claude model.11, 21
July 11Shah et al. publish FlashAttention-3 (arXiv:2407.08608), optimized for NVIDIA Hopper GPUs using asynchronous execution, warp specialization, and FP8 low-precision. FlashAttention-3 reaches 840 TFLOPs/s BF16 on H100 (85% utilization) and 1.3 PFLOPs/s in FP8. Presented at NeurIPS 2024.20, Appendix A
July 23Meta releases LLaMA 3.1 with 8B, 70B, and 405B parameter models. The 405B model is the largest open-weight model at the time, with 126 layers, 128 attention heads, and a 128K-token context window.8, 11, 18
August 6OpenAI introduces Structured Outputs in the API, guaranteeing that model responses conform to a provided JSON schema.23
September 12OpenAI releases o1-preview, the first dedicated reasoning model. o1 uses chain-of-thought reasoning at inference time, achieving 83.3% on AIME 2024 (compared to ~12% for GPT-4o) and 78% on GPQA Diamond. This marks the beginning of the “reasoning model” era.16
October 22Anthropic introduces Computer Use in beta with Claude 3.5 Sonnet, allowing the model to control a computer by viewing screenshots and executing mouse/keyboard actions.21, 23
November 25Anthropic open-sources the Model Context Protocol (MCP), a standard for connecting AI models to external tools and data sources. MCP uses JSON-RPC 2.0 messaging and defines a client-server architecture for tool discovery and invocation.23, 29
December 5OpenAI releases the full o1 model (not just preview), with improved reasoning capabilities.16
December 11Google announces Gemini 2.0, its next-generation multimodal model family with improved reasoning and agentic capabilities. Gemini 2.0 Flash is released the same day, outperforming Gemini 1.5 Pro on key benchmarks at twice the speed, with native image and audio output and tool use.22
December 26DeepSeek releases DeepSeek-V3, a 671B-parameter MoE model with 37B active parameters per token, trained on 14.8 trillion tokens for approximately $5.576M in compute costs. DeepSeek-V3 introduces Multi-head Latent Attention (MLA), which compresses the KV cache by 57x compared to standard MHA.8, 12, 18

2025

2025 is the year everything converges. Open-weight MoE models match closed APIs. Reasoning models become standard. Agents move from demos to production. Multimodal capabilities expand from vision to native audio, video, and image generation. The infrastructure layer matures with new GPUs, serving frameworks, and cost optimizations.

DateEventChapter
January 20DeepSeek releases DeepSeek-R1, a 671B-parameter reasoning model trained with GRPO (Group Relative Policy Optimization) that matches OpenAI o1 on math and coding benchmarks: 79.8% on AIME 2024 (pass@1), 71.5% on GPQA Diamond, 97.3% on MATH-500. Released under the MIT license, it is the first open reasoning model at frontier quality.15, 16
January 30NVIDIA launches the RTX 5090 consumer GPU with 32 GB GDDR7 and 1,792 GB/s bandwidth at $1,999.Appendix B
January 31OpenAI releases o3-mini to all ChatGPT users, a smaller reasoning model optimized for speed in technical domains.16
February 17xAI releases Grok 3, a ~3-trillion-parameter MoE model trained on 12.8 trillion tokens using xAI’s Colossus supercomputer (100,000+ H100 GPUs). Grok 3 introduces DeepSearch for real-time web analysis and Big Brain Mode for extended reasoning. It launches with a 128K-token context window.11
February 24Anthropic releases Claude 3.7 Sonnet, the first hybrid reasoning model that combines instant responses with extended thinking in a single model. Users can toggle between standard mode and extended thinking mode (budget_tokens up to 128,000). Anthropic also launches Claude Code (beta), a command-line coding agent that reaches general availability on May 22.15, 16, 23
March 11OpenAI releases the Responses API and the OpenAI Agents SDK, providing a framework for building multi-step agents with tool use. The Agents SDK supports 100+ LLMs and is provider-agnostic. The Assistants API is announced for sunset on August 26, 2026.23, 29
March 12Apple launches the Mac Studio with M3 Ultra chip, offering up to 512 GB unified LPDDR5X memory at 819 GB/s, enabling local inference of large models.Appendix B
March 25Google releases Gemini 2.5 Pro (experimental), a reasoning model with a 1M-token context window. It quickly reaches the top of multiple benchmarks. OpenAI also launches native image generation in GPT-4o on the same day, with 130M+ users generating 700M+ images in the first week.16, 20, 22
March 26The MCP specification is updated to replace Server-Sent Events (SSE) with Streamable HTTP transport, improving reliability for production deployments.23, 29
April 5Meta releases LLaMA 4 with two models: Scout (109B total, 17B active, 16 experts, 10M-token context) and Maverick (400B total, 17B active, 128 experts, 1M-token context). Both are natively multimodal MoE models using MetaCLIP vision encoders with early fusion training.8, 9, 11, 12, 21
April 9Google releases the Agent Development Kit (ADK) and the Agent2Agent (A2A) Protocol at Cloud Next 2025. Google also announces TPU v7 Ironwood, its seventh-generation TPU with 4,614 FP8 TFLOPS, 192 GB HBM3e, and 7.38 TB/s bandwidth per chip. Ironwood pods scale to 9,216 chips delivering 42.5 ExaFLOPS.23, 24
April 14OpenAI releases GPT-4.1 (API-only), with a 1M-token context window, improved coding performance, and lower pricing ($2.00/$8.00 per million tokens input/output).17, 20, 24
April 16OpenAI releases o3 and o4-mini, its most advanced reasoning models. o3 achieves 96.7% on AIME 2024 and 87.7% on GPQA Diamond. Both models can use tools (web search, Python, image analysis) natively. OpenAI also releases Codex CLI, an open-source command-line coding agent under the Apache 2.0 license.16, 23
April 29Alibaba releases Qwen3, a family of models from 600M to 235B parameters (MoE, 22B active). Qwen3 introduces hybrid thinking mode (switchable reasoning) and is trained on 36 trillion tokens across 119 languages. Released under Apache 2.0.12, 17, 28
May 16OpenAI launches Codex, a cloud-based software engineering agent powered by the codex-1 model (a fine-tuned version of o3). Codex can write features, fix bugs, run tests, and propose pull requests autonomously in sandboxed environments. Available to ChatGPT Pro, Team, and Enterprise users.23
May 20Google DeepMind announces Gemini Diffusion at Google I/O 2025, an experimental text diffusion model that generates text in parallel rather than token-by-token. Gemini Diffusion achieves 1,479 tokens per second on benchmarks (857 tps in practical demos), matching Gemini 2.0 Flash-Lite coding performance at dramatically higher speed. It represents a fundamentally different approach to text generation.17
May 22Anthropic releases Claude 4 with two models: Opus 4 and Sonnet 4. Opus 4 achieves 72.5% on SWE-bench Verified (79.4% with parallel test-time compute) and can sustain multi-hour agentic coding sessions. Both models feature hybrid instant/extended thinking modes.15, 16, 23
May 28DeepSeek quietly releases DeepSeek-R1-0528 on Hugging Face, a significant update to R1 using the same 671B V3 base with improved post-training. R1-0528 matches OpenAI o3 and Gemini 2.5 Pro on key reasoning benchmarks, released under the MIT license.15, 16
June 17Google announces Gemini 2.5 Pro and Gemini 2.5 Flash have reached General Availability (GA), exiting their experimental/preview phases. Alongside the GA launch, Google debuts a preview of Gemini 2.5 Flash-Lite, a cost-optimized variant for high-throughput tasks.16, 20, 24
July 9xAI releases Grok 4 with scientist-level reasoning, a 256K-token context window (128K in app, 256K in API), native tool use (web search, Python, image analysis), and a $300/month SuperGrok Heavy subscription. Grok 4 is trained on xAI’s Colossus supercomputer using 200,000+ GPUs.16
August 1Google launches Gemini 2.5 Deep Think, its most advanced reasoning mode, available to AI Ultra subscribers ($250/month). Deep Think explores multiple ideas in parallel before choosing the best answer, achieving 87.6% on LiveCodeBench and a Bronze medal at IMO 2025.16
August 5OpenAI releases GPT-OSS, its first open-weight language models since GPT-2 in 2019. The family includes gpt-oss-120B and gpt-oss-20B, both MoE architectures released under the Apache 2.0 license. The 20B model matches o3-mini on common benchmarks and runs on devices with 16 GB of memory. On the same day, Anthropic releases Claude Opus 4.1, a drop-in replacement for Opus 4 with improved agentic coding (74.5% on SWE-bench Verified) at the same pricing.11, 25
August 7OpenAI releases GPT-5, a unified system that combines fast conversational responses with deep reasoning in a single model. GPT-5 eliminates the need to choose between separate model variants. It is the first GPT model available free to all ChatGPT users.1, 11, 17
August 21DeepSeek releases DeepSeek-V3.1, a hybrid model with 671B total parameters (37B active) that unifies thinking and non-thinking modes in a single checkpoint. V3.1 supports a 128K-token context window and is positioned as DeepSeek’s “first step toward the agent era,” with improved tool calling and agent capabilities. It surpasses DeepSeek-R1 on multiple reasoning benchmarks.12
September 5Alibaba previews Qwen 3-Max, its first model with over one trillion parameters. The full release follows on September 24 at the Yunqi Conference. Qwen 3-Max is a dense MoE model trained on 36 trillion tokens, achieving third place on the LMArena text leaderboard. An upgraded version with test-time scaling (TTS) in January 2026 scores 100% on AIME 2025.11, 13
September 19xAI releases Grok 4 Fast with a 2-million-token context window, the largest production context window as of March 2026. Grok 4 Fast reduces average thinking tokens by 40% compared to Grok 4, achieving a 98% cost reduction per task.20
September 29Anthropic releases Claude Sonnet 4.5, its best coding model at the time, with breakthrough advances in computer use, reasoning, and sustained focus. Anthropic also releases the Claude Agent SDK (rebranded from Claude Code SDK), providing a framework for building agents with Claude models.23, 29
October 15Anthropic releases Claude Haiku 4.5, the fastest model in the Claude 4.5 family. Haiku 4.5 matches Claude Sonnet 4 on coding and agent benchmarks at one-third the cost and twice the speed, making it the go-to model for latency-sensitive and high-volume applications.24
November 12OpenAI releases GPT-5.1 (Instant and Thinking variants), a mid-cycle upgrade to GPT-5 with improved conversational quality, adaptive reasoning that dynamically adjusts thinking time based on task complexity, and six personality presets. GPT-5.1 Codex and Codex Max follow on November 19.11, 17
November 18Google releases Gemini 3 Pro with a 1M-token context window, achieving 37.5% on Humanity’s Last Exam and setting new benchmarks across reasoning, coding, and multimodal tasks.11, 20, 21
November 24Anthropic releases Claude Opus 4.5, achieving 80.9% on SWE-bench Verified (the highest at the time), with a 67% price reduction compared to Opus 4. Opus 4.5 is designed for sustained multi-hour agentic coding sessions and complex reasoning tasks.11, 16, 23
November 25The MCP specification is updated with support for tasks and workflows, enabling more complex agent orchestration patterns.23
November 26Alibaba releases Qwen3-VL (arXiv:2511.21631), a family of multimodal vision-language models with dense variants (2B/4B/8B/32B) and MoE variants (30B-A3B/235B-A22B). Qwen3-VL supports 256K native context, 2-hour video understanding with 99.5% needle-in-a-haystack accuracy, and introduces DeepStack feature fusion.21
December 1DeepSeek releases DeepSeek-V3.2 with 685B total parameters (37B active), continued pre-training from the V3.1-Terminus checkpoint.12
December 9The Agentic AI Foundation (AAIF) is formed under the Linux Foundation, with eight platinum members (AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, OpenAI) each contributing $350,000 to develop governance standards for AI agent interoperability.23
December 11OpenAI releases GPT-5.2 with three variants (Instant, Thinking, Pro), focused on professional knowledge work with improved reasoning, long-context handling, and agentic capabilities.11, 21
December 11Hugging Face announces TGI maintenance mode, recommending migration to vLLM or SGLang for new deployments.24
December 16OpenAI releases GPT Image 1.5, an updated image generation API that is 4x faster and 20% cheaper than GPT Image 1 (released April 23, 2025).22
December 17Google releases Gemini 3 Flash, a smaller, faster variant of Gemini 3 optimized for high-throughput inference.22

2026 (Through March 20)

The first quarter of 2026 continues the rapid pace. Open-weight models with early fusion multimodal capabilities become the norm. Context windows exceed one million tokens as standard. Agent frameworks mature with production-grade tooling.

DateEventChapter
JanuaryNVIDIA ships the B300 (Blackwell Ultra) GPU with 288 GB HBM3e, 8 TB/s bandwidth, and ~13-15 PFLOPS dense FP4 (varies by platform configuration). The B300 is the most powerful single GPU available as of March 2026.24, Appendix B
January 21PyTorch 2.10 is released.27
January 24The MCP Python SDK reaches version 1.26.0.29
January 28NVIDIA announces Dynamic Context Parallelism, achieving 1.48x speedup for long-context inference by dynamically distributing attention computation across GPUs.20
February 5Anthropic releases Claude Opus 4.6 with a 1M-token context window (beta), improved agentic coding performance, and adaptive reasoning (low/medium/high/max effort levels). Claude Opus 4.6 achieves 80.8% on SWE-bench Verified (79.2% in Thinking mode per vals.ai). Pricing: $5/$25 per million tokens input/output. On the same day, OpenAI releases GPT-5.3-Codex, its most capable agentic coding model, combining the coding performance of GPT-5.2-Codex with the reasoning capabilities of GPT-5.2 in a single model that is 25% faster.11, 16, 17, 19, 20
February 16Alibaba releases Qwen 3.5 with a flagship 397B-parameter MoE model (17B active), supporting 262K native context across 201 languages. Qwen 3.5 introduces hybrid Gated DeltaNet attention and multi-token prediction. Released under Apache 2.0.12, 17, 21, 22
February 17Anthropic releases Claude Sonnet 4.6. Pricing: $3/$15 per million tokens input/output.19
February 19Google releases Gemini 3.1 Pro with a 1M-token context window.20
February 26Google releases Nano Banana 2 (Gemini 3.1 Flash Image), the latest iteration of its image generation model.22
March 2Alibaba releases the Qwen 3.5 small series (0.8B to 35B parameters), including Qwen3.5-9B with 9B parameters, hybrid architecture, and 262K native context.21, 28
March 3Google releases Gemini 3.1 Flash-Lite at $0.25/$1.50 per million tokens.24
March 3Tri Dao et al. publish FlashAttention-4 (arXiv:2603.05451), redesigned for NVIDIA Blackwell GPUs using algorithm and kernel pipelining co-design. FlashAttention-4 reaches 1,613 TFLOPs/s BF16 on B200 (71% utilization), nearly doubling FlashAttention-3’s 840 TFLOPs/s on H100. Implemented entirely in Python-embedded CuTe-DSL with 20-30x faster compile times than C++ templates.20, Appendix A
March 5OpenAI releases GPT-5.4 with native computer-use capabilities, a 1.05M-token context window, and tool search (47% token reduction on MCP Atlas benchmark tasks). GPT-5.4 achieves 75% on OSWorld-Verified, exceeding the 72.4% human baseline. Available in standard, Thinking, and Pro variants. Pricing: $2.50/$15.00 per million tokens input/output.11, 16, 20, 21, 23, 24
March 10Google releases Gemini Embedding 2 with 3,072-dimensional vectors, Matryoshka Representation Learning (MRL) support for 128-3,072 dimensions, and 8,192-token input limit.22
March 13Anthropic removes the long-context surcharge (previously 2x input, 1.5x output above 200K tokens), making long-context usage the same price as standard usage. AWS and Cerebras announce a disaggregated inference collaboration using Trainium3 and WSE-3 with the llm-d open-source framework.19, 24
March 16Mistral AI releases Mistral Small 4 with 119B total parameters, 6B active per token, 128 experts (top-4 routing), 256K context window, and multimodal text+image input. It unifies instruct, reasoning, and coding capabilities in a single model. Released under Apache 2.0.12, 21, 22
March 17OpenAI releases GPT-5.4 mini and GPT-5.4 nano. Mini: $0.75/$4.50 per million tokens. Nano: $0.20/$1.25 per million tokens. GPT-5.4 mini achieves 72.1% on OSWorld-Verified.23, 24
March 18PyTorch 2.11 is released.27
March 19The OpenAI Agents SDK reaches version 0.12.5, adding WebSocket transport (up to 40% faster for 20+ tool calls), human-in-the-loop support, and session persistence with multiple backends.29

The Big Picture: Nine Years in Numbers

To appreciate the scale of progress, here are a few comparisons between the original Transformer (2017) and the frontier as of March 2026:

Metric2017 (Transformer)March 2026 (Frontier)Change
Parameters~65 million1+ trillion (Qwen 3-Max, Grok 3 estimated ~3T)~15,000x
Training tokens~36 million (WMT)36+ trillion (Qwen3)~1,000,000x
Context window512 tokens2,000,000 tokens (Grok 4 Fast)~4,000x
ModalitiesText onlyText, images, audio, video, code, tools6+ modalities
Training cost~$100 (estimated)$50M-$500M+ (frontier)~500,000x
Inference costN/A$0.20/MTok (GPT-5.4 nano)Commoditized
Open-weight models0Thousands (LLaMA 4, Qwen 3.5, Mistral, DeepSeek, GPT-OSS)Ecosystem

Thematic Threads

Reading the timeline chronologically reveals several recurring themes that the book explores in depth:

1. The scaling hypothesis proved right, then hit limits (Chapters 11, 13). From GPT-1 (117M) to GPT-3 (175B) to frontier models exceeding a trillion parameters, scaling consistently improved capabilities. But the Chinchilla paper (March 2022) showed that most models were undertrained, and by 2025, the “data wall” forced labs to explore synthetic data and new training techniques.

2. Open-weight models closed the gap (Chapters 11, 12, 25). LLaMA (February 2023) started the open-source revolution. OpenAI itself joined the movement with GPT-OSS (August 2025), its first open-weight models since GPT-2. By March 2026, open-weight MoE models like Qwen 3.5 (397B/17B active), LLaMA 4 Maverick (400B/17B active), and Mistral Small 4 (119B/6B active) compete directly with closed APIs on most benchmarks.

3. MoE became the dominant architecture (Chapter 12). Mixtral 8x7B (December 2023) proved MoE viable for open models. By March 2026, every major open-weight frontier model uses MoE: DeepSeek-V3 (671B/37B), LLaMA 4 (400B/17B), Qwen 3.5 (397B/17B), Mistral Small 4 (119B/6B). The pattern is consistent: large total parameter counts with small active parameter counts.

4. Reasoning models emerged as a new paradigm (Chapters 15, 16). OpenAI o1 (September 2024) introduced the idea that models could “think” before answering. DeepSeek-R1 (January 2025) made reasoning open-source, and R1-0528 (May 2025) closed the gap with o3. Claude 3.7 Sonnet (February 2025) pioneered hybrid reasoning, combining instant responses with extended thinking in a single model. DeepSeek-V3.1 (August 2025) unified thinking and non-thinking modes in a single checkpoint. Qwen 3-Max (September 2025) pushed the scale frontier past one trillion parameters with test-time scaling. Google’s Gemini 2.5 Deep Think (August 2025) introduced parallel reasoning exploration. By March 2026, every major model family includes reasoning capabilities, either as dedicated reasoning models or as hybrid modes within general models.

5. Context windows grew 4,000x (Chapter 20). The original Transformer handled 512 tokens. GPT-3 managed 2,048. By March 2026, Grok 4 Fast supports 2 million tokens, and several models offer 1M+ token contexts. This expansion was enabled by FlashAttention (2022), FlashAttention-2 (2023), FlashAttention-3 (2024, 840 TFLOPs/s on H100), FlashAttention-4 (March 2026, 1,613 TFLOPs/s on B200), Ring Attention (2023), and hardware improvements (B200/B300 with 180-288 GB HBM3e).

6. Multimodal became the default (Chapters 21, 22). GPT-4 (March 2023) added vision. GPT-4o (May 2024) unified text, image, and audio. By March 2026, every frontier model is natively multimodal, and open-weight models like LLaMA 4, Qwen 3.5, and Mistral Small 4 all use early fusion architectures trained on interleaved multimodal data.

7. Agents moved from concept to production (Chapter 23). Function calling (June 2023) enabled tool use. MCP (November 2024) standardized tool interfaces. By March 2026, production agent frameworks (OpenAI Agents SDK, Claude Agent SDK, Google ADK) support multi-step workflows, coding agents (OpenAI Codex, Claude Code) handle autonomous software engineering, and the AAIF provides governance standards.

8. Inference costs plummeted (Chapter 24). GPT-3 API pricing in 2020 was roughly $60 per million tokens. By March 2026, GPT-5.4 nano costs $0.20 per million input tokens, a 300x reduction. Open-weight models running on optimized infrastructure (vLLM, SGLang, TensorRT-LLM) push costs even lower.


D.1 Key Takeaways

  • The Transformer paper (June 2017) is the single origin point for everything in this book. Every model, optimization, and application described in these chapters builds on the architecture Vaswani et al. introduced.

  • The pace of progress is accelerating, not slowing. The gap between GPT-3 (May 2020) and GPT-4 (March 2023) was nearly three years. The gap between GPT-5 (August 2025) and GPT-5.4 (March 2026) was seven months. Model generations are compressing.

  • Open-weight models consistently trail closed models by 6-12 months, then match or exceed them. LLaMA 2 (July 2023) matched GPT-3.5. DeepSeek-V3 (December 2024) matched GPT-4o. Even OpenAI joined the open-weight movement with GPT-OSS (August 2025). By March 2026, the gap between open and closed models is the narrowest it has ever been.

  • The MoE architecture is the defining technical trend of 2025-2026. It allows models to have large total parameter counts (for knowledge capacity) while keeping active parameters small (for inference efficiency).

  • Reasoning models represent a fundamental shift from “predict the next token faster” to “think longer on harder problems.” This shift, starting with o1 in September 2024, has reshaped how every lab designs its models.

  • The infrastructure stack (FlashAttention, vLLM, MCP, agent frameworks) is as important as the models themselves. Without these systems, frontier models would be too slow, too expensive, and too isolated to be useful in production. The FlashAttention series alone progressed from 124 TFLOPs/s on A100 (2022) to 840 TFLOPs/s on H100 (2024) to 1,613 TFLOPs/s on B200 (March 2026), a 13x improvement in four years.

  • Parameter-efficient fine-tuning (LoRA in 2021, QLoRA in 2023) and simplified alignment (DPO in 2023) democratized model customization. These techniques made it possible to fine-tune billion-parameter models on consumer hardware and align them without complex reinforcement learning pipelines.

  • The release cadence in late 2025 was extraordinary. In a single 60-day stretch from September 29 to November 26, 2025, the industry saw Claude Sonnet 4.5, Claude Haiku 4.5, Qwen 3-Max, Grok 4 Fast, GPT-5.1, Gemini 3 Pro, Claude Opus 4.5, and Qwen3-VL. This compressed timeline reflects the intense competition among labs and the maturation of model development pipelines.

Appendix E provides a detailed comparison table of every frontier model referenced in this timeline, with specifications, pricing, and context windows as of March 2026.


Sources: All dates and facts in this timeline are verified via web search as of March 2026 and cross-referenced with the source citations in each chapter. Key primary sources include: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017, arXiv submitted June 12, 2017 (arxiv.org/abs/1706.03762). GPT-1 released June 2018 (en.wikipedia.org/wiki/GPT-1). BERT published October 11, 2018 (arxiv.org/abs/1810.04805). GPT-2 released February 14, 2019 (wikiwand.com/en/Generative_Pre-trained_Transformer). GPT-3 paper May 28, 2020, API beta June 11, 2020 (arxiv.org/abs/2005.14165, gwern.net). BigBird submitted July 28, 2020, NeurIPS 2020 (arxiv.org/abs/2007.14062). LoRA published June 17, 2021 (arxiv.org/abs/2106.09685). InstructGPT announced January 27, 2022, arXiv paper March 4, 2022 (openai.com/blog/instruction-following, arxiv.org/abs/2203.02155). Chinchilla scaling laws March 2022 (arxiv.org/abs/2203.15556). FlashAttention published May 27, 2022 (arxiv.org/abs/2205.14135). Whisper released September 21, 2022 (openai.com/blog/whisper). ChatGPT launched November 30, 2022 (ofzenandcomputing.com/chatgpt-release-date). Constitutional AI December 2022 (arxiv.org/abs/2212.08073). LLaMA announced February 24, 2023 (en.wikipedia.org/wiki/Llama_(language_model)). Stanford Alpaca released March 13, 2023 (simonwillison.net, crfm.stanford.edu/2023/03/13/alpaca.html). GPT-4 launched March 14, 2023 (openai.com/index/gpt-4-research). QLoRA published May 23, 2023 (arxiv.org/abs/2305.14314). DPO published May 29, 2023 (arxiv.org/abs/2305.18290). vLLM launched June 2023 (blog.vllm.ai/2023/06/20/vllm.html). FlashAttention-2 published July 17, 2023 (arxiv.org/abs/2307.08691, hazyresearch.stanford.edu/blog/2023-07-17-flash2). LLaMA 2 released July 18, 2023 (about.fb.com/news/2023/07/llama-2). Mistral 7B released September 27, 2023 (mistral.ai). Gemini 1.0 announced December 6, 2023 (en.wikipedia.org/wiki/Gemini_(language_model)). Mixtral 8x7B released December 11, 2023 (mistral.ai/news/mixtral-of-experts). Claude 3 released March 4, 2024 (anthropic.com/news/claude-3-family). NVIDIA B200 announced March 2024 at GTC (techzine.eu). LLaMA 3 released April 18, 2024 (ai.meta.com/blog/meta-llama-3). GPT-4o released May 13, 2024 (openai.com). Claude 3.5 Sonnet released June 20, 2024 (anthropic.com/news/claude-3-5-sonnet, siliconangle.com). FlashAttention-3 published July 11, 2024, NeurIPS 2024 (arxiv.org/abs/2407.08608, neurips.cc/virtual/2024/poster/93328). LLaMA 3.1 405B released July 23, 2024 (ai.meta.com). o1-preview released September 12, 2024 (openai.com/index/learning-to-reason-with-llms). MCP announced November 25, 2024 (anthropic.com/news/model-context-protocol). Gemini 2.0 announced December 11, 2024 (blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024). DeepSeek-V3 released December 26, 2024 (arxiv.org/abs/2412.19437). DeepSeek-R1 released January 20, 2025 (arxiv.org/abs/2501.12948). Grok 3 released February 17, 2025 (testingcatalog.com, techcrunch.com). Claude 3.7 Sonnet and Claude Code released February 24, 2025 (anthropic.com/news/claude-3-7-sonnet). OpenAI Agents SDK released March 11, 2025 (openai.com/index/new-tools-for-building-agents). LLaMA 4 released April 5, 2025 (huggingface.co/blog/llama4-release). Google ADK and TPU v7 Ironwood announced April 9, 2025 (9to5google.com). GPT-4.1 released April 14, 2025 (openai.com/index/gpt-4-1). o3, o4-mini, and Codex CLI released April 16, 2025 (openai.com/index/introducing-o3-and-o4-mini, techcrunch.com). Qwen3 released April 29, 2025 (alibabacloud.com/blog/602192). OpenAI Codex agent launched May 16, 2025 (openai.com/index/introducing-codex). Gemini Diffusion announced May 20, 2025 at Google I/O (blog.google/technology/google-deepmind/gemini-diffusion, simonwillison.net, the-decoder.com). Claude 4 (Opus 4, Sonnet 4) released May 22, 2025 (anthropic.com/news/claude-4). DeepSeek-R1-0528 released May 28, 2025 (pandaily.com, rits.shanghai.nyu.edu, felloai.com). Gemini 2.5 Pro and Flash GA June 17, 2025 (9to5google.com, blog.google/products/gemini/gemini-2-5-model-family-expands, cloud.google.com/blog). Grok 4 released July 9, 2025, 256K context window (jagranjosh.com, datastudios.org, datacamp.com). Gemini 2.5 Deep Think launched August 1, 2025 (testingcatalog.com, beebom.com, tdtu.edu.vn). GPT-OSS released August 5, 2025 (theregister.com, milvus.io, jagranjosh.com, huggingface.co/blog/welcome-openai-gpt-oss). Claude Opus 4.1 released August 5, 2025 (anthropic.com/news/claude-opus-4-1, testingcatalog.com). GPT-5 released August 7, 2025 (openai.com/index/introducing-gpt-5). DeepSeek-V3.1 released August 21, 2025 (startuphub.ai, the-decoder.com, gigazine.net, outlookbusiness.com). Qwen 3-Max preview September 5, full release September 24 at Yunqi Conference (ctol.digital, cybernews.com, qwen-ai.com, startuptalky.com). Grok 4 Fast released September 19, 2025 (x.ai/news/grok-4-fast, alternativeto.net). Claude Sonnet 4.5 released September 29, 2025 (anthropic.com/news/claude-sonnet-4-5, claudefa.st). Claude Haiku 4.5 released October 15, 2025 (anthropic.com/news/claude-haiku-4-5, unite.ai, analyticsvidhya.com). GPT-5.1 released November 12, 2025 (openai.com/index/gpt-5-1, cometapi.com). Gemini 3 Pro released November 18, 2025 (gigazine.net, businessworld.in). Claude Opus 4.5 released November 24, 2025 (anthropic.com/news/claude-opus-4-5, llm-stats.com, digitalapplied.com). Qwen3-VL released November 26, 2025 (arxiv.org/abs/2511.21631, emergentmind.com, alibabacloud.com/blog/602584). DeepSeek-V3.2 released December 1, 2025 (huggingface.co/deepseek-ai/DeepSeek-V3.2). AAIF formed December 9, 2025 (linuxfoundation.org). GPT-5.2 released December 11, 2025 (openai.com/index/introducing-gpt-5-2). NVIDIA B300 shipped January 2026 (spheron.network/blog/nvidia-b300-blackwell-ultra-guide). Claude Opus 4.6 and GPT-5.3-Codex both released February 5, 2026 (anthropic.com/research/claude-opus-4-6, openai.com/index/introducing-gpt-5-3-codex). Qwen 3.5 released February 16, 2026 (launchberg.com, qwen-ai.com). FlashAttention-4 published March 3, 2026 (arxiv.org/abs/2603.05451, pytorch.org/blog/flexattention-flashattention-4-fast-and-flexible, research.colfax-intl.com). GPT-5.4 released March 5, 2026 (openai.com/index/introducing-gpt-5-4). Mistral Small 4 released March 16, 2026 (mistral.ai/news/mistral-small-4). GPT-5.4 mini and nano released March 17, 2026 (openai.com/index/introducing-gpt-5-4-mini-and-nano). OpenAI Agents SDK version 0.12.5 as of March 19, 2026 (pypi.org/project/openai-agents). All dates cross-referenced with Wikipedia, official announcements, and multiple independent sources.