Chapter 25. Open vs. Closed Models, The Great Divide

Part 8. The Ecosystem, How It All Connects

Every model discussed in this book falls into one of two categories: models whose weights you can download and run yourself, and models you can only access through an API. This distinction, between open-weight and closed models, is the most consequential decision in the LLM ecosystem. It determines what you can customize, what you can inspect, where your data goes, how much you pay, and ultimately who controls the AI you depend on. This chapter explains exactly what “open” means in the context of LLMs, surveys the major open and closed model families as of March 2026, shows you how to run models locally on your own hardware, and walks through the fine-tuning techniques that let you customize open models for your specific needs.

What “Open” Actually Means

The term “open source” gets thrown around loosely in the AI world, but it means something very specific in software: the source code is publicly available under a license that allows anyone to use, modify, and redistribute it. In the LLM world, the situation is more nuanced. There are at least three distinct levels of openness, and confusing them leads to bad decisions.

Open Weights

Open weights means the trained model parameters (the billions of numbers that define the model’s behavior) are publicly available for download. You can load these weights into a compatible framework, run inference on your own hardware, and serve the model to your own users. Most models that people casually call “open source” are actually open-weight models. LLaMA 4, Mistral, and Gemma fall into this category.

Open weights give you the finished product but not necessarily the recipe. You get the trained model, but you may not get the training data, the training code, the data preprocessing pipeline, or the evaluation infrastructure. You can use the model, but you cannot fully reproduce how it was made.

Open Source (by the OSI Definition)

On October 28, 2024, the Open Source Initiative (OSI) published the Open Source AI Definition (OSAID) 1.0 at the All Things Open conference, the first formal standard for what “open source” means in the context of AI. According to OSAID, an AI system must guarantee four freedoms: it must be usable, studyable, modifiable, and shareable for any purpose. Critically, this requires not just the model weights but also sufficient information about the training data, training code, and methodology to allow meaningful study and modification.

By this strict definition, very few large language models qualify as truly open source. Most “open” models release weights and sometimes inference code, but not the full training pipeline or training data. DeepSeek comes closest: it releases weights under the MIT license, publishes detailed technical reports, and provides training code. But even DeepSeek does not release its full training dataset.

Source: The Open Source Initiative published OSAID 1.0 on October 28, 2024 at the All Things Open conference, defining four freedoms: use, study, modify, and share for any purpose (confirmed from prweb.com, theoutpost.ai, itsfoss.com, infoq.com).

Closed (Proprietary)

Closed models are accessible only through an API. You send your input to the provider’s servers, the model processes it on their hardware, and you receive the output. You cannot download the weights, inspect the architecture details, run the model locally, or modify it in any way. GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro are closed models.

Closed models offer convenience (no infrastructure to manage) and often represent the absolute frontier of capability. But they come with fundamental limitations: your data leaves your network, you depend on the provider’s uptime and pricing decisions, and you cannot customize the model beyond what the API allows (system prompts, temperature, etc.).

The Spectrum in Practice

In reality, “open vs. closed” is a spectrum, not a binary:

def openness_spectrum():
    """
    Show the spectrum of model openness as of March 2026.
    """
    print("The Openness Spectrum (March 2026)")
    print("=" * 75)

    levels = [
        ("Fully Closed",
         "API only, no weights, no architecture details",
         "GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro"),
        ("Closed + Published Research",
         "API only, but architecture described in papers",
         "GPT-4 (technical report), Gemini (technical report)"),
        ("Open Weights (Restrictive)",
         "Weights downloadable, custom license with limits",
         "LLaMA 4 (Llama Community License)"),
        ("Open Weights (Permissive)",
         "Weights downloadable, permissive license",
         "Mistral Small 4 (Apache 2.0), GPT-OSS (Apache 2.0)"),
        ("Open Weights + Code + Report",
         "Weights, inference code, detailed technical report",
         "DeepSeek V3.2 (MIT), Qwen 3.5 (Apache 2.0), Kimi K2.5 (Modified MIT)"),
        ("Fully Open Source (OSAID)",
         "Weights, code, training data, full reproducibility",
         "Very few models qualify"),
    ]

    for i, (level, desc, examples) in enumerate(levels):
        if i > 0:
            print(f"  |")
        print(f"  [{i+1}] {level}")
        print(f"      {desc}")
        print(f"      Examples: {examples}")

    print(f"\n  Most 'open source' LLMs sit at levels 3-5.")
    print(f"  True OSAID compliance (level 6) remains rare.")

openness_spectrum()

The practical difference between levels 3 and 5 matters enormously. A model released under the MIT license with a detailed technical report (like DeepSeek V3.2) gives you far more freedom than a model released under a restrictive custom license (like LLaMA 4). Both are “open weights,” but the legal and practical implications are very different.

The Major Open Model Families

As of March 2026, five families dominate the open-weight landscape: Meta’s LLaMA, DeepSeek, Alibaba’s Qwen, Mistral, and (in a historic reversal) OpenAI’s GPT-OSS. Each takes a different approach to architecture, licensing, and community engagement.

LLaMA 4 (Meta)

Meta released LLaMA 4 on April 5, 2025, with two initial variants:

LLaMA 4 Scout: 109B total parameters, 17B active per token, 16 experts, 10M token context window
LLaMA 4 Maverick: 400B total parameters, 17B active per token, 128 experts, 1M token context window

A third variant, LLaMA 4 Behemoth (2 trillion total parameters, 288B active, 16 experts), was announced alongside Scout and Maverick but has been repeatedly delayed. Originally planned for the April 2025 LlamaCon event, then pushed to June, then to fall 2025, and most recently to May 2026. Internal evaluations revealed that the model did not significantly outperform prior versions, and Meta has acknowledged that Behemoth’s performance fell short of expectations. The company is focusing on smaller LLaMA 4 variants instead.

Both Scout and Maverick use a Mixture-of-Experts architecture (covered in Chapter 12) with early-fusion multimodal capabilities (covered in Chapter 22). They accept both text and image inputs and support 12 languages. The MoE design means that despite having hundreds of billions of total parameters, only 17B are active for any given token, making inference costs comparable to a much smaller dense model.

LLaMA 4 is released under the Llama Community License, which is more permissive than earlier versions but is not a standard open-source license. Key restrictions include:

If your product or service has more than 700 million monthly active users, you must request a separate license from Meta
You must include “Built with Llama” attribution in your products
You cannot use LLaMA outputs to train models that compete with LLaMA

These restrictions mean LLaMA 4 is not open source by the OSI definition. It is open-weight with a custom commercial license. For most companies and developers, the restrictions are irrelevant (very few products have 700 million monthly users), but for large tech companies like Google, Apple, or Amazon, the license creates complications.

Source: LLaMA 4 Scout (109B total, 17B active, 16 experts, 10M context) and Maverick (400B total, 17B active, 128 experts, 1M context) released April 5, 2025 (confirmed from huggingface.co/blog/llama4-release, arxiv.org/html/2601.11659v1, simonwillison.net). Behemoth (2T total, 288B active) repeatedly delayed to May 2026 due to underperformance (confirmed from streamlinefeed.co.ke, aibase.com/en/news/26207, the-decoder.com, neuron.expert, winbuzzer.com). Llama Community License requires separate license above 700M monthly active users (confirmed from redresscompliance.com, fortune.com, ai.meta.com/llama/license).

DeepSeek (DeepSeek AI)

DeepSeek, a Chinese AI lab founded in July 2023, has produced some of the most impactful open models in the history of the field. Their release timeline:

DeepSeek-V3 (December 26, 2024): 671B total parameters, 37B active per token, trained on 14.8 trillion tokens using 2,048 NVIDIA H800 GPUs. The full training consumed 2.788 million GPU hours at an estimated cost of $5.576 million, a fraction of what comparable frontier models cost. Released under the MIT license.
DeepSeek-R1 (January 20, 2025): A reasoning model built on the V3 architecture, trained using large-scale reinforcement learning. R1 matched OpenAI’s o1 on reasoning benchmarks while being fully open under the MIT license. This was a watershed moment: it demonstrated that frontier-level reasoning capabilities could be achieved and released openly.
DeepSeek-V3.1 (August 21, 2025): 671B total parameters, 37B active, with hybrid inference modes (thinking and non-thinking) and enhanced agent capabilities. 128K context window. MIT license.
DeepSeek-V3.2 (December 1, 2025): 671B total parameters, 37B active, 128K context window. Introduced DeepSeek Sparse Attention (DSA) for more efficient long-context inference. MIT license. API pricing: $0.28 per million input tokens, $0.42 per million output tokens.

DeepSeek’s impact goes beyond the models themselves. The V3 technical report was extraordinarily detailed, describing innovations like Multi-Head Latent Attention (which compresses the KV cache) and an auxiliary-loss-free load balancing strategy for MoE routing. Other labs, including Moonshot AI (Kimi K2) and Zhipu AI (GLM-5), have openly built on DeepSeek’s published techniques. This is the open-weight ecosystem working as intended: teams build on each other’s innovations in public, and the pace of progress compounds.

The MIT license is the most permissive option available. It allows commercial use, modification, distribution, and sublicensing with essentially no restrictions beyond including the copyright notice. This makes DeepSeek models the easiest to integrate into commercial products without legal review.

Source: DeepSeek-V3 released December 26, 2024, 671B total/37B active, 14.8T tokens, 2.788M H800 GPU hours, ~$5.576M training cost, MIT license (confirmed from arxiv.org/html/2412.19437v1, simonwillison.net, helicone.ai). DeepSeek-R1 released January 20, 2025, MIT license, matched o1 on reasoning benchmarks (confirmed from medium.com/@mayadakhatib, deeplearning.ai, fusionchat.ai). DeepSeek-V3.1 released August 21, 2025, 671B total, 37B active, 128K context, hybrid thinking/non-thinking modes (confirmed from huggingface.co/deepseek-ai/DeepSeek-V3.1, wandb.ai, emergentmind.com, aibinger.com, livemint.com). DeepSeek-V3.2 released December 1, 2025, 671B total, 128K context, MIT license, $0.28/$0.42 per MTok (confirmed from api-docs.deepseek.com, introl.com, canopywave.com, stackviv.ai).

Qwen 3.5 (Alibaba)

Alibaba’s Qwen 3.5 series, released between February 16 and March 2, 2026, represents the state of the art in open multimodal models:

Flagship (February 16, 2026): Qwen3.5-397B-A17B, 397B total parameters, 17B active, MoE architecture, 256K context window, 201 languages
Medium series (late February 2026): 32B, 72B, and 122B variants, all vision-language models with early-fusion multimodal training and 262K context
Small series (March 2, 2026): 0.8B, 2B, 4B, and 9B dense models optimized for on-device inference

All Qwen 3.5 models are released under the Apache 2.0 license, one of the most permissive standard open-source licenses. Apache 2.0 allows commercial use, modification, and distribution with minimal restrictions (attribution and a patent grant). Unlike the Llama Community License, there are no user-count thresholds or competitive-use restrictions.

The Qwen 3.5 small series is particularly notable: the 9B model scored 81.7 on the GPQA Diamond reasoning benchmark, beating OpenAI’s GPT-OSS-120B (80.1) despite being over 13x smaller. These small models are designed to run on consumer hardware, including smartphones and laptops with just 16 GB of RAM.

Source: Qwen 3.5 flagship (397B/17B active) released February 16, 2026; small series (0.8B-9B) released March 2, 2026; all Apache 2.0 licensed; 201 languages; 256K-262K context (confirmed from ai.rs, nvidia.com, officechai.com, techcommunity.microsoft.com). Qwen3.5-9B scored 81.7 on GPQA Diamond vs. GPT-OSS-120B at 80.1 (confirmed from computertech.co, nyu.edu, officechai.com, digitalapplied.com; GPT-OSS-120B official score 80.1% from arxiv.org/html/2508.10925v1 Table 3, the 71.5% figure cited by some sources is the GPT-OSS-20B score).

Mistral (Mistral AI)

Mistral AI, a French company, has consistently released high-quality open models under permissive licenses:

Mistral Small 3 (January 30, 2025): 24B dense parameters, Apache 2.0 license, 32K context window. A latency-optimized model that set a new benchmark for sub-70B models.
Mistral Small 4 (March 16, 2026): 119B total parameters, 6B active per token, 128 experts (top-4 routing), 256K context window, Apache 2.0 license. Multimodal (text + image input). Includes a configurable “reasoning_effort” parameter that lets users choose between faster replies and deeper reasoning.

Mistral’s approach is distinctive in two ways. First, they consistently use the Apache 2.0 license, making their models among the most legally straightforward to deploy commercially. Second, their MoE architectures are extremely sparse: Mistral Small 4 activates only 6B of its 119B parameters per token (a 20:1 ratio), making it remarkably efficient to serve.

Source: Mistral Small 3 released January 30, 2025, 24B parameters, Apache 2.0, 32K context (confirmed from simonwillison.net, huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501). Mistral Small 4 released March 16, 2026, 119B total/6B active, 128 experts top-4, 256K context, Apache 2.0, multimodal, configurable reasoning_effort (confirmed from mistral.ai, the-decoder.com, testingcatalog.com, huggingface.co/mistralai/Mistral-Small-4-119B-2603-NVFP4).

OpenAI GPT-OSS: The Historic Reversal

On August 5, 2025, OpenAI released its first open-weight models since GPT-2 in 2019, marking a dramatic reversal of the company’s closed-source strategy. The GPT-OSS family includes two models, both released under the Apache 2.0 license:

gpt-oss-120b: 117B total parameters, 5.1B active per token, 128 experts (top-4 routing), 36 layers, 131K context window. Fits on a single 80 GB GPU (like an H100). Achieves near-parity with OpenAI’s o4-mini on core reasoning benchmarks.
gpt-oss-20b: 21B total parameters, 3.6B active per token, 32 experts (top-4 routing), 24 layers. Runs on edge devices with just 16 GB of memory. Delivers performance similar to OpenAI’s o3-mini.

Both models are text-only, support configurable reasoning effort (low, medium, high), and include structured output support. They use the same MoE architecture pattern that dominates the open ecosystem: large total parameter counts with very sparse activation (only 3-5% of parameters active per token).

The release was significant for several reasons. OpenAI had been the most prominent advocate of the closed-model approach, arguing that safety concerns justified keeping weights private. The GPT-OSS release signaled that even OpenAI recognized the competitive pressure from open models. Independent evaluations confirmed that gpt-oss-120b beats o3-mini but falls behind o4-mini and o3, making it the most capable model that fits on a single H100 GPU.

Source: GPT-OSS released August 5, 2025, Apache 2.0 license. gpt-oss-120b: 117B total/5.1B active, 128 experts top-4, 36 layers, 131K context, GPQA Diamond 80.1% (high, no tools). gpt-oss-20b: 21B total/3.6B active, 32 experts top-4, 24 layers, GPQA Diamond 71.5% (high, no tools), 16 GB minimum (confirmed from openai.com/index/introducing-gpt-oss, arxiv.org/html/2508.10925v1 Table 3, adgully.com, ainews.com, oproai.com, replicate.com, artificialanalysis.ai). Near-parity with o4-mini (confirmed from openai.com/index/introducing-gpt-oss). 120b beats o3-mini, most capable model on single H100 (confirmed from artificialanalysis.ai).

Other Notable Open Models

GLM-5 (Zhipu AI / Z.ai, February 11, 2026): 744B total parameters, 40B active per token, 256 experts (8 selected per token), 200K context window, released under the MIT license. Trained on 28.5 trillion tokens using Huawei Ascend chips (not NVIDIA GPUs), making it the first frontier-class model trained entirely on non-NVIDIA hardware. GLM-5 integrates DeepSeek Sparse Attention for efficient long-context inference. It scored 77.8% on SWE-bench Verified, 50.4% on Humanity’s Last Exam, and 75.9 on BrowseComp, making it one of the strongest open-weight models for agentic coding tasks.
MiniMax M2.5 (MiniMax, February 12, 2026): 229B total parameters, 10B active per token, sparse MoE architecture, 204K context window, released under a modified MIT license. Trained with large-scale reinforcement learning across hundreds of thousands of real-world environments, M2.5 is specifically optimized for agentic workloads: coding, tool use, web browsing, and multi-step task execution. It scored 80.2% on SWE-bench Verified, matching Claude Opus 4.6 (80.8% Anthropic-reported; 79.2% on vals.ai standardized) while costing roughly 1/10th to 1/20th the price at $0.15 per million input tokens. This makes M2.5 the first open-weight model to match a closed frontier model on a major coding benchmark.
Kimi K2 / K2.5 (Moonshot AI, July 2025 / January 2026): Kimi K2 is a 1 trillion total parameter MoE model with 32B active per token, 384 experts (8 selected per token plus 1 shared), released under a modified MIT license. Trained on 15.5 trillion tokens, it was the first open-weight model to reach the trillion-parameter scale. Kimi K2.5, released January 27, 2026, added native multimodal capabilities through a 400-million-parameter vision encoder (MoonViT) and introduced a “self-directed agent swarm” architecture that coordinates up to 100 parallel sub-agents. K2.5 scored 76.8% on SWE-bench Verified and 87.6 on GPQA Diamond. API pricing starts at $0.60 per million input tokens.
Gemma 3 (Google, March 12, 2025): 1B, 4B, 12B, and 27B parameter models derived from Gemini 2.0 research. Multimodal (text + image), 128K context. Released under Google’s custom Gemma license, which allows commercial use but is not Apache 2.0. Gemma models are popular for on-device deployment due to their small sizes.
DeepSeek-R1 distilled models (January 2025): DeepSeek released six distilled versions of R1 ranging from 1.5B to 70B parameters, all under the MIT license. These smaller models capture much of R1’s reasoning ability at a fraction of the size, making reasoning capabilities accessible on consumer hardware.

Source: GLM-5 released February 11, 2026, 744B total/40B active, 256 experts, 200K context window, 28.5T training tokens, MIT license, trained on Huawei Ascend chips, 77.8% SWE-bench Verified, 50.4% HLE, 75.9 BrowseComp (confirmed from buildfastwithai.com, huggingface.co/unsloth/GLM-5-GGUF, creativeainews.com, letsdatascience.com, chatbotkit.com, llm-stats.com, thenextgentechinsider.com). MiniMax M2.5 released February 12, 2026, 229B total/10B active, modified MIT license, 80.2% SWE-bench Verified, 51.3% Multi-SWE-Bench, $0.15/MTok input (confirmed from huggingface.co/MiniMaxAI/MiniMax-M2.5, winbuzzer.com, digitalapplied.com, minimax.io, thepromptbuddy.com). Kimi K2 released July 2025, 1T total/32B active, 384 experts, 15.5T training tokens, modified MIT license (confirmed from arxiv.org/html/2507.20534, huggingface.co/moonshotai/Kimi-K2-Base, scalebytech.com, apxml.com). Kimi K2.5 released January 27, 2026, native multimodal with MoonViT 400M vision encoder, agent swarm architecture, 76.8% SWE-bench Verified, 87.6 GPQA Diamond self-reported (84.1% on vals.ai standardized), $0.60/MTok input (confirmed from modelslab.com, llm-stats.com, huggingface.co/blog/mlabonne/kimik25, spheron.network, recapio.com; vals.ai standardized score from maniac.ai/blog/chinese-frontier-models-compared). Gemma 3 released March 12, 2025, 1B/4B/12B/27B parameters, multimodal, 128K context, custom Gemma license (confirmed from arxiv.org/html/2503.19786, buildfastwithai.com, google.dev/gemma/terms).

The Closed Frontier

The major closed model providers as of March 2026 are OpenAI, Anthropic, and Google (for their Gemini API models). Each offers capabilities that open models have not yet fully matched on every dimension, though the gap has narrowed to near-zero on coding and knowledge benchmarks.

OpenAI (GPT-5.4 Family)

GPT-5.4: $2.50/$15 per MTok (input/output), 1.05M token context window
GPT-5.4 mini: $0.75/$4.50 per MTok, 400K context, released March 17, 2026
GPT-5.4 nano: $0.20/$1.25 per MTok, released March 17, 2026

GPT-5.4 represents the current frontier for general-purpose language models. Its architecture details are not published, but it scores 75% on OSWorld-Verified (above the 72.4% human baseline for computer use tasks) and leads on many reasoning benchmarks. The nano variant is aggressively priced, undercutting many open-model API providers.

Note that OpenAI now straddles both sides of the open/closed divide: GPT-5.4 remains fully closed, while GPT-OSS (described above) is open-weight under Apache 2.0. This dual strategy lets OpenAI compete in both markets.

Anthropic (Claude 4.6 Family)

Claude Opus 4.6: $5/$25 per MTok, the most expensive but often highest-quality option
Claude Sonnet 4.6: $3/$15 per MTok, the most popular tier for production use
Claude Haiku 4.5: $1/$5 per MTok, optimized for speed and cost

Anthropic removed its long-context surcharge on March 13, 2026, making Claude 4.6 models more competitive for long-document workloads. Claude Opus 4.6 scored 80.8% on SWE-bench Verified according to Anthropic’s own evaluation, while the independent vals.ai standardized leaderboard reports 79.2% for the Thinking variant. Claude Sonnet 4.6 scored 79.6% on SWE-bench Verified at one-fifth the price of Opus, making it the preferred choice for many production coding workloads.

Google (Gemini Family)

Gemini 3.1 Pro: $2/$12 per MTok (standard), $4/$18 above 200K tokens, 1M context
Gemini 3 Flash: $0.50/$3 per MTok
Gemini 2.5 Flash-Lite: $0.10/$0.40 per MTok (the cheapest frontier-adjacent option)
Gemini 3.1 Flash-Lite: $0.25/$1.50 per MTok

Google’s Gemini models are tightly integrated with Google Cloud and offer strong multimodal capabilities. The Flash-Lite tier provides an extremely low-cost option for high-volume applications.

Source: GPT-5.4 pricing and specs confirmed in Chapter 24 sources. Claude Opus 4.6 80.8% SWE-bench Verified (Anthropic self-reported, confirmed from beehiiv.com/walterslabreport, therundown.ai, ayyaztech.com). vals.ai standardized leaderboard shows 79.2% for Claude Opus 4.6 Thinking (confirmed from vals.ai/benchmarks/swebench). Claude Sonnet 4.6 79.6% SWE-bench Verified (confirmed from natural20.com, digitalapplied.com, nxcode.io, caylent.com). Gemini pricing confirmed in Chapter 24 sources.

The Performance Gap: How Close Are Open Models?

The gap between open and closed models has narrowed dramatically. At the end of 2023, the best closed model scored approximately 88% on MMLU while the best open model managed roughly 70.5%, a gap of 17.5 percentage points. By early 2026, that gap is effectively zero on knowledge benchmarks and single digits on most reasoning tasks.

A comprehensive benchmark analysis by whatllm.org in January 2026 found that open-weight models had narrowed the quality gap with proprietary models to within five points on a normalized quality index. On cost, the difference is even more striking: open-source models average $0.83 per million tokens versus $6.03 for proprietary models, a 7.3x cost advantage.

def performance_comparison():
    """
    Compare open vs. closed model performance across key benchmarks.
    Numbers reflect the state of play in early 2026.
    """
    print("Open vs. Closed Model Performance (Early 2026)")
    print("=" * 75)

    print("\n  Knowledge Benchmarks (MMLU, MMLU-Pro)")
    print("  " + "-" * 55)
    print("  Gap in late 2023:  ~17.5 percentage points")
    print("  Gap in early 2026: ~0 points (effectively closed)")
    print("  Example: DeepSeek V3.2 and Qwen 3.5 match or exceed")
    print("           GPT-4-class performance on MMLU")

    print("\n  Reasoning Benchmarks (GPQA Diamond)")
    print("  " + "-" * 55)
    print("  Kimi K2.5:           87.6 (open, 1T total, self-reported)")
    print("  Qwen 3.5-9B:         81.7 (open, 9B params)")
    print("  GPT-OSS-120B:        80.1 (open, 117B total)")
    print("  GPT-OSS-20B:         71.5 (open, 21B total)")
    print("  Note: vals.ai standardized scores differ; Kimi K2.5")
    print("  scores 84.1% on vals.ai vs. 87.6% self-reported")

    print("\n  Coding Benchmarks (SWE-bench Verified)")
    print("  " + "-" * 55)
    print("  Claude Opus 4.6:     80.8% (closed, Anthropic-reported)")
    print("  MiniMax M2.5:        80.2% (open, 229B MoE, MIT license)")
    print("  Claude Sonnet 4.6:   79.6% (closed)")
    print("  GLM-5:               77.8% (open, 744B MoE, MIT license)")
    print("  GPT-5.4:             77.2% (closed)")
    print("  Kimi K2.5:           76.8% (open, 1T MoE)")
    print("  Gemini 3 Flash:      76.2% (closed)")
    print("  DeepSeek V3.2-Spec.: 73.1% (open, 671B MoE, MIT license)")

    print("\n  Cost per Million Tokens (average)")
    print("  " + "-" * 55)
    print("  Open models:   ~$0.83/MTok")
    print("  Closed models: ~$6.03/MTok")
    print("  Ratio:         7.3x cheaper for open models")

    print("\n  Key insight: Open models have reached parity on knowledge,")
    print("  have matched closed models on coding (MiniMax M2.5 80.2%")
    print("  vs. Claude Opus 4.6 80.8%), and are closing fast on the")
    print("  hardest reasoning tasks. The cost advantage is 7.3x.")

performance_comparison()

The practical implication is that for roughly 80-90% of production use cases, open models now offer comparable quality at dramatically lower cost. The remaining 10-20%, the hardest multi-step reasoning tasks and the most nuanced creative work, is where closed frontier models still justify their premium pricing. But even this gap is shrinking rapidly. On coding benchmarks, the gap has effectively closed: MiniMax M2.5’s 80.2% on SWE-bench Verified matches Claude Opus 4.6’s 80.8% (Anthropic-reported; 79.2% on the vals.ai standardized leaderboard), and GLM-5’s 77.8% surpasses GPT-5.4’s 77.2%. The question is no longer whether open models can compete with closed models, but whether the remaining advantages of closed models justify their 7x higher cost.

Source: MMLU gap narrowed from ~17.5 points (late 2023) to effectively zero (early 2026) (confirmed from letsdatascience.com). Open models average $0.83/MTok vs. $6.03 for proprietary, 7.3x cheaper (confirmed from whatllm.org). Quality gap within 5 points on normalized index (confirmed from thenextgentechinsider.com citing whatllm.org January 2026 analysis). Claude Opus 4.6 80.8% SWE-bench Verified (confirmed from therundown.ai, ayyaztech.com). Claude Sonnet 4.6 79.6% (confirmed from natural20.com, caylent.com). MiniMax M2.5 80.2% SWE-bench Verified (confirmed from huggingface.co/MiniMaxAI/MiniMax-M2.5, winbuzzer.com, digitalapplied.com). GLM-5 77.8% SWE-bench Verified (confirmed from buildfastwithai.com, chatbotkit.com, creativeainews.com). Kimi K2.5 76.8% (confirmed from modelslab.com, clore.ai). DeepSeek V3.2-Speciale 73.1% (confirmed from beebom.com, aiplanetx.com).

The Open-Source MoE Revolution

One of the most important trends in the open-weight ecosystem is the dominance of Mixture-of-Experts (MoE) architectures (covered in detail in Chapter 12). Every major open-weight frontier model released in 2025 and 2026 uses MoE. This is not a coincidence: MoE is what makes it possible to release models with frontier-level quality that ordinary organizations can actually afford to run.

The key insight is the ratio between total parameters and active parameters:

Model	Total Params	Active Params	Ratio	License
GPT-OSS-120B	117B	5.1B	22.9:1	Apache 2.0
GPT-OSS-20B	21B	3.6B	5.8:1	Apache 2.0
LLaMA 4 Scout	109B	17B	6.4:1	Llama Community
LLaMA 4 Maverick	400B	17B	23.5:1	Llama Community
DeepSeek V3.2	671B	37B	18.1:1	MIT
Qwen 3.5 Flagship	397B	17B	23.4:1	Apache 2.0
GLM-5	744B	40B	18.6:1	MIT
MiniMax M2.5	229B	10B	22.9:1	Modified MIT
Kimi K2.5	1,000B	32B	31.3:1	Modified MIT
Mistral Small 4	119B	6B	19.8:1	Apache 2.0

The “active parameters” column is what determines inference cost. A model with 400B total parameters but only 17B active per token costs roughly the same to run as a 17B dense model, while delivering quality closer to a 400B dense model. This is why MoE has become the dominant architecture for open models: it lets you ship frontier quality at a fraction of the serving cost.

DeepSeek V3’s training cost of approximately $5.576 million was a shock to the industry when announced in December 2024. For context, training GPT-4 reportedly cost over $100 million, and frontier models from OpenAI and Google are estimated to cost $200-500 million or more. DeepSeek achieved comparable quality at roughly 1/20th to 1/50th the cost, largely through MoE efficiency, FP8 mixed-precision training, and innovations like Multi-Head Latent Attention that reduce memory requirements. Moonshot AI’s Kimi K2, released in July 2025, pushed the open-weight frontier to 1 trillion total parameters while keeping active parameters at just 32B, demonstrating that the MoE approach scales to even larger model sizes.

OpenAI’s GPT-OSS models take sparsity to an extreme: gpt-oss-120b activates only 5.1B of its 117B parameters per token (a 22.9:1 ratio), meaning less than 5% of the model is active for any given token. This extreme sparsity is what allows a 117B-parameter model to fit on a single 80 GB GPU and run at competitive speeds.

This cost efficiency cascades through the entire ecosystem. When a model is cheap to train, the lab can afford to release it openly. When it is cheap to serve (because only a fraction of parameters are active), organizations can afford to self-host it. The MoE architecture is the economic engine that makes the open-weight ecosystem viable.

Source: DeepSeek V3 training cost ~$5.576M for 2.788M H800 GPU hours (confirmed from arxiv.org/html/2412.19437v1, simonwillison.net). GPT-OSS-120B: 117B total, 5.1B active, 128 experts top-4, fits on single 80GB GPU (confirmed from adgully.com, oproai.com, artificialanalysis.ai, openai.com/index/introducing-gpt-oss). Model parameter counts confirmed from sources cited in individual model sections above.

Running Models Locally

One of the most powerful advantages of open-weight models is the ability to run them on your own hardware. This means complete data privacy (nothing leaves your machine), zero per-token costs after the initial hardware investment, no rate limits, and no dependency on external services. Several tools make this practical.

llama.cpp: The Foundation

llama.cpp is a C/C++ implementation of LLM inference, started by Georgi Gerganov in March 2023. When Meta released the original LLaMA weights, Gerganov ported the inference code to pure C++ in a single weekend, enabling the model to run on a MacBook CPU without any dependency on Python or PyTorch. This was a pivotal moment: it proved that large language models could run on consumer hardware.

llama.cpp introduced the GGUF (GGML Unified Format) file format, which has become the standard for local LLM deployment. GGUF files contain the model weights in a quantized format along with metadata (tokenizer configuration, model architecture details, quantization parameters) in a single self-contained file. You download one GGUF file and you have everything needed to run the model.

The project supports a wide range of hardware:

CPU inference with AVX2, AVX-512, and ARM NEON optimizations
GPU acceleration via NVIDIA CUDA, Apple Metal, AMD ROCm, and Vulkan
Hybrid CPU+GPU inference (offloading some layers to GPU, keeping others on CPU)

As of March 2026, llama.cpp had nearly 98,000 stars on GitHub (97,941 as of March 15, 2026), making it one of the most popular open-source AI projects. It supports models from all major open families: LLaMA, DeepSeek, Qwen, Mistral, Gemma, GPT-OSS, and many others.

Source: llama.cpp started March 2023 by Georgi Gerganov, pure C/C++ implementation of LLaMA inference (confirmed from wikipedia.org/wiki/Llama.cpp, vife.ai). GGUF format created by Gerganov as successor to GGML (confirmed from gitbook.io, panaversity.org). 97,941 GitHub stars as of March 15, 2026 (confirmed from evanli.github.io/Github-Ranking).

Ollama: One-Command Simplicity

Ollama wraps llama.cpp in a user-friendly command-line interface that makes running local models as simple as a single command. It handles model downloading, format conversion, GPU detection, and serves an OpenAI-compatible REST API on localhost:11434.

def ollama_quickstart():
    """
    Show how to get started with Ollama for local LLM inference.
    These are shell commands, not Python, but shown here for reference.
    """
    print("Running a Model Locally with Ollama")
    print("=" * 60)

    print("\n  Step 1: Install Ollama (one command)")
    print("  $ curl -fsSL https://ollama.com/install.sh | sh")

    print("\n  Step 2: Run a model (downloads automatically)")
    print("  $ ollama run llama4")
    print("  # Downloads LLaMA 4 Scout and starts a chat session")

    print("\n  Step 3: Use the API (OpenAI-compatible)")
    print("  $ curl http://localhost:11434/v1/chat/completions \\")
    print("    -H 'Content-Type: application/json' \\")
    print("    -d '{\"model\": \"llama4\", \"messages\": [{\"role\": \"user\",")
    print("          \"content\": \"Explain MoE in one paragraph\"}]}'")

    print("\n  Supported models include:")
    models = [
        ("llama4", "LLaMA 4 Scout (109B MoE, 17B active)"),
        ("deepseek-v3", "DeepSeek V3 (671B MoE, 37B active)"),
        ("qwen3.5", "Qwen 3.5 (various sizes)"),
        ("mistral-small", "Mistral Small 4 (119B MoE, 6B active)"),
        ("gemma3", "Gemma 3 (1B to 27B dense)"),
    ]
    for cmd, desc in models:
        print(f"    ollama run {cmd:<20s} # {desc}")

    print("\n  Hardware requirements (approximate):")
    print("    7B model:  8 GB RAM minimum")
    print("    13B model: 16 GB RAM minimum")
    print("    70B model: 64 GB RAM (or GPU with 48+ GB VRAM)")

ollama_quickstart()

Ollama supports macOS, Linux, and Windows. As of March 2026, it sees over 11 million downloads per month with 29-30% month-over-month growth, and has become the de facto standard for running local LLMs alongside LM Studio (which provides a desktop GUI instead of a CLI).

Source: Ollama runs LLMs locally with OpenAI-compatible API on localhost:11434, supports LLaMA, Qwen, DeepSeek, Gemma, Mistral (confirmed from lobehub.com, chatgate.ai, nerdleveltech.com). 11.1M downloads per month as of March 2026 with 29% month-over-month growth (confirmed from ai-buzz.com/companies/ollama).

LM Studio: The Desktop GUI

LM Studio is a free desktop application that provides a graphical interface for downloading and running open models locally. It uses llama.cpp as its inference engine under the hood but adds a polished chat interface, a built-in model browser connected to Hugging Face, and point-and-click quantization settings. LM Studio supports Windows, macOS (Apple Silicon), and Linux, with GPU acceleration via CUDA, Metal, Vulkan, and ROCm.

LM Studio is particularly useful for non-developers who want to experiment with local models without touching the command line. It also exposes an OpenAI-compatible local server, so applications built for the OpenAI API can be pointed at a local model with a single configuration change.

Source: LM Studio is a free desktop app for local LLM inference, uses llama.cpp engine, supports Windows/macOS/Linux, GPU acceleration via CUDA/Metal/Vulkan/ROCm, OpenAI-compatible API (confirmed from nerdleveltech.com, elephas.app, fundesk.io, vife.ai).

vLLM: Production-Grade Serving

For production deployments (serving open models to many users simultaneously), vLLM is the leading open-source serving framework. Covered in detail in Chapter 24, vLLM uses PagedAttention for efficient KV cache management, continuous batching for high throughput, and supports tensor parallelism for splitting large models across multiple GPUs. It exposes an OpenAI-compatible API, making it a drop-in replacement for the OpenAI API in existing applications.

vLLM is the right choice when you need to serve an open model to hundreds or thousands of concurrent users. Ollama and LM Studio are designed for single-user local inference; vLLM is designed for multi-user production serving.

Source: vLLM is the leading open-source LLM serving engine, uses PagedAttention (confirmed from vllm.ai, perficient.com, startuphub.ai).

Quantization for Local Deployment

Running a large model locally requires fitting it into your available memory (RAM or GPU VRAM). A 70-billion-parameter model in 16-bit precision requires approximately 140 GB just for the weights, far more than any consumer GPU. Quantization solves this by reducing the precision of each weight from 16 bits to 8, 4, or even 2 bits, shrinking the model proportionally.

Chapter 24 covered quantization from the serving infrastructure perspective. Here we focus on the practical aspects of quantizing open models for local use.

The Quantization Formats

Three quantization approaches dominate the open-model ecosystem:

GGUF (for llama.cpp / Ollama / LM Studio)

GGUF is the format used by llama.cpp and all tools built on it (Ollama, LM Studio). It supports a range of quantization levels, each trading quality for size:

Quant Level	Bits/Weight	70B Model Size	Quality Impact
Q2_K	~2.5	~23 GB	Significant degradation
Q3_K_M	~3.4	~31 GB	Noticeable on complex tasks
Q4_K_M	~4.5	~40 GB	Minimal for most tasks
Q5_K_M	~5.5	~48 GB	Very close to FP16
Q6_K	~6.5	~55 GB	Nearly indistinguishable
Q8_0	8.0	~70 GB	Effectively lossless
FP16	16.0	~140 GB	Full precision (baseline)

Q4_K_M is the most popular choice for local deployment: it reduces a 70B model from 140 GB to approximately 40 GB (fitting on a single 48 GB GPU or in system RAM), with minimal quality loss for most tasks.

GPTQ (for GPU inference)

GPTQ (Frantar and Alistarh, arXiv:2210.17323, 2022) is a one-shot weight quantization method that uses approximate second-order information to minimize quantization error. It can quantize a 175B-parameter model in approximately four GPU hours, reducing weights to 3 or 4 bits with negligible accuracy degradation. GPTQ is designed specifically for GPU inference and is supported by vLLM and other GPU-based serving frameworks.

AWQ (for GPU inference)

AWQ (Activation-aware Weight Quantization, Lin et al., arXiv:2306.00978, 2023) takes a different approach: instead of treating all weights equally, it identifies the 1% of weights that are most important (based on activation magnitudes) and protects them during quantization. This produces better quality than GPTQ at the same bit width, particularly at very low precision (3-4 bits). AWQ is also supported by vLLM and is generally preferred over GPTQ for new deployments.

def quantization_decision_guide():
    """
    Help users choose the right quantization format for their use case.
    """
    print("Quantization Decision Guide")
    print("=" * 65)

    print("\n  Where are you running the model?")
    print("  " + "-" * 50)

    scenarios = [
        ("CPU only (laptop, no GPU)",
         "GGUF Q4_K_M via Ollama or LM Studio",
         "Best CPU optimization, runs on any hardware"),
        ("Consumer GPU (8-24 GB VRAM)",
         "GGUF Q4_K_M via Ollama, or AWQ INT4 via vLLM",
         "Fits larger models in limited VRAM"),
        ("Professional GPU (48+ GB VRAM)",
         "AWQ INT4 or GPTQ INT4 via vLLM",
         "Best throughput for serving multiple users"),
        ("Multi-GPU server (production)",
         "FP8 or AWQ INT4 via vLLM with tensor parallelism",
         "Maximum throughput and quality balance"),
    ]

    for scenario, recommendation, reason in scenarios:
        print(f"\n  Scenario: {scenario}")
        print(f"    Use: {recommendation}")
        print(f"    Why: {reason}")

    print("\n  Rule of thumb:")
    print("    - Single user, local: GGUF via Ollama or LM Studio")
    print("    - Multiple users, production: AWQ/GPTQ via vLLM")
    print("    - If unsure, start with GGUF Q4_K_M in Ollama")

quantization_decision_guide()

Source: GPTQ (arXiv:2210.17323, Frantar and Alistarh, 2022) quantizes 175B model in ~4 GPU hours to 3-4 bits with negligible accuracy loss (confirmed from arxiv.org/abs/2210.17323, ar5iv.labs.arxiv.org). AWQ (arXiv:2306.00978, Lin et al., 2023) protects 1% salient weights based on activation magnitudes (confirmed from arxiv.org/abs/2306.00978, huggingface.co/papers/2306.00978). GGUF quantization levels and sizes confirmed from practical deployment guides (confirmed from tonisagrista.com, youngju.dev, dasroot.net).

Fine-Tuning Open Models

Running an open model as-is gives you a general-purpose assistant. Fine-tuning lets you specialize it for your specific domain, writing style, or task. Instead of training a model from scratch (which costs millions of dollars), you take a pre-trained model and continue training it on a much smaller, task-specific dataset. The model retains its general knowledge while learning the patterns in your data.

Fine-tuning is one of the most compelling reasons to choose open models over closed APIs. With a closed model, you are limited to prompt engineering (crafting system prompts and few-shot examples). With an open model, you can modify the model’s weights directly, embedding your domain knowledge into the model itself.

Full Fine-Tuning vs. Parameter-Efficient Fine-Tuning

Full fine-tuning updates every parameter in the model. For a 70B model, this means updating 70 billion numbers, which requires enormous GPU memory (multiple high-end GPUs) and significant compute time. Full fine-tuning produces the best results but is impractical for most organizations.

Parameter-Efficient Fine-Tuning (PEFT) updates only a small fraction of the model’s parameters, reducing memory and compute requirements by 90-99% while achieving results within 1-2% of full fine-tuning quality. The two dominant PEFT methods are LoRA and QLoRA.

LoRA: Low-Rank Adaptation

LoRA (Low-Rank Adaptation), introduced by Hu et al. at Microsoft in 2021 (arXiv:2106.09685), is the most widely used fine-tuning technique for open LLMs. The core idea is elegant: instead of updating the full weight matrices during fine-tuning, LoRA freezes the original weights and injects small, trainable low-rank matrices into each layer.

Here is how it works. A weight matrix in a Transformer layer might have shape [8192, 8192], containing 67 million parameters. LoRA decomposes the update to this matrix into two much smaller matrices: one of shape [8192, r] and one of shape [r, 8192], where r (the rank) is typically 8, 16, or 32. With rank 16, the two adapter matrices contain only 8192 x 16 + 16 x 8192 = 262,144 parameters, a 256x reduction from the original 67 million.

During inference, the LoRA adapter matrices are multiplied together and added to the original frozen weights, so there is zero additional latency. You can also swap different LoRA adapters in and out without reloading the base model, enabling one base model to serve multiple specialized tasks.

The original LoRA paper demonstrated that for GPT-3 (175B parameters), LoRA reduced the number of trainable parameters by 10,000x and the GPU memory requirement by 3x compared to full fine-tuning, with comparable quality on downstream tasks.

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    """
    A linear layer with LoRA adaptation.
    The original weight matrix W is frozen.
    Two small matrices A and B are trained.
    Output = x @ W + x @ A @ B (scaled by alpha/rank).
    """
    def __init__(self, in_features: int, out_features: int, rank: int = 16, alpha: float = 32.0):
        super().__init__()
        self.W = nn.Linear(in_features, out_features, bias=False)
        self.W.weight.requires_grad = False  # Freeze original weights

        self.A = nn.Linear(in_features, rank, bias=False)
        self.B = nn.Linear(rank, out_features, bias=False)
        nn.init.kaiming_normal_(self.A.weight)
        nn.init.zeros_(self.B.weight)

        self.scale = alpha / rank

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        base_output = self.W(x)
        lora_output = self.B(self.A(x)) * self.scale
        return base_output + lora_output

# Demonstrate the parameter savings
in_dim, out_dim, rank = 8192, 8192, 16
layer = LoRALinear(in_dim, out_dim, rank=rank)

original_params = in_dim * out_dim
lora_params = (in_dim * rank) + (rank * out_dim)
ratio = original_params / lora_params

print(f"Original weight matrix: {in_dim} x {out_dim} = {original_params:,} parameters")
print(f"LoRA adapters (rank {rank}): {in_dim}x{rank} + {rank}x{out_dim} = {lora_params:,} parameters")
print(f"Parameter reduction: {ratio:.0f}x fewer trainable parameters")
print(f"Memory for original: {original_params * 2 / 1e9:.2f} GB (FP16)")
print(f"Memory for LoRA:     {lora_params * 2 / 1e6:.1f} MB (FP16)")

Source: LoRA introduced by Hu et al. (arXiv:2106.09685, 2021), reduces trainable parameters by 10,000x and GPU memory by 3x for GPT-3 175B (confirmed from arxiv.org/html/2106.09685v1, huggingface.co/papers/2106.09685, microsoft.com/en-us/research/publication/lora-low-rank-adaptation-of-large-language-models).

QLoRA: Fine-Tuning on a Single GPU

QLoRA (Dettmers et al., arXiv:2305.14314, NeurIPS 2023) combines LoRA with 4-bit quantization to make fine-tuning accessible on consumer hardware. The key innovation: the base model is loaded in 4-bit precision (using a new data type called NormalFloat4, or NF4), and LoRA adapters are trained in 16-bit precision on top of the quantized base.

The result: QLoRA can fine-tune a 65B-parameter model on a single 48 GB GPU while preserving full 16-bit fine-tuning quality. This was transformative. Before QLoRA, fine-tuning a 65B model required multiple expensive GPUs. After QLoRA, a single NVIDIA A6000 (48 GB, approximately $1,000-1,500 used) or even a consumer RTX 4090 (24 GB, for smaller models) was sufficient.

def qlora_memory_comparison():
    """
    Compare memory requirements for different fine-tuning approaches
    on a 70B parameter model.
    """
    print("Fine-Tuning Memory Requirements: 70B Parameter Model")
    print("=" * 65)

    approaches = [
        ("Full Fine-Tuning (FP16)",
         "~280 GB",
         "4x A100 80GB or 8x A6000 48GB",
         "Updates all 70B parameters"),
        ("LoRA (FP16 base)",
         "~140 GB + adapters",
         "2x A100 80GB",
         "Freezes base, trains ~0.1% of params"),
        ("QLoRA (4-bit base + FP16 adapters)",
         "~35 GB + adapters",
         "1x A6000 48GB or 1x A100 40GB",
         "4-bit base, trains ~0.1% in FP16"),
        ("QLoRA (4-bit, smaller model: 7B)",
         "~4 GB + adapters",
         "1x RTX 4060 8GB (consumer GPU)",
         "Fine-tune a 7B model at home"),
    ]

    for approach, memory, hardware, note in approaches:
        print(f"\n  {approach}")
        print(f"    Memory:   {memory}")
        print(f"    Hardware: {hardware}")
        print(f"    Note:     {note}")

    print("\n  QLoRA made fine-tuning accessible to individual developers")
    print("  and small teams, not just well-funded AI labs.")

qlora_memory_comparison()

Source: QLoRA (arXiv:2305.14314, Dettmers et al., NeurIPS 2023) fine-tunes a 65B model on a single 48GB GPU using 4-bit NormalFloat quantization with LoRA adapters, preserving full 16-bit fine-tuning quality (confirmed from arxiv.org/abs/2305.14314, hf.co/papers/2305.14314, dl.acm.org/doi/10.5555/3666122.3666563).

Practical Fine-Tuning with Unsloth

Unsloth is an open-source library that optimizes LoRA and QLoRA fine-tuning by rewriting PyTorch operations as custom Triton kernels. It achieves up to 2x faster training and 70% less VRAM usage compared to standard Hugging Face training, with no loss in accuracy. As of March 2026, Unsloth supports models from all major open families (LLaMA, DeepSeek, Qwen, Gemma, GPT-OSS) and has released Unsloth Studio (beta, March 18, 2026), an open-source, no-code web interface for training, running, and exporting models locally. Studio includes visual dataset creation via graph-node workflows (Data Recipes, powered by NVIDIA DataDesigner), self-healing tool calling, code execution in a sandbox, and support for text, vision, TTS audio, and embedding models. It runs on Windows, macOS, and Linux, with CPU-only mode for chat inference. Earlier additions include 500K-context fine-tuning (December 2025), FP8 reinforcement learning with GRPO on a single RTX 4090 (November 2025), vision RL fine-tuning (September 2025), and 3x faster training kernels (December 2025).

def fine_tuning_example():
    """
    Show a minimal QLoRA fine-tuning setup using Unsloth.
    This is runnable code (requires: pip install unsloth).
    """
    code = '''
# pip install unsloth
from unsloth import FastLanguageModel
import torch

# Step 1: Load a pre-trained model in 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=2048,
    load_in_4bit=True,       # QLoRA: load base in 4-bit
)

# Step 2: Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                     # LoRA rank
    lora_alpha=32,            # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0,
    bias="none",
)

# Step 3: Prepare your training data
# Format: list of dicts with "instruction" and "output" keys
training_data = [
    {"instruction": "Summarize this legal contract clause...",
     "output": "This clause establishes..."},
    # ... hundreds or thousands of examples
]

# Step 4: Train with Hugging Face Trainer
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    train_dataset=training_data,  # Your dataset here
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        max_steps=100,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        output_dir="outputs",
    ),
)
trainer.train()

# Step 5: Save the LoRA adapter (small file, ~50-200 MB)
model.save_pretrained("my-legal-adapter")

# Step 6: Export to GGUF for local deployment
model.save_pretrained_gguf(
    "my-legal-model",
    tokenizer,
    quantization_method="q4_k_m",
)
# Result: a single GGUF file you can run with Ollama
'''
    print("QLoRA Fine-Tuning with Unsloth (Runnable Example)")
    print("=" * 60)
    print(code)
    print("  This fine-tunes a Qwen3-8B model on your custom data,")
    print("  then exports it as a GGUF file for local deployment.")
    print("  Total time: ~30-60 minutes on a single consumer GPU.")
    print("  VRAM required: ~6-8 GB for an 8B model with QLoRA.")

fine_tuning_example()

The fine-tuning workflow for open models in 2026 is straightforward:

Choose a base model. Pick the smallest model that meets your quality requirements. For most tasks, a 7-8B model fine-tuned on domain-specific data outperforms a general-purpose 70B model.
Prepare your dataset. Collect 500-10,000 examples of the input-output pairs you want the model to learn. Quality matters more than quantity: 1,000 high-quality examples often outperform 10,000 noisy ones.
Fine-tune with QLoRA. Use Unsloth or Hugging Face’s PEFT library. Training takes 30 minutes to a few hours on a single GPU.
Export and deploy. Save the LoRA adapter (a small file, typically 50-200 MB) or merge it into the base model and export as GGUF for local deployment.
Evaluate. Test on held-out examples. If quality is insufficient, add more training data or increase the LoRA rank.

Source: Unsloth achieves up to 2x faster training and 70% less VRAM, supports LLaMA/DeepSeek/Qwen/Gemma/GPT-OSS, released Unsloth Studio beta March 18, 2026 as open-source no-code web UI with Data Recipes (NVIDIA DataDesigner), self-healing tool calling, code execution, text/vision/TTS/embedding support (confirmed from github.com/unslothai/unsloth, unslothai.substack.com, thenextgentechinsider.com, alternativeto.net, gigazine.net, junia.ai, railway.app, orendra.com, clore.ai, huggingface.co/blog/unsloth-trl, unsloth.ai/blog).

When to Choose Open vs. Closed

The decision between open and closed models is not ideological; it is practical. Different situations call for different approaches, and many organizations use both.

Choose Open Models When:

Data privacy is non-negotiable. If your data cannot leave your network (healthcare, finance, legal, government), open models running on your own infrastructure are the only option. No data is sent to any external API.
You need customization beyond prompting. If system prompts and few-shot examples are not enough to get the behavior you need, fine-tuning an open model lets you embed domain knowledge directly into the weights.
Cost at scale matters. Above approximately 10 million tokens per day (as discussed in Chapter 24), self-hosting open models becomes significantly cheaper than API pricing, with savings of 30-80%.
You need predictable latency and availability. Self-hosted models do not have rate limits, do not experience provider outages, and do not change behavior when the provider updates the model behind the API.
You want to avoid vendor lock-in. With open models, you can switch between providers, run on different hardware, or move between cloud and on-premises without changing your model.

Choose Closed Models When:

You need absolute frontier performance. For the hardest multi-step reasoning tasks and the most nuanced creative work, closed frontier models (GPT-5.4, Claude Opus 4.6) still hold a slight edge. On coding, open models like MiniMax M2.5 have reached parity. On general reasoning and instruction following, closed models retain a small but measurable lead.
You want zero infrastructure overhead. If you do not have (or do not want) a team to manage GPU servers, model updates, and serving infrastructure, a managed API is the right choice. You pay per token and the provider handles everything.
Your volume is low. Below approximately 10 million tokens per day, the engineering cost of self-hosting typically exceeds the API cost savings.
You need the latest capabilities immediately. Closed model providers ship new features (tool use improvements, new modalities, safety updates) faster than the open ecosystem can replicate them.
Compliance requires a vendor relationship. Some enterprise compliance frameworks require a contractual relationship with the AI provider, including SLAs, data processing agreements, and audit rights. Closed API providers offer these; open-weight models do not.

The Hybrid Approach

Many organizations in 2026 use a hybrid strategy: open models for high-volume, cost-sensitive, or privacy-critical workloads, and closed models for the hardest tasks or when they need the absolute best quality. This is not a compromise; it is an optimization. A typical setup might use:

Qwen 3.5 9B (self-hosted) for customer-facing chatbot responses (high volume, cost-sensitive)
DeepSeek V3.2 (API at $0.28/$0.42) for internal analysis tasks (moderate volume, good quality)
Claude Opus 4.6 (API at $5/$25) for complex legal document review (low volume, quality-critical)

def cost_comparison_hybrid():
    """
    Compare costs for a hybrid deployment strategy processing
    100 million tokens per day across three tiers.
    """
    print("Hybrid Deployment Cost Comparison")
    print("(100M tokens/day total, split across tiers)")
    print("=" * 65)

    tiers = [
        ("High-volume chatbot (80M tok/day)",
         "Qwen 3.5 9B, self-hosted on 2x B200",
         2 * 6.50 * 24,  # 2 GPUs at $6.50/hr
         80_000_000 / 1_000_000 * 3.0,  # If using Claude Sonnet instead
         ),
        ("Internal analysis (15M tok/day)",
         "DeepSeek V3.2 API ($0.28/$0.42)",
         15_000_000 / 1_000_000 * 0.35,  # Blended ~$0.35/MTok
         15_000_000 / 1_000_000 * 9.0,  # If using GPT-5.4 instead
         ),
        ("Legal review (5M tok/day)",
         "Claude Opus 4.6 API ($5/$25)",
         5_000_000 / 1_000_000 * 15.0,  # Blended ~$15/MTok
         5_000_000 / 1_000_000 * 15.0,  # Same (already using best)
         ),
    ]

    total_hybrid = 0
    total_closed = 0

    for desc, solution, hybrid_cost, closed_cost in tiers:
        print(f"\n  {desc}")
        print(f"    Solution: {solution}")
        print(f"    Daily cost (hybrid):     ${hybrid_cost:,.0f}")
        print(f"    Daily cost (all closed): ${closed_cost:,.0f}")
        total_hybrid += hybrid_cost
        total_closed += closed_cost

    print(f"\n  {'=' * 50}")
    print(f"  Total daily cost (hybrid):     ${total_hybrid:,.0f}")
    print(f"  Total daily cost (all closed): ${total_closed:,.0f}")
    print(f"  Monthly savings:               ${(total_closed - total_hybrid) * 30:,.0f}")
    savings_pct = (1 - total_hybrid / total_closed) * 100
    print(f"  Savings:                       {savings_pct:.0f}%")

cost_comparison_hybrid()

The Ecosystem: Hugging Face and Model Distribution

The infrastructure that makes open models accessible is as important as the models themselves. Hugging Face is the central hub of the open-model ecosystem. As of March 2026, it hosts over 2.6 million public models (2,693,054 as of mid-March), more than 500,000 public datasets, and serves 13 million users. It is, in effect, the GitHub of machine learning.

When a lab releases an open model, the typical distribution path is:

Weights uploaded to Hugging Face in the original format (usually PyTorch safetensors)
Community members create quantized versions (GGUF, GPTQ, AWQ) within hours
Ollama and LM Studio add support within days
vLLM and other serving frameworks add support within days to weeks
Cloud providers (AWS, GCP, Azure) add hosted versions within weeks

This ecosystem means that a model released by a Chinese lab (DeepSeek, Qwen) or a French startup (Mistral) becomes globally accessible on every major platform within days. The speed of this distribution pipeline is one of the open ecosystem’s greatest strengths.

Hugging Face also hosts the Open LLM Leaderboard, which provides standardized benchmark evaluations for open models. This creates a transparent, reproducible way to compare models, something that is much harder with closed models (where benchmark results are self-reported by the provider).

Source: Hugging Face hosts over 2.6 million public models (2,693,054 as of mid-March 2026), 500,000+ datasets, 13 million users (confirmed from huggingface.co/MODELS showing 2,693,054, huggingface.co/blog/huggingface/state-of-os-hf-spring-2026 reporting 2M+ public models and 13M users in 2025).

Key Takeaways

“Open source” in AI is a spectrum, not a binary. Open weights (downloadable model parameters) is the most common form. True open source by the OSI’s OSAID 1.0 definition (weights + training data + code + full reproducibility) remains rare. Most “open” models are technically open-weight with varying license restrictions.
Five families dominate the open-weight landscape in March 2026, with several more close behind. LLaMA 4 (Meta, Llama Community License, 109B-400B MoE), DeepSeek V3.2 (MIT license, 671B MoE, $5.576M training cost), Qwen 3.5 (Alibaba, Apache 2.0, 0.8B-397B), Mistral Small 4 (Apache 2.0, 119B MoE, 6B active), and GPT-OSS (OpenAI, Apache 2.0, 21B-117B MoE, first open-weight release since GPT-2 in 2019). Notable additions in early 2026 include GLM-5 (Zhipu AI, MIT, 744B MoE, 77.8% SWE-bench Verified, trained on Huawei chips), MiniMax M2.5 (modified MIT, 229B MoE, 80.2% SWE-bench Verified), and Kimi K2.5 (Moonshot AI, modified MIT, 1T MoE, first open model at the trillion-parameter scale). DeepSeek, Qwen, Mistral, GPT-OSS, and GLM-5 all use permissive licenses (MIT or Apache 2.0).
The performance gap between open and closed models has effectively closed. On knowledge benchmarks (MMLU), the gap is zero. On reasoning tasks, it is single digits. On coding benchmarks (SWE-bench Verified), open models now match closed models: MiniMax M2.5 scores 80.2% (open) vs. Claude Opus 4.6 at 80.8% (Anthropic-reported; 79.2% on vals.ai standardized), and GLM-5 at 77.8% surpasses GPT-5.4 at 77.2%. Open models average 7.3x cheaper per token than closed models.
MoE is the economic engine of the open ecosystem. Every major open frontier model uses Mixture-of-Experts, activating only 3-20% of total parameters per token. This makes frontier-quality models affordable to both train (DeepSeek V3 cost ~$5.6M vs. $100M+ for comparable closed models) and serve (inference cost proportional to active, not total, parameters). MiniMax M2.5 takes efficiency to an extreme: 229B total parameters with only 10B active (a 22.9:1 ratio), yet it matches Claude Opus 4.6 on SWE-bench Verified. GPT-OSS-120B fits on a single H100 with 117B total but only 5.1B active.
Three tools dominate local model deployment. llama.cpp (C/C++ inference engine, started March 2023, nearly 98K GitHub stars) provides the foundation. Ollama (11M+ monthly downloads) wraps it in a one-command CLI. LM Studio provides a desktop GUI. All use the GGUF format for quantized model storage.
Quantization makes large models fit on consumer hardware. GGUF Q4_K_M reduces a 70B model from 140 GB to ~40 GB with minimal quality loss. GPTQ and AWQ target GPU inference at INT4 precision. For production serving, vLLM supports all major quantization formats.
LoRA and QLoRA make fine-tuning accessible. LoRA (Hu et al., 2021) reduces trainable parameters by up to 10,000x by injecting small low-rank adapter matrices into frozen model weights. QLoRA (Dettmers et al., 2023) adds 4-bit quantization, enabling fine-tuning of a 65B model on a single 48 GB GPU. Unsloth further optimizes this with 2x speed and 70% less VRAM.
Choose open models for privacy, customization, cost at scale, and vendor independence. Choose closed models for absolute frontier performance, zero infrastructure overhead, low volume, and immediate access to the latest capabilities. Many organizations use a hybrid approach: open models for high-volume workloads, closed models for the hardest tasks.
Hugging Face is the distribution hub, hosting 2.6M+ models and 500K+ datasets for 13M users. New open models become globally accessible across all major platforms within days of release.
Even OpenAI has gone open. The August 2025 release of GPT-OSS under Apache 2.0 marked a historic shift: the company most associated with closed-source AI acknowledged the competitive necessity of open-weight models. Every major AI lab now releases at least some open-weight models. Chinese labs (DeepSeek, Qwen, Zhipu AI, MiniMax, Moonshot AI) have been particularly aggressive, releasing frontier-class models under MIT or Apache 2.0 licenses and driving a wave of open innovation that has reshaped the competitive landscape.

What’s Next

You now understand the divide between open and closed models: what “open” actually means, the major model families and their licenses, how to run models locally, the quantization techniques that make this practical, and the fine-tuning methods that let you customize open models for your needs. In Chapter 26, we will confront the hardest questions in the field: why models hallucinate, how alignment techniques try to make models safe and honest, the ongoing arms race between jailbreaking and safety measures, and the fundamental limitations of what language models can and cannot do.

Chapter 24. Serving Infrastructure, From GPU to API Response Chapter 26. Safety, Alignment & Limitations