Chapter 22. Native Multimodal Models, The 2026 Standard
In Chapter 21, you learned how language models process images by connecting a separate vision encoder to a text model. That approach works, but it has a fundamental limitation: the vision encoder and language model were trained separately, then stitched together. The visual representations and textual representations live in different spaces, connected only by a thin projection layer or cross-attention mechanism. In 2026, the frontier has moved beyond this “bolted-on” approach. The leading models are now natively multimodal: they process text, images, audio, and video through a unified architecture, trained from scratch on interleaved multimodal data. This chapter explains what “native multimodal” actually means, how these unified architectures work, and why they enable capabilities that were impossible with the older approach.
The Old Way: Separate Models Stitched Together
Before we examine native multimodal architectures, let us be precise about what they replace.
The traditional approach to multimodal AI used a pipeline of separate models:
For vision-language tasks: A pretrained vision encoder (like CLIP or SigLIP) extracts image features. A projection layer maps those features into the language model’s embedding space. The language model processes the combined sequence.
For speech-to-text: A separate speech recognition model (like Whisper) transcribes audio to text. The text is then fed to the language model.
For text-to-speech: The language model generates text. A separate text-to-speech model converts that text to audio.
For image generation: The language model generates a text description or prompt. A separate diffusion model (like DALL-E or Stable Diffusion) generates the image from that prompt.
This pipeline approach has several problems:
Information loss at boundaries. When audio is transcribed to text before being processed by the language model, the model loses access to tone, emotion, emphasis, and speaker identity. When images are encoded by a separately trained vision encoder, the visual representations may not align perfectly with the language model’s internal concepts.
Latency accumulation. Each model in the pipeline adds latency. For voice interaction, the old ChatGPT pipeline (before GPT-4o) used three models: Whisper for speech-to-text, GPT-3.5 or GPT-4 for text processing, and a TTS model for speech output. The average response time was 2.8 seconds with GPT-3.5 and 5.4 seconds with GPT-4.
No cross-modal reasoning during generation. When a diffusion model generates an image from a text prompt, it cannot reason about the image as it generates it. It cannot say “wait, this does not match what I meant” and adjust. The language model and image generator are separate systems that cannot communicate during the generation process.
Training inefficiency. Each model is trained separately on different data. The vision encoder learns from image-text pairs. The language model learns from text. The speech model learns from audio-text pairs. There is no joint learning that could discover deeper cross-modal relationships.
Source: GPT-4o pipeline comparison: “Previously, Voice Mode was powered by a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio” (confirmed from ritvik19.medium.com/papers-explained-185-gpt-4o). Average response time 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4 in the pipeline approach (confirmed from business-standard.com, news.ycombinator.com/item?id=40508755).
The New Way: Unified Architectures
A natively multimodal model processes all modalities through a single neural network, trained end-to-end on mixed-modality data. There is no pipeline. Audio goes in, audio comes out. Images go in, images come out. Text, images, and audio can be interleaved in any combination, both as input and output.
The key architectural insight is that all modalities can be represented as sequences of tokens (or continuous embeddings) that a Transformer can process. Text is already tokenized. Images can be tokenized by dividing them into patches (as you learned in Chapter 21) or by using a learned tokenizer that converts image patches into discrete codes. Audio can be tokenized by converting waveforms into discrete acoustic tokens. Once everything is tokens, a single Transformer can process them all.
GPT-4o: The First Mainstream Omnimodal Model
GPT-4o (the “o” stands for “omni”), released by OpenAI on May 13, 2024, was the first widely deployed model to demonstrate true native multimodality. GPT-4o is trained end-to-end across text, vision, and audio. All inputs and outputs are processed by a single neural network.
The difference is dramatic. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds. This is comparable to human response time in conversation. The old pipeline approach averaged 5.4 seconds with GPT-4 (and 2.8 seconds with GPT-3.5), a 17x improvement in the worst case.
More importantly, GPT-4o can perceive and generate audio with nuance that was impossible with the pipeline approach. It can detect emotion in a speaker’s voice and respond with appropriate emotional tone. It can be interrupted mid-sentence and respond naturally. It can generate speech with laughter, singing, or dramatic emphasis, not just flat text-to-speech output.
def compare_voice_latency():
"""
Compare voice response latency between pipeline and native approaches.
"""
print("Voice Response Latency Comparison")
print("=" * 55)
approaches = [
("Pipeline (GPT-3.5 era)", "2,800 ms", "3 models in sequence"),
("Pipeline (GPT-4 era)", "5,400 ms", "3 models in sequence"),
("Native (GPT-4o)", "320 ms avg", "Single end-to-end model"),
("Native (GPT-4o minimum)", "232 ms", "Best case"),
("Human conversation", "~300 ms", "Natural response time"),
]
print(f"{'Approach':<30} {'Latency':>12} {'Notes':<25}")
print("-" * 70)
for approach, latency, notes in approaches:
print(f"{approach:<30} {latency:>12} {notes:<25}")
print(f"\n GPT-4o achieves human-like response times by eliminating")
print(f" the transcription and TTS steps entirely.")
compare_voice_latency()The architecture uses a unified decoder-only Transformer. All modalities (text, audio waveforms, images, video frames) are encoded into a single shared embedding space with modality-agnostic weights. The same attention mechanism processes all tokens, whether they came from text, audio, or images.
Source: GPT-4o released May 13, 2024. Audio response latency 232ms minimum, 320ms average, comparable to human response time (confirmed from arxiv.org/html/2410.21276v1 GPT-4o System Card, magicstudioonline.com, hothardware.com). “GPT-4o is an autoregressive omni model, which accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs” (confirmed from arxiv.org/html/2410.21276v1). End-to-end training across text, vision, and audio with single neural network (confirmed from michaeljohnpena.com, emergentmind.com/topics/gpt-4o-language-model).
Gemini: Native Multimodal from Day One
Google’s Gemini family, first released in December 2023, was designed as natively multimodal from the beginning. Unlike GPT-4 (which started as a text model and had vision bolted on later), Gemini was trained jointly across text, image, audio, and video data from the start.
The architecture uses a decoder-only Transformer backbone with explicit token-type embeddings that facilitate native multimodal fusion. Images and video frames are tokenized into discrete units similar to text tokens, enabling direct fusion across modalities. All input modalities are cast into a single token stream via modality markers: image data is discretized and embedded in a manner similar to VQ-VAE, and audio data is embedded using speech features.
This means Gemini does not have a separate “vision encoder” that was pretrained independently. The visual processing layers and the language processing layers are the same layers, trained together on interleaved multimodal data. The model learns cross-modal relationships from the ground up, rather than trying to bridge two separately trained systems.
Source: Gemini architecture uses “an enhanced Transformer decoder backbone configured for multimodality” that “supports mixed modality input sequences by tokenizing images and video frames into discrete units akin to text tokens, enabling direct fusion across modalities” (confirmed from emergentmind.com/topics/gemini-models). “Unified Multimodal Token Interleaving: All input modalities (text, image, audio, video) are cast into a single token stream via modality markers” (confirmed from emergentmind.com/topics/gemini-2-0-model). Gemini 1.0 released December 2023, trained jointly across image, audio, video, and text data (confirmed from ritvik19.medium.com, exploreai.tools, ts2.tech).
How Images Become Tokens: The Visual Tokenizer
In Chapter 21, you learned that vision encoders like ViT convert images into continuous vectors (patch embeddings). For native multimodal models that need to both understand and generate images, there is a different approach: converting images into discrete tokens, just like text.
The key technology is the VQ-VAE (Vector Quantized Variational Autoencoder). A VQ-VAE has three components:
- Encoder: Takes an image and compresses it into a grid of latent vectors.
- Codebook: A fixed-size dictionary of learned vectors (typically 8,192 or 16,384 entries). Each latent vector from the encoder is replaced by the nearest vector in the codebook. This “quantization” step converts continuous representations into discrete codes (integers).
- Decoder: Takes the discrete codes, looks up the corresponding codebook vectors, and reconstructs the image.
After training, the encoder and codebook can convert any image into a sequence of integers, and the decoder can convert those integers back into an image. These integers are image tokens, directly analogous to text tokens.
import numpy as np
def vqvae_tokenization_demo():
"""
Demonstrate how a VQ-VAE converts an image into discrete tokens.
This is the foundation for autoregressive image generation in LLMs.
"""
# Image: 512 x 512 pixels, 3 channels (RGB)
image_size = 512
channels = 3
# The encoder compresses the image spatially
# A typical compression ratio is 16x in each dimension
compression_ratio = 16
latent_grid = image_size // compression_ratio # 32 x 32
# Each position in the latent grid maps to one codebook entry
num_tokens = latent_grid * latent_grid # 1,024 tokens
# The codebook has a fixed number of entries
codebook_size = 8192
print("VQ-VAE Image Tokenization")
print("=" * 55)
print(f" Input image: {image_size}x{image_size}x{channels} "
f"({image_size * image_size * channels:,} values)")
print(f" Compression ratio: {compression_ratio}x per dimension")
print(f" Latent grid: {latent_grid}x{latent_grid}")
print(f" Number of tokens: {num_tokens}")
print(f" Codebook size: {codebook_size:,} entries")
print(f" Bits per token: {np.log2(codebook_size):.1f} bits")
print(f" Compression ratio: "
f"{image_size * image_size * channels / num_tokens:.0f}x "
f"(from {image_size * image_size * channels:,} values "
f"to {num_tokens} tokens)")
print(f"\n Each token is an integer from 0 to {codebook_size - 1}.")
print(f" The sequence [{np.random.randint(0, codebook_size)} "
f"{np.random.randint(0, codebook_size)} "
f"{np.random.randint(0, codebook_size)} ...] represents an image,")
print(f" just like [15339 318 257 ...] represents text.")
print(f"\n To generate an image, the model predicts these tokens")
print(f" one at a time, exactly like predicting text tokens.")
print(f" Then the VQ-VAE decoder converts them back to pixels.")
vqvae_tokenization_demo()This is the critical insight: once images are discrete tokens, you can generate them the same way you generate text. The model predicts the next image token, then the next, then the next, autoregressively. The VQ-VAE decoder then converts the predicted token sequence back into pixels.
Chameleon: Meta’s Early Fusion Research Model
Meta’s Chameleon (arXiv:2405.09818, May 2024) was one of the first open research models to demonstrate this approach at scale. Chameleon is an early-fusion, token-based model that represents all modalities (images, text, and code) as discrete tokens and uses a uniform Transformer architecture trained from scratch on approximately 10 trillion tokens of interleaved mixed-modal data.
Chameleon’s image tokenizer encodes a 512 x 512 image into 1,024 discrete tokens from a codebook of 8,192 entries. The total vocabulary is 65,536 tokens: 8,192 for images plus the text vocabulary. The model processes sequences of interleaved text and image tokens with the same Transformer, using the same attention mechanism for both modalities.
Meta released 7B and 34B parameter versions. The 7B model was trained on approximately 4.4 trillion tokens over more than 5 million GPU-hours. Chameleon demonstrated state-of-the-art performance in image captioning, competitive text generation (matching Mixtral 8x7B and Gemini Pro), and non-trivial image generation, all in a single model.
def chameleon_tokenization():
"""
Show how Chameleon represents a mixed text-image sequence.
"""
# Chameleon vocabulary
text_vocab = 65536 - 8192 # ~57,344 text tokens
image_vocab = 8192 # Image codebook entries
total_vocab = 65536
# A mixed-modal sequence might look like:
# [text tokens] [image tokens] [text tokens]
text_tokens_before = 50 # "Here is a photo of a sunset:"
image_tokens = 1024 # 512x512 image = 1024 tokens
text_tokens_after = 30 # "The colors are beautiful."
total_sequence = text_tokens_before + image_tokens + text_tokens_after
print("Chameleon: Mixed-Modal Token Sequence")
print("=" * 55)
print(f" Text vocabulary: {text_vocab:,} tokens (IDs 0-{text_vocab-1:,})")
print(f" Image vocabulary: {image_vocab:,} tokens "
f"(IDs {text_vocab:,}-{total_vocab-1:,})")
print(f" Total vocabulary: {total_vocab:,} tokens")
print(f"\n Example mixed sequence:")
print(f" Text tokens: {text_tokens_before} "
f"('Here is a photo of a sunset:')")
print(f" Image tokens: {image_tokens} "
f"(512x512 image, codebook size 8192)")
print(f" Text tokens: {text_tokens_after} "
f"('The colors are beautiful.')")
print(f" Total: {total_sequence} tokens")
print(f"\n The Transformer processes this entire sequence uniformly.")
print(f" It does not know or care which tokens are text vs image.")
print(f" The same attention mechanism handles both modalities.")
chameleon_tokenization()Chameleon’s significance is that it proved early fusion works at scale. A single model, with a single set of weights, can understand images, generate images, understand text, and generate text. The model does not need separate encoders, decoders, or projection layers for different modalities. Everything is tokens, and the Transformer processes them all.
Source: Chameleon (arXiv:2405.09818, May 2024) is “a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence” (confirmed from arxiv.org/html/2405.09818v1). Image tokenizer: 512x512 image encoded into 1,024 discrete tokens from codebook of 8,192 entries (confirmed from ritvik19.medium.com, andlukyane.com, ailab.sh). Total vocabulary 65,536 tokens (confirmed from medium.com/@maxwbuckley). 7B and 34B parameter versions, trained on ~10T tokens of interleaved mixed-modal data (confirmed from techxplore.com, xagi.in). 7B trained on ~4.4T tokens over 5M+ GPU-hours (confirmed from bdtechtalks.substack.com). HuggingFace model checkpoints: facebook/chameleon-7b (confirmed from huggingface.co/facebook/chameleon-7b).
Transfusion: Combining Autoregressive and Diffusion
Chameleon’s approach of converting images to discrete tokens works, but it has a limitation: the quantization step (mapping continuous image features to the nearest codebook entry) introduces information loss. Fine details, subtle color gradients, and textures can be degraded by this discretization.
Meta and Waymo’s Transfusion (arXiv:2408.11039, August 2024) proposed an elegant alternative: use next-token prediction for text (which is inherently discrete) and diffusion for images (which are inherently continuous), within a single Transformer.
In Transfusion, a mixed-modality sequence contains both discrete text tokens and continuous image vectors. The text tokens are processed with the standard autoregressive language modeling loss (predict the next token). The image vectors are processed with a diffusion loss (learn to denoise). The same Transformer backbone handles both, but with modality-specific encoding and decoding layers at the input and output.
The key advantage is that images remain in continuous space throughout the process. There is no quantization bottleneck. The model can generate images with the full fidelity of a diffusion model while also generating text with the full quality of a language model.
def transfusion_architecture():
"""
Illustrate how Transfusion combines autoregressive and diffusion
within a single Transformer.
"""
print("Transfusion: Dual-Objective Training")
print("=" * 60)
print("\n Text tokens (discrete):")
print(" Objective: Next-token prediction (autoregressive)")
print(" Attention: Causal (left-to-right)")
print(" Loss: Cross-entropy")
print("\n Image patches (continuous):")
print(" Objective: Diffusion (denoising)")
print(" Attention: Bidirectional (within image)")
print(" Loss: Mean squared error (noise prediction)")
print("\n Shared Transformer backbone:")
print(" Same weights process both text and image tokens")
print(" Modality-specific input/output layers handle encoding/decoding")
print("\n Generation:")
print(" Text: Predict one token at a time (autoregressive)")
print(" Image: Iterative denoising (diffusion, ~50-100 steps)")
print("\n Key insight: No image quantization needed.")
print(" Images stay in continuous space, preserving full fidelity.")
print(" Text stays in discrete space, preserving exact token prediction.")
transfusion_architecture()Transfusion’s experiments showed that this approach scales significantly better than the discrete-token approach (like Chameleon). At 7B parameters trained on 2 trillion multimodal tokens, Transfusion produced a model that generates images and text on par with similar-scale specialized models. The researchers also demonstrated that images can be compressed to just 16 patches (instead of 1,024 tokens in Chameleon), dramatically reducing the sequence length needed for image generation.
This matters because it suggests the future of native multimodal models may not require choosing between discrete tokens (good for text) and continuous representations (good for images). You can have both in the same model.
Source: Transfusion (arXiv:2408.11039, August 2024) “combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences” (confirmed from arxiv.org/abs/2408.11039). “Scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models” (confirmed from arxiv.org/abs/2408.11039). Images compressed to 16 patches (confirmed from arxiv.org/abs/2408.11039). Scales significantly better than discrete image tokens (confirmed from emergentmind.com/papers/2408.11039).
A Note on Gemini Diffusion
In May 2025, Google DeepMind announced Gemini Diffusion, an experimental research model that applies diffusion techniques to text generation rather than image generation. Unlike autoregressive models that predict one token at a time, Gemini Diffusion generates text by iteratively refining random noise into coherent output, similar to how image diffusion models work. This is a different application of diffusion than Transfusion: Transfusion uses diffusion for images within a multimodal model, while Gemini Diffusion uses diffusion for text itself.
The speed numbers are striking. Google reported an average sampling speed of 1,479 tokens per second (excluding a fixed 0.84-second overhead), roughly 5x faster than Gemini 2.0 Flash-Lite while matching its quality on benchmarks like HumanEval (89.6% pass@1). Because diffusion generates entire blocks of tokens simultaneously rather than one at a time, it can also self-correct during generation, which is particularly useful for editing, math, and code tasks. This is an active research direction that may influence future multimodal architectures: if diffusion works well for both images and text, a single diffusion-based model could potentially handle all modalities with a unified generation mechanism.
Source: Gemini Diffusion announced May 2025 at Google I/O, “a state-of-the-art text diffusion model that learns to generate outputs by converting random noise into coherent text or code” (confirmed from blog.google/technology/google-deepmind/gemini-diffusion, deepmind.google/models/gemini-diffusion). 1,479 tokens per second excluding 0.84s overhead (confirmed from gigazine.net, the-decoder.com, govindhtech.com). Matches Gemini 2.0 Flash-Lite quality at 5x speed, 89.6% HumanEval pass@1 (confirmed from handyai.substack.com, simonwillison.net, llm-stats.com).
Native Image Generation: Models That Draw
The most visible consequence of native multimodality is that language models can now generate images directly, without calling a separate image generation model. This is not a minor feature. It fundamentally changes how users interact with AI.
GPT-4o Image Generation
On March 25, 2025, OpenAI launched native image generation in GPT-4o. This replaced the previous approach where ChatGPT called DALL-E 3 as a separate model. With native image generation, GPT-4o generates images using the same model that processes text, within the same conversation context.
The difference is significant:
Context awareness: Because the image generator is the same model that understands the conversation, it can maintain consistency across multiple images in a conversation. If you ask for a character in one image and then ask for the same character in a different scene, the model can maintain visual consistency because it has full access to the conversation history.
Text rendering: GPT-4o’s native image generation can render text within images with unprecedented accuracy. Menus, product labels, signs, and diagrams with text are generated correctly, which was a persistent weakness of diffusion-based models like DALL-E 3.
Iterative refinement: You can ask the model to modify a generated image through conversation. “Make the sky more orange.” “Remove the person on the left.” “Add a logo in the top right corner.” The model understands these instructions in context and modifies the image accordingly.
The architecture uses an autoregressive approach rather than diffusion. GPT-4o generates images by predicting visual tokens sequentially, similar to how it generates text tokens. This is fundamentally different from diffusion models (like DALL-E 3 or Stable Diffusion), which generate images by iteratively denoising random noise.
The impact was immediate and massive. In the first week after launch, over 130 million users generated more than 700 million images. The demand was so intense that OpenAI’s infrastructure strained under the load, with CEO Sam Altman noting that their GPUs were under unprecedented pressure.
OpenAI subsequently released the model as GPT Image 1 (model ID: gpt-image-1) via the API on April 23, 2025, and then upgraded to GPT Image 1.5 (gpt-image-1.5) on December 16, 2025, with 4x faster generation, improved instruction following, better text rendering, and a 20% price reduction for image tokens.
def gpt4o_image_gen_timeline():
"""
Timeline of GPT-4o native image generation milestones.
"""
print("GPT-4o Native Image Generation Timeline")
print("=" * 65)
milestones = [
("May 13, 2024", "GPT-4o released",
"Native multimodal (text, image, audio input/output)"),
("Mar 25, 2025", "Native image generation in ChatGPT",
"Replaces DALL-E 3; autoregressive, not diffusion"),
("Apr 23, 2025", "GPT Image 1 API release",
"gpt-image-1 model ID for developers"),
("Dec 16, 2025", "GPT Image 1.5 released",
"4x faster, better text rendering, 20% cheaper"),
]
for date, event, detail in milestones:
print(f" {date:<16} {event}")
print(f" {'':16} {detail}")
print()
print(" First week stats (March 25 - April 1, 2025):")
print(" 130+ million users generated 700+ million images")
gpt4o_image_gen_timeline()Source: GPT-4o native image generation launched March 25, 2025 (confirmed from openai.com/index/introducing-4o-image-generation, wikipedia.org/wiki/GPT_Image). Replaces DALL-E 3 (confirmed from maginative.com, techcrunch.com). Autoregressive approach, not diffusion (confirmed from medium.com/@ricodedeijn, theaienterprise.io, medium.com/@zingabera_7320). 130+ million users, 700+ million images in first week (confirmed from techcrunch.com, opentools.ai, gigazine.net, bgr.com). GPT Image 1 API released April 23, 2025 as gpt-image-1 (confirmed from wikipedia.org/wiki/GPT_Image). GPT Image 1.5 released December 16, 2025, 4x faster, 20% price reduction (confirmed from ainews.com, affiliatebooster.com, how2shout.com, felloai.com).
Gemini’s Image Generation: From Nano Banana to Nano Banana 2
Google’s approach to native image generation followed a different path. On December 11, 2024, Google announced Gemini 2.0 Flash with native multimodal output capabilities, including natively generated images mixed with text and steerable text-to-speech audio. The image generation feature was initially available to trusted testers, then opened to developers on March 12, 2025 via Google AI Studio.
Unlike diffusion-based models, Gemini’s image generation is built into the same Transformer that handles text. The model generates images autoregressively, similar to how it generates text but with visual tokens instead of word tokens. This means you can have a conversation where the model generates text and images interleaved naturally: “Here is a story about a turtle” followed by an illustration, followed by more text, followed by another illustration, all maintaining visual consistency.
Google’s image generation capabilities evolved rapidly through a series of models with the internal codename “Nano Banana”:
Nano Banana (August 2025): The original model, officially Gemini 2.5 Flash Image, launched as an image generation and editing tool within the Gemini app. It went viral, adding 13 million first-time users in just 4 days and 23 million within two weeks, generating over 500 million images. It appeared anonymously on LMArena (a blind testing platform) in the summer of 2025, where its quality surprised evaluators.
Nano Banana Pro (November 20, 2025): Built on Gemini 3 Pro, this upgrade introduced native 4K resolution output, support for 14 simultaneous reference images, web search integration for real-time information grounding, and 94% text rendering accuracy across multiple languages. It reduced text rendering errors from 56% to 8%.
Nano Banana 2 (February 26, 2026): Officially designated Gemini 3.1 Flash Image, this model combined the quality of Nano Banana Pro with the speed of Gemini Flash. It became the default image generator across all Gemini image modes (Fast, Thinking, and Pro), with native 4K support and improved text rendering.
def gemini_image_gen_evolution():
"""
Track the evolution of Google's native image generation.
"""
print("Google Gemini Native Image Generation Evolution")
print("=" * 70)
models = [
("Dec 11, 2024", "Gemini 2.0 Flash",
"First native image output announced (trusted testers)"),
("Mar 12, 2025", "Gemini 2.0 Flash (public)",
"Image generation available in AI Studio for developers"),
("Aug 26, 2025", "Nano Banana (Gemini 2.5 Flash Image)",
"13M users in 4 days, 23M in 2 weeks, 500M+ images"),
("Nov 20, 2025", "Nano Banana Pro (Gemini 3 Pro Image)",
"4K output, 94% text accuracy, 14 reference images"),
("Feb 26, 2026", "Nano Banana 2 (Gemini 3.1 Flash Image)",
"Pro quality at Flash speed, default for all modes"),
]
for date, model, detail in models:
print(f" {date:<16} {model}")
print(f" {'':16} {detail}")
print()
gemini_image_gen_evolution()Source: Gemini 2.0 Flash announced December 11, 2024 with native image and audio output (confirmed from siliconangle.com, mobilesyrup.com, magicstudioonline.com). Image generation opened to developers March 12, 2025 (confirmed from developers.googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation). Nano Banana launched August 26, 2025, officially Gemini 2.5 Flash Image, 13M first-time users in 4 days, 23M by September 9, 500M+ images (confirmed from storyboard18.com, mspoweruser.com, ndtv.com, news9live.com). Nano Banana Pro launched November 20, 2025, built on Gemini 3 Pro, native 4K, 94% text accuracy, text rendering errors reduced from 56% to 8% (confirmed from spectrumailab.com, royfactory.net, createvision.ai, gadgets360.com). Nano Banana 2 launched February 26, 2026, officially Gemini 3.1 Flash Image, combines Pro quality with Flash speed, 40-50% faster and 50% cheaper than Nano Banana Pro (confirmed from sci-tech-today.com, techspot.com, theneuron.ai, gadgets360.com, photoworkout.com).
Native Audio: Models That Listen and Speak
Audio is the third major modality in native multimodal models. The progression from pipeline-based audio to native audio mirrors the progression in vision: separate models are being replaced by unified architectures that process speech natively.
The Pipeline Approach (Pre-2024)
Before GPT-4o, voice interaction with AI followed a three-step pipeline:
- Speech-to-text: A model like OpenAI’s Whisper (released September 2022, trained on 680,000 hours of multilingual audio) transcribes the user’s speech into text.
- Text processing: The language model (GPT-4, Claude, etc.) processes the text and generates a text response.
- Text-to-speech: A TTS model converts the text response back into audio.
This pipeline works but loses information at each boundary. When speech is transcribed to text, the model loses:
- Tone and emotion: Sarcasm, excitement, sadness, frustration
- Emphasis: Which words the speaker stressed
- Speaker identity: Who is speaking in a multi-speaker conversation
- Non-verbal sounds: Laughter, sighs, pauses, background context
- Prosody: The rhythm and melody of speech
The language model processes only the flat text transcript, stripped of all this information. Its response is then converted to speech by a TTS model that has no understanding of the conversation context, producing output that sounds robotic and emotionally flat.
GPT-4o: End-to-End Audio
GPT-4o eliminated this pipeline. Audio goes directly into the model as audio tokens, and audio comes directly out. The model processes the raw audio signal, not a text transcription. This means it can perceive and respond to all the nuances that the pipeline approach loses.
The GPT-4o System Card (arXiv:2410.21276) describes the model as “an autoregressive omni model, which accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs.” The audio capabilities include:
- Responding to audio inputs with an average latency of 320 milliseconds
- Generating speech with emotional expression, laughter, and singing
- Understanding and responding to tone, emphasis, and emotion in the speaker’s voice
- Being interrupted mid-sentence and responding naturally
- Processing audio in multiple languages
Gemini: Native Audio Generation
Google’s Gemini 2.0 Flash (December 2024) also introduced native audio output, including steerable text-to-speech in multiple languages. The Multimodal Live API enables real-time audio streaming interactions where the model can process audio input and generate audio output simultaneously.
Google subsequently developed dedicated native audio models. The Gemini 2.5 Flash Native Audio model, updated in December 2025, processes speech natively rather than converting it to text first. Google describes this as “a fundamental architectural shift in how AI processes speech.” Unlike traditional voice systems that chain together speech-to-text, language model, and text-to-speech components, the native audio model directly understands and responds to audio input, preserving nuances like tone, emotion, and conversational context. The model provides 30 HD voices in 24 languages, with enhanced audio quality that Google says “feels like speaking with a person.”
By March 2026, Google has expanded this to Gemini Audio, a dedicated model family that can translate real-time speech in over 70 languages, automatically detect which languages are being spoken, and filter out background noise while preserving the characteristics of the original speakers. Gemini Audio also powers expressive audio generation with granular control over style, tone, and performance.
Additionally, on February 18, 2026, Google integrated Lyria 3, DeepMind’s most advanced music generation model, directly into the Gemini app. Users can now generate 30-second tracks with vocals and instrumentals from text prompts, images, or video clips. Lyria 3 produces 48kHz stereo audio and was trained on over 2 million tracks. This represents another dimension of native audio generation: not just speech, but music and sound effects as well.
Claude: Voice via External Partnership
Anthropic took a different approach to audio. Rather than building native audio capabilities into Claude’s architecture, Anthropic partnered with ElevenLabs for speech synthesis. Claude’s voice mode, launched in late May 2025 on iOS and Android, uses ElevenLabs technology for text-to-speech rather than an in-house model. This means Claude’s voice interaction still follows the pipeline approach: speech is transcribed to text, processed by Claude, and then converted back to speech by ElevenLabs.
This is a notable architectural difference. While OpenAI and Google have invested in end-to-end audio processing within their models, Anthropic has focused its resources on text and vision capabilities, outsourcing audio to a specialist. Claude’s voice mode initially launched with English support and five voices (Buttery, Airy, Mellow, Glassy, and Rounded), and has since expanded to detect 38 input languages with 14 neural TTS voices and code-switching support (switching between languages within a single sentence).
Source: Whisper released September 2022, trained on 680,000 hours of multilingual audio (confirmed from openai.com/index/whisper, openwhispr.com). GPT-4o System Card arXiv:2410.21276: “autoregressive omni model” accepting “any combination of text, audio, image, and video” (confirmed from arxiv.org/html/2410.21276v1). Gemini 2.0 Flash native audio output including steerable TTS (confirmed from neowin.net, themobileindian.com, medium.com/@chongcht). Gemini 2.5 Flash Native Audio: 30 HD voices in 24 languages, “fundamental architectural shift in how AI processes speech” (confirmed from docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-live-api, blog.google/products/gemini/gemini-audio-model-updates, supermaker.ai). Gemini Audio translates real-time speech in 70+ languages (confirmed from deepmind.google/models/gemini-audio, notebookcheck.net). Lyria 3 integrated into Gemini app February 18, 2026, 30-second tracks with vocals and instrumentals, 48kHz stereo, trained on 2M+ tracks (confirmed from creati.ai, winbuzzer.com, toolmesh.ai, bitcoinethereumnews.com). Claude voice mode launched late May 2025, uses ElevenLabs for TTS, initially English-only with 5 voices (Buttery, Airy, Mellow, Glassy, Rounded), later expanded to 38 input languages and 14 neural TTS voices with code-switching (confirmed from the-decoder.com, aitechsuite.com, datastudios.org).
Video Understanding: Processing Temporal Sequences
Video is the most token-expensive modality. A video is a sequence of image frames over time, and each frame needs to be tokenized. The challenge is that even a short video contains an enormous number of frames.
How Video Becomes Tokens
The standard approach to video understanding in multimodal models is to sample frames at a fixed rate and tokenize each frame as an image. Gemini, for example, samples video at 1 frame per second (FPS) and tokenizes each frame.
def video_token_cost():
"""
Calculate the token cost of processing video in Gemini.
"""
# Gemini video tokenization
tokens_per_second = 263 # Fixed rate for Gemini
tokens_per_frame = 258 # Each frame (<=384px) = 258 tokens
print("Video Token Cost in Gemini")
print("=" * 65)
print(f" Tokenization rate: {tokens_per_second} tokens per second")
print(f" (Approximately 1 frame/second at {tokens_per_frame} tokens/frame")
print(f" plus audio tokens)")
print()
durations = [
(10, "10-second clip"),
(60, "1-minute video"),
(300, "5-minute video"),
(3600, "1-hour video"),
(7200, "2-hour movie"),
]
print(f"{'Duration':<20} {'Tokens':>12} {'Context %':>12}")
print(f"{'':20} {'':>12} {'(of 1M)':>12}")
print("-" * 50)
for seconds, label in durations:
tokens = seconds * tokens_per_second
pct = (tokens / 1_000_000) * 100
print(f"{label:<20} {tokens:>12,} {pct:>11.1f}%")
print(f"\n A 1-hour video uses ~947K tokens, nearly filling")
print(f" a 1M-token context window.")
print(f" A 2-hour movie requires ~1.9M tokens, exceeding")
print(f" most models' context windows.")
video_token_cost()The numbers reveal why video understanding is so expensive. A 1-hour video at 263 tokens per second consumes approximately 947,000 tokens, nearly filling a 1-million-token context window. A 2-hour movie would require about 1.9 million tokens. This is why video understanding is currently limited to relatively short clips in most applications, or requires models with very large context windows.
What Video Understanding Can Do
Despite the token cost, video understanding enables powerful capabilities:
- Temporal reasoning: Understanding what happens over time. “What did the person do after picking up the cup?” requires tracking actions across frames.
- Activity recognition: Identifying activities like cooking, exercising, or driving from video sequences.
- Timestamp-based Q&A: “At what point in the video does the speaker mention pricing?” requires correlating content with time.
- Summarization: Condensing a long video into key points, identifying the most important moments.
- Real-time analysis: Processing live video streams for applications like sports coaching (“How can I improve my golf swing?”) or security monitoring.
Gemini’s video understanding processes video by extracting frames at 1 FPS, tokenizing each frame, and including audio tokens alongside the visual tokens. The model can then reason over the combined visual and audio sequence, answering questions that require understanding both what is shown and what is said.
The limitation is that 1 FPS sampling misses fast-moving events. A ball being thrown, a quick gesture, or a brief facial expression may fall between sampled frames. Higher frame rates would capture more detail but would multiply the token cost proportionally.
Source: Gemini video tokenization at 263 tokens per second (confirmed from geminibyexample.com/027-calculate-input-tokens). Gemini samples video at 1 FPS (confirmed from s-anand.net, google.dev/discuss, getdecipher.com). Each frame (<=384px) = 258 tokens (confirmed from geminibyexample.com, google.dev/discuss). Gemini 3 Video Understanding API processes video at approximately 300 tokens per second at default resolution (confirmed from fastgptplus.com).
The Architecture Spectrum: From Bolted-On to Fully Native
Not all “multimodal” models are equally native. There is a spectrum of architectural approaches, and understanding where each model falls on this spectrum helps explain their capabilities and limitations.
Level 1: Bolted-On (Separate Models)
The least integrated approach. A language model calls separate models for different modalities. The language model generates a text prompt, sends it to DALL-E for image generation, or receives transcribed text from Whisper. The models do not share weights or training.
Examples: ChatGPT with DALL-E 3 (before March 2025), Claude with ElevenLabs voice.
Limitations: Information loss at boundaries, high latency, no cross-modal reasoning during generation.
Level 2: Encoder-Attached (Vision Encoder + LLM)
A pretrained vision encoder is connected to a language model through a projection layer or cross-attention mechanism. The vision encoder and language model may be fine-tuned together, but they were originally trained separately.
Examples: LLaVA, LLaMA 3.2 Vision.
Capabilities: Image understanding, visual question answering, document analysis.
Limitations: Cannot generate images. Vision encoder and language model may have misaligned representations. Image tokens consume context window (in projection-based approaches).
Level 2.5: Early Fusion with Vision Encoder
A vision encoder (often pretrained) is integrated into the model, but the entire system is trained end-to-end on interleaved text and image data from the start. The vision encoder’s representations are fused into the language model at the input level rather than being bolted on after pretraining. This is a meaningful step beyond Level 2 because the model learns cross-modal relationships during pretraining, not just during fine-tuning.
Examples: LLaMA 4 Maverick (MetaCLIP vision encoder with early fusion, trained on 40+ trillion tokens of interleaved multimodal data), Qwen 3.5 (early fusion MoE, trained on text, images, and video simultaneously), Mistral Small 4 (Pixtral vision encoder with multimodal input).
Capabilities: Strong image understanding, visual reasoning, document analysis, video understanding (Qwen 3.5 supports 60-second video clips; the related Qwen3-VL family supports up to 2-hour videos with its 1M-token context).
Limitations: Cannot generate images or audio. Still uses a distinct vision encoder component, even though it is trained jointly.
Level 3: Native Input, Pipeline Output
The model natively processes multiple input modalities but uses separate systems for non-text output. For example, a model might natively process text, images, and audio as input, but only generate text as output, relying on external models for image or audio generation.
Examples: Claude Sonnet 4.6 (natively processes text and images, uses ElevenLabs for audio output, cannot generate images).
Level 4: Fully Native (End-to-End Multimodal)
The model processes and generates all supported modalities through a single architecture, trained end-to-end. No separate models, no pipelines, no information loss at boundaries.
Examples: GPT-4o (text, image, audio), Gemini (text, image, audio, video).
def multimodal_spectrum():
"""
Compare models across the multimodal architecture spectrum.
"""
print("Multimodal Architecture Spectrum (March 2026)")
print("=" * 75)
headers = ["Model", "Text In", "Image In", "Audio In",
"Text Out", "Image Out", "Audio Out", "Level"]
models = [
("GPT-4o / GPT-5.4", "Native", "Native", "Native",
"Native", "Native", "Native", "4"),
("Gemini 3 Pro", "Native", "Native", "Native",
"Native", "Native", "Native", "4"),
("Claude Sonnet 4.6", "Native", "Native", "Pipeline*",
"Native", "No", "Pipeline*", "3"),
("LLaMA 4 Maverick", "Native", "Native**", "No",
"Native", "No", "No", "2.5"),
("Qwen 3.5", "Native", "Native**", "No",
"Native", "No", "No", "2.5"),
("Mistral Small 4", "Native", "Native**", "No",
"Native", "No", "No", "2.5"),
]
print(f" {'Model':<20} {'Txt In':>7} {'Img In':>7} {'Aud In':>7} "
f"{'Txt Out':>8} {'Img Out':>8} {'Aud Out':>8} {'Level':>6}")
print(" " + "-" * 73)
for row in models:
print(f" {row[0]:<20} {row[1]:>7} {row[2]:>7} {row[3]:>7} "
f"{row[4]:>8} {row[5]:>8} {row[6]:>8} {row[7]:>6}")
print(f"\n * Claude uses ElevenLabs for speech (pipeline, not native)")
print(f" ** Early fusion: vision encoder trained jointly with LLM")
print(f" from the start on interleaved multimodal data")
multimodal_spectrum()The trend is clear: models are moving up the spectrum toward fully native multimodality. In 2023, most models were at Level 1 or 2. By March 2026, the frontier models (GPT-4o/GPT-5.4, Gemini) are at Level 4, and open-source models (LLaMA 4, Qwen 3.5, Mistral Small 4) have adopted early fusion for input processing, placing them at Level 2.5. The gap between closed and open models is narrowing on the input side but remains wide on the output side: no open-source model can generate images or audio natively within a language model.
Why Native Multimodality Matters: Concrete Advantages
The shift from bolted-on to native multimodality is not just an architectural curiosity. It enables capabilities that are impossible or impractical with the pipeline approach.
1. Interleaved Generation
A natively multimodal model can generate text and images interleaved in a single response. Ask Gemini to “write a children’s story about a turtle and illustrate each scene,” and it will generate text paragraphs alternating with illustrations, maintaining visual consistency throughout. The turtle looks the same in every image because the model has full context of what it has already generated.
With the pipeline approach, you would need to generate the text first, then separately prompt an image generator for each illustration, manually ensuring consistency.
2. Cross-Modal Reasoning
When a model processes audio natively, it can reason about the relationship between what someone says and how they say it. “The speaker said ‘I’m fine’ but their tone suggests they are upset” requires understanding both the text content and the audio signal simultaneously. A pipeline that transcribes audio to text before processing loses this capability entirely.
3. Iterative Multimodal Editing
Native image generation within a language model enables conversational image editing. You can say “generate a logo for my coffee shop,” see the result, then say “make the text larger and change the color to dark green,” and the model modifies the existing image rather than generating a new one from scratch. This works because the model maintains the full context of the conversation, including the previously generated image.
4. Reduced Latency
Eliminating pipeline steps reduces latency. GPT-4o’s 320ms audio response time (compared to 5.4 seconds with the GPT-4 pipeline) is a direct consequence of processing audio natively rather than through transcription and TTS steps.
5. Emergent Cross-Modal Capabilities
When a model is trained on interleaved multimodal data from scratch, it can develop capabilities that were not explicitly trained. For example, GPT-4o can generate speech with laughter, singing, and dramatic emphasis, not because it was specifically trained on “laughing speech” examples, but because the unified training on diverse audio data allowed it to learn these patterns naturally.
The Cost of Native Multimodality
Native multimodal models are more expensive to train and run than text-only models. Understanding the cost structure helps explain why not all models have adopted this approach.
Training Cost
Training a natively multimodal model requires:
- Multimodal training data: Interleaved text, image, audio, and video data is harder to collect and curate than text alone. The data must be high-quality across all modalities.
- Longer training: Processing images and audio alongside text increases the total number of tokens. Chameleon was trained on approximately 10 trillion tokens; a text-only model of similar size might train on 2-5 trillion text tokens.
- More compute: Image and audio tokens are more expensive to process than text tokens because they require additional encoding/decoding steps and often use higher-dimensional representations.
This is why the encoder-attached approach (Level 2) remains popular for open-source models. You can take an existing pretrained language model and pretrained vision encoder, connect them with a small projection layer, and fine-tune for a fraction of the cost. LLaVA-1.5 completes training in about 208 GPU-hours (as discussed in Chapter 21). Training a natively multimodal model from scratch requires millions of GPU-hours.
Inference Cost
Multimodal inputs and outputs consume more tokens than text alone:
def multimodal_cost_comparison():
"""
Compare the token cost of different modalities.
"""
print("Token Cost by Modality (Approximate)")
print("=" * 60)
examples = [
("Text: 1 page (~500 words)", 700, "~700 tokens"),
("Image: 1024x1024 (GPT-4o high)", 765, "765 tokens"),
("Image: 1000x1000 (Claude)", 1333, "1,333 tokens"),
("Audio: 1 minute (GPT-4o)", 1500, "~1,500 tokens*"),
("Video: 1 minute (Gemini)", 15780, "15,780 tokens"),
("Video: 10 minutes (Gemini)", 157800, "157,800 tokens"),
]
print(f" {'Content':<40} {'Tokens':>10}")
print(" " + "-" * 52)
for content, tokens, note in examples:
print(f" {content:<40} {note:>10}")
print(f"\n * Audio token counts vary by model and are less")
print(f" precisely documented than image token counts.")
print(f"\n A 10-minute video call with screen sharing could")
print(f" easily consume 200,000+ tokens, compared to ~5,000")
print(f" tokens for the same conversation as text.")
multimodal_cost_comparison()The token cost of multimodal content means that applications processing images, audio, or video consume context window space and API credits much faster than text-only applications. This is a practical consideration for developers building multimodal applications.
The State of Native Multimodality in March 2026
Let us take stock of where each major model family stands on native multimodality as of March 2026.
| Model Family | Native Input | Native Output | Image Gen | Audio Gen | Video Understanding | Architecture |
|---|---|---|---|---|---|---|
| GPT-4o / GPT-5.4 | Text, image, audio, video | Text, image, audio | Yes (autoregressive) | Yes (native) | Yes | Unified decoder-only Transformer |
| Gemini 3 / 3.1 | Text, image, audio, video | Text, image, audio | Yes (autoregressive) | Yes (native TTS + Lyria 3 music) | Yes (1 FPS, 263 tok/s) | Decoder-only with modality embeddings |
| Claude Sonnet 4.6 | Text, image | Text | No | No (ElevenLabs) | No | Text+vision native, audio pipeline |
| LLaMA 4 Maverick | Text, image | Text | No | No | No | MoE + MetaCLIP early fusion |
| Qwen 3.5 | Text, image | Text | No | No | Yes (60s clips) | Early fusion MoE |
| Qwen3-VL | Text, image, video | Text | No | No | Yes (up to 2h) | Encoder-attached, 256K-1M context |
| Mistral Small 4 | Text, image | Text | No | No | No | MoE + Pixtral early fusion |
The table reveals a clear divide. OpenAI and Google have invested heavily in fully native multimodal architectures that can both understand and generate across multiple modalities. Anthropic, Meta, Alibaba, and Mistral have focused primarily on multimodal input (understanding images and, in some cases, video) while keeping output primarily text-based.
This divide is partly about resources (training a fully native multimodal model with generation capabilities is extremely expensive) and partly about strategy (Anthropic has explicitly prioritized safety and reasoning over multimodal generation). Notably, the open-source models have converged on early fusion as the standard approach for multimodal input: LLaMA 4 (April 2025), Qwen 3.5 (February 2026), and Mistral Small 4 (March 2026) all train on interleaved text and image data from the start, rather than bolting on a vision encoder after pretraining.
The trend is also moving toward making native multimodal capabilities available in smaller, cheaper models. On March 17, 2026, OpenAI released GPT-5.4 mini and GPT-5.4 nano, bringing the multimodal understanding capabilities of the GPT-5.4 family to faster and more affordable models. GPT-5.4 mini runs more than 2x faster than the full GPT-5.4 while approaching its performance on several evaluations. This mirrors the pattern seen in open-source models, where Qwen 3.5’s smaller variants (4B, 9B) include native multimodal support with early fusion. Native multimodality is no longer reserved for the largest, most expensive models.
The Open-Source Gap
The gap between closed and open models is particularly stark for multimodal generation, but narrowing for multimodal understanding. As of March 2026:
- Image understanding: Open-source models have largely caught up. LLaMA 4 Maverick (April 2025), Qwen 3.5 (February 2026), and Mistral Small 4 (March 2026) all use early fusion to process images natively alongside text. The Qwen3-VL family (November 2025) can process up to 2 hours of video with its 1M-token context window, while Qwen 3.5 supports 60-second video clips natively.
- Image generation: Only GPT-4o/GPT-5.4 and Gemini offer native image generation within a language model. No open-source model matches this capability.
- Audio generation: Only GPT-4o/GPT-5.4 and Gemini offer native audio generation. Claude uses an external partnership (ElevenLabs). Open-source models generally lack native audio.
- Video understanding: Gemini, GPT-4o/GPT-5.4, and the Qwen family support video input. Qwen3-VL (November 2025) demonstrated 99.5% accuracy on needle-in-a-haystack tests in 2-hour videos. Most other open-source models are limited to image input or short video clips.
The research foundations exist (Chameleon, Transfusion), but the engineering effort and compute required to train production-quality native multimodal models with generation capabilities at frontier scale remains concentrated in a few well-resourced labs.
Source: LLaMA 4 Maverick released April 5, 2025, early fusion with MetaCLIP vision encoder, trained on 40+ trillion tokens (confirmed from huggingface.co/blog/llama4-release, ai.azure.com). Qwen 3.5 released February 16, 2026, early fusion MoE, 397B/17B active, 201 languages, 262K context extensible to 1M, supports 60-second video clips and images up to 1344x1344 (confirmed from launchberg.com, huggingface.co/blog/mlabonne/qwen35, the-decoder.com, thenextgentechinsider.com). Qwen3-VL released November 26, 2025, 256K native context expandable to 1M, 99.5% accuracy on needle-in-a-haystack in 2-hour videos (confirmed from unite.ai, the-decoder.com, huggingface.co/Qwen/Qwen3-VL-4B-Instruct). Qwen 3.5 small model series released March 2, 2026 with early fusion (confirmed from oflight.co.jp, techcommunity.microsoft.com). Mistral Small 4 released March 16, 2026, 119B/6B active, 128 experts, multimodal text+image, Apache 2.0 (confirmed from mistral.ai/news/mistral-small-4, huggingface.co/mistralai/Mistral-Small-4-119B-2603). GPT-5.4 mini and nano released March 17, 2026, multimodal understanding, mini runs 2x faster than GPT-5.4 (confirmed from openai.com/index/introducing-gpt-5-4-mini-and-nano, buildfastwithai.com, blockchain.news).
How Native Multimodal Training Works
Training a natively multimodal model differs fundamentally from the two-stage approach described in Chapter 21 (where you connect a pretrained vision encoder to a pretrained language model). In native multimodal training, the model learns all modalities simultaneously from the beginning.
Data Preparation
The training data for a native multimodal model consists of interleaved sequences of different modalities:
- Text documents: Standard web text, books, code (same as text-only pretraining)
- Image-text pairs: Images with captions or descriptions
- Interleaved image-text documents: Web pages, articles, and social media posts where images and text appear together naturally
- Audio-text pairs: Speech recordings with transcriptions
- Video-text pairs: Videos with descriptions or subtitles
The key is that these are not separate datasets processed independently. They are combined into a single training stream where the model encounters text, images, audio, and video in natural combinations, just as they appear on the internet.
Tokenization Across Modalities
Each modality needs to be converted into a format the Transformer can process:
def multimodal_tokenization():
"""
Show how different modalities are tokenized for a native multimodal model.
"""
print("Multimodal Tokenization")
print("=" * 65)
modalities = [
("Text", "BPE tokenizer", "Discrete tokens",
"~100K-200K vocab", "Chapter 4"),
("Images", "VQ-VAE or patch embedding", "Discrete or continuous",
"1,024 tokens per 512x512", "Chapter 21"),
("Audio", "Audio codec (e.g., EnCodec)", "Discrete tokens",
"~50-75 tokens/second", "Codec-based"),
("Video", "Frame sampling + image tokenizer", "Per-frame tokens",
"258 tokens/frame at 1 FPS", "Frame-based"),
]
print(f" {'Modality':<10} {'Tokenizer':<28} {'Token Type':<18} "
f"{'Density':<22}")
print(" " + "-" * 78)
for mod, tok, ttype, density, note in modalities:
print(f" {mod:<10} {tok:<28} {ttype:<18} {density:<22}")
print(f"\n All tokens are embedded into the same vector space")
print(f" and processed by the same Transformer layers.")
print(f" Special tokens mark modality boundaries:")
print(f" <image_start> ... image tokens ... <image_end>")
print(f" <audio_start> ... audio tokens ... <audio_end>")
multimodal_tokenization()Training Objectives
The training objective depends on the modality:
- Text: Standard next-token prediction (cross-entropy loss). This is the same objective used in text-only pretraining (Chapter 14).
- Images (discrete tokens): Next-token prediction, identical to text. The model predicts the next image token in the sequence. This is the Chameleon approach.
- Images (continuous): Diffusion loss (mean squared error on noise prediction). The model learns to denoise images. This is the Transfusion approach.
- Audio: Next-token prediction on discrete audio tokens, or a specialized audio loss function.
For models that use discrete tokens for all modalities (like Chameleon), the training is elegantly simple: it is just next-token prediction on a mixed sequence of text and image tokens. The model does not need to know which tokens are text and which are images. It just predicts the next token.
For models that use the Transfusion approach (discrete for text, continuous for images), the training uses two loss functions simultaneously: cross-entropy for text tokens and diffusion loss for image patches. The losses are combined with a weighting factor that balances the two objectives.
Stability Challenges
Training natively multimodal models is harder than training text-only models. The Chameleon paper documented several stability challenges:
- Divergence: The training loss can suddenly spike and diverge, especially when the model encounters unusual combinations of modalities. Chameleon required careful architectural modifications (including replacing LayerNorm with RMSNorm and adding query-key normalization) to achieve stable training.
- Modality competition: If one modality dominates the training data, the model may become good at that modality at the expense of others. Careful data mixing ratios are needed to ensure balanced performance.
- Codebook collapse: In VQ-VAE-based approaches, some codebook entries may never be used (the encoder always maps to the same subset of codes). This wastes representational capacity and degrades image quality.
These challenges explain why native multimodal training requires significantly more engineering effort than the encoder-attached approach, and why it remains concentrated at well-resourced labs.
Comparing Approaches: When to Use What
The choice between bolted-on, encoder-attached, early fusion, and native multimodal architectures depends on the use case:
| Use Case | Best Approach | Why |
|---|---|---|
| Image understanding only | Early fusion (Level 2.5) | Strong performance, many open-source options (LLaMA 4, Qwen 3.5, Mistral Small 4) |
| Image generation | Native (Level 4) | Only native models can generate images within conversation context |
| Voice assistant | Native (Level 4) | Low latency, emotional understanding |
| Document analysis | Early fusion or Native | Both work well; early fusion is cheaper |
| Video analysis | Native (Level 4) or Qwen3-VL | Requires temporal reasoning across frames; Qwen3-VL supports up to 2h video |
| Multi-turn image editing | Native (Level 4) | Requires maintaining visual context across turns |
| Text-only tasks | Any | Native models are not worse at text; they are also good at other things |
For developers building applications in March 2026, the practical choice often comes down to:
- If you need image or audio generation: Use GPT-4o/GPT-5.4 or Gemini. These are the only models with native multimodal output.
- If you need image understanding at low cost: Use an open-source early fusion model (LLaMA 4, Qwen 3.5, Mistral Small 4) running locally or on your own infrastructure. These models train on interleaved multimodal data from the start, providing stronger cross-modal reasoning than older encoder-attached approaches.
- If you need the best reasoning with image input: Claude Sonnet 4.6 or GPT-5.4, depending on the specific task.
- If you need video understanding: Gemini (best video support with long context), GPT-5.4, or Qwen3-VL (up to 2 hours of video with its 1M-token context window; 99.5% needle-in-a-haystack accuracy).
The Future: Unified Multimodal Embeddings
A recent development points to where native multimodality is heading. On March 10, 2026, Google released Gemini Embedding 2, described as the first natively multimodal embedding model. It maps text, images, video, audio, and PDF documents into a single unified vector space.
This is significant because it extends the native multimodal concept beyond generation to retrieval and search. With a unified embedding space, you can:
- Search for images using text queries (and vice versa)
- Find videos that match an audio clip
- Retrieve documents that are semantically similar to an image
- Build RAG (Retrieval-Augmented Generation) systems that work across all modalities
The model generates 3,072-dimensional embedding vectors by default and supports Matryoshka Representation Learning (MRL), which allows the output dimensions to be scaled down to as few as 128 without significant quality loss (Google recommends 768, 1,536, or 3,072 for production use). This is useful for applications that need to trade off embedding quality against storage and computation costs.
The input limits are generous: up to 8,192 tokens for text (four times the 2,048-token limit of the previous gemini-embedding-001 model), up to 6 images per request, up to 120 seconds of video, and up to 80 seconds of audio. The model supports semantic understanding in over 100 languages and handles audio natively without requiring a transcription step, which matters for music, ambient sound, or non-speech audio that would lose meaning if converted to text first.
Early adopters reported a 70% latency reduction over conventional multi-model pipelines that used separate embedding models for each modality. Legal discovery teams saw a 20% improvement in recall when switching from separate text and image embedding pipelines to the unified model.
This suggests the future of AI systems is not just native multimodal generation, but native multimodal everything: understanding, generation, retrieval, and reasoning, all unified in a single architecture.
Source: Gemini Embedding 2 released March 10, 2026, first natively multimodal embedding model, maps text, images, video, audio, and PDFs into single unified vector space (confirmed from the-decoder.com, buildfastwithai.com, awesomeagents.ai, launchberg.com). 3,072-dimensional vectors with MRL support (confirmed from gaga.art, docs.cloud.google.com/vertex-ai, apidog.com). 8,192 token limit, 4x previous 2,048-token limit (confirmed from the-decoder.com). Input limits: 6 images, 120s video, 80s audio (confirmed from kiadev.net, launchberg.com). Handles audio natively without transcription (confirmed from the-decoder.com). 70% latency reduction from early adopters, 20% recall improvement in legal discovery (confirmed from buildfastwithai.com).
Key Takeaways
Native multimodal models process all modalities (text, images, audio, video) through a single neural network trained end-to-end, rather than stitching together separate models. This eliminates information loss at modality boundaries, reduces latency, and enables cross-modal reasoning during generation.
GPT-4o (May 2024) was the first mainstream native multimodal model, achieving 320ms average audio response latency (compared to 5.4 seconds with the GPT-4 pipeline and 2.8 seconds with GPT-3.5) by processing audio directly rather than through transcription. It accepts any combination of text, audio, image, and video as input and generates text, audio, and image outputs.
Visual tokenization converts images into discrete tokens using VQ-VAE (Vector Quantized Variational Autoencoder). A 512x512 image becomes approximately 1,024 tokens from a codebook of 8,192 entries. Once images are tokens, they can be generated autoregressively, just like text.
Chameleon (Meta, arXiv:2405.09818, May 2024) demonstrated early fusion at scale: a single model with 7B or 34B parameters, trained on 10 trillion tokens of interleaved text and image data, that can understand and generate both modalities. It proved that a unified architecture can match specialized models in each domain.
Transfusion (Meta/Waymo, arXiv:2408.11039, August 2024) showed that you can combine autoregressive generation (for text) and diffusion (for images) in a single Transformer, avoiding the information loss from image quantization while maintaining the benefits of unified training.
GPT-4o native image generation launched March 25, 2025, replacing DALL-E 3 with an autoregressive approach that generates images within the same model that processes text. In the first week, 130+ million users generated 700+ million images. The model was released as GPT Image 1 (API, April 23, 2025) and upgraded to GPT Image 1.5 (December 16, 2025) with 4x faster generation.
Gemini’s image generation evolved through Nano Banana (August 2025, 13M users in 4 days, 23M in 2 weeks), Nano Banana Pro (November 2025, native 4K, 94% text accuracy), and Nano Banana 2 (February 2026, Pro quality at Flash speed, 40-50% faster and 50% cheaper). All use autoregressive generation within the Gemini Transformer.
Native audio in GPT-4o and Gemini enables emotional understanding, natural interruption handling, and expressive speech generation. Gemini 2.5 Flash Native Audio provides 30 HD voices in 24 languages. Google’s Gemini Audio product family also includes 70+ language real-time speech translation and Lyria 3 music generation (integrated February 18, 2026, producing 30-second tracks with vocals from text, image, or video prompts). Claude uses ElevenLabs for audio (pipeline approach), reflecting Anthropic’s focus on text and vision rather than native audio.
Video understanding is token-expensive: Gemini processes video at 263 tokens per second (1 FPS sampling, 258 tokens per frame). A 1-hour video consumes approximately 947,000 tokens, nearly filling a 1M-token context window.
The architecture spectrum ranges from Level 1 (bolted-on separate models) through Level 2 (encoder-attached), Level 2.5 (early fusion with vision encoder, used by LLaMA 4, Qwen 3.5, and Mistral Small 4), Level 3 (native input, pipeline output), to Level 4 (fully native). GPT-4o/GPT-5.4 and Gemini are at Level 4; Claude is at Level 3; the leading open-source models are at Level 2.5.
Early fusion has become the standard for open-source multimodal models. LLaMA 4 (April 2025), Qwen 3.5 (February 2026), and Mistral Small 4 (March 2026) all train on interleaved text and image data from the start, rather than bolting on a vision encoder after pretraining. This narrows the gap with closed models on multimodal understanding, though not on multimodal generation.
Training native multimodal models requires interleaved multimodal data, unified tokenization across modalities, and careful handling of stability challenges (divergence, modality competition, codebook collapse). This is significantly more expensive than the encoder-attached approach, which is why full native multimodality (with generation) remains concentrated at well-resourced labs.
Gemini Embedding 2 (March 10, 2026) extends native multimodality to embeddings, mapping text, images, video, audio, and PDFs into a single 3,072-dimensional vector space for unified cross-modal retrieval, with 8,192-token text input and native audio support without transcription.
Gemini Diffusion (May 2025) demonstrated that diffusion techniques can generate text at 1,479 tokens per second (5x faster than Gemini 2.0 Flash-Lite) while matching its quality. If diffusion works well for both images and text, future multimodal models may use a single diffusion-based generation mechanism for all modalities.
Native multimodality is moving to smaller models. GPT-5.4 mini and nano (March 17, 2026) bring multimodal understanding to faster, cheaper models. Qwen 3.5’s smaller variants (4B, 9B) include early fusion multimodal support. The era of multimodality as a premium feature reserved for frontier models is ending.
What’s Next
You now understand how native multimodal models work: the unified architectures that process all modalities through a single network, the tokenization schemes that convert images and audio into sequences the Transformer can process, and the training approaches that enable models to understand and generate across modalities. In Chapter 23, we will explore how these multimodal models are being deployed as agents that can take actions in the world: calling tools, browsing the web, operating computers, and executing multi-step workflows autonomously.