The world is not text. It is images, sound, video, code, diagrams, handwriting, sensor readings, and speech — often all at once. Multimodal models process and reason across these modalities, moving AI from language-only to something closer to human perception.
Early LLMs operated exclusively on text. A user could describe an image in words, but the model could not see it. Multimodal models accept multiple input types — images, audio, video, code — in a single context window, and reason across them.
The architecture typically encodes each modality into the same embedding space the language model uses. An image encoder (often a Vision Transformer, ViT) maps image patches to token-like representations that are concatenated with text tokens in the context. The language model processes the combined sequence, reasoning about the image and text jointly.
Early fusion interleaves modality tokens at the input level — image tokens are mixed with text tokens and processed together through the full transformer stack. This allows deep cross-modal reasoning but requires training (or fine-tuning) the entire model on multimodal data.
Late fusion processes each modality through a separate encoder, then combines the outputs at a later stage. This is simpler and allows reuse of existing unimodal models, but limits the depth of cross-modal interaction. Modern frontier models (GPT-4V, Claude, Gemini) use early or intermediate fusion for richer reasoning.
The most mature multimodal capability. Vision-language models can:
These capabilities combine the image encoder’s perceptual ability with the language model’s reasoning, enabling tasks that neither component could do alone.
Counting remains unreliable — models frequently miscount objects in cluttered scenes. Fine-grained spatial reasoning (“Is the red cup to the left or right of the blue book?”) is inconsistent. Hallucination extends to vision: models may describe objects that are not in the image, particularly when the question implies their presence. OCR accuracy varies with font quality, image resolution, and language. These limitations are improving with each model generation but are not yet solved.
Whisper (OpenAI) demonstrated that a single transformer trained on 680,000 hours of multilingual audio could match or exceed specialized speech recognition systems. The model handles multiple languages, accents, background noise, and can translate speech from one language to another.
Real-time voice capabilities (GPT-4o’s voice mode, Claude’s voice) go further: the model processes speech input and generates speech output natively, without a separate speech-to-text / text-to-speech pipeline. This enables natural conversational interaction — interruptions, tone-aware responses, and sub-second latency.
The pipeline approach — transcribe speech to text, process text with an LLM, synthesize text to speech — loses information at each stage. Prosody, tone, emphasis, and emotional cues in the input speech are discarded during transcription. The synthesized output speech is disconnected from the conversational flow. End-to-end voice models process audio directly, preserving these signals and producing more natural interactions. The tradeoff: end-to-end models are harder to train and debug, and their reasoning about the spoken content is less transparent than a text-intermediate pipeline.
Video extends image understanding with temporal reasoning. Models that process video can:
Video is computationally expensive — a 30-second clip at 30fps contains 900 frames, each requiring encoding. Practical approaches sample keyframes or process video at reduced temporal resolution, trading temporal granularity for computational feasibility.
Current video models handle “what is happening” better than “why did it happen” or “what will happen next.” Causal and predictive reasoning about video content requires understanding physics (objects fall, liquids flow), social dynamics (facial expressions indicate emotion, gestures communicate intent), and narrative structure (setup implies payoff). These capabilities are emerging but remain significantly weaker than the corresponding text-based reasoning abilities.
The trajectory of multimodal AI points toward unified models — single architectures that natively process all modalities. Google’s Gemini was designed from the start as a multimodal model, training on text, images, audio, and video jointly rather than bolting vision onto a text model after the fact.
A unified model represents all modalities in a shared embedding space. Text, image patches, audio segments, and video frames are all sequences of tokens processed by the same transformer. This enables cross-modal reasoning that is difficult with separate models: “Listen to this audio clip and describe what is visible in this image that could be making the sound.”
The ultimate multimodal model accepts any combination of input modalities and generates any combination of output modalities. Text in, image out (image generation). Image and text in, audio out (a spoken description of what is in the image). Video and audio in, text out (meeting transcription with speaker identification). Current models handle many of these combinations but not all, and output modalities remain more limited than input modalities — most models generate text and some generate images, but few generate high-quality audio or video.
The most valuable multimodal capability is not just processing different modalities but reasoning across them. Practical applications:
Document processing: an insurance claim includes a photo of damage, a handwritten repair estimate, and a typed policy document. A multimodal model reads the photo, OCRs the handwriting, parses the policy, and determines coverage — a task that previously required multiple specialized systems and human review.
Accessibility: describing images for visually impaired users, transcribing audio for deaf users, interpreting sign language, reading aloud from photos of text. Multimodal models make content accessible across sensory modalities.
Content creation: given a text description and reference images, generate a presentation with appropriate visuals. Given a video, generate a text summary, chapter markers, and a social media post. The model handles cross-modal translation that previously required specialized tools and human creativity.
A persistent challenge in cross-modal reasoning is grounding — connecting generated text to specific regions of an image or specific moments in audio/video. When a model says “the crack in the upper left corner,” can it point to the exact pixels? Grounding enables fact-checking (did the model actually see what it claims to see?) and makes the output actionable (the user can look at the specific region). Models are increasingly capable of generating bounding boxes, segmentation masks, and temporal markers, but grounding accuracy varies by task and model.
This lesson establishes: