Multimodal AI

The world is not text. It is images, sound, video, code, diagrams, handwriting, sensor readings, and speech — often all at once. Multimodal models process and reason across these modalities, moving AI from language-only to something closer to human perception.

From Text-Only to Multimodal

Early LLMs operated exclusively on text. A user could describe an image in words, but the model could not see it. Multimodal models accept multiple input types — images, audio, video, code — in a single context window, and reason across them.

The architecture typically encodes each modality into the same embedding space the language model uses. An image encoder (often a Vision Transformer, ViT) maps image patches to token-like representations that are concatenated with text tokens in the context. The language model processes the combined sequence, reasoning about the image and text jointly.

Early vs Late Fusion

Early fusion interleaves modality tokens at the input level — image tokens are mixed with text tokens and processed together through the full transformer stack. This allows deep cross-modal reasoning but requires training (or fine-tuning) the entire model on multimodal data.

Late fusion processes each modality through a separate encoder, then combines the outputs at a later stage. This is simpler and allows reuse of existing unimodal models, but limits the depth of cross-modal interaction. Modern frontier models (GPT-4V, Claude, Gemini) use early or intermediate fusion for richer reasoning.

Vision-Language Models

The most mature multimodal capability. Vision-language models can:

  • Describe images in natural language with high accuracy and detail.
  • Answer questions about image content: “What brand is the laptop in this photo?” or “How many people are in this room?”
  • Read text in images (OCR) including handwriting, signs, receipts, and screenshots.
  • Interpret diagrams — flowcharts, architecture diagrams, circuit schematics — and explain their content.
  • Reason spatially — understand relative positions, counts, and relationships between objects.

These capabilities combine the image encoder’s perceptual ability with the language model’s reasoning, enabling tasks that neither component could do alone.

Limitations of Vision-Language Models

Counting remains unreliable — models frequently miscount objects in cluttered scenes. Fine-grained spatial reasoning (“Is the red cup to the left or right of the blue book?”) is inconsistent. Hallucination extends to vision: models may describe objects that are not in the image, particularly when the question implies their presence. OCR accuracy varies with font quality, image resolution, and language. These limitations are improving with each model generation but are not yet solved.

Speech and Audio Models

Whisper (OpenAI) demonstrated that a single transformer trained on 680,000 hours of multilingual audio could match or exceed specialized speech recognition systems. The model handles multiple languages, accents, background noise, and can translate speech from one language to another.

Real-time voice capabilities (GPT-4o’s voice mode, Claude’s voice) go further: the model processes speech input and generates speech output natively, without a separate speech-to-text / text-to-speech pipeline. This enables natural conversational interaction — interruptions, tone-aware responses, and sub-second latency.

End-to-End vs Pipeline Voice

The pipeline approach — transcribe speech to text, process text with an LLM, synthesize text to speech — loses information at each stage. Prosody, tone, emphasis, and emotional cues in the input speech are discarded during transcription. The synthesized output speech is disconnected from the conversational flow. End-to-end voice models process audio directly, preserving these signals and producing more natural interactions. The tradeoff: end-to-end models are harder to train and debug, and their reasoning about the spoken content is less transparent than a text-intermediate pipeline.

Video Understanding

Video extends image understanding with temporal reasoning. Models that process video can:

  • Summarize the content of a video clip.
  • Answer questions about events and their sequence (“What happened after the car stopped?”).
  • Track objects across frames.
  • Understand actions and activities in the scene.

Video is computationally expensive — a 30-second clip at 30fps contains 900 frames, each requiring encoding. Practical approaches sample keyframes or process video at reduced temporal resolution, trading temporal granularity for computational feasibility.

The Temporal Reasoning Gap

Current video models handle “what is happening” better than “why did it happen” or “what will happen next.” Causal and predictive reasoning about video content requires understanding physics (objects fall, liquids flow), social dynamics (facial expressions indicate emotion, gestures communicate intent), and narrative structure (setup implies payoff). These capabilities are emerging but remain significantly weaker than the corresponding text-based reasoning abilities.

Unified Architectures

The trajectory of multimodal AI points toward unified models — single architectures that natively process all modalities. Google’s Gemini was designed from the start as a multimodal model, training on text, images, audio, and video jointly rather than bolting vision onto a text model after the fact.

A unified model represents all modalities in a shared embedding space. Text, image patches, audio segments, and video frames are all sequences of tokens processed by the same transformer. This enables cross-modal reasoning that is difficult with separate models: “Listen to this audio clip and describe what is visible in this image that could be making the sound.”

The Any-to-Any Vision

The ultimate multimodal model accepts any combination of input modalities and generates any combination of output modalities. Text in, image out (image generation). Image and text in, audio out (a spoken description of what is in the image). Video and audio in, text out (meeting transcription with speaker identification). Current models handle many of these combinations but not all, and output modalities remain more limited than input modalities — most models generate text and some generate images, but few generate high-quality audio or video.

Cross-Modal Reasoning

The most valuable multimodal capability is not just processing different modalities but reasoning across them. Practical applications:

Document processing: an insurance claim includes a photo of damage, a handwritten repair estimate, and a typed policy document. A multimodal model reads the photo, OCRs the handwriting, parses the policy, and determines coverage — a task that previously required multiple specialized systems and human review.

Accessibility: describing images for visually impaired users, transcribing audio for deaf users, interpreting sign language, reading aloud from photos of text. Multimodal models make content accessible across sensory modalities.

Content creation: given a text description and reference images, generate a presentation with appropriate visuals. Given a video, generate a text summary, chapter markers, and a social media post. The model handles cross-modal translation that previously required specialized tools and human creativity.

Grounding and Attribution

A persistent challenge in cross-modal reasoning is grounding — connecting generated text to specific regions of an image or specific moments in audio/video. When a model says “the crack in the upper left corner,” can it point to the exact pixels? Grounding enables fact-checking (did the model actually see what it claims to see?) and makes the output actionable (the user can look at the specific region). Models are increasingly capable of generating bounding boxes, segmentation masks, and temporal markers, but grounding accuracy varies by task and model.

Key takeaways

This lesson establishes:

  • How vision-language models encode images into the same representation space as text
  • How early fusion and late fusion approaches to multimodal architecture differ
  • The advantages of end-to-end voice models over speech-to-text/LLM/text-to-speech pipelines
  • The current limitations of multimodal models (counting, spatial reasoning, hallucination)
  • Three practical applications of cross-modal reasoning

Next: Symbolic AI and Neurosymbolic Systems

← AI Frontiers Multimodal AI