AI Multimodal Models Introduction

Intermediate 10 min read

What you'll learn

✓What multimodal really means
✓How images become tokens
✓Common APIs for multimodal calls
✓Where multimodal models help
✓Cost and latency considerations

Prerequisites

•Familiar with APIs

Multimodal models accept and sometimes produce content across more than one modality: text, images, audio, video. The current generation of frontier models is mostly text plus images plus audio, with video on the way. This post is a tour of what these models are, how they work at a high level, and how to use them.

What multimodal really is

A multimodal model has been trained on aligned data across modalities. The simplest version pairs images with captions and learns to map both into a shared representation. Later stages tune the model to follow instructions that reference the image, like “describe this chart” or “extract the totals from this invoice.”

Under the hood, images are converted into patches, each patch becomes a vector, and those vectors are fed alongside text tokens into the same transformer. The model does not see pixels in the way you do; it sees a sequence of vectors that came from the encoder.

Mental model

Think of the model as a text model with an extra mouth that eats images. Whatever the mouth swallows gets turned into tokens that line up next to your written prompt. The model then attends across all of them and writes a response.

image  -> vision encoder -> image tokens
text   -> tokenizer      -> text tokens
                   \          /
                    v        v
                 shared transformer
                        |
                        v
                      output

Multimodal input flow

Hands-on example

Most APIs accept images as URLs or base64 alongside text content blocks.

from anthropic import Anthropic
import base64

client = Anthropic()
with open("invoice.png", "rb") as f:
    data = base64.standard_b64encode(f.read()).decode()

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=512,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64",
                "media_type": "image/png", "data": data}},
            {"type": "text", "text": "Extract vendor, total, and date as JSON."},
        ],
    }],
)
print(resp.content[0].text)

Use cases that work well today include document understanding (invoices, IDs, forms), chart reading, UI screenshots for QA, accessibility descriptions, and visual question answering. Audio support follows the same pattern: send a clip, ask a question, get text back.

Trade-offs

Image inputs cost more than text. A high-resolution image can be the equivalent of a couple thousand tokens. Resize before sending if you do not need pixel-level detail.

Latency is higher than pure text. The vision encoder adds a fixed cost regardless of how short the question is. Batch when you can.

Spatial precision varies. Models are good at reading text in images and recognizing common objects, but they struggle with exact pixel coordinates or fine-grained measurements. If you need to find a button at a precise location, pair the model with a detector or OCR step.

Audio modes are improving fast but still trail text in instruction following. For high-stakes transcription tasks, a specialized speech model often beats a general multimodal one.

Practical tips

Always say what you want extracted. Vague prompts on visual content get vague answers. Asking for “the total at the bottom of the invoice as a number” outperforms “what is the total.”

Validate structured outputs. If you ask for JSON, parse it and check the schema. Multimodal outputs are no more reliable than text outputs in this regard.

Downsample before sending. For most document tasks, 1024 pixels on the long edge is enough. Sending the original 4K image just burns tokens.

Use a smaller model for triage. A cheap model can decide whether an image even needs the expensive one. For example, classify pages as “form” or “free text” first, then route only the form pages to a stronger model.

Be careful with private images. The data goes to the provider unless you self-host. Treat images the same as you would any sensitive payload, including PII redaction policies and retention agreements.

Wrap-up

Multimodal models extend the LLM workflow to images, audio, and video by turning each modality into tokens the transformer already knows how to handle. The patterns you learned for text generation carry over, with a few new considerations around cost, latency, and precision. Start with a focused task like document extraction and grow from there.