Gemini API Multimodal: Vision and Video Processing Guide

If you have ever tried to send a PDF, a screen recording, or a 30-minute meeting video to an LLM and watched the request fail with a token limit error, you already know why the Gemini API multimodal stack matters. Google’s Gemini 2.5 family ingests images, video, audio, and PDFs natively in a 1-million-token context window, and the SDK exposes a File API that handles uploads up to 2 GB without you stitching anything together. This guide walks through the patterns that actually hold up in production: extracting structured data from images, transcribing and tagging video, choosing between inline bytes and the File API, and avoiding the token-count surprises that wreck cost forecasts. Read on if you are building a tool that needs to see, watch, or read its inputs, and you want code that runs the first time.

What Makes the Gemini API Multimodal Different

The Gemini API multimodal interface accepts text, images, video, audio, and PDFs through one generateContent call, without separate endpoints or model variants for each format. Unlike OpenAI’s vision endpoint, which caps images at roughly 20 MB and requires base64 encoding for every request, Gemini exposes a File API that stores uploads for 48 hours and lets you reference them by URI across multiple prompts. Video is treated as a first-class input: Gemini samples one frame per second and 1 audio token per second, so a 10-minute clip costs roughly 158,000 tokens on a default ingest.

Three model tiers cover most workloads. For long-context tasks (full feature films, hour-long meetings, multi-PDF retrieval), reach for Gemini 2.5 Pro. The Flash tier gives you near-Pro accuracy on images and short video at a fraction of the cost. Meanwhile, Flash-Lite is the cheapest option for high-volume image tagging where latency matters more than depth.

Setting Up the Gemini SDK

Install the current SDK, which is google-genai — not the older google-generativeai package, which is in maintenance mode and will be deprecated in 2026.

pip install google-genai pillow

Grab a key from Google AI Studio and export it. The SDK reads GEMINI_API_KEY by default.

export GEMINI_API_KEY="your-key-here"

A minimal text call confirms the install works before you start sending media. Notably, the Client object is the entry point for every multimodal feature in this guide.

from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Reply with the single word OK."
)

print(response.text)

If you get an authentication error, the most common cause is a stale key from the older Vertex AI flow. Generate a fresh AI Studio key and try again. For production workloads on Google Cloud, you can swap to Vertex AI auth by passing vertexai=True to the Client, but stick with the AI Studio path for this tutorial.

Processing Images with the Gemini API

Images can be passed three ways: inline bytes (best for files under 20 MB), a PIL Image object (convenient when you already have one in memory), or a File API reference (mandatory for files over 20 MB or shared across requests). Most production code uses inline bytes for one-shot requests and the File API when the same image feeds several prompts.

The simplest form sends raw bytes with a MIME type:

from google import genai
from google.genai import types

client = genai.Client()

with open("invoice.png", "rb") as f:
    image_bytes = f.read()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        types.Part.from_bytes(data=image_bytes, mime_type="image/png"),
        "Extract the invoice number, vendor name, and total amount."
    ]
)

print(response.text)

A few details matter here. First, the order of parts is significant — placing the image before the text typically gives slightly better grounding than the reverse, especially on documents with dense layouts. Second, Gemini charges 258 tokens per image tile (a 384×384 patch); a 1024×768 image gets divided into 6 tiles, costing about 1,548 tokens before your prompt. Third, the model accepts PNG, JPEG, WEBP, HEIC, and HEIF without conversion.

For multiple images in one prompt, just add more parts. This pattern works well for comparison tasks like spot-the-difference, before/after analysis, or picking a winner from product photos.

import pathlib

def load_image(path: str) -> types.Part:
    return types.Part.from_bytes(
        data=pathlib.Path(path).read_bytes(),
        mime_type="image/jpeg"
    )

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        load_image("kitchen_before.jpg"),
        load_image("kitchen_after.jpg"),
        "Compare these two kitchen photos. List five concrete changes "
        "and rank them by visual impact from highest to lowest."
    ]
)

For OCR-heavy work — receipts, forms, screenshots of legacy systems — Gemini 2.5 Flash matches GPT-4o on most public benchmarks and costs roughly a third as much. However, for hand-written content or low-contrast scans, Pro still pulls ahead by a meaningful margin.

Working with Video Files

Video is where Gemini’s multimodal pricing gets interesting. The default ingest rate is 1 frame per second plus 1 audio token per second, which means a 60-second clip costs around 263 tokens per second once you include the per-frame visual encoding. A 10-minute meeting recording lands around 158,000 tokens — well within Pro’s 1M context, but you should know the number before you build a SaaS product on top of it.

Small videos (under 20 MB and roughly under a minute) can ride inline:

with open("clip.mp4", "rb") as f:
    video_bytes = f.read()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        types.Part.from_bytes(data=video_bytes, mime_type="video/mp4"),
        "Summarize this clip in three bullet points. "
        "Include any text shown on screen."
    ]
)

For anything larger, you must use the File API. Uploads are async; you wait for the file to reach ACTIVE state before referencing it in a prompt.

import time
from google import genai

client = genai.Client()

uploaded = client.files.upload(file="meeting.mp4")

while uploaded.state.name == "PROCESSING":
    time.sleep(2)
    uploaded = client.files.get(name=uploaded.name)

if uploaded.state.name != "ACTIVE":
    raise RuntimeError(f"File upload failed: {uploaded.state.name}")

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        uploaded,
        "Produce a meeting summary with: attendees mentioned, "
        "decisions made, action items with owners, and unresolved questions. "
        "Cite timestamps in MM:SS format for each action item."
    ]
)

print(response.text)

The timestamp-citing trick is one of the most useful patterns in production. Gemini knows the frame rate at which it sampled and produces accurate timestamps that map back to your source video. Furthermore, you can ask for transcripts with speaker labels and get them aligned to the second.

For YouTube specifically, Gemini accepts URLs directly — no download required:

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=types.Content(
        parts=[
            types.Part(file_data=types.FileData(
                file_uri="https://www.youtube.com/watch?v=...",
                mime_type="video/mp4"
            )),
            types.Part(text="List the five key takeaways with timestamps.")
        ]
    )
)

This works only for public, unlisted-with-direct-link, or your own videos. Private and members-only content returns a permissions error.

Controlling Video Sampling Rate

The default 1 fps sampling is fine for talking-head videos but wastes tokens on static slides and underweights fast-moving content (sports, gameplay, surveillance). Use video_metadata to override the rate:

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=types.Content(parts=[
        types.Part(
            file_data=types.FileData(
                file_uri=uploaded.uri,
                mime_type="video/mp4"
            ),
            video_metadata=types.VideoMetadata(
                start_offset="0s",
                end_offset="120s",
                fps=0.5  # one frame every two seconds
            )
        ),
        types.Part(text="Describe the visual changes between scenes.")
    ])
)

For slide decks recorded as video, dropping fps to 0.2 (one frame every five seconds) typically cuts token cost by 80% without missing content. In contrast, for action sequences, push fps to 5 — the maximum the API supports — to catch fast transitions.

Using the File API for Large Media

The File API is mandatory for anything over 20 MB and useful for any media you plan to reference more than once. Files persist for 48 hours, and the same URI can feed multiple generateContent calls without re-uploading. This is the right pattern for batch analysis (run five different prompts against the same video) or interactive sessions (let a user ask follow-up questions about a document).

def upload_and_wait(client: genai.Client, path: str, poll_seconds: int = 2):
    """Upload a file and block until it's ACTIVE or FAILED."""
    uploaded = client.files.upload(file=path)
    while uploaded.state.name == "PROCESSING":
        time.sleep(poll_seconds)
        uploaded = client.files.get(name=uploaded.name)
    if uploaded.state.name != "ACTIVE":
        raise RuntimeError(
            f"Upload {path} failed: state={uploaded.state.name}"
        )
    return uploaded

You can list, fetch, and delete uploaded files explicitly. Deleting is good hygiene if you handle sensitive content — files auto-expire after 48 hours, but explicit deletes shrink the window of exposure.

for f in client.files.list():
    print(f.name, f.mime_type, f.size_bytes)

client.files.delete(name=uploaded.name)

PDFs deserve a special mention here. Gemini treats each PDF page as both an image (for layout, charts, diagrams) and as extracted text. Consequently, a 100-page technical PDF costs roughly 258 tokens per page in visual encoding plus the text tokens — a 50-page document with charts typically lands around 30,000 tokens, which is dramatically cheaper than chunking it into a vector store for most question-answering tasks.

Structured Output from Visual Inputs

Free-form text from a vision model is fine for chat interfaces but useless for pipelines. Gemini supports structured outputs via JSON schema, which you should reach for whenever you plan to parse the response programmatically. The pattern combines vision input with a response_schema:

from pydantic import BaseModel

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    invoice_date: str
    line_items: list[LineItem]
    subtotal: float
    tax: float
    total: float

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        types.Part.from_bytes(
            data=open("invoice.pdf", "rb").read(),
            mime_type="application/pdf"
        ),
        "Extract every field from this invoice."
    ],
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=Invoice
    )
)

invoice = Invoice.model_validate_json(response.text)
print(f"{invoice.vendor_name}: ${invoice.total}")

This approach beats prompting for JSON by example. The Pydantic schema becomes a hard contract — the model cannot return malformed JSON or extra fields. For a deeper dive into the underlying technique, see our guide on building production pipelines with OpenAI structured outputs, which covers the same idea in a different SDK.

Real-World Scenario: Receipt Processing for an Expense Tool

A small B2B SaaS team building an expense management tool needs to extract structured data from receipt photos submitted via mobile. Their first attempt used Tesseract OCR followed by a regex-based parser, which worked on roughly 60% of receipts. The remaining 40% — thermal-printed receipts that smudged, photos taken at angles, receipts in languages other than English — required manual cleanup that consumed two support hours per day.

Switching to the Gemini API multimodal pipeline produced a measurable change. They use Gemini 2.5 Flash for the first pass with a strict Pydantic schema covering merchant, date, line items, tax, and total. The model returns structured JSON in roughly 1.5 seconds per receipt and handles smudged thermal prints, angled photos, and non-English receipts in the same code path. For receipts where the model marks any field as null, they fall back to Gemini 2.5 Pro, which clears most of the remaining cases. Approximately 5% of receipts still need human review — typically those where the photo cuts off the total or shows physical damage to the paper. The team reports their manual cleanup time dropped enough that a single support agent now handles what previously needed two, and the cost per receipt is well under a cent at Flash pricing.

The pattern generalizes: Flash for volume, Pro for fallback, human review for the residual. This three-tier funnel keeps unit economics tight while protecting accuracy on edge cases.

When to Use the Gemini API Multimodal Features

You need to process video longer than a few minutes — the 1M context and native video ingest beat alternatives that require manual frame extraction
Your inputs are mixed (PDF plus image plus text in one request) and you want a single API call instead of orchestrating multiple services
Cost matters: Flash pricing on vision tasks runs significantly cheaper than comparable GPT-4o calls for OCR and document extraction
You need timestamps in video summaries — Gemini’s frame-aligned outputs are accurate without extra work
The use case involves YouTube content, where direct URL ingestion saves a download step

When the Gemini API Multimodal Stack Is the Wrong Tool

Your application needs sub-100ms latency per image (Flash averages 800ms-1.5s; consider Cloud Vision API or a self-hosted CLIP model)
You require 100% deterministic output for compliance reasons — even with structured schemas, LLMs can hallucinate field values on degraded inputs
Total monthly volume sits below 1,000 calls and you already have an OpenAI account; the integration effort may not pay back
You’re processing personally identifiable medical imagery without a BAA — use Vertex AI on Google Cloud, not the AI Studio endpoint
Your media exceeds 2 GB per file or 50 files in active state — you’ll need to chunk or move to streaming pipelines

Common Mistakes with the Gemini API Multimodal Stack

Using base64-encoded JSON for large files instead of the File API, which doubles your bandwidth and burns latency on every retry
Forgetting to poll for the ACTIVE state after upload — referencing a PROCESSING file returns an error that looks like a bad path
Not setting response_schema when the downstream code parses JSON, then debugging trailing markdown fences in production
Sampling video at the default 1 fps for slide decks, paying 5x more in tokens than necessary
Mixing AI Studio keys with Vertex AI endpoints — they look identical in code but use different auth paths
Trusting model output on the first call without a confidence pass; build a fallback to a stronger model for low-confidence cases
Sending hand-written content to Flash and assuming OCR accuracy will match printed text; Pro is required for messy handwriting

Where to Take This Next

The Gemini API multimodal stack lets you collapse what used to be separate OCR, transcription, and vision pipelines into a single SDK call with a JSON schema attached. Start with Gemini 2.5 Flash for image tasks, move to Pro for video over five minutes or anything with handwritten content, and always use the File API for media you’ll reference more than once. The combination of a 1M-token context, native video ingest, and YouTube URL support genuinely changes what’s practical to build solo or with a small team.

For your next step, see how function calling layers on top of LLM inputs in our guide on Claude tool use patterns — the same concept maps directly to Gemini’s tool-calling API. If you are evaluating providers side by side, our deep dive on building apps with the OpenAI API covers the equivalent vision endpoints, and getting started with the Claude API covers Anthropic’s approach to images and PDFs. To squeeze more accuracy out of any multimodal request, the techniques in our prompt engineering best practices guide apply directly to the prompt-text portion of these requests.