
If you are running thousands of LLM calls for classification, embedding generation, content tagging, or evaluation runs, the OpenAI Batch API is the single highest-leverage cost optimization available today. It returns identical model output at exactly half the price of the synchronous API, in exchange for an asynchronous SLA of up to 24 hours. For workloads that do not need a response within seconds, that trade is almost always worth taking.
This tutorial walks through everything you need to ship batch jobs to production: how the workflow differs from the standard Chat Completions endpoint, how to format the JSONL request file, how to submit and poll jobs in Python, how to parse mixed success/error results, and the operational pitfalls that bite teams the first time they migrate from synchronous calls.
By the end, you will have a working pipeline that takes 10,000 prompts, ships them to the batch endpoint, retrieves the results, and reconciles failures, all at half the per-token cost you are paying today.
What Is the OpenAI Batch API?
The OpenAI Batch API is an asynchronous endpoint that accepts a JSONL file of API requests, processes them within 24 hours, and returns a JSONL file of responses at a 50% discount on both input and output tokens. It supports /v1/chat/completions, /v1/embeddings, /v1/completions, and /v1/responses requests, with a separate rate-limit pool that is dramatically higher than the synchronous tier.
In contrast to the standard API, where you send one request and block on the HTTP response, batch jobs decouple submission from retrieval. You upload a file, get back a batch_id, poll for status, and then download the completed output file. Because the work runs in OpenAI’s idle compute windows, the discount is structural rather than promotional.
The 50% reduction applies to every supported model, from gpt-4o-mini to o3. For a team paying $3,000/month on synchronous classification, the same workload through batch costs $1,500 with no quality difference whatsoever.
How the Batch API Actually Works
The end-to-end flow has four steps:
- Build a JSONL file where each line is a self-contained API request with a unique
custom_id - Upload the file to the Files API with
purpose="batch" - Create a batch referencing the uploaded file, the endpoint, and a 24-hour completion window
- Poll the batch status until it reaches
completed, then download the output file
Each line in the request file is independent. Failures on one row do not cancel the rest of the batch. Furthermore, the output file preserves the custom_id of each input row, which means you can reconcile responses back to your application records without relying on order.
A single batch supports up to 50,000 requests and 200 MB of file size per current limits. For larger workloads, you split into multiple batches and submit them in parallel — they share a separate batch-only rate-limit bucket, so synchronous traffic is not affected.
When to Use the OpenAI Batch API
- You have thousands of independent LLM calls with no real-time UX dependency (overnight classification, weekly evaluation runs, historical reprocessing)
- You are generating embeddings for a large corpus before initial indexing
- You are running offline content moderation, tagging, or enrichment pipelines
- You need to absorb a spike in volume without raising your synchronous rate limit
- You are running model-evaluation suites where wall-clock time does not matter
When NOT to Use the OpenAI Batch API
- Any user-facing request that expects a response in under a minute (chatbots, search, autocomplete)
- Interactive workflows where the next step depends on the LLM output within the same session
- Streaming responses — batch returns final completions only, not deltas
- Very small jobs (under ~100 requests) where the operational overhead outweighs the cost savings
- Time-sensitive analysis like incident triage or live customer support routing
Common Mistakes with the OpenAI Batch API
- Using a non-unique
custom_idacross rows, which makes result reconciliation ambiguous - Forgetting that the 24-hour window is a maximum SLA, not a guaranteed time — plan for variability
- Treating the batch endpoint as synchronous and polling every second instead of every few minutes
- Not handling partial failures — assuming a
completedbatch means every row succeeded - Mixing different models in a single batch when a per-row override is what you actually need
- Hardcoding output file IDs instead of fetching them from the batch metadata after completion
Setting Up the OpenAI Python Client
Before writing any batch code, you need the OpenAI Python SDK and an API key with batch access enabled. Batch access is available on the standard tier — there is no special application — but new accounts may need to add a payment method first.
pip install openai>=1.40.0
export OPENAI_API_KEY="sk-..."
Then verify the client connects:
from openai import OpenAI
client = OpenAI()
models = client.models.list()
print(f"Connected. {len(models.data)} models available.")
If this prints a model count, you are ready. If you get a 401, double-check the environment variable. For production, store the key in a secrets manager rather than a shell profile — see API security checklist for the patterns that hold up under audit.
Building the JSONL Request File
The batch endpoint reads a JSON Lines file where each line is a complete request object with three required fields: custom_id, method, and url, plus a body that mirrors the synchronous API payload.
Here is a function that turns a list of prompts into a valid batch file:
import json
from pathlib import Path
from typing import Iterable
def build_batch_file(
prompts: Iterable[dict],
output_path: Path,
model: str = "gpt-4o-mini",
endpoint: str = "/v1/chat/completions",
) -> Path:
"""
Build a JSONL request file for the OpenAI Batch API.
Each prompt dict must contain:
- id: unique string used as custom_id
- system: system prompt string
- user: user prompt string
"""
with output_path.open("w", encoding="utf-8") as f:
for prompt in prompts:
request = {
"custom_id": prompt["id"],
"method": "POST",
"url": endpoint,
"body": {
"model": model,
"messages": [
{"role": "system", "content": prompt["system"]},
{"role": "user", "content": prompt["user"]},
],
"max_tokens": 500,
"temperature": 0,
},
}
f.write(json.dumps(request, ensure_ascii=False) + "\n")
return output_path
Why this works: The custom_id is the only field that ties a response back to your application record, so it must be unique within the batch. Furthermore, the body is exactly what you would send to the synchronous endpoint — anything that works there works here, including tool calls, JSON mode, and structured outputs. If you have not used schema-enforced output before, OpenAI structured outputs is the right place to start before adding it to batch jobs.
A sample input list looks like this:
prompts = [
{
"id": "review-00001",
"system": "You classify product reviews as positive, neutral, or negative. Reply with one word.",
"user": "Battery lasted 2 hours less than advertised but the screen is gorgeous.",
},
{
"id": "review-00002",
"system": "You classify product reviews as positive, neutral, or negative. Reply with one word.",
"user": "Returned it. Stopped charging after a week.",
},
]
build_batch_file(prompts, Path("batch_input.jsonl"))
Submitting the Batch Job
With the JSONL file in hand, submission is a two-step process: upload the file, then create the batch.
from openai import OpenAI
from pathlib import Path
def submit_batch(
client: OpenAI,
input_file: Path,
description: str,
endpoint: str = "/v1/chat/completions",
) -> str:
"""Upload a JSONL file and create a batch job. Returns the batch ID."""
uploaded = client.files.create(
file=input_file.open("rb"),
purpose="batch",
)
batch = client.batches.create(
input_file_id=uploaded.id,
endpoint=endpoint,
completion_window="24h",
metadata={"description": description},
)
return batch.id
Why the metadata field matters: In a production pipeline you will likely have dozens of in-flight batches at once. The metadata field is searchable and shows up in the dashboard, so use it to encode the originating job ID, environment, and version so on-call engineers can trace a stuck batch back to its source.
A common production wrapper adds tagging by environment:
batch_id = submit_batch(
client,
Path("batch_input.jsonl"),
description="nightly_review_classification_v3",
)
print(f"Submitted batch {batch_id}")
Polling the Batch Status
Batch jobs progress through several states: validating, in_progress, finalizing, completed, failed, expired, and cancelled. The right polling strategy depends on workload size, but a sensible default is exponential backoff capped at five minutes.
import time
from openai import OpenAI
TERMINAL_STATES = {"completed", "failed", "expired", "cancelled"}
def wait_for_batch(
client: OpenAI,
batch_id: str,
initial_delay: int = 30,
max_delay: int = 300,
) -> dict:
"""
Poll a batch until it reaches a terminal state.
Returns the final batch object as a dict.
"""
delay = initial_delay
while True:
batch = client.batches.retrieve(batch_id)
status = batch.status
counts = batch.request_counts
print(
f"[{batch_id}] status={status} "
f"completed={counts.completed}/{counts.total} "
f"failed={counts.failed}"
)
if status in TERMINAL_STATES:
return batch.model_dump()
time.sleep(delay)
delay = min(delay * 2, max_delay)
Why exponential backoff: Batches under 1,000 requests often finish within a few minutes, but larger jobs can take hours. Polling every second wastes API quota on no-op retrieve calls and trips rate limits during normal operation. Starting at 30 seconds and doubling to a 5-minute ceiling keeps the observability acceptable without burning calls.
For long-running batches in production, polling in a worker is the wrong pattern — use webhooks instead. OpenAI does not provide a native webhook for batches yet, but a lightweight Cloud Function or scheduled Lambda that checks status every five minutes and writes to a queue is the standard alternative.
Retrieving and Parsing the Results
When a batch reaches completed, two file IDs become available on the batch object: output_file_id for successful responses and error_file_id for rows that failed validation or hit per-request errors. Both are JSONL files keyed by custom_id.
import json
from pathlib import Path
from openai import OpenAI
def download_batch_results(
client: OpenAI,
batch_id: str,
output_dir: Path,
) -> dict[str, Path]:
"""
Download output and error files for a completed batch.
Returns dict mapping 'output'/'errors' to local file paths.
"""
batch = client.batches.retrieve(batch_id)
output_dir.mkdir(parents=True, exist_ok=True)
paths = {}
if batch.output_file_id:
content = client.files.content(batch.output_file_id)
output_path = output_dir / f"{batch_id}_output.jsonl"
output_path.write_bytes(content.read())
paths["output"] = output_path
if batch.error_file_id:
content = client.files.content(batch.error_file_id)
error_path = output_dir / f"{batch_id}_errors.jsonl"
error_path.write_bytes(content.read())
paths["errors"] = error_path
return paths
Parsing the output file gives you back the original custom_id and the full Chat Completions response:
def parse_batch_output(output_path: Path) -> dict[str, str]:
"""Return mapping of custom_id -> assistant message content."""
results = {}
with output_path.open("r", encoding="utf-8") as f:
for line in f:
row = json.loads(line)
custom_id = row["custom_id"]
response = row.get("response")
if response and response.get("status_code") == 200:
content = response["body"]["choices"][0]["message"]["content"]
results[custom_id] = content
else:
results[custom_id] = None
return results
Why check status_code per row: A batch can be completed overall even when individual rows return 400-level errors (malformed messages, content-policy rejections, token-limit breaches). Treating “batch completed” as “every row succeeded” is the number-one production bug teams hit on their first deployment.
Handling Errors and Partial Failures
The error file uses the same custom_id keying but stores failure metadata instead of a model response. A typical row looks like this:
{
"id": "batch_req_abc123",
"custom_id": "review-00042",
"response": null,
"error": {
"code": "context_length_exceeded",
"message": "This model's maximum context length is 16385 tokens."
}
}
A production retry layer should split failures into two buckets: retryable (rate limits, transient 5xx) and terminal (bad request, content policy, context length). Retryable rows go into a follow-up batch; terminal rows go to a dead-letter store for human review.
RETRYABLE_CODES = {"rate_limit_exceeded", "server_error", "timeout"}
def split_errors(error_path: Path) -> tuple[list[str], list[dict]]:
"""Return (retryable_custom_ids, terminal_errors)."""
retryable = []
terminal = []
with error_path.open("r", encoding="utf-8") as f:
for line in f:
row = json.loads(line)
code = row.get("error", {}).get("code", "")
if code in RETRYABLE_CODES:
retryable.append(row["custom_id"])
else:
terminal.append(row)
return retryable, terminal
For applications where retries materially affect cost, combine this with token counting to prevent context-length failures before they hit the batch — it is cheaper to truncate input than to pay for a failed call.
A Realistic Production Scenario
Consider a mid-sized SaaS team running a product-feedback pipeline. Every night, roughly 30,000 new user reviews land in a Postgres table and need classification (sentiment, topic, urgency) before the morning standup dashboard refreshes.
On the synchronous API with gpt-4o-mini, the workload runs in about 90 minutes with 20 worker threads, costing roughly $18 per night at current input/output token mixes. The team has no need for sub-minute results because the dashboard is consumed at 9 AM and the data arrives by midnight.
After migrating to the Batch API, the same workload submits as a single batch at 12:05 AM, completes by roughly 2 AM in their experience, and costs $9. Across a year, that is a real annualized savings without any infrastructure change beyond two scheduled jobs (submit + poll). Furthermore, the synchronous rate limit on the customer-facing chatbot is no longer competing with the classification workload, which incidentally fixes an unrelated 429 spike they were seeing during peak hours.
The migration took roughly two days of engineering work: rewriting the worker as a JSONL builder, adding a polling job, and updating the result-reconciliation step to read from a file instead of a queue.
How the Batch API Compares to Alternatives
| Approach | Cost | Latency | Best For |
|---|---|---|---|
| Synchronous Chat Completions | Full price | Seconds | User-facing real-time UX |
| Batch API | 50% off | Up to 24h | Bulk offline processing |
| Fine-tuning a smaller model | One-time + inference | Seconds | Repeated identical task patterns |
| Prompt caching | Up to 90% off cached tokens | Seconds | Long shared system prompts |
The four are not mutually exclusive. A mature production pipeline often uses three at once: caching for the long system prompt, batch for nightly enrichment, and synchronous for the live chat surface. If you have not enabled prompt caching yet, Anthropic prompt caching covers the equivalent pattern on Claude, and the OpenAI implementation works on the same conceptual basis with automatic detection above 1024 tokens.
For workloads where the comparison is closer, the OpenAI Assistants API vs Chat Completions comparison walks through when the stateful endpoint pays off — and importantly, batch does not currently support the Assistants endpoint, so high-volume Assistants traffic cannot benefit from the discount.
Cancelling and Cleaning Up
A submitted batch can be cancelled with a single API call as long as it has not reached finalizing. Cancellation is best-effort — some requests in flight may still complete and bill — but it prevents new ones from starting.
def cancel_batch(client: OpenAI, batch_id: str) -> str:
"""Cancel a running batch. Returns the final status."""
batch = client.batches.cancel(batch_id)
return batch.status
After a successful run, both input and output files persist on OpenAI’s storage until you delete them. For long-running production pipelines, schedule a weekly cleanup job that lists files older than 30 days with purpose="batch" and removes them:
from datetime import datetime, timedelta, timezone
def purge_old_batch_files(client: OpenAI, days: int = 30) -> int:
"""Delete batch-purpose files older than N days. Returns count deleted."""
cutoff = datetime.now(timezone.utc) - timedelta(days=days)
deleted = 0
for file in client.files.list().data:
if file.purpose != "batch":
continue
created = datetime.fromtimestamp(file.created_at, tz=timezone.utc)
if created < cutoff:
client.files.delete(file.id)
deleted += 1
return deleted
Storage is free at small scale but counts against your organization’s quota at high volume. Most teams discover this only after their batch pipeline has accumulated several gigabytes of stale JSONL.
Monitoring Batch Jobs in Production
The OpenAI dashboard shows batch status, but it is not enough for production observability. At minimum, emit metrics for: submission count per batch, completion latency, per-row success rate, and total token cost (from the request_counts and the metadata on the completed batch).
A simple Prometheus-compatible emitter looks like this:
from prometheus_client import Counter, Histogram
batch_submitted = Counter("openai_batch_submitted_total", "Batches submitted")
batch_completed = Counter("openai_batch_completed_total", "Batches completed", ["status"])
batch_duration = Histogram("openai_batch_duration_seconds", "End-to-end batch time")
batch_failures = Counter("openai_batch_row_failures_total", "Per-row failures", ["error_code"])
Wire these into the submission and polling functions, and your existing monitoring stack will alert when a batch stalls longer than expected or when failure rates spike — both early signals of upstream data-quality regressions.
Conclusion
The OpenAI Batch API is the rare cost optimization with no quality trade-off — the output is byte-for-byte identical to the synchronous API, and the only thing you give up is the ability to wait less than a minute. For any workload that runs in scheduled windows rather than user-facing flows, migrating delivers a clean 50% reduction with two days of engineering work.
Start by picking your highest-volume offline workflow, rewriting the worker to emit a JSONL file, and submitting a single test batch to confirm the output format matches what your downstream code expects. Once the pipeline is in place, the same pattern extends to embeddings, evaluation suites, and content enrichment with no additional infrastructure.
Next, read Building Apps With the OpenAI API for the broader request patterns that compose with batch, or RAG From Scratch if your immediate target is generating embeddings for a large corpus before indexing.