
If your product sends a stream of nearly identical questions to an LLM, you are paying full price for answers you have already generated. Users phrase the same intent a dozen ways: “how do I reset my password,” “reset password,” “I forgot my login.” A traditional key-value cache treats each of these as a distinct request, so it never hits. Semantic caching for LLMs solves exactly this problem by matching on meaning rather than exact text, and it can cut both inference cost and tail latency dramatically on high-repeat workloads.
This guide is for backend and AI engineers running LLM features in production who want to reduce spend without degrading answer quality. You will learn what semantic caching actually is, how it differs from exact-match and prompt caching, how to build a working cache with embeddings and a vector store, and — most importantly — how to tune the similarity threshold so you do not serve wrong answers. The code is production-oriented, and the trade-offs are the ones you will actually hit.
What Is Semantic Caching for LLMs?
Semantic caching for LLMs stores past prompts alongside their responses, then serves a cached answer when a new prompt is semantically similar to a stored one. Instead of hashing the raw text, it embeds each prompt into a vector and compares vectors by cosine similarity. When a new query lands within a similarity threshold of a cached entry, the system returns the stored response and skips the model call entirely.
The key insight is that meaning, not surface text, drives the cache hit. Two prompts with zero shared words can map to nearly the same vector, so they resolve to the same cached answer. That is what makes this technique effective for chatbots, support tools, and FAQ-style assistants where users express the same handful of intents endlessly.
Semantic Caching vs Exact-Match and Prompt Caching
Before building anything, it helps to know where semantic caching sits among the other caching strategies you have probably already met. These are complementary layers, not competitors, and production systems often run all three.
| Cache type | Matches on | Best for | Typical hit condition |
|---|---|---|---|
| Exact-match (key-value) | Byte-identical prompt | Deterministic repeated calls | Same string, same params |
| Prompt caching (provider-side) | Shared prompt prefix | Long, stable system prompts and context | Reused prefix within TTL |
| Semantic caching | Prompt meaning (embedding) | Paraphrased, varied user queries | Similarity above threshold |
Exact-match caching is the classic cache-aside pattern applied to LLM calls, and it is worth reading up on the broader caching strategies like write-through and cache-aside before you layer semantics on top. It hits only when the input is byte-for-byte identical, which is rare for free-text user input.
Provider-side prompt caching is a different mechanism entirely. It caches a prefix of your request — a large system prompt or shared context — so repeated calls reprocess only the changed suffix. Anthropic’s implementation, covered in our guide to Anthropic prompt caching, is a prefix match: any byte change in the cached region invalidates everything after it. That saves input-token cost on stable context, but it does nothing when two users ask the same question in different words. Semantic caching fills that gap.
How Semantic Caching Works: The Core Flow
The runtime path for semantic caching for LLMs follows a consistent sequence. Understanding it makes the failure modes obvious later.
- Embed the incoming prompt into a vector using an embedding model.
- Search the vector store for the nearest stored prompt embedding.
- If the top match’s similarity clears your threshold, return its cached response (a cache hit).
- On a miss, call the LLM as normal, then embed and store the new prompt-response pair.
- Apply a TTL and eviction policy so stale or low-value entries expire.
Notice that every request incurs an embedding call even on a hit. That cost is real but small: embedding a short query is orders of magnitude cheaper and faster than a full generation. The economics only work when your hit rate is high enough that the saved generations outweigh the embedding overhead on misses.
Building a Semantic Cache With Embeddings and a Vector Store
Let us build a working cache. The design uses an embedding model to vectorize prompts and a vector store to hold entries. For clarity the example uses OpenAI embeddings and Redis with vector search, but the pattern is provider-agnostic — swap in any embedding model or vector database from our comparison.
First, the embedding layer. Choosing the right model matters here; our breakdown of OpenAI, Voyage, and Cohere embeddings covers the trade-offs in dimension size, cost, and retrieval quality.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
EMBED_MODEL = "text-embedding-3-small" # 1536 dims, cheap, good enough for caching
def embed(text: str) -> list[float]:
"""Return the embedding vector for a prompt.
We normalize whitespace so trivial formatting differences do not
perturb the vector. Embedding a short query costs a tiny fraction
of a generation, which is what makes the cache economically viable.
"""
cleaned = " ".join(text.split())
response = client.embeddings.create(model=EMBED_MODEL, input=cleaned)
return response.data[0].embedding
Next, the cache itself. Redis 8 (or Redis Stack) ships vector similarity search, so it can act as both the vector index and the response store in one hop. If you already run Redis, this avoids adding another dependency — and our guide to Redis data structures beyond caching shows how much it can do here.
import json
import numpy as np
import redis
from redis.commands.search.field import VectorField, TextField, NumericField
from redis.commands.search.query import Query
r = redis.Redis(host="localhost", port=6379, decode_responses=False)
INDEX_NAME = "llm_cache_idx"
KEY_PREFIX = "llmcache:"
DIM = 1536
def create_index() -> None:
"""Create the vector index once at startup. Idempotent."""
try:
r.ft(INDEX_NAME).info()
return # already exists
except redis.ResponseError:
pass
schema = (
VectorField(
"embedding",
"HNSW", # approximate NN — fast at scale, tunable recall
{"TYPE": "FLOAT32", "DIM": DIM, "DISTANCE_METRIC": "COSINE"},
),
TextField("prompt"),
TextField("response"),
NumericField("created_at"),
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
r.ft(INDEX_NAME).create_index(
schema,
definition=IndexDefinition(prefix=[KEY_PREFIX], index_type=IndexType.HASH),
)
With the index in place, the lookup and store operations complete the cache. The similarity threshold is the single most important number in this whole system, so it lives in one clearly named constant.
import time
SIMILARITY_THRESHOLD = 0.92 # cosine similarity; tune this carefully (see below)
TTL_SECONDS = 60 * 60 * 24 * 7 # one week
def _to_bytes(vec: list[float]) -> bytes:
return np.array(vec, dtype=np.float32).tobytes()
def lookup(prompt: str) -> str | None:
"""Return a cached response for a semantically similar prompt, or None."""
query_vec = _to_bytes(embed(prompt))
q = (
Query("*=>[KNN 1 @embedding $vec AS score]")
.sort_by("score")
.return_fields("response", "score", "prompt")
.dialect(2)
)
results = r.ft(INDEX_NAME).search(q, query_params={"vec": query_vec})
if not results.docs:
return None
top = results.docs[0]
# Redis returns COSINE *distance* (0 = identical); similarity = 1 - distance
similarity = 1.0 - float(top.score)
if similarity >= SIMILARITY_THRESHOLD:
return top.response.decode() if isinstance(top.response, bytes) else top.response
return None
def store(prompt: str, response: str) -> None:
"""Persist a new prompt-response pair with its embedding and a TTL."""
key = f"{KEY_PREFIX}{int(time.time() * 1000)}"
r.hset(
key,
mapping={
"embedding": _to_bytes(embed(prompt)),
"prompt": prompt,
"response": response,
"created_at": int(time.time()),
},
)
r.expire(key, TTL_SECONDS)
Finally, wire the cache around your actual LLM call. This wrapper is the only thing your application code needs to touch.
def cached_completion(prompt: str) -> str:
"""Serve from the semantic cache when possible, else call the LLM."""
hit = lookup(prompt)
if hit is not None:
return hit # cache hit — no generation, no generation cost
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
answer = completion.choices[0].message.content
store(prompt, answer)
return answer
That is a complete, working semantic cache in well under a hundred lines. The hard parts are not the code, though — they are the threshold, invalidation, and knowing when the whole approach is a bad idea.
Setting the Similarity Threshold (the Hard Part)
The similarity threshold decides whether two prompts count as “the same question.” Set it too low and the cache serves confidently wrong answers; set it too high and your hit rate collapses to near zero. There is no universal correct value, because it depends on your embedding model, your domain, and how much a wrong answer actually costs you.
Start empirically rather than guessing. Collect a few hundred real prompts, generate their embeddings, and compute pairwise similarities for pairs you know are genuinely equivalent versus genuinely different. For most modern embedding models and short queries, equivalent paraphrases cluster around 0.90 to 0.97 cosine similarity, while unrelated prompts sit well below 0.80. Your threshold belongs in the gap between those two distributions.
The cost asymmetry should shape where you land in that gap. In a support FAQ bot, a slightly-wrong cached answer is annoying but recoverable, so a threshold near 0.90 that maximizes hit rate is reasonable. In anything touching billing, medical, or legal content, a false hit is unacceptable — push the threshold to 0.97 or higher, accept a lower hit rate, and treat the cache as a bonus rather than a load-bearing optimization.
# A quick offline calibration harness. Run this against labeled pairs
# before you pick SIMILARITY_THRESHOLD for production.
def cosine(a: list[float], b: list[float]) -> float:
va, vb = np.array(a), np.array(b)
return float(va @ vb / (np.linalg.norm(va) * np.linalg.norm(vb)))
equivalent_pairs = [
("how do I reset my password", "I forgot my login, help me reset it"),
("what are your business hours", "when are you open"),
]
different_pairs = [
("how do I reset my password", "what is your refund policy"),
]
for a, b in equivalent_pairs:
print(f"EQUIV {cosine(embed(a), embed(b)):.3f} {a!r} / {b!r}")
for a, b in different_pairs:
print(f"DIFFER {cosine(embed(a), embed(b)):.3f} {a!r} / {b!r}")
Run that, look at where the two groups separate, and set your threshold in the safe middle. Re-run it whenever you change embedding models, because the numeric ranges shift between models and a threshold tuned for one is meaningless for another.
Cache Invalidation and TTLs
Cached LLM answers go stale in ways that ordinary data caches do not. If your knowledge base changes — a new pricing page, an updated policy — every cached response derived from the old facts is now wrong, even though the prompts still match perfectly. Semantic caching does not know your source data changed, so you have to manage freshness deliberately.
TTLs are your first line of defense. A one-week expiry, as in the example above, bounds how long a stale answer can circulate. For volatile content, drop the TTL to hours. Additionally, wire cache invalidation into your content-update pipeline: when the underlying data behind a category of answers changes, delete the relevant cache entries rather than waiting for them to expire.
You should also avoid caching anything user-specific or non-deterministic. A prompt like “what’s my account balance” must never be cached, because the same words map to different correct answers per user and per moment. Gate those requests out of the cache entirely with a simple allowlist or a per-prompt “cacheable” flag before lookup is ever called.
Production Concerns: Cost, Latency, and Eviction
The whole point of semantic caching for LLMs is cost, so track the number that matters: net savings, not raw hit rate. Every request pays for an embedding, and every miss pays for embedding plus generation plus a store write. Your break-even hit rate is whatever makes the saved generations exceed the embedding overhead across all requests. Instrument hits, misses, and both cost components so you can prove the cache is actually paying for itself. Pair this with the discipline in our guide to token counting and budget management for LLM apps.
Latency generally improves, but not uniformly. A hit returns in the time of one embedding call plus a vector search — typically tens of milliseconds against Redis — versus seconds for a full generation. A miss, however, is slower than an uncached call, because you paid for the embedding and search before falling through to the model. High hit rates make the average a clear win; low hit rates can make it a net loss on latency too.
Eviction keeps the index from growing without bound. TTLs handle time-based expiry, but for a bounded cache you also want size-based eviction. An LRU or least-frequently-used policy on the cache keys works well, since popular intents should stay resident while one-off queries age out. Redis handles TTL expiry natively; for LRU behavior, either use Redis’s maxmemory-policy or track access counts and prune the coldest entries on a schedule.
A Real-World Scenario
Consider a mid-sized SaaS company running an in-app support assistant that handles a few thousand user questions a day. The team notices that a large share of traffic clusters around a small set of intents — password resets, billing questions, feature availability — each phrased countless ways. Their exact-match cache almost never hits because free-text input is never byte-identical, so nearly every question triggers a full generation.
After adding a semantic cache with a threshold calibrated against a few hundred labeled support tickets, the picture changes. A substantial fraction of questions now resolve to cached answers, which trims the generation bill and drops perceived latency for the most common intents. The team’s biggest lesson during rollout is not technical but procedural: they wire cache invalidation into their docs pipeline, so that whenever a help-center article is updated, the related cache entries are purged. Without that step, the cache would have kept serving last month’s policy long after it changed. The trade-off they accept is a modest engineering cost to maintain the invalidation hooks, in exchange for meaningfully lower spend on their highest-volume queries.
When to Use Semantic Caching for LLMs
- Your workload has high intent repetition, like support bots, FAQ assistants, or search-style Q&A
- User input is free text, so exact-match caching almost never hits
- Answers are largely stable over hours or days, not personalized per request
- You can tolerate an occasional near-miss, or you set a strict threshold where you cannot
- Inference cost or tail latency is a real constraint you need to move
When NOT to Use Semantic Caching for LLMs
- Responses are user-specific, time-sensitive, or non-deterministic (balances, live data, personalized advice)
- Your traffic is mostly unique, one-off prompts with little semantic overlap
- Wrong answers carry high stakes and you cannot set a threshold conservative enough to trust
- The task depends on exact context or tool calls that vary per request, where a “similar” prompt is not actually equivalent
- Your volume is low enough that the embedding overhead outweighs any savings
Common Mistakes With Semantic Caching for LLMs
- Setting the similarity threshold by intuition instead of calibrating it against labeled prompt pairs
- Caching personalized or volatile responses, which serves one user’s answer to another
- Forgetting invalidation, so the cache keeps returning stale answers after source data changes
- Ignoring the embedding cost on misses and assuming any hit rate is a net win
- Reusing a threshold across different embedding models, whose similarity ranges do not transfer
- Letting the index grow unbounded with no TTL or eviction policy
Conclusion
Semantic caching for LLMs turns the natural repetition in user questions into a genuine cost and latency win, but only when you respect its two sharp edges: the similarity threshold and cache invalidation. Get the threshold right by calibrating against real data, gate out anything user-specific or volatile, and wire invalidation into your content pipeline — then the cache pays for itself on high-repeat workloads. Start by measuring your actual intent repetition; if a small set of paraphrased questions dominates your traffic, you have a strong candidate for semantic caching. From there, explore the vector database options that best fit your scale and the embedding models that will drive your similarity matching.