RAG & Vector Search

Agentic RAG Explained: Retrieval Meets Autonomous Agents

If your retrieval-augmented generation pipeline answers simple questions well but falls apart on multi-step queries, conditional lookups, or questions that need data from three different sources, you are hitting the ceiling of standard RAG. Agentic RAG is the architectural shift that gets you past it. Instead of running a fixed retrieve-then-generate flow, an autonomous agent decides what to retrieve, when to retrieve again, and which tools to call along the way.

This deep dive is for engineers who have already built at least one RAG prototype and now need to handle queries that require planning. We will walk through the architecture, the core patterns that make agentic RAG work in production, a complete code example using LangGraph, and the trade-offs you need to understand before adopting it. By the end, you will be able to decide whether agentic RAG fits your use case or whether better chunking and reranking would solve the same problem at a fraction of the cost.

What Is Agentic RAG?

Agentic RAG is a retrieval architecture where an LLM agent autonomously decides which retrieval steps to perform, in what order, and whether the retrieved results are sufficient to answer the user’s query. Unlike standard RAG, which always runs a single embedding lookup before generation, agentic RAG treats retrieval as one of several tools the agent can call as many times as needed.

The key word is autonomous. The agent inspects the query, picks a strategy, fires off retrieval calls, evaluates what came back, and either generates a final answer or decides to retrieve more. For instance, a query like “Compare our Q3 revenue to the industry average and explain the gap” requires at least two distinct lookups (internal financial data and external benchmarks) plus a reasoning step that joins them. Standard RAG cannot plan that. An agent can.

If you are new to RAG fundamentals, start with our RAG from scratch guide before going further. Agentic RAG assumes you already understand chunking, embeddings, and vector search.

Standard RAG vs Agentic RAG: Key Differences

Standard RAG is a fixed pipeline. Agentic RAG is a loop with a controller. The table below highlights the differences that matter most in production.

AspectStandard RAGAgentic RAG
Retrieval strategyFixed: embed query, search, generateDynamic: agent decides per query
Number of retrieval callsExactly one per queryZero to many, decided at runtime
Multi-source supportSingle index, usuallyMultiple indexes and tools
Self-correctionNoneAgent can re-query if results are weak
LatencyPredictable, lowVariable, often 2-5x higher
Token costLowHigher (planning + tool calls)
Best forQ&A over a single corpusMulti-hop, conditional, comparative queries
Debugging difficultyEasy (linear trace)Hard (branching execution)

Notice the trade-off pattern. Agentic RAG buys you flexibility at the cost of latency, tokens, and observability complexity. As a result, you should not reach for it unless the failure modes of standard RAG are blocking real user value.

The Agentic RAG Architecture

At a high level, an agentic RAG system has four components working together: the agent controller, the retrieval tools, the memory or state store, and the response generator. Each plays a distinct role in turning a vague user query into a grounded answer.

The agent controller is an LLM running in a loop. On each iteration, it reads the current state (the user query plus everything retrieved so far) and produces one of three actions: call a retrieval tool, call a different tool such as a calculator or SQL query, or emit a final answer. The loop runs until the agent decides it has enough information or hits a maximum step count.

The retrieval tools are typed function calls the agent can invoke. A typical setup exposes several: search_internal_docs(query, top_k)search_web(query)query_database(sql), and sometimes a dedicated lookup_by_id(doc_id) for following references. Each tool returns structured results that get appended to the agent’s working context.

The memory or state store tracks what the agent has done so far in the current session. Furthermore, it prevents redundant retrieval (the agent should not search for the same thing twice) and provides the loop with the data it needs to decide its next action. LangGraph, the framework we use in the example below, makes this state explicit and serializable.

The response generator is often the same model as the controller, called one final time with the accumulated context to produce the user-facing answer. Some teams separate these for cost reasons, using a cheaper model for the final synthesis. For background on how agents orchestrate tools, see our guide on building AI agents with tools, planning, and execution.

Core Patterns in Agentic RAG

Four patterns show up across nearly every production agentic RAG system. Knowing them helps you read existing implementations and design your own without reinventing wheels.

Self-Querying

The agent rewrites the user’s natural language query into one or more optimized retrieval queries before calling the vector search. For example, “What did the CTO say about hiring last quarter?” becomes a filtered search with author=CTOdate_range=Q3 2025, and a query string of “hiring plans”. This pattern alone resolves a surprising number of failure cases in standard RAG, since user queries rarely match the embedding space directly.

Multi-Step Retrieval (Iterative Refinement)

The agent retrieves once, evaluates whether the results are sufficient, and retrieves again with a refined query if they are not. A common implementation uses an LLM call between retrievals to ask: “Does this context answer the question? If not, what specifically is missing?” The missing pieces become the next query. This pattern is essential for multi-hop questions where the answer requires chaining facts.

Tool Routing

When you have multiple data sources, the agent picks which one to query based on the question. Internal product docs go to your vector store. Real-time pricing goes to a SQL database. Public information goes to a web search tool. The routing decision happens in the controller, usually via structured tool calling. This is where agentic RAG starts to look less like search and more like an orchestrator. Hybrid search inside a single store is a different pattern; we cover it separately in our hybrid search guide.

Self-Correction (Critique and Retry)

After generating an answer, the agent (or a separate critic model) checks it against the retrieved context. If the answer contains a claim that is not supported, the agent retrieves additional evidence or revises the answer. Frameworks like CRAG and Self-RAG formalize this loop. In practice, even a lightweight version, where the critic just flags unsupported claims, cuts hallucination rates noticeably.

Building Agentic RAG: A Production Example

Let’s build a minimal but realistic agentic RAG system using LangGraph and OpenAI. The agent will answer questions over a small document corpus with the ability to retrieve, evaluate, and re-retrieve.

First, install the dependencies:

pip install langgraph langchain-openai langchain-community chromadb

Next, set up the vector store. We use Chroma here for simplicity, but the pattern works with any vector database. For a comparison of options, see our vector databases compared post.

# setup_vectorstore.py
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

def build_vectorstore(documents: list[str], persist_path: str = "./chroma_db"):
    """Build a persistent Chroma index from raw document strings."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
    )
    chunks = splitter.create_documents(documents)
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_path,
    )
    return vectorstore

Why text-embedding-3-small? It is roughly 5x cheaper than text-embedding-3-large and accurate enough for most internal-document use cases. Reserve the larger model for cases where retrieval precision is the bottleneck.

Now define the agent state and the tools. LangGraph uses a typed state dict that flows through every node in the graph.

# agent_state.py
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    query: str
    retrieved_docs: list[str]
    retry_count: int
    final_answer: str | None

The state tracks the conversation, the original query, retrieved documents, how many times we have already retrieved, and the final answer once produced. The retry_count prevents infinite loops.

Next, define the graph nodes. Each node is a function that takes state and returns state updates.

# nodes.py
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def retrieve_node(state: AgentState, vectorstore) -> dict:
    """Run a vector search using the current query."""
    docs = vectorstore.similarity_search(state["query"], k=4)
    return {
        "retrieved_docs": [d.page_content for d in docs],
        "retry_count": state.get("retry_count", 0) + 1,
    }

def grade_node(state: AgentState) -> dict:
    """Ask the LLM whether the retrieved docs are sufficient."""
    context = "\n\n".join(state["retrieved_docs"])
    grade_prompt = f"""You are grading whether retrieved context answers a question.
Question: {state["query"]}
Context: {context}

Respond with exactly one word: SUFFICIENT or INSUFFICIENT."""
    response = llm.invoke([HumanMessage(content=grade_prompt)])
    is_sufficient = "SUFFICIENT" in response.content.upper()
    return {"messages": [response], "is_sufficient": is_sufficient}

def rewrite_query_node(state: AgentState) -> dict:
    """Generate a refined query when retrieval was insufficient."""
    rewrite_prompt = f"""The original query "{state["query"]}" did not return useful results.
Generate a better search query that captures the same intent with different wording.
Return only the new query, nothing else."""
    response = llm.invoke([HumanMessage(content=rewrite_prompt)])
    return {"query": response.content.strip()}

def answer_node(state: AgentState) -> dict:
    """Generate the final grounded answer."""
    context = "\n\n".join(state["retrieved_docs"])
    answer_prompt = f"""Answer the question using only the provided context.
If the context does not contain the answer, say so explicitly.

Question: {state["query"]}
Context: {context}"""
    response = llm.invoke([
        SystemMessage(content="You are a precise technical assistant."),
        HumanMessage(content=answer_prompt),
    ])
    return {"final_answer": response.content, "messages": [response]}

Now wire these into a graph with conditional edges. The graph defines the flow: retrieve, grade, then either answer or rewrite-and-retry. The retry_count cap prevents runaway loops.

# graph.py
from langgraph.graph import StateGraph, START, END
from functools import partial

def build_agent_graph(vectorstore, max_retries: int = 2):
    graph = StateGraph(AgentState)

    graph.add_node("retrieve", partial(retrieve_node, vectorstore=vectorstore))
    graph.add_node("grade", grade_node)
    graph.add_node("rewrite", rewrite_query_node)
    graph.add_node("answer", answer_node)

    graph.add_edge(START, "retrieve")
    graph.add_edge("retrieve", "grade")

    def grade_router(state: AgentState) -> str:
        if state.get("is_sufficient"):
            return "answer"
        if state.get("retry_count", 0) >= max_retries:
            return "answer"  # give up gracefully
        return "rewrite"

    graph.add_conditional_edges("grade", grade_router, {
        "answer": "answer",
        "rewrite": "rewrite",
    })
    graph.add_edge("rewrite", "retrieve")
    graph.add_edge("answer", END)

    return graph.compile()

Finally, run a query end to end:

# main.py
docs = [
    "Riverpod is a state management library for Flutter created by Remi Rousselet.",
    "Provider was the predecessor to Riverpod, also by the same author.",
    "Bloc emphasizes a strict event-state separation pattern.",
]
vectorstore = build_vectorstore(docs)
agent = build_agent_graph(vectorstore)

result = agent.invoke({
    "query": "Who made Riverpod?",
    "messages": [],
    "retrieved_docs": [],
    "retry_count": 0,
    "final_answer": None,
})
print(result["final_answer"])

Why this design works: the grade step prevents the agent from confidently answering off-topic retrievals, the rewrite step gives it a recovery path, and the retry cap bounds latency. In production, you would extend this with logging at each node and a tool-routing step before retrieval. For deeper LangGraph patterns, our LlamaIndex vs LangChain comparison covers when each framework fits.

Real-World Scenario

A common pattern in mid-sized SaaS companies looks like this. A customer support team builds a standard RAG system over their product docs and ticket history. Initial demos go well. Then, after launch, support engineers report that the bot fails on questions that combine product behavior with customer-specific data, such as “Why is feature X disabled for this customer?”

The root cause is not retrieval quality. The vector store contains both the product docs and the customer’s account configuration, but the embedding lookup for “Why is feature X disabled” never returns the customer-specific configuration row, because the query does not mention the customer ID. Teams that diagnose this often try better chunking first, then reranking, then hybrid search. Those help, but the underlying issue is that the query needs two retrievals against two different filters, joined by reasoning.

Migrating that system to agentic RAG (with a tool that takes customer_id as a filter and a separate tool for general product docs) typically resolves the bulk of these failures within a few weeks of iteration. The trade-off is real, though: average response time roughly doubles, and the per-query LLM cost increases by 2-3x because of the planning and grading steps. Most teams accept that trade-off for the support use case because hallucinations there are expensive.

When to Use Agentic RAG

  • Queries genuinely require multiple retrievals from different sources or filters
  • Users ask comparative or multi-hop questions that standard RAG cannot decompose
  • Hallucinations are costly enough to justify the cost of self-correction loops
  • You need to combine retrieval with non-retrieval tools (SQL, calculators, web search)
  • Your top failure mode is “retrieval missed the relevant chunk” rather than “answer wording is wrong”
  • You have observability infrastructure in place to debug branching execution paths

When NOT to Use Agentic RAG

  • Standard RAG with better chunking and reranking would solve 80% of your problem at 10% of the cost
  • Latency budget is under one second per query (agentic loops rarely fit)
  • Your traffic is high enough that the 2-5x token cost would break your unit economics
  • You do not yet have evals to measure whether the agent is actually doing better
  • The team lacks experience debugging non-deterministic LLM execution
  • You are still on a single corpus with no need for tool routing

Common Mistakes with Agentic RAG

The biggest mistake is reaching for agentic RAG before exhausting cheaper fixes. Teams often see standard RAG fail on a handful of queries, conclude “we need an agent”, and end up with a slower, more expensive system that fails on the same queries for the same reasons. Before adding an agent, try smarter chunking strategies and a reranking layer. These two changes alone resolve most of the failures attributed to “RAG is too rigid”.

A second common mistake is no retry budget. Without a hard cap on retrieval iterations, a confused agent will loop indefinitely, burn tokens, and timeout downstream callers. Always set max_retries and max_steps. Pick a number based on actual user query complexity; for most consumer-facing apps, 2-3 retries is the right ceiling.

A third mistake is treating the grade step as optional. The grade node is what prevents the agent from confidently answering with bad context. Skipping it because “the LLM will figure it out” leads to confidently wrong answers. Keep the grade step even when it feels redundant. The cost is one extra LLM call; the benefit is catching the cases where retrieval silently failed.

A fourth mistake is using the same LLM for the controller and the synthesis step without tracking cost separately. The controller does many small calls (planning, grading, rewriting), and the synthesis does one large call. If your bills surprise you, instrument these separately so you know which to optimize. For some teams, swapping the controller to a cheaper model while keeping the synthesis on a stronger one cuts costs in half without measurable quality loss.

Production Considerations

Three operational concerns dominate once agentic RAG hits production: latency, cost, and observability. Each requires deliberate design.

Latency is the most painful surprise. A standard RAG call is one embedding lookup plus one generation, typically 500-1000ms end to end. An agentic call with retrieve, grade, possibly rewrite, retrieve again, and synthesize can easily reach 3-5 seconds. Streaming the final answer hides some of this, but the user still waits for the agent loop to decide what to retrieve. Mitigation strategies include running grading on a faster model, caching frequent query rewrites, and parallelizing tool calls when the agent decides to query multiple sources at once.

Cost scales with the number of LLM calls per query, not just tokens. A query that triggers two retrieval rounds plus grading plus synthesis is four to six LLM calls instead of one. Use a small model for the controller (gpt-4o-mini, Haiku, or similar) and reserve the strong model for the final synthesis. Furthermore, batch any embedding generation aggressively; embedding costs are low per call but add up fast.

Observability is where most teams underinvest. Standard RAG produces a single linear trace per query, easy to scan. An agentic system produces a tree: which tools were called, in what order, with what results, leading to which answer. Without proper tracing, debugging “the agent gave a wrong answer” becomes a needle-in-haystack search. Use LangSmith, Langfuse, or your own structured logging from day one, capturing the full step sequence including grades, query rewrites, and tool inputs and outputs. For an end-to-end production example with observability, see our Pinecone Serverless production RAG tutorial.

A final consideration is evals. You cannot improve what you do not measure, and agentic RAG has more knobs to tune than standard RAG. Build a regression test set of real queries with known good answers, and run it on every prompt or model change. Without that, you will not know whether your improvements are actually improvements or whether you are just trading one failure mode for another.

Conclusion

Agentic RAG is the right tool for queries that standard RAG cannot decompose: multi-hop, multi-source, conditional, or self-correcting. It buys you flexibility and grounding quality at the cost of latency, tokens, and operational complexity. The decision to adopt it should come from your failure analysis, not from architectural fashion.

Start with standard RAG, fix what better chunking and reranking can fix, and only add an agent when the remaining failures genuinely require planning. When you do build agentic RAG, design the retry budget and observability layer before the first production deploy. For your next step, walk through our building AI agents with tools and planning post to deepen the agent control loop, then try wiring up a second retrieval tool to see tool routing in action.

Leave a Comment