RAG & Vector Search

RAG Chunking Strategies: Fixed, Recursive, and Semantic

If your retrieval-augmented generation system surfaces documents that contain the right keywords but miss the actual answer, your chunking step is usually the culprit. RAG chunking strategies decide whether the model sees a coherent passage or a sliced-up paragraph that loses its meaning halfway through. This guide compares fixed-size, recursive, and semantic chunking — with production code, trade-off analysis, and a decision framework you can apply to your own pipeline today. By the end, you will know which strategy fits your data, what to measure, and where most teams quietly go wrong.

Why Chunking Decides Retrieval Quality

Embedding a 300-page PDF as a single vector is useless. The embedding averages the whole document into one direction in vector space, so any query maps to “broadly relevant” rather than the specific paragraph that answers the question. Chunking exists to break long documents into pieces small enough to embed precisely, but large enough to preserve the context the model needs to answer.

The chunk size affects three measurable things at once: retrieval precision, recall, and the prompt tokens you pay for at inference. Small chunks score higher on precision because each chunk is tightly focused, but they fragment ideas across boundaries and force the retriever to pull many chunks to reconstruct the full answer. Large chunks preserve context but dilute the embedding signal, so semantically close passages compete for the same vector space.

Furthermore, chunk boundaries change what the embedding actually represents. A chunk that ends mid-sentence creates an embedding for a grammatical fragment, not an idea. A chunk that spans two unrelated sections produces a vector that points at the midpoint between them — useful for neither. As a result, the choice between fixed, recursive, and semantic chunking is not a stylistic preference. It directly determines whether retrieval works.

If you are new to the pipeline these chunks feed into, the end-to-end RAG walkthrough on TeachMeIDEA covers the surrounding components — embedding, storage, retrieval, and generation — so you can see where chunking sits in context.

Fixed-Size Chunking Explained

Fixed-size chunking is the simplest approach: split the source text into segments of a target character or token count, optionally with overlap between adjacent chunks. The splitter does not care about sentence boundaries, paragraphs, or semantic structure. It walks the text and cuts every N tokens.

How Fixed-Size Chunking Works

A typical fixed-size implementation takes three parameters: chunk size, chunk overlap, and the unit of measurement (characters or tokens). The overlap ensures that a sentence split across two chunks appears in both, giving retrieval a second chance to surface the right context.

import tiktoken

def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 64,
                    encoding_name: str = "cl100k_base") -> list[str]:
    """Split text into fixed token-count chunks with overlap.

    chunk_size and overlap are in tokens, not characters, so chunk
    budgets line up with model context windows.
    """
    encoder = tiktoken.get_encoding(encoding_name)
    tokens = encoder.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(encoder.decode(chunk_tokens))
        if end == len(tokens):
            break
        start = end - overlap

    return chunks

Why tokens instead of characters: embedding models and LLMs both bill and reason in tokens. Counting characters drifts away from the model’s actual unit, so a 1,000-character chunk might be 180 tokens in English and 700 tokens in Japanese. Counting tokens gives you predictable embedding cost and predictable prompt size at retrieval time.

Trade-offs of Fixed-Size Chunking

Fixed-size chunking is fast, deterministic, and trivial to debug. Indexing a million-document corpus is bounded by I/O, not by chunker logic. Furthermore, every chunk has predictable cost — useful for capacity planning when you are deciding how many embeddings to budget for a vector database.

The cost is that fixed-size chunks happily slice through the middle of sentences, code blocks, and tables. An embedding of “of the customer is responsible for paying any applicable” is meaningless on its own. Overlap reduces this problem but does not eliminate it, and overlap also inflates your storage cost by roughly overlap / chunk_size — a 64-token overlap on 512-token chunks adds about 12.5% more vectors.

Fixed-size is the right default when documents are uniform in structure (chat logs, paginated PDFs from the same template, scraped pages with predictable layouts) and you need to ship something working in an afternoon. It is the wrong default for documents with headings, lists, or mixed content, where structural splits would be obvious wins.

Recursive Chunking Explained

Recursive chunking respects the structure of natural text by trying a hierarchy of separators in order. It first attempts to split on the largest meaningful boundary (double newlines, signalling paragraphs). If a resulting chunk is still too large, it recurses into the next separator (single newlines, then sentences, then words) until the chunks fit the target size.

How Recursive Chunking Works

Most production RAG systems built on LangChain or LlamaIndex use recursive character splitting as the default. The implementation is conceptually a depth-first tree split: each oversized chunk gets re-split with a smaller separator until everything fits.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],
    is_separator_regex=False,
)

docs = splitter.create_documents([raw_text])

The separator order matters. Paragraphs first preserves the most semantic context; single newlines next handles lists and code; sentence endings catch long flowing prose; spaces and empty strings are last-resort fallbacks. If you swap the order, you get materially different chunks for the same input.

For Markdown, HTML, or code, use a format-aware splitter rather than the generic one. LangChain ships MarkdownHeaderTextSplitter for header-aware Markdown splitting, and RecursiveCharacterTextSplitter.from_language(Language.PYTHON) for source code with language-appropriate separators (class/function/block boundaries instead of paragraphs). Format-aware splitting is the single biggest retrieval-quality improvement most teams make after they realize their generic splitter shredded their codebase.

Trade-offs of Recursive Chunking

Recursive chunking gives you most of the structural awareness of semantic chunking at the cost of fixed-size chunking. It is the right default for documentation, articles, books, and any text where paragraph and sentence boundaries carry meaning — which covers the majority of real-world RAG corpora.

The honest limit is that recursive chunking only knows about syntactic structure, not semantic structure. Two adjacent paragraphs covering completely different topics get chunked as if they belong together; a single argument that spans three paragraphs gets split across chunks. For most data the syntactic structure correlates strongly with semantic structure, so this trade-off rarely matters. For dense technical writing or transcripts with rapid topic shifts, it shows up as retrieval misses.

Semantic Chunking Explained

Semantic chunking ignores syntactic separators entirely. Instead, it embeds each sentence, then groups consecutive sentences whose embeddings are close together in vector space. Boundaries appear wherever the embedding similarity between adjacent sentences drops — those drops mark topic shifts.

How Semantic Chunking Works

The algorithm in practice: sentence-tokenize the document, embed each sentence (or small group of sentences), compute the cosine distance between each adjacent pair, and place a chunk boundary at points where the distance exceeds a threshold. The threshold is usually defined as a percentile of all distances in the document (for instance, the 95th percentile), so it adapts to documents with different levels of topical homogeneity.

from typing import Sequence
import numpy as np
from openai import OpenAI

client = OpenAI()

def semantic_chunk(sentences: Sequence[str],
                   model: str = "text-embedding-3-small",
                   percentile_threshold: float = 95.0) -> list[str]:
    """Group consecutive sentences into chunks at topic-shift points."""
    if len(sentences) < 2:
        return list(sentences)

    response = client.embeddings.create(input=list(sentences), model=model)
    embeddings = np.array([item.embedding for item in response.data])

    similarities = []
    for i in range(len(embeddings) - 1):
        a, b = embeddings[i], embeddings[i + 1]
        cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
        similarities.append(1.0 - cosine)  # distance, not similarity

    cutoff = np.percentile(similarities, percentile_threshold)

    chunks, current = [], [sentences[0]]
    for i, distance in enumerate(similarities):
        if distance > cutoff:
            chunks.append(" ".join(current))
            current = [sentences[i + 1]]
        else:
            current.append(sentences[i + 1])
    if current:
        chunks.append(" ".join(current))

    return chunks

Why a percentile threshold and not a fixed number: cosine distances vary wildly across domains. Legal text reads more uniformly than a podcast transcript, so a fixed threshold of 0.3 might produce 10 chunks for one and 500 for the other. Percentile thresholds adapt automatically and tend to land in a useful range.

Trade-offs of Semantic Chunking

Semantic chunking produces the best-aligned chunks for retrieval when topic boundaries are the actual signal you care about. Long-form content with topic transitions (interviews, course transcripts, multi-topic articles) benefits the most. Furthermore, semantic chunks often align with how humans would naturally divide the text, which makes downstream evaluation easier.

The cost is real and worth understanding before adopting it. Indexing now requires an embedding API call (or local model inference) for every sentence in your corpus, on top of the embedding calls you already make for the final chunks. For a million-document corpus, this can double or triple indexing cost and latency. Additionally, semantic chunking is not deterministic across embedding model versions — upgrading from text-embedding-3-small to text-embedding-3-large will change your chunk boundaries, which means re-indexing everything to keep retrieval consistent.

There is a more pragmatic position emerging in production RAG: use recursive chunking by default, and reserve semantic chunking for the specific document types where evaluation shows it helps. Indexing the entire corpus semantically as a first move is rarely the bottleneck you think it is.

Beyond the Three: Sentence-Window and Parent-Document Patterns

Two patterns sit alongside the three core strategies and often combine with them. Both address the same trade-off: small chunks retrieve precisely, but large chunks give the LLM enough context to answer.

The sentence-window pattern indexes one sentence (or a small group of sentences) per vector for retrieval, but at query time expands the retrieved chunk to include the surrounding N sentences before passing context to the LLM. You get precise retrieval and rich context simultaneously, at the cost of storing a metadata pointer back to the original document.

from llama_index.core.node_parser import SentenceWindowNodeParser

parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)
nodes = parser.get_nodes_from_documents(documents)

The parent-document pattern does something similar but at coarser granularity: it indexes small “child” chunks for retrieval and stores larger “parent” chunks in a document store. Retrieval returns child IDs; the pipeline resolves them to parents before generation. LangChain’s ParentDocumentRetriever implements this directly.

Both patterns matter because they let you decouple the size used for retrieval from the size used for generation. In practice, this is often more impactful than agonizing over fixed vs recursive vs semantic. If you are already comfortable with LangChain fundamentals, adding a parent-document retriever is a half-day change that often outperforms switching chunking algorithms entirely.

RAG Chunking Strategies Compared

The table below summarizes when each approach earns its place in production. None of these are universal winners — match the strategy to the document type and the retrieval failure mode you are seeing.

StrategySpeedCostStructure-AwareBest Fit
Fixed-SizeFastestLowestNoUniform documents, prototypes, chat logs
RecursiveFastLowSyntacticDocumentation, articles, mixed corpora (default)
SemanticSlowHighSemanticLong-form with topic shifts, high-value corpora
Sentence-WindowFast indexingLowHybridRetrieval precision with generation context
Parent-DocumentFastLowHybridQ&A over books or long docs

A useful mental model: fixed-size is your fallback when nothing else fits the data, recursive is your default for almost everything else, and semantic is reserved for cases where you have measured a retrieval problem that the other two cannot solve.

Real-World Scenario: Tuning a Customer Support RAG

Consider a mid-sized SaaS company building an internal RAG system over 3,000 help-center articles, an API reference, and roughly 18 months of support ticket transcripts. The initial implementation uses fixed-size chunking at 1,024 tokens with no overlap, embedded with text-embedding-3-small and stored in a vector database. Initial retrieval looks reasonable in spot checks but fails on real questions about specific API error codes.

The first investigation finds that long support tickets are getting chunked mid-conversation, so a single chunk often contains a user’s question without the agent’s resolution. Switching the support transcript loader to a recursive splitter with \n\n---\n\n as the top separator (ticket boundary), then \n (turn boundary), keeps each conversation intact. Retrieval on tickets improves noticeably.

The API reference is a different problem. The fixed-size chunker is slicing through code examples, so embeddings of half-formed Python snippets dominate retrieval. The fix is a format-aware Markdown splitter that respects heading hierarchy and never splits inside fenced code blocks. After that change, the API reference becomes the strongest retrieval source.

The help-center articles are the last category, and they reveal where semantic chunking earns its cost. The articles average 1,500 words and frequently mix billing, configuration, and troubleshooting in a single page. Recursive chunking respects paragraphs but does not catch the topic shifts inside them, so retrieval surfaces “billing” articles for “configuration” questions because both topics share a chunk. Switching only the help-center pipeline to semantic chunking — keeping recursive for the rest — solves this specific failure mode without paying the semantic-embedding cost on the other two corpora.

The lesson is that “which chunking strategy” is the wrong question at the corpus level. The right question is: for each document type, which strategy matches its structure and the queries users actually ask?

When to Use Each RAG Chunking Strategy

The decision is rarely between one strategy for the whole pipeline. Most production systems mix strategies per document type. Use this section as a starting matrix, then validate against your own retrieval evaluation.

When to Use Fixed-Size Chunking

  • Documents are structurally uniform (chat logs, paginated forms, sensor readings, scraped pages from a single template)
  • You are prototyping and need a working pipeline within hours, not days
  • Token-budget predictability matters more than chunk quality (cost-controlled embedding pipelines)
  • Source text has no meaningful structure to preserve (raw transcripts without speaker turns or timestamps)

When to Use Recursive Chunking

  • General-purpose documentation, articles, books, and long-form web content
  • Mixed corpora where you cannot guarantee a single structure across documents
  • Source files with format conventions a structure-aware splitter can leverage (Markdown, HTML, source code)
  • You want the strong default that solves 80% of cases without semantic-chunking cost

When to Use Semantic Chunking

  • High-value corpora where retrieval quality justifies the indexing cost (legal, medical, internal knowledge bases)
  • Long-form content with rapid topic shifts inside paragraphs (interview transcripts, multi-topic articles)
  • You have measured a specific retrieval failure that paragraph-level splitting cannot fix
  • Re-indexing on embedding model upgrades is a cost you can afford and have budgeted for

When NOT to Use Each RAG Chunking Strategy

Knowing where each approach breaks down is more valuable than knowing where it works. Most teams discover these limits the hard way.

When Fixed-Size Falls Short

  • Documents with headings, lists, tables, or code blocks (the splitter will slice through them)
  • Multilingual corpora where character-based sizing creates inconsistent token counts
  • Q&A use cases where the answer is a single sentence that gets cut in half by an arbitrary boundary
  • Any scenario where retrieval evaluation shows fragmented or partial-thought chunks

When Recursive Becomes Overkill

  • Tiny structured documents (under 200 tokens each) where any splitting is unnecessary
  • Streaming pipelines where the recursion overhead matters at high throughput
  • Highly regular formats (logs, CSV-derived text) where fixed-size is faster and equivalent

When Semantic Isn’t Worth the Cost

  • Prototypes and proof-of-concept systems where the cost of being wrong is low
  • Corpora where retrieval evaluation already passes with recursive chunking
  • Pipelines that re-index frequently — every re-index pays the semantic-embedding cost again
  • Documents without meaningful topic shifts (technical specs, reference tables) where the algorithm has nothing useful to do

Common Mistakes with RAG Chunking Strategies

Most chunking failures are not about choosing the wrong algorithm. They are about applying any algorithm without measuring the result.

  • Picking chunk size by intuition rather than by retrieval evaluation. A 512-token default is fine to start, but the right size depends on your queries and embedding model. Measure recall@k on a curated eval set before shipping.
  • Using a single chunking strategy across heterogeneous document types. Support tickets, API references, and help articles need different splitters; one pipeline rarely wins everywhere.
  • Skipping overlap on fixed-size chunkers and then wondering why important sentences disappear at chunk boundaries.
  • Splitting through code blocks, tables, and lists with a generic text splitter. Use format-aware splitters for any document that is not plain prose.
  • Treating semantic chunking as a free upgrade. The indexing cost is real, and the boundary instability across embedding model versions is an operational hazard.
  • Forgetting to re-evaluate after upgrading the embedding model. Different embedding models produce different similarity geometries, and the chunk size that worked for one may underperform for the next.
  • Ignoring the relationship between chunk size and prompt budget. If your retriever pulls eight chunks and each is 1,500 tokens, you are spending 12,000 tokens on context alone — at scale, this dominates inference cost. The decision to use fine-tuning instead of RAG sometimes comes down to recurring context cost, not retrieval quality.
  • Failing to test chunking against real user queries. Reading sample chunks reveals little — running retrieval against the questions your system actually receives reveals everything.

Conclusion

The right RAG chunking strategy is the one your evaluation set tells you to use, not the one that is fashionable on Twitter this month. Recursive chunking is the strong default, fixed-size earns its place on uniform structures and tight prototypes, and semantic chunking is worth its cost only on corpora where measured retrieval problems demand it. Most production systems benefit more from mixing strategies per document type and adding a sentence-window or parent-document layer than from agonizing over which single algorithm to standardize on.

The next step is to wire up an evaluation harness — even a hand-curated set of 50 representative queries with expected source documents is enough to start. Once you can measure recall@5 on your own data, the chunking question stops being theoretical. For the surrounding pipeline, the RAG-from-scratch guide on TeachMeIDEA shows how chunking fits into embedding, storage, and retrieval, and the vector databases comparison covers the storage layer your chunks will eventually land in. If your system is heading toward autonomous retrieval, the patterns in building AI agents show how the same chunking decisions affect tool-using agents that retrieve on demand.

Leave a Comment