RAG Is Not Just Search: Chunking, Ranking, Reranking & More

"RAG is just search plus a prompt" is the sentence that produces most disappointing RAG systems. Plain semantic search gets you a demo; it does not get you answers users trust. The gap between the two is a stack of unglamorous techniques careful chunking, hybrid ranking, a reranking pass, and explicit hallucination controls that turn "vaguely related paragraphs" into "the right evidence, in the right order, with the model honest about its limits." This guide is about that gap and how to close it.

★

What separates a RAG demo from a RAG product

Why naive top-k vector search retrieves plausible-but-wrong context
Chunking strategies that preserve meaning instead of shredding it
Hybrid ranking: combining keyword and semantic signals
Reranking: a cheap second pass that dramatically lifts precision
Hallucination control: grounding, thresholds, and graceful abstention

Info

The mental shift

Retrieval is not "find similar text." It is "assemble the minimal set of evidence that lets the model answer correctly, and nothing that distracts it." Every technique here serves that goal relevance, ordering, and honesty.

Why naive vector search disappoints

A single embedding search returns the k chunks whose vectors sit closest to the question's vector. That sounds right, but it fails in predictable ways: it misses exact terms (product codes, error numbers, names) that semantic similarity blurs over; it returns near-duplicates that crowd out diverse evidence; and it has no notion of authority or freshness, so a stale draft can outrank the current policy. The model then dutifully answers from mediocre context and sounds confident doing it.

Question: "What's the refund window for error code E-402?"

Naive top-k may return:
  ✓ a paragraph about refunds (semantically close)
  ✗ three near-duplicate marketing blurbs about "hassle-free returns"
  ✗ misses the support doc that literally lists "E-402 → 14 days"
       because that doc is terse and embeds far from the chatty query

Chunking: the foundation everything rests on

If a chunk does not contain a coherent, self-contained idea, no amount of clever ranking rescues it. Chunking is where most quality is won or lost, and the right strategy depends on your content.

Strategies, from blunt to smart

✓ Pros

Structure-aware: split on headings/sections so chunks match real topics
Recursive character splitting with overlap: solid general-purpose default
Sentence/semantic chunking: group sentences that belong together
Parent-child: embed small chunks, but feed the larger parent to the model

✕ Cons

Fixed-size character splits that cut sentences mid-thought
Chunks so large they bury the relevant line in noise
Chunks so small they lose the context that gives them meaning
Dropping headings/titles, leaving chunks ambiguous out of context

💡

Pro tip

The parent-child pattern is the single highest-leverage upgrade for most systems: index small, precise child chunks for accurate retrieval, but pass the surrounding parent section to the model so it has enough context to answer. You get precision in search and completeness in generation.

Hybrid ranking: keywords and vectors together

Semantic search understands meaning; keyword search (BM25) nails exact terms. They fail in opposite ways, which is exactly why combining them wins. Hybrid retrieval runs both and fuses the results — commonly with Reciprocal Rank Fusion (RRF), which blends rankings without needing the two scoring scales to be comparable. The result catches both "what they meant" and "the literal code they typed."

# Reciprocal Rank Fusion: combine a keyword and a vector ranking.
def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:                 # each is an ordered list of doc ids
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

fused = rrf([bm25_results, vector_results])  # best of both worlds

💡

Tip

Hybrid is often the biggest single win

Teams chase fancier embedding models when simply adding BM25 alongside vector search would have fixed their "it can't find exact terms" complaints overnight. Try hybrid before anything exotic.

Reranking: the cheap precision multiplier

Initial retrieval optimises for speed across millions of chunks, so it is deliberately approximate. Reranking adds a second, slower-but-smarter pass over just the top candidates: retrieve, say, the top 50 cheaply, then use a cross-encoder reranker to score how well each one actually answers the query, and keep the best 5. Because it only runs on a handful of candidates, it is affordable — and it routinely produces the largest jump in answer quality of anything in the pipeline.

Stage 1  Retrieve top 50 (fast, approximate)   ── recall-oriented
Stage 2  Rerank those 50 with a cross-encoder   ── precision-oriented
Stage 3  Pass the top 5 to the model             ── grounded generation

Cost: rerank runs on 50 items, not the whole corpus → cheap, high impact.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_and_rerank(question: str, k: int = 5):
    candidates = hybrid_search(question, top_k=50)        # cheap, recall-oriented

    # The cross-encoder scores how well each candidate answers THIS query.
    pairs = [(question, c.text) for c in candidates]
    for cand, score in zip(candidates, reranker.predict(pairs)):
        cand.score = score

    ranked = sorted(candidates, key=lambda c: c.score, reverse=True)
    return ranked[:k]                                     # precision-oriented top 5

Top 5

of 50 candidates — reranking trades a little latency for a large precision gain

Assembling context the model can use

Once you have the best chunks, how you present them matters. Deduplicate near-identical passages so the model sees diverse evidence. Order by relevance, and be aware that models attend more strongly to the start and end of long contexts so do not bury the best chunk in the middle. Label each chunk with a source id so the model can cite it, and keep the total tight: more context is not better context.

⚠

Warning

The "lost in the middle" effect

Stuffing twenty chunks into the prompt can score worse than five well-chosen ones, because relevant evidence placed in the middle of a long context gets under-weighted. Curate, order, and trim — quality of context beats quantity every time.

Hallucination control

Even with perfect retrieval, a model can embellish. Controlling that is a layered job, not a single setting.

Instruct grounding and abstention

Tell the model to answer only from the provided context and to say "I don't know" when the context is insufficient. Make abstention an explicit, acceptable outcome.

Set a relevance threshold

If the best retrieved chunk scores below a confidence bar, do not answer from weak evidence — abstain or ask a clarifying question instead of guessing.

Require citations

Demand a source id after each claim. Citations make the answer auditable and discourage the model from asserting things no chunk supports.

Verify when stakes are high

For critical answers, add a check that each claim is actually supported by a cited chunk, and flag or suppress unsupported ones.

Measure, or you are guessing

Every technique here is a knob, and knobs need a dial. Maintain an evaluation set of real questions with known-good evidence and answers. Measure retrieval (did the right chunk make the final cut?) and generation (is the answer faithful, relevant, and correctly cited?) separately, and re-run the suite on every change. Without it, "we added reranking" is a vibe, not a result.

✓ Pros

Structure-aware or parent-child chunking with sensible overlap
Hybrid retrieval (BM25 + vector) fused with RRF
A cross-encoder reranking pass over the top candidates
Deduplicated, relevance-ordered, citation-labelled context
Grounding, thresholds, citations, and an eval set for both stages

✕ Cons

No single-shot naive top-k as your final retrieval
No dumping every retrieved chunk into the prompt unfiltered
No answering from low-confidence retrievals instead of abstaining
No shipping pipeline changes without re-running evals

! Common mistakes to avoid

✕Relying on a single naive top-k vector search.

✓Add keyword (BM25) search and fuse with Reciprocal Rank Fusion so exact terms are not missed.
✕Stuffing twenty retrieved chunks into the prompt.

✓Deduplicate, rerank, and pass only the best few — relevant evidence buried mid-context gets ignored.
✕Skipping the reranking pass to save latency.

✓Rerank the top ~50 candidates; it runs on a handful of items and gives the biggest quality jump.
✕Measuring only the final answer.

✓Evaluate retrieval separately — a great answer is impossible if the right chunk was never retrieved.

? Frequently asked questions

Why does naive vector search return wrong-but-plausible results? +

It optimises for semantic similarity only, so it misses exact terms (codes, names), returns near-duplicates, and has no sense of authority or freshness. The model then answers confidently from mediocre context.

What is hybrid search and why does it help? +

Hybrid combines keyword (BM25) and semantic search, which fail in opposite ways. Fusing them — often with Reciprocal Rank Fusion — catches both "what they meant" and "the literal term they typed".

What does reranking do? +

It is a second, smarter pass that re-scores the top candidates from initial retrieval with a cross-encoder, pushing the truly relevant chunks to the top before generation. Because it runs on few items, it is cheap and high-impact.

What is the parent-child chunking pattern? +

You embed small, precise child chunks for accurate retrieval but feed the larger parent section to the model for context — getting precision in search and completeness in generation.

What is the "lost in the middle" problem? +

Models attend more to the start and end of a long context, so relevant evidence placed in the middle gets under-weighted. Curate, order by relevance, and trim rather than over-stuffing the prompt.

✓

Success

The compounding payoff

None of these techniques is exotic, and each adds a slice of quality: better chunks feed better hybrid retrieval, reranking sharpens what survives, and hallucination controls keep the model honest about the rest. Stacked together they are the difference between a RAG demo you show once and a RAG product people rely on.

RAG Is Not Just Search: Chunking, Ranking, Reranking, and Hallucination Control

What separates a RAG demo from a RAG product

The mental shift

Why naive vector search disappoints

Chunking: the foundation everything rests on

Strategies, from blunt to smart

✓ Pros

✕ Cons

Hybrid ranking: keywords and vectors together

Hybrid is often the biggest single win

Reranking: the cheap precision multiplier

Assembling context the model can use

The "lost in the middle" effect

Hallucination control

Instruct grounding and abstention

Set a relevance threshold

Require citations

Verify when stakes are high

Measure, or you are guessing

✓ Pros

✕ Cons

! Common mistakes to avoid

? Frequently asked questions

The compounding payoff

Bishrul Haq

Tags

Share

Comments

Related posts

Essential Sorting Algorithms for Computer Science Students

GraphQL in Laravel Using Lighthouse

Building Modern Reactive UIs with Laravel 12 and Livewire 4: A Production Guide

Building Powerful Admin Panels with Laravel 12 and Filament v5: A Production Guide

Scaling Laravel 12 with Octane and FrankenPHP: A Production Performance Guide