Cover image for RAG Is Not Just Search: Chunking, Ranking, Reranking, and Hallucination Control

At a glance

Reading time

~200 words/min

Published

23 hours ago

Jun 18, 2026

Views

9

All-time total

RAG Is Not Just Search: Chunking, Ranking, Reranking, and Hallucination Control

"RAG is just search plus a prompt" is the sentence that produces most disappointing RAG systems. Plain semantic search gets you a demo; it does not get you answers users trust. The gap between the two is a stack of unglamorous techniques careful chunking, hybrid ranking, a reranking pass, and explicit hallucination controls that turn "vaguely related paragraphs" into "the right evidence, in the right order, with the model honest about its limits." This guide is about that gap and how to close it.

What separates a RAG demo from a RAG product

  • Why naive top-k vector search retrieves plausible-but-wrong context
  • Chunking strategies that preserve meaning instead of shredding it
  • Hybrid ranking: combining keyword and semantic signals
  • Reranking: a cheap second pass that dramatically lifts precision
  • Hallucination control: grounding, thresholds, and graceful abstention
i

Info

The mental shift

Retrieval is not "find similar text." It is "assemble the minimal set of evidence that lets the model answer correctly, and nothing that distracts it." Every technique here serves that goal relevance, ordering, and honesty.

Why naive vector search disappoints

A single embedding search returns the k chunks whose vectors sit closest to the question's vector. That sounds right, but it fails in predictable ways: it misses exact terms (product codes, error numbers, names) that semantic similarity blurs over; it returns near-duplicates that crowd out diverse evidence; and it has no notion of authority or freshness, so a stale draft can outrank the current policy. The model then dutifully answers from mediocre context and sounds confident doing it.

Question: "What's the refund window for error code E-402?"

Naive top-k may return:
  ✓ a paragraph about refunds (semantically close)
  ✗ three near-duplicate marketing blurbs about "hassle-free returns"
  ✗ misses the support doc that literally lists "E-402 → 14 days"
       because that doc is terse and embeds far from the chatty query

Chunking: the foundation everything rests on

If a chunk does not contain a coherent, self-contained idea, no amount of clever ranking rescues it. Chunking is where most quality is won or lost, and the right strategy depends on your content.

Strategies, from blunt to smart

Pros

  • Structure-aware: split on headings/sections so chunks match real topics
  • Recursive character splitting with overlap: solid general-purpose default
  • Sentence/semantic chunking: group sentences that belong together
  • Parent-child: embed small chunks, but feed the larger parent to the model

Cons

  • Fixed-size character splits that cut sentences mid-thought
  • Chunks so large they bury the relevant line in noise
  • Chunks so small they lose the context that gives them meaning
  • Dropping headings/titles, leaving chunks ambiguous out of context
💡

Pro tip

The parent-child pattern is the single highest-leverage upgrade for most systems: index small, precise child chunks for accurate retrieval, but pass the surrounding parent section to the model so it has enough context to answer. You get precision in search and completeness in generation.

Hybrid ranking: keywords and vectors together

Semantic search understands meaning; keyword search (BM25) nails exact terms. They fail in opposite ways, which is exactly why combining them wins. Hybrid retrieval runs both and fuses the results — commonly with Reciprocal Rank Fusion (RRF), which blends rankings without needing the two scoring scales to be comparable. The result catches both "what they meant" and "the literal code they typed."

# Reciprocal Rank Fusion: combine a keyword and a vector ranking.
def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:                 # each is an ordered list of doc ids
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

fused = rrf([bm25_results, vector_results])  # best of both worlds
💡

Tip

Hybrid is often the biggest single win

Teams chase fancier embedding models when simply adding BM25 alongside vector search would have fixed their "it can't find exact terms" complaints overnight. Try hybrid before anything exotic.

Reranking: the cheap precision multiplier

Initial retrieval optimises for speed across millions of chunks, so it is deliberately approximate. Reranking adds a second, slower-but-smarter pass over just the top candidates: retrieve, say, the top 50 cheaply, then use a cross-encoder reranker to score how well each one actually answers the query, and keep the best 5. Because it only runs on a handful of candidates, it is affordable — and it routinely produces the largest jump in answer quality of anything in the pipeline.

Stage 1  Retrieve top 50 (fast, approximate)   ── recall-oriented
Stage 2  Rerank those 50 with a cross-encoder   ── precision-oriented
Stage 3  Pass the top 5 to the model             ── grounded generation

Cost: rerank runs on 50 items, not the whole corpus → cheap, high impact.
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_and_rerank(question: str, k: int = 5):
    candidates = hybrid_search(question, top_k=50)        # cheap, recall-oriented

    # The cross-encoder scores how well each candidate answers THIS query.
    pairs = [(question, c.text) for c in candidates]
    for cand, score in zip(candidates, reranker.predict(pairs)):
        cand.score = score

    ranked = sorted(candidates, key=lambda c: c.score, reverse=True)
    return ranked[:k]                                     # precision-oriented top 5
Top 5

of 50 candidates — reranking trades a little latency for a large precision gain

Assembling context the model can use

Once you have the best chunks, how you present them matters. Deduplicate near-identical passages so the model sees diverse evidence. Order by relevance, and be aware that models attend more strongly to the start and end of long contexts so do not bury the best chunk in the middle. Label each chunk with a source id so the model can cite it, and keep the total tight: more context is not better context.

Warning

The "lost in the middle" effect

Stuffing twenty chunks into the prompt can score worse than five well-chosen ones, because relevant evidence placed in the middle of a long context gets under-weighted. Curate, order, and trim — quality of context beats quantity every time.

Hallucination control

Even with perfect retrieval, a model can embellish. Controlling that is a layered job, not a single setting.

1

Instruct grounding and abstention

Tell the model to answer only from the provided context and to say "I don't know" when the context is insufficient. Make abstention an explicit, acceptable outcome.

2

Set a relevance threshold

If the best retrieved chunk scores below a confidence bar, do not answer from weak evidence — abstain or ask a clarifying question instead of guessing.

3

Require citations

Demand a source id after each claim. Citations make the answer auditable and discourage the model from asserting things no chunk supports.

4

Verify when stakes are high

For critical answers, add a check that each claim is actually supported by a cited chunk, and flag or suppress unsupported ones.

Measure, or you are guessing

Every technique here is a knob, and knobs need a dial. Maintain an evaluation set of real questions with known-good evidence and answers. Measure retrieval (did the right chunk make the final cut?) and generation (is the answer faithful, relevant, and correctly cited?) separately, and re-run the suite on every change. Without it, "we added reranking" is a vibe, not a result.

Pros

  • Structure-aware or parent-child chunking with sensible overlap
  • Hybrid retrieval (BM25 + vector) fused with RRF
  • A cross-encoder reranking pass over the top candidates
  • Deduplicated, relevance-ordered, citation-labelled context
  • Grounding, thresholds, citations, and an eval set for both stages

Cons

  • No single-shot naive top-k as your final retrieval
  • No dumping every retrieved chunk into the prompt unfiltered
  • No answering from low-confidence retrievals instead of abstaining
  • No shipping pipeline changes without re-running evals

! Common mistakes to avoid

  • Relying on a single naive top-k vector search.

    Add keyword (BM25) search and fuse with Reciprocal Rank Fusion so exact terms are not missed.

  • Stuffing twenty retrieved chunks into the prompt.

    Deduplicate, rerank, and pass only the best few — relevant evidence buried mid-context gets ignored.

  • Skipping the reranking pass to save latency.

    Rerank the top ~50 candidates; it runs on a handful of items and gives the biggest quality jump.

  • Measuring only the final answer.

    Evaluate retrieval separately — a great answer is impossible if the right chunk was never retrieved.

? Frequently asked questions

Why does naive vector search return wrong-but-plausible results? +

It optimises for semantic similarity only, so it misses exact terms (codes, names), returns near-duplicates, and has no sense of authority or freshness. The model then answers confidently from mediocre context.

What is hybrid search and why does it help? +

Hybrid combines keyword (BM25) and semantic search, which fail in opposite ways. Fusing them — often with Reciprocal Rank Fusion — catches both "what they meant" and "the literal term they typed".

What does reranking do? +

It is a second, smarter pass that re-scores the top candidates from initial retrieval with a cross-encoder, pushing the truly relevant chunks to the top before generation. Because it runs on few items, it is cheap and high-impact.

What is the parent-child chunking pattern? +

You embed small, precise child chunks for accurate retrieval but feed the larger parent section to the model for context — getting precision in search and completeness in generation.

What is the "lost in the middle" problem? +

Models attend more to the start and end of a long context, so relevant evidence placed in the middle gets under-weighted. Curate, order by relevance, and trim rather than over-stuffing the prompt.

Success

The compounding payoff

None of these techniques is exotic, and each adds a slice of quality: better chunks feed better hybrid retrieval, reranking sharpens what survives, and hallucination controls keep the model honest about the rest. Stacked together they are the difference between a RAG demo you show once and a RAG product people rely on.

Newsletter

Want more posts like this?

Get practical software notes and tutorials delivered when something new is published.

No spam. Unsubscribe anytime.

How did this land?

Comments

0
Log in or sign up to join the discussion and react to this post.

No comments yet. Be the first to share your thoughts.

Related posts

Essential Sorting Algorithms for Computer Science Students

Algorithms are commonly taught in Computer Science, Software Engineering subjects at your Bachelors or Masters. Some find it difficult to understand due to memorizing.

6 years ago

GraphQL in Laravel Using Lighthouse

In modern web development, GraphQL has emerged as a powerful alternative to REST APIs due to its flexibility and efficiency.

1 year ago

Building Modern Reactive UIs with Laravel 12 and Livewire 4: A Production Guide

A production-grade walkthrough of Livewire 4 in Laravel 12 — form objects, lazy components, Alpine interop, file uploads, Pest tests, and the deployment gotchas nobody warns you about.

6 days ago

Building Powerful Admin Panels with Laravel 12 and Filament v5: A Production Guide

Ship a real Filament v5 admin panel on Laravel 12 — Resources, RBAC with Spatie, multi-tenancy, custom widgets, and a deployment checklist for teams beyond hello-world.

1 month ago

Scaling Laravel 12 with Octane and FrankenPHP: A Production Performance Guide

Cut Laravel 12 latency by more than half with Octane and FrankenPHP — install, configure, audit singletons, and benchmark, with the production gotchas that bite teams in week two.

3 weeks ago