Cover image for Building a Private RAG System: Architecture, Embeddings, Vector DBs, and Evaluation

At a glance

Reading time

~200 words/min

Published

6 hours ago

Jun 15, 2026

Views

5

All-time total

Building a Private RAG System: Architecture, Embeddings, Vector DBs, and Evaluation

Retrieval-Augmented Generation (RAG) is how you make a language model answer questions about your data your docs, your tickets, your codebase without retraining anything. A "private" RAG system keeps that data inside your own infrastructure, which for many organisations is the only way the project gets legal sign-off. This guide walks the full pipeline end to end: ingestion, chunking, embeddings, the vector store, retrieval, generation, and the evaluation loop that tells you whether any of it actually works.

What you will build a mental model of

  • The two-phase RAG pipeline: offline indexing and online querying
  • How chunking and embeddings turn documents into searchable vectors
  • Choosing a vector database and the right similarity metric
  • Wiring retrieval into a grounded, citation-friendly prompt
  • Evaluating retrieval and answer quality so you can improve it
i

Info

Why RAG instead of fine-tuning

Fine-tuning teaches a model a style or skill; it is a poor and expensive way to teach it facts that change. RAG injects current, source-attributable facts at query time, so updating knowledge means re-indexing a document, not retraining a model. For most "answer questions about our data" needs, RAG is the right tool.

The big picture: two pipelines

Every RAG system is really two pipelines that share a vector store. The offline pipeline runs whenever your data changes; the online pipeline runs on every user question. Keeping them mentally separate makes the whole design tractable.

OFFLINE (indexing, runs when data changes)
  documents ─▶ clean ─▶ chunk ─▶ embed ─▶ store vectors + metadata
                                              │
                                       [ VECTOR DB ]
                                              │
ONLINE (querying, runs per question)          ▼
  question ─▶ embed ─▶ similarity search ─▶ top-k chunks
          ─▶ build grounded prompt ─▶ LLM ─▶ answer + citations

Step 1: Ingestion and cleaning

Garbage in, garbage out applies brutally to RAG. Before anything else, extract clean text from your sources (PDFs, HTML, wikis, tickets), strip boilerplate like nav bars and footers, and preserve structure headings, lists, tables, because that structure carries meaning the model will use. Attach metadata to every document now: source, URL, author, last-updated, access level. You will need it for filtering and citations later.

Step 2: Chunking : the most underrated decision

Models retrieve and reason over chunks, not whole documents. Chunk too large and you dilute relevance and waste context; chunk too small and you sever the meaning that spans sentences. Good defaults are a few hundred tokens per chunk with modest overlap so ideas straddling a boundary are not lost but the right size is content-dependent and worth tuning against your evals.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,          # characters; tune against your eval set
    chunk_overlap=120,       # keep ideas that straddle a boundary
    separators=["\n\n", "\n", ". ", " "],  # prefer natural boundaries
)
chunks = splitter.split_text(clean_document_text)
💡

Pro tip

Store the source heading and document title alongside each chunk and prepend them to the chunk text before embedding. A bare paragraph is ambiguous; the same paragraph under "Refunds > Eligibility" embeds into a far more findable vector. This one trick noticeably lifts retrieval quality.

Step 3: Embeddings

An embedding model turns each chunk into a vector a list of numbers such that semantically similar text lands close together in space. The same model embeds your chunks at index time and your questions at query time; you must use the same model for both, or the geometry does not line up. For a private system, run an open embedding model locally so your text never leaves the network.

Warning

Never mix embedding models

Vectors from two different embedding models are not comparable — searching a Model-A index with a Model-B query vector returns nonsense. If you change embedding models, you must re-embed your entire corpus. Pin the model and version, and record it in your index metadata.

Step 4: The vector database

The vector store holds your chunk vectors plus metadata and answers "which chunks are most similar to this query vector?" fast, using approximate nearest-neighbour (ANN) search. The options span embedded libraries to full servers, and the right pick depends on scale and ops appetite.

Pros

  • pgvector — add vectors to Postgres you already run; great for moderate scale
  • Qdrant / Milvus / Weaviate — purpose-built, scale to large corpora
  • FAISS / Chroma — embedded, perfect for prototyping and small sets
  • Metadata filtering lets you enforce access control at query time

Cons

  • A new server is real ops back it up, monitor it, secure it
  • ANN trades a little recall for big speed; tune the index parameters
  • Re-indexing large corpora is slow plan capacity and timing
  • Cosine vs dot vs Euclidean must match your embedding model
💡

Tip

If you already run Postgres, start with pgvector

It keeps your vectors, metadata, and existing relational data in one system you already back up and secure. Many private RAG systems never need to outgrow it reach for a dedicated vector DB only when scale or recall genuinely demands it.

Step 5: Retrieval and the grounded prompt

At query time you embed the question, retrieve the top-k most similar chunks (filtered by the user's access level), and assemble a prompt that hands those chunks to the model with strict instructions: answer only from the provided context, cite sources, and say "I don't know" when the context does not contain the answer. That last instruction is what turns a confident hallucinator into a trustworthy assistant. The access filter belongs in the query itself, so a user can never retrieve a chunk they are not allowed to see:

def retrieve(question: str, user, k: int = 6):
    qvec = embed(question)                       # SAME model used at index time

    # pgvector similarity search WITH access control pushed into the SQL.
    rows = db.execute(
        """
        SELECT id, title, text, (embedding <=> %(q)s) AS distance
        FROM chunks
        WHERE acl_level <= %(level)s              -- never return chunks above the user's level
        ORDER BY embedding <=> %(q)s              -- cosine distance, ANN-indexed
        LIMIT %(k)s
        """,
        {"q": qvec, "level": user.acl_level, "k": k},
    ).fetchall()

    # Abstain instead of answering from weak evidence.
    return [r for r in rows if r.distance < RELEVANCE_THRESHOLD]
SYSTEM = (
    "Answer the question using ONLY the context below. "
    "Cite the source id in brackets after each claim, e.g. [doc-12]. "
    "If the context does not contain the answer, say you don't know."
)

context = "\n\n".join(f"[{c.id}] {c.title}\n{c.text}" for c in top_k_chunks)
prompt = f"{SYSTEM}\n\nContext:\n{context}\n\nQuestion: {question}"
answer = llm.complete(prompt)

Step 6: Evaluation : the part teams skip and regret

Without evaluation you are tuning blind. RAG quality has two halves, and you must measure both because a great answer is impossible if retrieval handed the model the wrong chunks.

1

Evaluate retrieval

Build a set of questions with known relevant chunks, then measure whether retrieval surfaces them recall and precision at k. If the right chunk is not retrieved, no prompt can save the answer.

2

Evaluate the answers

For the same questions, judge the generated answers for faithfulness (no claims beyond the context), relevance, and correct citations. An LLM-as-judge plus a human spot-check is the practical standard.

3

Track it over time

Run the eval on every change new chunking, new embedding model, new prompt. RAG is a system of knobs; evaluation is the only way to know which turn helped.

2 halves

retrieval quality and answer quality — measure both or you are flying blind

Controlling hallucination

Grounding does not automatically stop a model from inventing things, but several controls together get you most of the way: instruct the model to abstain when context is insufficient, require citations so claims are traceable, set a relevance threshold so weak retrievals trigger "I don't know" instead of a guess, and consider a reranking step to push the truly relevant chunks to the top before generation.

Danger

A confident wrong answer is worse than no answer

In a private knowledge system, users trust the output. An assistant that fabricates a refund policy or a security procedure does real damage. Bias your system toward abstention "I could not find that" is a feature, not a failure.

The private RAG checklist

Pros

  • Local embedding and generation models data never leaves your network
  • Metadata-based access control enforced at retrieval time
  • Citations on every answer for traceability and trust
  • An eval set covering both retrieval and answer quality
  • A re-indexing pipeline that runs when source data changes

Cons

  • No mixing embedding models without full re-indexing
  • No skipping evaluation you cannot improve what you do not measure
  • No serving answers without access-control filtering
  • No deployment that hallucinates instead of abstaining

! Common mistakes to avoid

  • Embedding chunks with one model and queries with another.

    Use the exact same embedding model and version for both; changing it means re-embedding everything.

  • Dumping whole documents in as single chunks.

    Chunk into coherent, self-contained pieces with overlap, and prepend the section heading.

  • Letting the model answer from weak retrievals.

    Set a relevance threshold and instruct it to say "I don't know" rather than guess.

  • Skipping evaluation and tuning blind.

    Measure retrieval and answer quality separately on a fixed question set, on every change.

? Frequently asked questions

When should I use RAG instead of fine-tuning? +

Use RAG to give a model current, source-attributable facts; updating knowledge then means re-indexing a document, not retraining. Fine-tuning teaches style or skills, not changing facts.

Which vector database should I choose? +

If you already run Postgres, start with pgvector — it keeps vectors, metadata, and relational data in one backed-up system. Move to a dedicated store like Qdrant, Milvus, or Weaviate only when scale or recall demands it.

How big should my chunks be? +

A few hundred tokens with modest overlap is a solid default, but the right size is content-dependent. Tune it against your evaluation set rather than guessing.

How do I keep it private? +

Run open embedding and generation models locally so text never leaves your network, and enforce per-user access control with metadata filters at retrieval time.

How do I stop it hallucinating? +

Ground the model in retrieved context only, require citations, set a relevance threshold for abstention, and bias the system toward "I could not find that" over a confident guess.

Success

RAG is plumbing done well

There is no single magic component in RAG — quality comes from doing each unglamorous step well: clean ingestion, smart chunking, consistent embeddings, filtered retrieval, a disciplined prompt, and relentless evaluation. Get those right inside your own walls and you have a private assistant your users can actually trust.

Newsletter

Want more posts like this?

Get practical software notes and tutorials delivered when something new is published.

No spam. Unsubscribe anytime.

How did this land?

Comments

0
Log in or sign up to join the discussion and react to this post.

No comments yet. Be the first to share your thoughts.

Related posts

Essential Sorting Algorithms for Computer Science Students

Algorithms are commonly taught in Computer Science, Software Engineering subjects at your Bachelors or Masters. Some find it difficult to understand due to memorizing.

6 years ago

GraphQL in Laravel Using Lighthouse

In modern web development, GraphQL has emerged as a powerful alternative to REST APIs due to its flexibility and efficiency.

1 year ago

Building Modern Reactive UIs with Laravel 12 and Livewire 4: A Production Guide

A production-grade walkthrough of Livewire 4 in Laravel 12 — form objects, lazy components, Alpine interop, file uploads, Pest tests, and the deployment gotchas nobody warns you about.

3 days ago

Building Powerful Admin Panels with Laravel 12 and Filament v5: A Production Guide

Ship a real Filament v5 admin panel on Laravel 12 — Resources, RBAC with Spatie, multi-tenancy, custom widgets, and a deployment checklist for teams beyond hello-world.

4 weeks ago

Scaling Laravel 12 with Octane and FrankenPHP: A Production Performance Guide

Cut Laravel 12 latency by more than half with Octane and FrankenPHP — install, configure, audit singletons, and benchmark, with the production gotchas that bite teams in week two.

3 weeks ago