Cover image for Building a RAG Service with FastAPI: Chunking, Embeddings, and Vector Search

At a glance

Reading time

~200 words/min

Published

5 hours ago

Jun 10, 2026

Views

3

All-time total

Building a RAG Service with FastAPI: Chunking, Embeddings, and Vector Search

Part 10 builds the feature most LLM apps need: answering questions over your own documents. Retrieval augmented generation, or RAG, finds the passages most relevant to a question and hands them to the model as context. This part walks the full pipeline, chunking, embeddings, vector search, and a grounded answer, and wires it into a FastAPI endpoint.

What you will build

  • A chunking step that splits documents into searchable pieces
  • Embeddings that turn text into vectors for similarity search
  • Cosine similarity search you can run in the browser
  • A FastAPI endpoint that answers grounded in retrieved chunks

1. The RAG pipeline in four steps

RAG is not magic. At index time you split documents into chunks and store a vector for each. At query time you embed the question, find the nearest chunks, and pass them to the model with an instruction to answer only from that context. Retrieval is the part that decides answer quality.

2. Chunking

Chunks that are too large bury the relevant sentence in noise; too small and they lose context. A few hundred words with a little overlap is a sound default. Split on structure where you can, such as paragraphs or headings.

def chunk(text: str, size: int = 800, overlap: int = 100) -> list[str]:
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        end = start + size
        chunks.append(" ".join(words[start:end]))
        start = end - overlap   # overlap keeps context across boundaries
    return chunks

3. Embeddings and similarity

An embedding maps text to a vector so that similar meanings sit close together. You compare vectors with cosine similarity. The playground below shows the core idea with tiny hand built vectors so you can see why nearest neighbor search works, no API needed.

Python playground

In a real system you generate embeddings with an embedding model and store them in a vector database such as pgvector, Qdrant, or Pinecone, which does this nearest neighbor search at scale. The ranking logic is exactly what you just ran.

4. The grounded answer endpoint

Put it together in FastAPI. Retrieve the top chunks, build a prompt that includes them, and instruct the model to answer only from the provided context and to say when it does not know. That instruction is what keeps the model from making things up.

from anthropic import Anthropic
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
client = Anthropic()

class Query(BaseModel):
    question: str

def retrieve(question: str) -> list[str]:
    # embed the question, search the vector store, return top chunks
    ...

@app.post("/ask")
async def ask(q: Query) -> dict:
    context = "\n\n".join(retrieve(q.question))
    prompt = (
        "Answer using only the context below. "
        "If the answer is not in the context, say you do not know.\n\n"
        f"Context:\n{context}\n\nQuestion: {q.question}"
    )
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return {"answer": resp.content[0].text}

Checkpoint

A RAG system gives confident but wrong answers. Where do you look first?

5. Show your sources

A grounded answer that cannot be checked is only half useful. Carry metadata with each chunk, such as the document title and a link, and return those sources alongside the answer. Users trust an answer they can verify, and when the model does get something wrong, citations make it obvious which passage led it astray. Practically, this means your chunks are not just text, they are small records.

from pydantic import BaseModel

class Chunk(BaseModel):
    text: str
    source: str        # e.g. "Billing FAQ"
    url: str

class Answer(BaseModel):
    answer: str
    sources: list[str]

# After retrieving Chunk objects, pass their text to the model for the answer,
# and return the unique sources so the UI can render "based on: Billing FAQ".

6. Better retrieval: filtering, hybrid, reranking

Pure vector search is a strong baseline, not the ceiling. Three upgrades matter in practice. Metadata filtering restricts the search to the right subset, for example only this user documents, before similarity even runs. Hybrid search combines keyword matching with vector similarity, which rescues cases where an exact term like an error code matters more than meaning. And reranking takes the top twenty candidates and uses a stronger model to reorder them, so the few chunks you actually send are the best few, not merely the closest in vector space.

Retrieval upgrades and when they help
Technique Fixes Cost
Metadata filter Wrong scope, leaking other users data Cheap
Hybrid search Misses on exact terms, codes, names Moderate
Reranking Right doc retrieved but buried below noise Extra model call

7. Grounding and refusing to guess

The instruction you give the model is what separates a careful assistant from a confident fabricator. Tell it to answer only from the provided context and to say plainly when the answer is not there. Without that line, a model will often fill the gap with a plausible invention. With it, a missing answer becomes an honest I do not know, which in a support or docs setting is far better than a wrong one.

SYSTEM = (
    "You answer strictly from the provided context. "
    "If the context does not contain the answer, reply exactly: "
    "'I could not find that in the available documents.' "
    "Never use outside knowledge or guess."
)
# Pass SYSTEM as the system prompt and the retrieved chunks as the user content.

8. Evaluate retrieval, not vibes

You cannot improve what you do not measure. Build a small evaluation set of real questions paired with the chunk that should answer each one, then measure how often retrieval returns the right chunk in its top results. That single number tells you whether a change to chunking, embeddings, or search actually helped, and it stops you from tuning blindly. When answer quality is poor, this is the metric that reveals whether the problem is retrieval or generation.

💡

Pro tip

Tune retrieval before touching the prompt or the model. If the passage that answers the question never reaches the model, no prompt and no model can save the answer. Retrieval quality is the ceiling on RAG quality.

9. Indexing is a pipeline too

Retrieval gets the attention, but indexing is where the data is prepared, and a sloppy index caps quality no matter how good the search is. Indexing runs once per document, not per query, so it belongs in background work, the queue pattern from Part 6, because embedding a large document is slow and should not block an upload request. The steps are: load the document, clean it, chunk it, embed each chunk, and upsert the vectors with their metadata.

def index_document(doc_id: str, text: str, title: str, url: str) -> int:
    chunks = chunk(text)                    # from the chunking step above
    records = []
    for i, piece in enumerate(chunks):
        vector = embed(piece)               # your embedding model
        records.append({
            "id": f"{doc_id}:{i}",
            "vector": vector,
            "text": piece,
            "source": title,
            "url": url,
        })
    upsert(records)                         # write to the vector store
    return len(records)

Two details save pain later. Use a stable id per chunk, such as the document id plus the chunk index, so re indexing updates in place instead of creating duplicates. And store the metadata next to the vector, so a retrieved chunk already carries the source and link you need to cite it.

10. Cost and caching

RAG has two cost centers: embeddings at index time and the model call at query time. Embeddings are cheap per call but add up across a large corpus, so embed once and store the vectors rather than recomputing them. At query time, the context you retrieve is input you pay for on every question, so passing fifteen chunks when five would do is a direct, recurring cost. Cache embeddings for repeated or unchanged content, and tune the number of retrieved chunks against your evaluation set rather than maximizing it.

Pros

  • Embed once at index time and reuse the stored vectors
  • Carry metadata with each chunk so answers cite sources
  • Tune the chunk count against an eval set, not by feel

Cons

  • Re-embedding the same text on every query wastes money
  • Passing too many chunks adds cost and dilutes the answer
  • No evaluation set means you are tuning retrieval blind

? Frequently asked questions

How do I keep one user from seeing another user data? +

Tag every chunk with an owner in its metadata and apply a metadata filter at query time so search only ever considers that user documents.

My answers cite the wrong passage. What changed? +

Usually chunking or retrieval. Check your evaluation metric for whether the right chunk is even being retrieved before adjusting the prompt or model.

The bottom line

RAG is a pipeline, and its quality lives in retrieval. Chunk sensibly, embed and search for the nearest passages, and ground the model with an instruction to use only what it was given. Wired into FastAPI, that is a real document question answering feature. The next part makes it feel instant by streaming the answer to the browser.

? Frequently asked questions

Which vector database should I use? +

pgvector if you already run Postgres, Qdrant or Pinecone for a dedicated store. The retrieval logic and prompt stay the same across all of them.

How many chunks should I pass? +

Start with three to five. Too few misses context, too many adds noise and cost. Tune against real questions.

Up next: Part 11, streaming LLM responses.

Newsletter

Want more posts like this?

Get practical software notes and tutorials delivered when something new is published.

No spam. Unsubscribe anytime.

How did this land?

Comments

0
Log in or sign up to join the discussion and react to this post.

No comments yet. Be the first to share your thoughts.

Related posts

Calling LLMs from Python: The Request Loop, Tokens, Cost, and Retries

Call language models from Python with the Claude SDK: the messages loop, tokens and cost, and a client with timeouts and retries you can inject and test.

5 hours ago

Structured Outputs and Function Calling: Getting Reliable JSON from LLMs

Get dependable JSON from language models with structured outputs and function calling, then validate with Pydantic so your code works with typed objects.

5 hours ago

Streaming LLM Responses to the Browser with FastAPI and SSE

Stream a real model response to the browser: consume the model stream in Python, forward it through a FastAPI SSE endpoint, and render it live.

5 hours ago

Building an AI Agent API: Tool Calls, Memory, and Guardrails

Build a small AI agent API: the tool calling loop, conversation memory, and the guardrails that keep an action taking agent safe and bounded.

5 hours ago