Part 10 builds the feature most LLM apps need: answering questions over your own documents. Retrieval augmented generation, or RAG, finds the passages most relevant to a question and hands them to the model as context. This part walks the full pipeline, chunking, embeddings, vector search, and a grounded answer, and wires it into a FastAPI endpoint.
What you will build
- A chunking step that splits documents into searchable pieces
- Embeddings that turn text into vectors for similarity search
- Cosine similarity search you can run in the browser
- A FastAPI endpoint that answers grounded in retrieved chunks
1. The RAG pipeline in four steps
RAG is not magic. At index time you split documents into chunks and store a vector for each. At query time you embed the question, find the nearest chunks, and pass them to the model with an instruction to answer only from that context. Retrieval is the part that decides answer quality.
2. Chunking
Chunks that are too large bury the relevant sentence in noise; too small and they lose context. A few hundred words with a little overlap is a sound default. Split on structure where you can, such as paragraphs or headings.
def chunk(text: str, size: int = 800, overlap: int = 100) -> list[str]:
words = text.split()
chunks, start = [], 0
while start < len(words):
end = start + size
chunks.append(" ".join(words[start:end]))
start = end - overlap # overlap keeps context across boundaries
return chunks
3. Embeddings and similarity
An embedding maps text to a vector so that similar meanings sit close together. You compare vectors with cosine similarity. The playground below shows the core idea with tiny hand built vectors so you can see why nearest neighbor search works, no API needed.
In a real system you generate embeddings with an embedding model and store them in a vector database such as pgvector, Qdrant, or Pinecone, which does this nearest neighbor search at scale. The ranking logic is exactly what you just ran.
4. The grounded answer endpoint
Put it together in FastAPI. Retrieve the top chunks, build a prompt that includes them, and instruct the model to answer only from the provided context and to say when it does not know. That instruction is what keeps the model from making things up.
from anthropic import Anthropic
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
client = Anthropic()
class Query(BaseModel):
question: str
def retrieve(question: str) -> list[str]:
# embed the question, search the vector store, return top chunks
...
@app.post("/ask")
async def ask(q: Query) -> dict:
context = "\n\n".join(retrieve(q.question))
prompt = (
"Answer using only the context below. "
"If the answer is not in the context, say you do not know.\n\n"
f"Context:\n{context}\n\nQuestion: {q.question}"
)
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
return {"answer": resp.content[0].text}
Checkpoint
A RAG system gives confident but wrong answers. Where do you look first?
5. Show your sources
A grounded answer that cannot be checked is only half useful. Carry metadata with each chunk, such as the document title and a link, and return those sources alongside the answer. Users trust an answer they can verify, and when the model does get something wrong, citations make it obvious which passage led it astray. Practically, this means your chunks are not just text, they are small records.
from pydantic import BaseModel
class Chunk(BaseModel):
text: str
source: str # e.g. "Billing FAQ"
url: str
class Answer(BaseModel):
answer: str
sources: list[str]
# After retrieving Chunk objects, pass their text to the model for the answer,
# and return the unique sources so the UI can render "based on: Billing FAQ".
6. Better retrieval: filtering, hybrid, reranking
Pure vector search is a strong baseline, not the ceiling. Three upgrades matter in practice. Metadata filtering restricts the search to the right subset, for example only this user documents, before similarity even runs. Hybrid search combines keyword matching with vector similarity, which rescues cases where an exact term like an error code matters more than meaning. And reranking takes the top twenty candidates and uses a stronger model to reorder them, so the few chunks you actually send are the best few, not merely the closest in vector space.
| Technique | Fixes | Cost |
|---|---|---|
| Metadata filter | Wrong scope, leaking other users data | Cheap |
| Hybrid search | Misses on exact terms, codes, names | Moderate |
| Reranking | Right doc retrieved but buried below noise | Extra model call |
7. Grounding and refusing to guess
The instruction you give the model is what separates a careful assistant from a confident fabricator. Tell it to answer only from the provided context and to say plainly when the answer is not there. Without that line, a model will often fill the gap with a plausible invention. With it, a missing answer becomes an honest I do not know, which in a support or docs setting is far better than a wrong one.
SYSTEM = (
"You answer strictly from the provided context. "
"If the context does not contain the answer, reply exactly: "
"'I could not find that in the available documents.' "
"Never use outside knowledge or guess."
)
# Pass SYSTEM as the system prompt and the retrieved chunks as the user content.
8. Evaluate retrieval, not vibes
You cannot improve what you do not measure. Build a small evaluation set of real questions paired with the chunk that should answer each one, then measure how often retrieval returns the right chunk in its top results. That single number tells you whether a change to chunking, embeddings, or search actually helped, and it stops you from tuning blindly. When answer quality is poor, this is the metric that reveals whether the problem is retrieval or generation.
Pro tip
Tune retrieval before touching the prompt or the model. If the passage that answers the question never reaches the model, no prompt and no model can save the answer. Retrieval quality is the ceiling on RAG quality.
9. Indexing is a pipeline too
Retrieval gets the attention, but indexing is where the data is prepared, and a sloppy index caps quality no matter how good the search is. Indexing runs once per document, not per query, so it belongs in background work, the queue pattern from Part 6, because embedding a large document is slow and should not block an upload request. The steps are: load the document, clean it, chunk it, embed each chunk, and upsert the vectors with their metadata.
def index_document(doc_id: str, text: str, title: str, url: str) -> int:
chunks = chunk(text) # from the chunking step above
records = []
for i, piece in enumerate(chunks):
vector = embed(piece) # your embedding model
records.append({
"id": f"{doc_id}:{i}",
"vector": vector,
"text": piece,
"source": title,
"url": url,
})
upsert(records) # write to the vector store
return len(records)
Two details save pain later. Use a stable id per chunk, such as the document id plus the chunk index, so re indexing updates in place instead of creating duplicates. And store the metadata next to the vector, so a retrieved chunk already carries the source and link you need to cite it.
10. Cost and caching
RAG has two cost centers: embeddings at index time and the model call at query time. Embeddings are cheap per call but add up across a large corpus, so embed once and store the vectors rather than recomputing them. At query time, the context you retrieve is input you pay for on every question, so passing fifteen chunks when five would do is a direct, recurring cost. Cache embeddings for repeated or unchanged content, and tune the number of retrieved chunks against your evaluation set rather than maximizing it.
✓ Pros
- Embed once at index time and reuse the stored vectors
- Carry metadata with each chunk so answers cite sources
- Tune the chunk count against an eval set, not by feel
✕ Cons
- Re-embedding the same text on every query wastes money
- Passing too many chunks adds cost and dilutes the answer
- No evaluation set means you are tuning retrieval blind
? Frequently asked questions
How do I keep one user from seeing another user data? +
Tag every chunk with an owner in its metadata and apply a metadata filter at query time so search only ever considers that user documents.
My answers cite the wrong passage. What changed? +
Usually chunking or retrieval. Check your evaluation metric for whether the right chunk is even being retrieved before adjusting the prompt or model.
The bottom line
RAG is a pipeline, and its quality lives in retrieval. Chunk sensibly, embed and search for the nearest passages, and ground the model with an instruction to use only what it was given. Wired into FastAPI, that is a real document question answering feature. The next part makes it feel instant by streaming the answer to the browser.
? Frequently asked questions
Which vector database should I use? +
pgvector if you already run Postgres, Qdrant or Pinecone for a dedicated store. The retrieval logic and prompt stay the same across all of them.
How many chunks should I pass? +
Start with three to five. Too few misses context, too many adds noise and cost. Tune against real questions.
Up next: Part 11, streaming LLM responses.
Comments
0No comments yet. Be the first to share your thoughts.