Most "talk to your docs" demos break the moment you point them at a real corpus. The reason is almost never the model it is the retrieval.
A good Retrieval-Augmented Generation (RAG) system spends 80% of its effort on chunking, embedding, indexing, and ranking, and only 20% on the prompt. If you skip that work, you get hallucinations dressed up as confidence: fluent answers that sound correct and quietly invent facts that are not in your documents.
This guide builds a production-quality RAG pipeline in Laravel using PostgreSQL with pgvector for storage and search. No vector database service, no LangChain, no Python sidecar. Just SQL you can read, PHP you can debug, and a queue you already operate. By the end you will have a system that ingests documents, chunks them intelligently, embeds them in batches, retrieves with hybrid search, reranks with a cross-encoder, and produces citation-grounded answers from Claude.
Why pgvector instead of a managed vector DB
The hosted vector database market is crowded. Here is the honest comparison for a Laravel team that already runs Postgres:
| Option | Strengths | Tradeoffs |
|---|---|---|
| pgvector | Same DB, ACID, joins, full-text + vector hybrid in one query, free | Manual tuning past ~50M vectors |
| Pinecone | Managed, scales horizontally, fast | Per-vector pricing, separate system to operate, no joins |
| Qdrant | Open source, payload filtering, fast HNSW | Another service to deploy and back up |
| Weaviate | Built-in vectorization, GraphQL API | Heavier ops, opinionated schema |
For 95% of Laravel apps, pgvector wins on operational simplicity alone. You already know how to back up Postgres, restore it, run migrations, monitor query plans, and join across tables.
That last point is underrated being able to filter retrieval by tenant ID or document ACL in the same SQL statement is a feature you do not appreciate until you do not have it.
Architecture
Document → Chunker → Embedding API → pgvector index
↓
User question → Embed → ANN search (top-k) → Reranker → LLM answer
Each stage is independently swappable. You can switch chunkers without re-embedding (re-chunk + re-embed), switch embedders without rewriting retrieval, switch the LLM without touching the index. That separation is the whole reason the pipeline is worth building from parts.
Step 1 : Enable pgvector
In Postgres 16+ the extension ships in most managed providers. Enable it once:
CREATE EXTENSION IF NOT EXISTS vector;
Then create the chunks table. vector(1024) matches Voyage's voyage-3 embedding size; adjust if you use a different model.
// database/migrations/2026_05_11_000000_create_doc_chunks_table.php
public function up(): void
{
Schema::create('doc_chunks', function (Blueprint $table) {
$table->id();
$table->foreignId('document_id')->constrained()->cascadeOnDelete();
$table->unsignedInteger('position');
$table->text('content');
$table->jsonb('metadata')->default('{}');
$table->timestamps();
});
DB::statement('ALTER TABLE doc_chunks ADD COLUMN embedding vector(1024)');
DB::statement('CREATE INDEX doc_chunks_embedding_hnsw ON doc_chunks USING hnsw (embedding vector_cosine_ops)');
}
HNSW indexes give you sub-50ms ANN queries on millions of rows. Use it instead of ivfflat unless you have a hard memory limit.
Step 2 : Smart chunking
Chunking is the part most teams underbuild. Splitting on a fixed character count tears sentences apart. Use a recursive chunker that prefers paragraph and sentence boundaries.
// app/Services/Rag/Chunker.php
namespace App\Services\Rag;
class Chunker
{
public function __construct(
private int $targetSize = 800,
private int $overlap = 120,
) {}
public function chunk(string $text): array
{
$text = preg_replace("/\r\n|\r/", "\n", trim($text));
$paragraphs = preg_split("/\n\s*\n/", $text);
$chunks = [];
$buffer = '';
foreach ($paragraphs as $paragraph) {
$paragraph = trim($paragraph);
if ($paragraph === '') continue;
if (strlen($buffer) + strlen($paragraph) + 2 > $this->targetSize && $buffer !== '') {
$chunks[] = $buffer;
$tail = substr($buffer, max(0, strlen($buffer) - $this->overlap));
$buffer = $tail . "\n\n" . $paragraph;
} else {
$buffer = $buffer === '' ? $paragraph : $buffer . "\n\n" . $paragraph;
}
}
if ($buffer !== '') {
$chunks[] = $buffer;
}
return $chunks;
}
}
Step 3 : Embed in batches
Embedding APIs charge per token but reward batching. Always send 32 to 128 chunks per request.
// app/Services/Rag/Embedder.php
namespace App\Services\Rag;
use Illuminate\Support\Facades\Http;
class Embedder
{
public function embed(array $texts, string $type = 'document'): array
{
$response = Http::withToken(config('rag.voyage_api_key'))
->timeout(60)
->post('https://api.voyageai.com/v1/embeddings', [
'model' => 'voyage-3',
'input' => $texts,
'input_type' => $type,
])->throw()->json();
return collect($response['data'])->pluck('embedding')->all();
}
}
Step 4 : Indexing job
Indexing is async work. A queued job keeps ingestion off your request path and lets you retry safely.
// app/Jobs/IndexDocument.php
public function handle(Chunker $chunker, Embedder $embedder): void
{
$document = Document::findOrFail($this->documentId);
$chunks = $chunker->chunk($document->raw_text);
$embeddings = collect($chunks)->chunk(64)->flatMap(
fn ($batch) => $embedder->embed($batch->values()->all(), 'document')
)->all();
DB::transaction(function () use ($document, $chunks, $embeddings) {
$document->chunks()->delete();
foreach ($chunks as $i => $content) {
DocChunk::create([
'document_id' => $document->id,
'position' => $i,
'content' => $content,
'metadata' => ['source' => $document->source_url],
'embedding' => '[' . implode(',', $embeddings[$i]) . ']',
]);
}
});
}
Step 5 : Retrieval with reranking
Vector search gives you fast recall. A cross-encoder reranker gives you precision. Pull top 30 by ANN, then rerank to top 6 this single step is the biggest quality win in most RAG systems.
// app/Services/Rag/Retriever.php
namespace App\Services\Rag;
use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Http;
class Retriever
{
public function __construct(private Embedder $embedder) {}
public function retrieve(string $question, int $k = 6): array
{
[$queryEmbedding] = $this->embedder->embed([$question], 'query');
$vector = '[' . implode(',', $queryEmbedding) . ']';
$candidates = DB::select('
SELECT id, document_id, content, metadata,
1 - (embedding <=> ?::vector) AS similarity
FROM doc_chunks
ORDER BY embedding <=> ?::vector
LIMIT 30
', [$vector, $vector]);
return $this->rerank($question, $candidates, $k);
}
private function rerank(string $question, array $candidates, int $k): array
{
$response = Http::withToken(config('rag.voyage_api_key'))
->post('https://api.voyageai.com/v1/rerank', [
'model' => 'rerank-2.5',
'query' => $question,
'documents' => array_map(fn ($c) => $c->content, $candidates),
'top_k' => $k,
])->json();
return collect($response['data'])
->map(fn ($r) => (object) array_merge(
(array) $candidates[$r['index']],
['rerank_score' => $r['relevance_score']],
))
->all();
}
}
Step 6 : Answer with citations
The prompt must instruct the model to cite chunks by ID. Citations are how you verify the answer is grounded; without them, RAG is just a slower chatbot.
// app/Services/Rag/Answerer.php
public function answer(string $question): array
{
$chunks = $this->retriever->retrieve($question);
$context = collect($chunks)->map(fn ($c, $i) =>
"[{$c->id}] " . str_replace("\n", ' ', $c->content)
)->implode("\n\n");
$response = $this->claude->messages->create([
'model' => 'claude-sonnet-4-6',
'max_tokens' => 800,
'system' => 'Answer using only the context. Cite chunks like [123]. If the answer is not in the context, say so.',
'messages' => [
['role' => 'user', 'content' => "Context:\n{$context}\n\nQuestion: {$question}"],
],
]);
return [
'answer' => $response->content[0]['text'],
'sources' => collect($chunks)->map(fn ($c) => [
'id' => $c->id,
'snippet' => Str::limit($c->content, 200),
]),
];
}
Hybrid search : vector + keyword in one query
Pure vector search struggles with rare strings: error codes, SKUs, function names. Pure full-text struggles with paraphrasing. The answer is to run both and combine the scores. Postgres makes this almost embarrassingly easy:
SELECT id, content,
0.6 * (1 - (embedding <=> $1::vector)) +
0.4 * ts_rank_cd(content_tsv, plainto_tsquery('english', $2)) AS score
FROM doc_chunks
WHERE content_tsv @@ plainto_tsquery('english', $2)
OR embedding <=> $1::vector < 0.4
ORDER BY score DESC
LIMIT 30;
The 60/40 weighting is a sane default. Tune it on your eval set — domains heavy in identifiers (codebases, SKUs) tilt toward keyword; conceptual corpora (policies, narratives) tilt toward vector.
Embedding model selection
| Model | Dim | Best for | Notes |
|---|---|---|---|
| voyage-3 | 1024 | General docs, English | Strong default, recommended by Anthropic |
| voyage-3-large | 2048 | High-precision retrieval | Higher cost, marginal gain unless quality is bottleneck |
| voyage-code-3 | 1024 | Codebases, API docs | Pick this for any technical content |
| text-embedding-3-large | 3072 | OpenAI ecosystems | Larger index footprint |
Evaluating retrieval quality
You cannot improve what you do not measure. Build a small eval set of 30 to 50 question/expected-chunk pairs and track these every time you change the pipeline:
- Recall@k : did the gold chunk appear in the top-k results?
- MRR (Mean Reciprocal Rank) : how high in the list was it?
- Citation accuracy : does the final answer actually cite the right chunks?
- Refusal rate : how often does the model say "not in context" when the answer is there? (Indicates poor retrieval.)
Cost & scale
| Stage | When it scales painfully | Mitigation |
|---|---|---|
| Embedding ingest | Re-indexing the whole corpus | Hash chunk content; skip unchanged |
| HNSW index | Past ~10M vectors | Partition by tenant or document type |
| Reranker | Latency budget <500ms | Rerank only top 30, not top 100 |
| LLM answer | Long context = high cost | Top-6 chunks max, trim aggressively |
Tuning checklist
- Chunk size 600 to 1000 tokens. Smaller for FAQs and Q&A; larger for technical docs and runbooks.
- Always rerank. ANN-only retrieval misses semantically similar but irrelevant chunks. The cross-encoder pass is the single biggest quality unlock.
- Hybrid search. Combine vector similarity with Postgres full-text (
tsvector) for queries with rare keywords or proper nouns. - Cache embeddings by content hash. Re-indexing should never re-embed unchanged chunks. A SHA-256 column lets you skip 90% of an ingest run.
- Track citation rate. If users flag answers that did not match cited chunks, your retrieval is the problem, not the model.
- Add metadata filters.
tenant_id,product_version, andlanguagebelong in the WHERE clause, not the prompt. - Strip boilerplate before embedding. Headers, footers, and "© 2026 Acme Corp" appear on every page and waste embedding signal.
RAG quality is a retrieval problem dressed as a generation problem. Fix retrieval and the answers fix themselves.
Postgres + pgvector is enough for the vast majority of real-world RAG workloads. It scales to tens of millions of chunks, runs on infrastructure you already operate, and lets you debug a slow query with EXPLAIN ANALYZE instead of vendor support tickets.