Cover image for Building a Production RAG System with pgvector and Laravel

At a glance

Reading time

~200 words/min

Published

2 hours ago

May 15, 2026

Views

4

All-time total

Building a Production RAG System with pgvector and Laravel

Most "talk to your docs" demos break the moment you point them at a real corpus. The reason is almost never the model it is the retrieval.

 

A good Retrieval-Augmented Generation (RAG) system spends 80% of its effort on chunking, embedding, indexing, and ranking, and only 20% on the prompt. If you skip that work, you get hallucinations dressed up as confidence: fluent answers that sound correct and quietly invent facts that are not in your documents.

 

This guide builds a production-quality RAG pipeline in Laravel using PostgreSQL with pgvector for storage and search. No vector database service, no LangChain, no Python sidecar. Just SQL you can read, PHP you can debug, and a queue you already operate. By the end you will have a system that ingests documents, chunks them intelligently, embeds them in batches, retrieves with hybrid search, reranks with a cross-encoder, and produces citation-grounded answers from Claude.

Why pgvector instead of a managed vector DB

The hosted vector database market is crowded. Here is the honest comparison for a Laravel team that already runs Postgres:

Option Strengths Tradeoffs
pgvector Same DB, ACID, joins, full-text + vector hybrid in one query, free Manual tuning past ~50M vectors
Pinecone Managed, scales horizontally, fast Per-vector pricing, separate system to operate, no joins
Qdrant Open source, payload filtering, fast HNSW Another service to deploy and back up
Weaviate Built-in vectorization, GraphQL API Heavier ops, opinionated schema

For 95% of Laravel apps, pgvector wins on operational simplicity alone. You already know how to back up Postgres, restore it, run migrations, monitor query plans, and join across tables.

 

That last point is underrated being able to filter retrieval by tenant ID or document ACL in the same SQL statement is a feature you do not appreciate until you do not have it.

Architecture

Document → Chunker → Embedding API → pgvector index
                                          ↓
User question → Embed → ANN search (top-k) → Reranker → LLM answer

 

Each stage is independently swappable. You can switch chunkers without re-embedding (re-chunk + re-embed), switch embedders without rewriting retrieval, switch the LLM without touching the index. That separation is the whole reason the pipeline is worth building from parts.

Step 1 : Enable pgvector

In Postgres 16+ the extension ships in most managed providers. Enable it once:

 

CREATE EXTENSION IF NOT EXISTS vector;

 

Then create the chunks table. vector(1024) matches Voyage's voyage-3 embedding size; adjust if you use a different model.

 

// database/migrations/2026_05_11_000000_create_doc_chunks_table.php
public function up(): void
{
    Schema::create('doc_chunks', function (Blueprint $table) {
        $table->id();
        $table->foreignId('document_id')->constrained()->cascadeOnDelete();
        $table->unsignedInteger('position');
        $table->text('content');
        $table->jsonb('metadata')->default('{}');
        $table->timestamps();
    });

    DB::statement('ALTER TABLE doc_chunks ADD COLUMN embedding vector(1024)');
    DB::statement('CREATE INDEX doc_chunks_embedding_hnsw ON doc_chunks USING hnsw (embedding vector_cosine_ops)');
}

 

HNSW indexes give you sub-50ms ANN queries on millions of rows. Use it instead of ivfflat unless you have a hard memory limit.

Step 2 : Smart chunking

Chunking is the part most teams underbuild. Splitting on a fixed character count tears sentences apart. Use a recursive chunker that prefers paragraph and sentence boundaries.

 

// app/Services/Rag/Chunker.php
namespace App\Services\Rag;

class Chunker
{
    public function __construct(
        private int $targetSize = 800,
        private int $overlap = 120,
    ) {}

    public function chunk(string $text): array
    {
        $text = preg_replace("/\r\n|\r/", "\n", trim($text));
        $paragraphs = preg_split("/\n\s*\n/", $text);

        $chunks = [];
        $buffer = '';

        foreach ($paragraphs as $paragraph) {
            $paragraph = trim($paragraph);
            if ($paragraph === '') continue;

            if (strlen($buffer) + strlen($paragraph) + 2 > $this->targetSize && $buffer !== '') {
                $chunks[] = $buffer;
                $tail = substr($buffer, max(0, strlen($buffer) - $this->overlap));
                $buffer = $tail . "\n\n" . $paragraph;
            } else {
                $buffer = $buffer === '' ? $paragraph : $buffer . "\n\n" . $paragraph;
            }
        }

        if ($buffer !== '') {
            $chunks[] = $buffer;
        }

        return $chunks;
    }
}

Step 3 : Embed in batches

Embedding APIs charge per token but reward batching. Always send 32 to 128 chunks per request.

 

// app/Services/Rag/Embedder.php
namespace App\Services\Rag;

use Illuminate\Support\Facades\Http;

class Embedder
{
    public function embed(array $texts, string $type = 'document'): array
    {
        $response = Http::withToken(config('rag.voyage_api_key'))
            ->timeout(60)
            ->post('https://api.voyageai.com/v1/embeddings', [
                'model'      => 'voyage-3',
                'input'      => $texts,
                'input_type' => $type,
            ])->throw()->json();

        return collect($response['data'])->pluck('embedding')->all();
    }
}

Step 4 : Indexing job

Indexing is async work. A queued job keeps ingestion off your request path and lets you retry safely.

 

// app/Jobs/IndexDocument.php
public function handle(Chunker $chunker, Embedder $embedder): void
{
    $document = Document::findOrFail($this->documentId);

    $chunks = $chunker->chunk($document->raw_text);
    $embeddings = collect($chunks)->chunk(64)->flatMap(
        fn ($batch) => $embedder->embed($batch->values()->all(), 'document')
    )->all();

    DB::transaction(function () use ($document, $chunks, $embeddings) {
        $document->chunks()->delete();

        foreach ($chunks as $i => $content) {
            DocChunk::create([
                'document_id' => $document->id,
                'position'    => $i,
                'content'     => $content,
                'metadata'    => ['source' => $document->source_url],
                'embedding'   => '[' . implode(',', $embeddings[$i]) . ']',
            ]);
        }
    });
}

Step 5 : Retrieval with reranking

Vector search gives you fast recall. A cross-encoder reranker gives you precision. Pull top 30 by ANN, then rerank to top 6 this single step is the biggest quality win in most RAG systems.

 

// app/Services/Rag/Retriever.php
namespace App\Services\Rag;

use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Http;

class Retriever
{
    public function __construct(private Embedder $embedder) {}

    public function retrieve(string $question, int $k = 6): array
    {
        [$queryEmbedding] = $this->embedder->embed([$question], 'query');
        $vector = '[' . implode(',', $queryEmbedding) . ']';

        $candidates = DB::select('
            SELECT id, document_id, content, metadata,
                   1 - (embedding <=> ?::vector) AS similarity
            FROM doc_chunks
            ORDER BY embedding <=> ?::vector
            LIMIT 30
        ', [$vector, $vector]);

        return $this->rerank($question, $candidates, $k);
    }

    private function rerank(string $question, array $candidates, int $k): array
    {
        $response = Http::withToken(config('rag.voyage_api_key'))
            ->post('https://api.voyageai.com/v1/rerank', [
                'model'     => 'rerank-2.5',
                'query'     => $question,
                'documents' => array_map(fn ($c) => $c->content, $candidates),
                'top_k'     => $k,
            ])->json();

        return collect($response['data'])
            ->map(fn ($r) => (object) array_merge(
                (array) $candidates[$r['index']],
                ['rerank_score' => $r['relevance_score']],
            ))
            ->all();
    }
}

Step 6 : Answer with citations

The prompt must instruct the model to cite chunks by ID. Citations are how you verify the answer is grounded; without them, RAG is just a slower chatbot.

 

// app/Services/Rag/Answerer.php
public function answer(string $question): array
{
    $chunks = $this->retriever->retrieve($question);

    $context = collect($chunks)->map(fn ($c, $i) =>
        "[{$c->id}] " . str_replace("\n", ' ', $c->content)
    )->implode("\n\n");

    $response = $this->claude->messages->create([
        'model'      => 'claude-sonnet-4-6',
        'max_tokens' => 800,
        'system'     => 'Answer using only the context. Cite chunks like [123]. If the answer is not in the context, say so.',
        'messages'   => [
            ['role' => 'user', 'content' => "Context:\n{$context}\n\nQuestion: {$question}"],
        ],
    ]);

    return [
        'answer'   => $response->content[0]['text'],
        'sources'  => collect($chunks)->map(fn ($c) => [
            'id'      => $c->id,
            'snippet' => Str::limit($c->content, 200),
        ]),
    ];
}

Hybrid search : vector + keyword in one query

Pure vector search struggles with rare strings: error codes, SKUs, function names. Pure full-text struggles with paraphrasing. The answer is to run both and combine the scores. Postgres makes this almost embarrassingly easy:

 

SELECT id, content,
       0.6 * (1 - (embedding <=> $1::vector)) +
       0.4 * ts_rank_cd(content_tsv, plainto_tsquery('english', $2)) AS score
FROM doc_chunks
WHERE content_tsv @@ plainto_tsquery('english', $2)
   OR embedding <=> $1::vector < 0.4
ORDER BY score DESC
LIMIT 30;

 

The 60/40 weighting is a sane default. Tune it on your eval set — domains heavy in identifiers (codebases, SKUs) tilt toward keyword; conceptual corpora (policies, narratives) tilt toward vector.

Embedding model selection

Model Dim Best for Notes
voyage-3 1024 General docs, English Strong default, recommended by Anthropic
voyage-3-large 2048 High-precision retrieval Higher cost, marginal gain unless quality is bottleneck
voyage-code-3 1024 Codebases, API docs Pick this for any technical content
text-embedding-3-large 3072 OpenAI ecosystems Larger index footprint

 

Important: never mix models in one index. Re-embed the entire corpus when you switch embeddings from different models do not share a vector space.

Evaluating retrieval quality

You cannot improve what you do not measure. Build a small eval set of 30 to 50 question/expected-chunk pairs and track these every time you change the pipeline:

  • Recall@k : did the gold chunk appear in the top-k results?
  • MRR (Mean Reciprocal Rank) : how high in the list was it?
  • Citation accuracy : does the final answer actually cite the right chunks?
  • Refusal rate : how often does the model say "not in context" when the answer is there? (Indicates poor retrieval.)

Cost & scale

Stage When it scales painfully Mitigation
Embedding ingest Re-indexing the whole corpus Hash chunk content; skip unchanged
HNSW index Past ~10M vectors Partition by tenant or document type
Reranker Latency budget <500ms Rerank only top 30, not top 100
LLM answer Long context = high cost Top-6 chunks max, trim aggressively

Tuning checklist

  • Chunk size 600 to 1000 tokens. Smaller for FAQs and Q&A; larger for technical docs and runbooks.
  • Always rerank. ANN-only retrieval misses semantically similar but irrelevant chunks. The cross-encoder pass is the single biggest quality unlock.
  • Hybrid search. Combine vector similarity with Postgres full-text (tsvector) for queries with rare keywords or proper nouns.
  • Cache embeddings by content hash. Re-indexing should never re-embed unchanged chunks. A SHA-256 column lets you skip 90% of an ingest run.
  • Track citation rate. If users flag answers that did not match cited chunks, your retrieval is the problem, not the model.
  • Add metadata filters. tenant_id, product_version, and language belong in the WHERE clause, not the prompt.
  • Strip boilerplate before embedding. Headers, footers, and "© 2026 Acme Corp" appear on every page and waste embedding signal.

RAG quality is a retrieval problem dressed as a generation problem. Fix retrieval and the answers fix themselves.

Postgres + pgvector is enough for the vast majority of real-world RAG workloads. It scales to tens of millions of chunks, runs on infrastructure you already operate, and lets you debug a slow query with EXPLAIN ANALYZE instead of vendor support tickets.

Newsletter

Want more posts like this?

Get practical software notes and tutorials delivered when something new is published.

No spam. Unsubscribe anytime.

Share

Related posts

CRUD Operations In Laravel 8

This tutorial is created to illustrate the basic CRUD (Create , Read, Update, Delete) operation using SQL with Laravel 8. Laravel is one of the fastest-growing frameworks for PHP.

4 years ago

Scheduling Tasks with Cron Job in Laravel 5.8

Cron Job is used to schedule tasks that will be executed every so often. Crontab is a file that contains a list of scripts, By editing the Crontab, You can run the scripts periodically.

7 years ago

Connecting Multiple Databases in Laravel 5.8

This tutorial is created to implement multiple database connections using mysql. Let’s see how to configure multiple database connections in Laravel 5.8.

6 years ago

Integrating Google ReCaptcha in Laravel 5.8

reCAPTCHA is a free service from Google. It’s a CAPTCHA-like system designed to recognize that the user is human and, at the same time, assist in the digitization of books. It helps to protects your w

6 years ago

Clearing Route, View, Config Cache in Laravel 5.8

Sometimes you may face an issue that the changes to the Laravel Project may not update on the web. This occures when the application is served by the cache. In this tutorial, You’ll learn to Clear App

6 years ago