RAG with pgvector and Laravel: Production Guide (2026)

Most "talk to your docs" demos break the moment you point them at a real corpus. The reason is almost never the model it is the retrieval.

A good Retrieval-Augmented Generation (RAG) system spends 80% of its effort on chunking, embedding, indexing, and ranking, and only 20% on the prompt. If you skip that work, you get hallucinations dressed up as confidence: fluent answers that sound correct and quietly invent facts that are not in your documents.

This guide builds a production-quality RAG pipeline in Laravel using PostgreSQL with pgvector for storage and search. No vector database service, no LangChain, no Python sidecar. Just SQL you can read, PHP you can debug, and a queue you already operate. By the end you will have a system that ingests documents, chunks them intelligently, embeds them in batches, retrieves with hybrid search, reranks with a cross-encoder, and produces citation-grounded answers from Claude.

Why pgvector instead of a managed vector DB

The hosted vector database market is crowded. Here is the honest comparison for a Laravel team that already runs Postgres:

Option	Strengths	Tradeoffs
pgvector	Same DB, ACID, joins, full-text + vector hybrid in one query, free	Manual tuning past ~50M vectors
Pinecone	Managed, scales horizontally, fast	Per-vector pricing, separate system to operate, no joins
Qdrant	Open source, payload filtering, fast HNSW	Another service to deploy and back up
Weaviate	Built-in vectorization, GraphQL API	Heavier ops, opinionated schema

For 95% of Laravel apps, pgvector wins on operational simplicity alone. You already know how to back up Postgres, restore it, run migrations, monitor query plans, and join across tables.

That last point is underrated being able to filter retrieval by tenant ID or document ACL in the same SQL statement is a feature you do not appreciate until you do not have it.

Architecture

Document → Chunker → Embedding API → pgvector index
                                          ↓
User question → Embed → ANN search (top-k) → Reranker → LLM answer

Each stage is independently swappable. You can switch chunkers without re-embedding (re-chunk + re-embed), switch embedders without rewriting retrieval, switch the LLM without touching the index. That separation is the whole reason the pipeline is worth building from parts.

Step 1 : Enable pgvector

In Postgres 16+ the extension ships in most managed providers. Enable it once:

CREATE EXTENSION IF NOT EXISTS vector;

Then create the chunks table. vector(1024) matches Voyage's voyage-3 embedding size; adjust if you use a different model.

// database/migrations/2026_05_11_000000_create_doc_chunks_table.php
public function up(): void
{
    Schema::create('doc_chunks', function (Blueprint $table) {
        $table->id();
        $table->foreignId('document_id')->constrained()->cascadeOnDelete();
        $table->unsignedInteger('position');
        $table->text('content');
        $table->jsonb('metadata')->default('{}');
        $table->timestamps();
    });

    DB::statement('ALTER TABLE doc_chunks ADD COLUMN embedding vector(1024)');
    DB::statement('CREATE INDEX doc_chunks_embedding_hnsw ON doc_chunks USING hnsw (embedding vector_cosine_ops)');
}

HNSW indexes give you sub-50ms ANN queries on millions of rows. Use it instead of ivfflat unless you have a hard memory limit.

Step 2 : Smart chunking

Chunking is the part most teams underbuild. Splitting on a fixed character count tears sentences apart. Use a recursive chunker that prefers paragraph and sentence boundaries.

// app/Services/Rag/Chunker.php
namespace App\Services\Rag;

class Chunker
{
    public function __construct(
        private int $targetSize = 800,
        private int $overlap = 120,
    ) {}

    public function chunk(string $text): array
    {
        $text = preg_replace("/\r\n|\r/", "\n", trim($text));
        $paragraphs = preg_split("/\n\s*\n/", $text);

        $chunks = [];
        $buffer = '';

        foreach ($paragraphs as $paragraph) {
            $paragraph = trim($paragraph);
            if ($paragraph === '') continue;

            if (strlen($buffer) + strlen($paragraph) + 2 > $this->targetSize && $buffer !== '') {
                $chunks[] = $buffer;
                $tail = substr($buffer, max(0, strlen($buffer) - $this->overlap));
                $buffer = $tail . "\n\n" . $paragraph;
            } else {
                $buffer = $buffer === '' ? $paragraph : $buffer . "\n\n" . $paragraph;
            }
        }

        if ($buffer !== '') {
            $chunks[] = $buffer;
        }

        return $chunks;
    }
}

Step 3 : Embed in batches

Embedding APIs charge per token but reward batching. Always send 32 to 128 chunks per request.

// app/Services/Rag/Embedder.php
namespace App\Services\Rag;

use Illuminate\Support\Facades\Http;

class Embedder
{
    public function embed(array $texts, string $type = 'document'): array
    {
        $response = Http::withToken(config('rag.voyage_api_key'))
            ->timeout(60)
            ->post('https://api.voyageai.com/v1/embeddings', [
                'model'      => 'voyage-3',
                'input'      => $texts,
                'input_type' => $type,
            ])->throw()->json();

        return collect($response['data'])->pluck('embedding')->all();
    }
}

Step 4 : Indexing job

Indexing is async work. A queued job keeps ingestion off your request path and lets you retry safely.

// app/Jobs/IndexDocument.php
public function handle(Chunker $chunker, Embedder $embedder): void
{
    $document = Document::findOrFail($this->documentId);

    $chunks = $chunker->chunk($document->raw_text);
    $embeddings = collect($chunks)->chunk(64)->flatMap(
        fn ($batch) => $embedder->embed($batch->values()->all(), 'document')
    )->all();

    DB::transaction(function () use ($document, $chunks, $embeddings) {
        $document->chunks()->delete();

        foreach ($chunks as $i => $content) {
            DocChunk::create([
                'document_id' => $document->id,
                'position'    => $i,
                'content'     => $content,
                'metadata'    => ['source' => $document->source_url],
                'embedding'   => '[' . implode(',', $embeddings[$i]) . ']',
            ]);
        }
    });
}

Step 5 : Retrieval with reranking

Vector search gives you fast recall. A cross-encoder reranker gives you precision. Pull top 30 by ANN, then rerank to top 6 this single step is the biggest quality win in most RAG systems.

// app/Services/Rag/Retriever.php
namespace App\Services\Rag;

use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Http;

class Retriever
{
    public function __construct(private Embedder $embedder) {}

    public function retrieve(string $question, int $k = 6): array
    {
        [$queryEmbedding] = $this->embedder->embed([$question], 'query');
        $vector = '[' . implode(',', $queryEmbedding) . ']';

        $candidates = DB::select('
            SELECT id, document_id, content, metadata,
                   1 - (embedding <=> ?::vector) AS similarity
            FROM doc_chunks
            ORDER BY embedding <=> ?::vector
            LIMIT 30
        ', [$vector, $vector]);

        return $this->rerank($question, $candidates, $k);
    }

    private function rerank(string $question, array $candidates, int $k): array
    {
        $response = Http::withToken(config('rag.voyage_api_key'))
            ->post('https://api.voyageai.com/v1/rerank', [
                'model'     => 'rerank-2.5',
                'query'     => $question,
                'documents' => array_map(fn ($c) => $c->content, $candidates),
                'top_k'     => $k,
            ])->json();

        return collect($response['data'])
            ->map(fn ($r) => (object) array_merge(
                (array) $candidates[$r['index']],
                ['rerank_score' => $r['relevance_score']],
            ))
            ->all();
    }
}

Step 6 : Answer with citations

The prompt must instruct the model to cite chunks by ID. Citations are how you verify the answer is grounded; without them, RAG is just a slower chatbot.

// app/Services/Rag/Answerer.php
public function answer(string $question): array
{
    $chunks = $this->retriever->retrieve($question);

    $context = collect($chunks)->map(fn ($c, $i) =>
        "[{$c->id}] " . str_replace("\n", ' ', $c->content)
    )->implode("\n\n");

    $response = $this->claude->messages->create([
        'model'      => 'claude-sonnet-4-6',
        'max_tokens' => 800,
        'system'     => 'Answer using only the context. Cite chunks like [123]. If the answer is not in the context, say so.',
        'messages'   => [
            ['role' => 'user', 'content' => "Context:\n{$context}\n\nQuestion: {$question}"],
        ],
    ]);

    return [
        'answer'   => $response->content[0]['text'],
        'sources'  => collect($chunks)->map(fn ($c) => [
            'id'      => $c->id,
            'snippet' => Str::limit($c->content, 200),
        ]),
    ];
}

Hybrid search : vector + keyword in one query

Pure vector search struggles with rare strings: error codes, SKUs, function names. Pure full-text struggles with paraphrasing. The answer is to run both and combine the scores. Postgres makes this almost embarrassingly easy:

SELECT id, content,
       0.6 * (1 - (embedding <=> $1::vector)) +
       0.4 * ts_rank_cd(content_tsv, plainto_tsquery('english', $2)) AS score
FROM doc_chunks
WHERE content_tsv @@ plainto_tsquery('english', $2)
   OR embedding <=> $1::vector < 0.4
ORDER BY score DESC
LIMIT 30;

The 60/40 weighting is a sane default. Tune it on your eval set — domains heavy in identifiers (codebases, SKUs) tilt toward keyword; conceptual corpora (policies, narratives) tilt toward vector.

Embedding model selection

Model	Dim	Best for	Notes
voyage-3	1024	General docs, English	Strong default, recommended by Anthropic
voyage-3-large	2048	High-precision retrieval	Higher cost, marginal gain unless quality is bottleneck
voyage-code-3	1024	Codebases, API docs	Pick this for any technical content
text-embedding-3-large	3072	OpenAI ecosystems	Larger index footprint

Important: never mix models in one index. Re-embed the entire corpus when you switch embeddings from different models do not share a vector space.

Evaluating retrieval quality

You cannot improve what you do not measure. Build a small eval set of 30 to 50 question/expected-chunk pairs and track these every time you change the pipeline:

Recall@k : did the gold chunk appear in the top-k results?
MRR (Mean Reciprocal Rank) : how high in the list was it?
Citation accuracy : does the final answer actually cite the right chunks?
Refusal rate : how often does the model say "not in context" when the answer is there? (Indicates poor retrieval.)

Cost & scale

Stage	When it scales painfully	Mitigation
Embedding ingest	Re-indexing the whole corpus	Hash chunk content; skip unchanged
HNSW index	Past ~10M vectors	Partition by tenant or document type
Reranker	Latency budget <500ms	Rerank only top 30, not top 100
LLM answer	Long context = high cost	Top-6 chunks max, trim aggressively

Tuning checklist

Chunk size 600 to 1000 tokens. Smaller for FAQs and Q&A; larger for technical docs and runbooks.
Always rerank. ANN-only retrieval misses semantically similar but irrelevant chunks. The cross-encoder pass is the single biggest quality unlock.
Hybrid search. Combine vector similarity with Postgres full-text (tsvector) for queries with rare keywords or proper nouns.
Cache embeddings by content hash. Re-indexing should never re-embed unchanged chunks. A SHA-256 column lets you skip 90% of an ingest run.
Track citation rate. If users flag answers that did not match cited chunks, your retrieval is the problem, not the model.
Add metadata filters. tenant_id, product_version, and language belong in the WHERE clause, not the prompt.
Strip boilerplate before embedding. Headers, footers, and "© 2026 Acme Corp" appear on every page and waste embedding signal.

RAG quality is a retrieval problem dressed as a generation problem. Fix retrieval and the answers fix themselves.

Postgres + pgvector is enough for the vast majority of real-world RAG workloads. It scales to tens of millions of chunks, runs on infrastructure you already operate, and lets you debug a slow query with EXPLAIN ANALYZE instead of vendor support tickets.

Building a Production RAG System with pgvector and Laravel

Why pgvector instead of a managed vector DB

Architecture

Step 1 : Enable pgvector

Step 2 : Smart chunking

Step 3 : Embed in batches

Step 4 : Indexing job

Step 5 : Retrieval with reranking

Step 6 : Answer with citations

Hybrid search : vector + keyword in one query

Embedding model selection

Evaluating retrieval quality

Cost & scale

Tuning checklist

Tags

Share

Related posts

CRUD Operations In Laravel 8

Scheduling Tasks with Cron Job in Laravel 5.8

Connecting Multiple Databases in Laravel 5.8

Integrating Google ReCaptcha in Laravel 5.8

Clearing Route, View, Config Cache in Laravel 5.8