Building your first RAG pipeline with Claude and pgvector

Published on April 8, 2026 · 5 min read

RAG
Claude
pgvector
PostgreSQL
TypeScript

Most RAG posts push you toward a new vector DB, a new framework, a new orchestrator. You don’t need any of that to ship v1. Postgres with pgvector plus the Claude API is enough to go live. It is also what I deploy for clients 9 times out of 10.

Here is the exact pipeline I ship when a team needs a domain Q&A feature on top of an existing Postgres.

Why pgvector and not a dedicated vector DB

Three reasons:

You already run Postgres. No extra service to operate, monitor, back up.
Transactional guarantees. Your vectors sit next to your business data, same DB, same permissions, same backup story.
It scales further than people claim. Millions of vectors with an HNSW index run fine on a mid-sized Postgres box.

I only reach for a managed vector DB above 50M vectors or when I need sub-50ms retrieval across regions. Below that, pgvector wins on simplicity. Every time.

The pipeline in four steps

Ingest documents and split them into chunks.
Embed each chunk and store the vector.
Retrieve the top-k chunks for a user query.
Generate an answer with Claude, grounded on the retrieved chunks.

Let’s walk through each one.

1. Schema and indexes

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID NOT NULL,
  source TEXT NOT NULL,
  title TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
  chunk_index INT NOT NULL,
  content TEXT NOT NULL,
  embedding vector(1536),
  token_count INT
);

CREATE INDEX chunks_embedding_hnsw
  ON chunks USING hnsw (embedding vector_cosine_ops);

CREATE INDEX chunks_document_id ON chunks(document_id);

Use 1536 dimensions if you pair with OpenAI text-embedding-3-small. Claude has no first-party embeddings yet, so I pair Claude for generation with OpenAI or Voyage for embeddings. No ideology, just what works.

2. Chunking and embedding

Chunking is where most RAG pipelines die silently. Don’t cut on arbitrary character counts. Split on paragraphs with overlap, or use a semantic splitter. Test your splits on real documents before writing any retrieval code.

import OpenAI from "openai";
import { db } from "./db";

const openai = new OpenAI();

export async function ingestDocument(docId: string, text: string) {
  const chunks = splitByParagraph(text, { maxTokens: 500, overlap: 50 });

  const embeddings = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: chunks.map((c) => c.content),
  });

  await db.transaction(async (tx) => {
    for (let i = 0; i < chunks.length; i++) {
      await tx.insert(chunksTable).values({
        document_id: docId,
        chunk_index: i,
        content: chunks[i].content,
        embedding: embeddings.data[i].embedding,
        token_count: chunks[i].tokens,
      });
    }
  });
}

Batch the embedding calls. One call with 100 chunks is dramatically cheaper and faster than 100 sequential calls. If you skip this, your ingest job will crawl.

3. Retrieval with a twist

Naive retrieval returns nearest neighbors by cosine similarity. That’s the baseline. Two upgrades matter in practice:

Metadata filtering before vector search. If the user is scoped to a project, filter on document.project_id first. Skipping this is how you leak one tenant’s data to another.
Reranking after retrieval. Top-10 from vector search is often noisy. A reranker (Cohere, Voyage, or a cross-encoder) reorders based on the actual query. Cheap, huge quality bump.

export async function retrieve(query: string, projectId: string, k = 5) {
  const embeddingResponse = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: [query],
  });
  const queryVec = embeddingResponse.data[0].embedding;

  // Vector search with metadata filter
  const candidates = await db.query(
    `
    SELECT c.id, c.content, d.title
    FROM chunks c
    JOIN documents d ON d.id = c.document_id
    WHERE d.project_id = $1
    ORDER BY c.embedding <=> $2::vector
    LIMIT 20
    `,
    [projectId, queryVec],
  );

  // Rerank for better precision
  return rerank(query, candidates, k);
}

4. Generation with Claude

Claude shines here because of its long context window and its willingness to follow grounding instructions. Pass the retrieved chunks, give crisp instructions, require citations.

import Anthropic from "@anthropic-ai/sdk";

const claude = new Anthropic();

export async function answer(query: string, chunks: Chunk[]) {
  const context = chunks
    .map((c, i) => `[${i + 1}] ${c.title}\n${c.content}`)
    .join("\n\n---\n\n");

  const response = await claude.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    system: `You are a precise assistant. Answer the user's question using ONLY the provided context. Cite sources with [1], [2] markers. If the context does not contain the answer, say so.`,
    messages: [
      {
        role: "user",
        content: `Context:\n${context}\n\nQuestion: ${query}`,
      },
    ],
  });

  return response.content[0].type === "text" ? response.content[0].text : "";
}

Two prompt rules I never drop:

Explicit grounding: “use ONLY the provided context”.
Explicit fallback: “if the context does not contain the answer, say so”. Skip this and Claude will fill the gap with confident nonsense.

What to measure in production

Once live, track these four:

Retrieval hit rate. On a sample of real queries, is the correct answer in the top-k? If not, chunking or embeddings are the problem, not the model.
Groundedness. Are answers actually based on retrieved chunks? LLM-as-judge plus weekly human spot checks.
Latency breakdown. Embedding call, vector search, generation. Optimize the slowest.
Cost per query. Embeddings plus generation tokens. A chatty system prompt can double your bill overnight.

What’s next

Once v1 is running, the usual upgrades are hybrid search (BM25 + vectors), query rewriting (turn vague user input into better retrieval queries), and agentic retrieval (let Claude decide what to search for). All three are worth it — but only after you have a boring, grounded v1 in production and real usage data. Ship first, optimize on evidence.

If you want a second pair of eyes on your RAG architecture, I run 30-minute scoping calls. Bring a real query log.