AI EngineeringMar 1, 2026 · 9 min read

RAG Pipeline Architecture for Production: What Actually Works

Most RAG demos look great. Most RAG production systems fail silently: wrong chunks, hallucinated answers, P95 latency spikes, and accuracy that erodes as the knowledge base grows. Here's what the architecture looks like when it actually has to work.

What RAG is and why the naive version breaks

Retrieval-Augmented Generation (RAG) is the pattern of fetching relevant context from a knowledge base and injecting it into an LLM prompt before generating an answer. The basic version is simple: embed the query, find similar chunks in a vector DB, stuff them in the prompt.

The basic version has predictable failure modes at production scale:

Chunking strategy that splits mid-sentence or mid-concept, destroying semantic coherence
Retrieval that returns topically similar but contextually wrong chunks
Context windows that overflow silently, truncating the most relevant parts
No reranking layer, so the first-retrieved results are the ones used regardless of quality
No fallback when retrieval confidence is low, so the model hallucinates rather than saying "I don't know"
Embedding model drift when you update your LLM but keep old vectors

None of these fail loudly. Your system returns answers. They just happen to be wrong, or occasionally wrong, or confidently wrong in ways your users notice before you do.

The architecture that actually holds up

Production RAG is a pipeline with distinct stages, each of which needs to be built correctly. Here's the structure:

1. IngestionParse, clean, chunk, embed, storeOffline / batch

2. RetrievalEmbed query, ANN search, metadata filterP50 < 80ms

3. RerankingCross-encoder or LLM-based relevance scoringP50 < 200ms

4. Context assemblyDeduplicate, truncate, format for promptIn-process

5. GenerationPrompt + context → LLM → structured outputP50 < 1.5s

6. Grounding checkVerify answer cites retrieved contextOptional / high-stakes

Chunking: where most teams get it wrong

Fixed-size chunking (every 512 tokens, no overlap) is the default in most tutorials. It's fine for demos. It fails for production knowledge bases because it breaks semantic units (a paragraph mid-argument, a code block mid-function, a list missing its header).

What works better in production:

Semantic chunking: split on natural boundaries (paragraphs, headers, sections), not character counts
Overlap chunking: 10–20% overlap between adjacent chunks preserves cross-boundary context
Hierarchical chunks: store both a small chunk (for precise retrieval) and its parent section (for context injection)
Metadata-rich chunks: attach document title, section header, URL, and date to every chunk at index time

// Hierarchical chunk structure (TypeScript)
interface Chunk {
  id: string;
  content: string;          // small chunk for retrieval
  parentContent: string;    // parent section for context
  metadata: {
    documentId: string;
    title: string;
    section: string;
    url: string;
    updatedAt: string;
    chunkIndex: number;
  };
  embedding: number[];      // stored in vector DB
  parentEmbedding: number[]; // optional, for parent retrieval
}

Retrieval: ANN search is just the start

Vector similarity search gets you candidate chunks. It does not get you the right chunks. The gap between "semantically similar" and "contextually relevant" is where RAG accuracy lives.

Production retrieval should layer three things:

Dense retrieval: standard vector similarity (cosine, dot product) against your embedded knowledge base
Sparse retrieval: BM25 or keyword search to catch exact-match terms that embeddings handle poorly (product names, version numbers, proper nouns)
Hybrid fusion: RRF (Reciprocal Rank Fusion) combines dense and sparse results into a single ranked list

Pinecone, Qdrant, and Weaviate all support hybrid search natively now. Use it. On benchmarks, hybrid consistently beats pure vector retrieval by 10–25% on recall@10.

Reranking: the layer most teams skip

After retrieval you have 10–20 candidate chunks. A reranker scores each chunk against the original query using a cross-encoder model, which is slower but far more accurate than the bi-encoder used for retrieval. You keep the top 3–5 and discard the rest.

The latency cost is real (50–200ms). The accuracy gain is also real. In my production deployments, adding a reranker improved answer quality measurably in user evaluations. For systems where accuracy matters (medical, legal, financial, enterprise), it's not optional.

Options: Cohere Rerank, BGE-Reranker, cross-encoder/ms-marco-MiniLM. For most use cases, Cohere Rerank via API is the fastest path to production.

Handling retrieval failure gracefully

The worst thing a RAG system can do is answer confidently when it shouldn't. When retrieval returns low-confidence results, your system needs to know how to respond. Three patterns:

Confidence thresholding: if max similarity score is below 0.75, trigger a fallback response rather than generating from weak context
Abstention prompting: instruct the model explicitly to say "I don't have enough information to answer this" rather than guessing
Clarification routing: low-confidence queries get routed to a clarification flow before retrieval is retried

// Confidence threshold check (TypeScript)
const results = await vectorDB.query({
  vector: queryEmbedding,
  topK: 10,
  includeScores: true,
});

const MAX_SCORE = Math.max(...results.map(r => r.score));

if (MAX_SCORE < 0.72) {
  return {
    answer: null,
    fallback: true,
    reason: 'low_retrieval_confidence',
    score: MAX_SCORE,
  };
}

// Proceed with top results
const topChunks = results
  .filter(r => r.score > 0.65)
  .slice(0, 5);

Vector database selection

The right choice depends on your scale, latency requirements, and hosting preference:

PineconeFastest to production, managed, generous free tierVendor lock-in, cost at scale

QdrantSelf-hosted, excellent filtering, open sourceOps overhead if self-hosted

WeaviateBuilt-in BM25 hybrid, strong ecosystemMore complex config

pgvectorIf you're already on Postgres: zero new infraSlower ANN at large scale

My default for new projects is Pinecone for getting to production fast, with a migration path to Qdrant if the client has the ops capacity and cost sensitivity justifies it.

Evaluation: the piece everyone ignores until it's too late

You cannot ship a RAG system to production without an evaluation framework. You need to know when accuracy regresses, when a data update breaks retrieval, when a model upgrade changes output behavior, when a new document type confuses the chunker.

The minimum viable eval stack:

A golden dataset: 50–200 question/answer pairs covering your core use cases, manually curated
Retrieval eval: for each golden question, does the correct chunk appear in the top-5 results? Track recall@5
Generation eval: for each golden Q&A, does the generated answer match the expected answer? Use an LLM judge (frontier model or internal reasoning model) for semantic similarity scoring
Regression testing in CI: run evals on every data update and model change, fail the pipeline if accuracy drops below threshold

At Sanofi, this evaluation framework is what got us to 96.2% accuracy on the production system. Without it, we would have shipped degraded accuracy to users after a knowledge base update and had no way to detect it.

Building a RAG system?

I've built production RAG pipelines for Fortune 500 clients. If you're designing the architecture or debugging accuracy issues, a thorough review is the fastest path forward.

Book a deep dive →

Work with ThynkQ

Need help shipping the real thing?

Start with the free discovery call if the scope is still fuzzy. If the problem is already clear, ThynkQ can usually tell you whether this should be an audit, a build, or fractional CTO work.

Book the free discovery call