← Writing
AI EngineeringMar 1, 2026 · 9 min read

RAG Pipeline Architecture for Production: What Actually Works

Most RAG demos look great. Most RAG production systems fail silently — wrong chunks, hallucinated answers, P95 latency spikes, and accuracy that erodes as the knowledge base grows. Here's what the architecture looks like when it actually has to work.


What RAG is and why the naive version breaks

Retrieval-Augmented Generation (RAG) is the pattern of fetching relevant context from a knowledge base and injecting it into an LLM prompt before generating an answer. The basic version is simple: embed the query, find similar chunks in a vector DB, stuff them in the prompt.

The basic version has predictable failure modes at production scale:

None of these fail loudly. Your system returns answers. They just happen to be wrong, or occasionally wrong, or confidently wrong in ways your users notice before you do.

The architecture that actually holds up

Production RAG is a pipeline with distinct stages, each of which needs to be built correctly. Here's the structure:

1. IngestionParse, clean, chunk, embed, storeOffline / batch
2. RetrievalEmbed query, ANN search, metadata filterP50 < 80ms
3. RerankingCross-encoder or LLM-based relevance scoringP50 < 200ms
4. Context assemblyDeduplicate, truncate, format for promptIn-process
5. GenerationPrompt + context → LLM → structured outputP50 < 1.5s
6. Grounding checkVerify answer cites retrieved contextOptional / high-stakes

Chunking: where most teams get it wrong

Fixed-size chunking (every 512 tokens, no overlap) is the default in most tutorials. It's fine for demos. It fails for production knowledge bases because it breaks semantic units — a paragraph mid-argument, a code block mid-function, a list missing its header.

What works better in production:

// Hierarchical chunk structure (TypeScript) interface Chunk { id: string; content: string; // small chunk for retrieval parentContent: string; // parent section for context metadata: { documentId: string; title: string; section: string; url: string; updatedAt: string; chunkIndex: number; }; embedding: number[]; // stored in vector DB parentEmbedding: number[]; // optional, for parent retrieval }

Retrieval: ANN search is just the start

Vector similarity search gets you candidate chunks. It does not get you the right chunks. The gap between "semantically similar" and "contextually relevant" is where RAG accuracy lives.

Production retrieval should layer three things:

Pinecone, Qdrant, and Weaviate all support hybrid search natively now. Use it. On benchmarks, hybrid consistently beats pure vector retrieval by 10–25% on recall@10.

Reranking: the layer most teams skip

After retrieval you have 10–20 candidate chunks. A reranker scores each chunk against the original query using a cross-encoder model — which is slower but far more accurate than the bi-encoder used for retrieval. You keep the top 3–5 and discard the rest.

The latency cost is real (50–200ms). The accuracy gain is also real — in my production deployments, adding a reranker improved answer quality measurably in user evaluations. For systems where accuracy matters (medical, legal, financial, enterprise), it's not optional.

Options: Cohere Rerank, BGE-Reranker, cross-encoder/ms-marco-MiniLM. For most use cases, Cohere Rerank via API is the fastest path to production.

Handling retrieval failure gracefully

The worst thing a RAG system can do is answer confidently when it shouldn't. When retrieval returns low-confidence results, your system needs to know how to respond. Three patterns:

// Confidence threshold check (TypeScript) const results = await vectorDB.query({ vector: queryEmbedding, topK: 10, includeScores: true, }); const MAX_SCORE = Math.max(...results.map(r => r.score)); if (MAX_SCORE < 0.72) { return { answer: null, fallback: true, reason: 'low_retrieval_confidence', score: MAX_SCORE, }; } // Proceed with top results const topChunks = results .filter(r => r.score > 0.65) .slice(0, 5);

Vector database selection

The right choice depends on your scale, latency requirements, and hosting preference:

PineconeFastest to production, managed, generous free tierVendor lock-in, cost at scale
QdrantSelf-hosted, excellent filtering, open sourceOps overhead if self-hosted
WeaviateBuilt-in BM25 hybrid, strong ecosystemMore complex config
pgvectorIf you're already on Postgres — zero new infraSlower ANN at large scale

My default for new projects is Pinecone for getting to production fast, with a migration path to Qdrant if the client has the ops capacity and cost sensitivity justifies it.

Evaluation: the piece everyone ignores until it's too late

You cannot ship a RAG system to production without an evaluation framework. You need to know when accuracy regresses — when a data update breaks retrieval, when a model upgrade changes output behavior, when a new document type confuses the chunker.

The minimum viable eval stack:

At Sanofi, this evaluation framework is what got us to 96.2% accuracy on the production system. Without it, we would have shipped degraded accuracy to users after a knowledge base update and had no way to detect it.

Building a RAG system?

I've built production RAG pipelines for Fortune 500 clients. If you're designing the architecture or debugging accuracy issues, a deep dive is the fastest path forward.

Book a deep dive →

Ready to build?

I turn ideas into shipped products. Fast.

Free 30-minute discovery call. Tell me what you're building — I'll tell you exactly how I'd approach it.

Book a free strategy call →

Related articles

How to Build a SaaS in 2 Weeks: A Real Case StudyHire an AI Engineer vs an AI Agency in 2026How Much Does It Cost to Build an AI SaaS MVP?
← All articlesAI-Native vs AI-Bolted-On →