RAG Pipeline Architecture for Production: What Actually Works
Most RAG demos look great. Most RAG production systems fail silently — wrong chunks, hallucinated answers, P95 latency spikes, and accuracy that erodes as the knowledge base grows. Here's what the architecture looks like when it actually has to work.
What RAG is and why the naive version breaks
Retrieval-Augmented Generation (RAG) is the pattern of fetching relevant context from a knowledge base and injecting it into an LLM prompt before generating an answer. The basic version is simple: embed the query, find similar chunks in a vector DB, stuff them in the prompt.
The basic version has predictable failure modes at production scale:
- Chunking strategy that splits mid-sentence or mid-concept, destroying semantic coherence
- Retrieval that returns topically similar but contextually wrong chunks
- Context windows that overflow silently, truncating the most relevant parts
- No reranking layer, so the first-retrieved results are the ones used regardless of quality
- No fallback when retrieval confidence is low — the model hallucinates rather than saying "I don't know"
- Embedding model drift when you update your LLM but keep old vectors
None of these fail loudly. Your system returns answers. They just happen to be wrong, or occasionally wrong, or confidently wrong in ways your users notice before you do.
The architecture that actually holds up
Production RAG is a pipeline with distinct stages, each of which needs to be built correctly. Here's the structure:
Chunking: where most teams get it wrong
Fixed-size chunking (every 512 tokens, no overlap) is the default in most tutorials. It's fine for demos. It fails for production knowledge bases because it breaks semantic units — a paragraph mid-argument, a code block mid-function, a list missing its header.
What works better in production:
- Semantic chunking — split on natural boundaries (paragraphs, headers, sections), not character counts
- Overlap chunking — 10–20% overlap between adjacent chunks preserves cross-boundary context
- Hierarchical chunks — store both a small chunk (for precise retrieval) and its parent section (for context injection)
- Metadata-rich chunks — attach document title, section header, URL, and date to every chunk at index time
Retrieval: ANN search is just the start
Vector similarity search gets you candidate chunks. It does not get you the right chunks. The gap between "semantically similar" and "contextually relevant" is where RAG accuracy lives.
Production retrieval should layer three things:
- Dense retrieval — standard vector similarity (cosine, dot product) against your embedded knowledge base
- Sparse retrieval — BM25 or keyword search to catch exact-match terms that embeddings handle poorly (product names, version numbers, proper nouns)
- Hybrid fusion — RRF (Reciprocal Rank Fusion) combines dense and sparse results into a single ranked list
Pinecone, Qdrant, and Weaviate all support hybrid search natively now. Use it. On benchmarks, hybrid consistently beats pure vector retrieval by 10–25% on recall@10.
Reranking: the layer most teams skip
After retrieval you have 10–20 candidate chunks. A reranker scores each chunk against the original query using a cross-encoder model — which is slower but far more accurate than the bi-encoder used for retrieval. You keep the top 3–5 and discard the rest.
The latency cost is real (50–200ms). The accuracy gain is also real — in my production deployments, adding a reranker improved answer quality measurably in user evaluations. For systems where accuracy matters (medical, legal, financial, enterprise), it's not optional.
Options: Cohere Rerank, BGE-Reranker, cross-encoder/ms-marco-MiniLM. For most use cases, Cohere Rerank via API is the fastest path to production.
Handling retrieval failure gracefully
The worst thing a RAG system can do is answer confidently when it shouldn't. When retrieval returns low-confidence results, your system needs to know how to respond. Three patterns:
- Confidence thresholding — if max similarity score is below 0.75, trigger a fallback response rather than generating from weak context
- Abstention prompting — instruct the model explicitly to say "I don't have enough information to answer this" rather than guessing
- Clarification routing — low-confidence queries get routed to a clarification flow before retrieval is retried
Vector database selection
The right choice depends on your scale, latency requirements, and hosting preference:
My default for new projects is Pinecone for getting to production fast, with a migration path to Qdrant if the client has the ops capacity and cost sensitivity justifies it.
Evaluation: the piece everyone ignores until it's too late
You cannot ship a RAG system to production without an evaluation framework. You need to know when accuracy regresses — when a data update breaks retrieval, when a model upgrade changes output behavior, when a new document type confuses the chunker.
The minimum viable eval stack:
- A golden dataset — 50–200 question/answer pairs covering your core use cases, manually curated
- Retrieval eval — for each golden question, does the correct chunk appear in the top-5 results? Track recall@5
- Generation eval — for each golden Q&A, does the generated answer match the expected answer? Use an LLM judge (GPT-4 or Claude) for semantic similarity scoring
- Regression testing in CI — run evals on every data update and model change, fail the pipeline if accuracy drops below threshold
At Sanofi, this evaluation framework is what got us to 96.2% accuracy on the production system. Without it, we would have shipped degraded accuracy to users after a knowledge base update and had no way to detect it.
Building a RAG system?
I've built production RAG pipelines for Fortune 500 clients. If you're designing the architecture or debugging accuracy issues, a deep dive is the fastest path forward.
Book a deep dive →Ready to build?
I turn ideas into shipped products. Fast.
Free 30-minute discovery call. Tell me what you're building — I'll tell you exactly how I'd approach it.
Book a free strategy call →Related articles