Building RAG Systems for Production: Lessons Learned
Practical insights on building Retrieval-Augmented Generation systems that work reliably in production environments.
Building RAG Systems for Production: Lessons Learned
Our first RAG system was a disaster.
It worked beautifully in demos. We'd ask questions, get impressive answers with citations, show it to stakeholders, and everyone nodded approvingly. Then we deployed it to real users, and everything fell apart. Queries that should have found obvious information returned nothing. The system confidently cited documents that didn't actually support its answers. Response times spiked when multiple users hit it simultaneously. And worst of all, we had no way to know it was failing—users just stopped using it.
That project taught us more about RAG than anything we'd read in papers or blog posts. Three years and a dozen RAG deployments later, here's what we wish we'd known from the start.
The Fundamental Insight: Retrieval Is the Hard Part
Everyone focuses on the generation side—which model to use, how to prompt it, how to format the output. These things matter, but they're not where most RAG systems fail. They fail at retrieval.
If you retrieve the wrong documents, the best language model in the world can't save you. It'll confidently synthesize an answer from irrelevant information, or correctly tell you it can't find what you're looking for when the answer is sitting right there in your corpus.
The implication is uncomfortable: you need to spend more time on chunking, embedding, and retrieval strategy than on the LLM integration. The ratio should probably be 70/30, but most teams do the opposite.
This mismatch happens because retrieval engineering is less glamorous than LLM prompt engineering. There are no viral Twitter threads about chunking strategies. The demos that excite stakeholders show the generation step, not the search query that made it possible. Teams naturally gravitate toward the visible, discussable parts of the system while neglecting the plumbing that determines whether it actually works.
The best RAG engineers we know spend their time in search indexes, looking at query results, debugging why document X didn't surface for query Y. They treat retrieval as a search engineering problem first and an AI problem second. That perspective shift is uncomfortable but essential.
Chunking Is Where RAG Systems Are Won or Lost
We spent two weeks optimizing prompts on our first system. We should have spent those two weeks on chunking.
Chunking—how you split documents into pieces for embedding and retrieval—determines the upper bound on your system's quality. Get it wrong, and no amount of prompt engineering will fix it.
The too-small trap: We started with 256-token chunks because that's what the tutorials suggested. Retrieval precision was excellent—we found exactly the sentences we were looking for. But the chunks had no context. "The quarterly revenue was $12.4 million" doesn't help if you don't know which company or which quarter. We spent weeks adding metadata and pre/post context, essentially rebuilding information we'd thrown away.
The too-large trap: We overcorrected to 2000-token chunks. Now we had context, but retrieval suffered. A chunk about "Q3 2024 financial results" might contain revenue, expenses, headcount, and market analysis. Ask about revenue, and you get a hit—but most of the chunk is noise that confuses the model.
What actually works: For most use cases, 500-800 tokens with 100-token overlap. But more important than the numbers is respecting document structure. Split on section boundaries, not token counts. Keep tables together. Preserve the heading hierarchy. A chunk should be a coherent unit of information, not an arbitrary slice of text.
We eventually built a semantic chunker that uses the document structure and a small model to identify natural break points. It took a week to build and improved retrieval quality more than any other single change we made.
The semantic chunker works by first extracting document structure—headers, subheaders, paragraphs, lists—then using a lightweight classifier to score potential split points. A split point that follows a heading scores higher than one in the middle of a paragraph. A split that keeps a code block intact scores higher than one that bisects it. The result is chunks that respect the document's logical organization rather than fighting against it.
Some document types require specialized handling. PDFs with complex layouts—think financial reports with tables and footnotes—need layout analysis before chunking. Code documentation needs to keep function definitions with their docstrings. Meeting transcripts benefit from speaker-aware chunking that preserves conversational context. Generic chunkers struggle with all of these; purpose-built chunkers handle them well.
The investment in chunking infrastructure pays dividends because chunking decisions are hard to change later. Re-chunking a corpus means regenerating all embeddings, which is expensive and time-consuming. Decisions made early about chunk size and strategy become baked into the system's DNA.
Hybrid Search Beats Pure Semantic Search
Pure semantic search sounds elegant: embed queries, embed documents, find the closest matches. In practice, it has frustrating failure modes.
Ask for "document ABC-1234" and semantic search might return documents about ABCs and 1234s, but not the specific document you named. Ask about "John Smith" and you might get documents about Johns, Smiths, or even people named John who work in Smith Industries. The embedding model understands meaning, but sometimes you need exact matching.
Hybrid search combines semantic search with keyword matching (usually BM25). For every query, you run both, then combine the results. There are various ways to do the combination—reciprocal rank fusion, weighted scoring, learned reranking—but even a simple approach helps.
In our experience, hybrid search improves retrieval quality by 10-30% across the board, with the biggest gains on queries that include specific identifiers, proper nouns, or technical terms. The implementation cost is minimal—you're adding a BM25 index alongside your vector index. There's almost no reason not to do it.
The fusion strategy matters more than it might seem. Reciprocal Rank Fusion (RRF) is a good starting point—it combines rankings without requiring score calibration. More sophisticated approaches learn weights for each retrieval method based on query type, but RRF often gets you 80% of the benefit with 10% of the complexity.
One pattern we've found valuable: adjusting the balance between semantic and keyword search based on query analysis. Queries with identifiers or proper nouns weight keyword search higher. Conversational queries or those asking "how to" questions weight semantic search higher. A simple classifier that routes queries to the right mixture improves results without adding latency.
Don't forget about filtering. Often you want to search within a specific subset of your corpus—documents from a particular time range, or from a specific source, or with certain metadata properties. These filters should apply before retrieval, not after. Retrieving 100 documents and then filtering to 10 wastes compute and returns suboptimal results compared to filtering first.
You Need Evaluation From Day One
The biggest mistake on our first project was treating evaluation as a later problem. "We'll add metrics once the basic system works." By the time we got around to it, we'd made dozens of decisions based on vibes and demo performance. Some of those decisions were wrong, and we had to undo them.
Build evaluation infrastructure before you build the RAG pipeline. You need at least three things:
A golden test set. Fifty to a hundred question-answer pairs that you've manually verified. When you make changes, you should know within minutes whether quality improved or regressed. Building this test set is tedious, but it's the foundation of systematic improvement.
Retrieval metrics. For each test question, you should know whether the relevant documents were retrieved and at what rank. If retrieval fails, generation can't succeed—so retrieval metrics are your leading indicator.
LLM-as-judge for generation quality. Use a strong model (GPT-4 or Claude) to evaluate whether the generated answer is correct, complete, and grounded in the retrieved documents. Human evaluation is more reliable, but LLM evaluation scales better for continuous testing.
The teams that improve fastest are the ones that run evaluations on every change. Make it part of your CI pipeline.
Beyond these basics, consider building a "disagreement detection" system. When the LLM-as-judge and human evaluators disagree, those cases are gold—they reveal where your automatic evaluation is miscalibrated and where the model is failing in ways that seem correct. We maintain a queue of these disagreements for periodic human review, and each review session teaches us something about the system's failure modes.
Evaluation should also include latency and cost tracking. A system that gives perfect answers in 30 seconds is useless for interactive applications. A system that costs $5 per query isn't viable for high-volume use cases. Track these metrics alongside quality metrics so you can make informed tradeoffs.
Context Window Management Is Underrated
Early in a project, you have a handful of test documents and everything fits comfortably in the context window. In production, you're retrieving from thousands of documents, and suddenly context management matters a lot.
The naive approach is to retrieve the top N chunks and concatenate them. This works until it doesn't—until you hit the context limit, or until you're stuffing so much text that the model can't find the relevant parts, or until your costs and latency spike because you're processing enormous prompts.
More sophisticated approaches we've used:
Relevance-based truncation. Set a similarity threshold and only include chunks that clear it. If only three chunks are truly relevant, don't pad with seven marginally related ones.
Hierarchical summarization. For document sets too large to fit in context, generate summaries of each document, then retrieve over summaries. When you find a relevant summary, drill down to the chunks.
Map-reduce for broad questions. If a question requires information scattered across many documents, process chunks in batches and combine the results. More complex to implement, but sometimes necessary.
The right approach depends on your use case. But you need an approach—"throw everything in and hope for the best" stops working surprisingly fast.
A technique we've found valuable is dynamic context sizing. Rather than always including a fixed number of chunks, adjust based on the query complexity and the available token budget. Simple factual questions might need only two or three chunks. Complex synthesis questions might need ten. The model can signal when it needs more context—a response of "I found partial information but would need to see more documents" triggers a second retrieval pass.
Context ordering also matters. LLMs are sensitive to position—they attend more to the beginning and end of the context. Put the most relevant chunks first. Put metadata and less important context in the middle. Structure the context like you'd structure a document for a human reader, not like a random bag of snippets.
Production Concerns Nobody Talks About
The gap between "works in a notebook" and "works in production" is larger for RAG than for most systems.
Latency: A RAG query has multiple serial steps: embed the query, search the index, retrieve documents, call the LLM. Each step adds latency. For interactive applications, you need to parallelize where possible and optimize aggressively. Sub-second response times are achievable but not automatic.
Cost: You're paying for embeddings, vector database queries, and LLM tokens. At scale, costs can surprise you. We've seen systems where the embedding costs exceeded the LLM costs because of inefficient document update patterns. Model the costs before you deploy.
Observability: When a RAG system gives a bad answer, you need to know why. Was retrieval the problem? Was the information present but the model confused? Was the source document wrong? You need logging at every step—queries, retrieved documents, generated answers, user feedback—so you can diagnose issues.
Document updates: Your corpus changes over time. How do you update embeddings? How do you handle versioning? How do you avoid serving stale information? These aren't hard problems, but they need answers before you go to production.
Failure handling: What happens when the vector database is slow? When the LLM times out? When retrieval returns nothing? Your system needs graceful degradation and clear error messages—not silent failures that users interpret as "the AI is stupid."
Rate limiting and queueing: Production RAG systems need backpressure handling. What happens when 100 requests arrive simultaneously? LLM providers have rate limits. Vector databases have throughput limits. Build queuing with priority handling—urgent queries should preempt batch processing, not wait behind it.
Caching strategy: Some RAG queries repeat frequently. Caching responses can dramatically reduce costs and latency for common questions. But cache invalidation is tricky—when the underlying documents change, cached responses become stale. We typically cache for short periods (hours, not days) and build document-change detection that invalidates affected cache entries.
Streaming responses: For interactive applications, streaming the LLM response while it's being generated improves perceived latency dramatically. The first token can appear while retrieval is still happening if you pipeline the stages properly. Users see activity immediately rather than staring at a loading spinner.
What We'd Build Today
If we were starting a new RAG system tomorrow, here's the stack we'd use:
For documents under 100K, we'd use PostgreSQL with pgvector. It's simpler to operate than a dedicated vector database, and the performance is fine for most use cases. For larger corpora, Pinecone or Weaviate, depending on whether we needed managed infrastructure or more control.
For embeddings, OpenAI's text-embedding-3-large or Cohere's embed-v3. Both work well. We'd use a local model (like bge-large) only if there were strict data residency requirements.
For the LLM, Claude for most applications. It follows instructions more reliably than GPT-4 for RAG-specific tasks, especially when you need precise citations. GPT-4 when Claude's context window is limiting or when you need specific OpenAI features.
For orchestration, LangChain or LlamaIndex—not because they're perfect, but because they handle the plumbing and let you focus on the parts that matter. Roll your own only if you have unusual requirements.
For evaluation, RAGAS or a custom harness built around the metrics we described. Run it on every commit.
For document processing, Unstructured.io has become our default for parsing PDFs, HTML, and other formats into clean text. It handles the edge cases—multi-column layouts, embedded tables, mixed content types—that trip up simpler parsers. The extraction quality directly affects RAG quality, so investing here pays off.
For reranking, we'd add Cohere's Rerank or a fine-tuned cross-encoder. Rerankers score query-document pairs more accurately than embedding similarity, improving precision for the documents that make it into context. The latency hit is typically 50-100ms, which is acceptable for most use cases.
We'd also invest in a metadata strategy from day one. Every chunk should carry metadata about its source document, section, date, author, and any domain-specific attributes. This metadata enables filtering, improves answer attribution, and helps with debugging. Adding metadata later is painful; designing it in from the start is cheap.
The Honest Truth About RAG
RAG is the most practical LLM application for enterprises right now. It lets you ground AI responses in your actual data without the cost and complexity of fine-tuning. When it works well, it's genuinely useful—we've built systems that save teams hours of research time daily.
But it's not a magic box. It requires careful engineering, especially around retrieval. It requires ongoing maintenance as your corpus changes. It requires evaluation infrastructure to ensure quality over time. Teams that treat it as a quick win end up with systems that degrade and get abandoned.
Build it right, and RAG is a genuine capability upgrade. Build it wrong, and it's an expensive demo that never makes it to production.
Advanced Retrieval Techniques
Once you've nailed the basics, several advanced techniques can push retrieval quality higher.
Query Expansion and Rewriting
Users rarely phrase queries optimally for retrieval. "What's the return policy?" might not match documents that discuss "refund procedures" and "merchandise exchanges." Query expansion helps bridge this gap.
Synonym expansion adds related terms to the query. The naive approach uses a thesaurus; the better approach uses embeddings to find semantically similar terms. We expand queries before both semantic and keyword search, improving recall without sacrificing precision.
Query rewriting uses an LLM to rephrase queries for better retrieval. "Where do I send stuff back?" becomes "What is the return shipping address and procedure?" The rewritten query matches more relevant documents while preserving intent.
Hypothetical Document Embedding (HyDE) is a clever technique: generate a hypothetical answer to the query, embed that answer, and search for documents similar to the hypothetical answer. It works because the generated answer uses vocabulary and framing similar to the actual documents. We've found HyDE particularly effective for technical documentation where user queries use different terminology than the docs.
Multi-Stage Retrieval
Simple retrieval fetches the top K documents and stops. Multi-stage retrieval does multiple passes, refining results at each stage.
First stage: fast and broad. Retrieve a larger candidate set (100-500 documents) using cheap, fast methods—keyword search, approximate nearest neighbor with lower precision settings. Speed matters here; accuracy can be imperfect.
Second stage: slow and precise. Rerank the candidate set using a cross-encoder or more expensive embedding model. This is where you apply the expensive but accurate methods, now operating on a manageable set size.
Third stage: diversity. Ensure the final set isn't just the top-scoring documents—often they're near-duplicates. Apply diversity algorithms (MMR, or maximal marginal relevance) to ensure variety in the retrieved set.
This three-stage approach gives you both recall (from the broad first stage) and precision (from the accurate second stage) without the cost of running expensive models on your entire corpus.
Parent-Child Retrieval
Standard chunking creates a dilemma: small chunks improve retrieval precision but lose context; large chunks preserve context but hurt retrieval. Parent-child retrieval solves this by indexing at multiple granularities.
Index small chunks (the "children") for retrieval precision. When a child chunk matches, return its parent—a larger chunk that provides context. The retrieval system finds specific matches; the generation system gets coherent context.
Implementation requires linking chunks to their parents during indexing. When displaying search results, you can show the matched child highlighted within the parent context, helping users understand why that document was retrieved.
Self-Querying and Metadata Filtering
For structured domains, combining semantic search with metadata filtering improves results dramatically. "Sales reports from Q3" shouldn't semantically search for documents that mention "sales" and "Q3"—it should filter by document type and date, then semantically search within that subset.
Self-querying RAG uses an LLM to extract metadata filters from natural language queries. "What were John's expenses last month?" becomes a query with filters: author=John, type=expense, date=last_month. The filtering happens before retrieval, dramatically reducing the search space.
This pattern requires well-structured metadata—document types, dates, authors, departments—indexed alongside the document content. The upfront investment in metadata pays back in query precision.
Handling Different Document Types
One-size-fits-all chunking works for homogeneous corpora. Real-world document collections are heterogeneous—and different document types need different handling.
Tables and Structured Data
Tables are particularly problematic for standard RAG. Chunk a table, and you lose the row-column relationships that make the data meaningful. "Revenue: $12M" means nothing without knowing which company, which period, which segment.
We handle tables by extracting them as structured data (JSON or Markdown table format), preserving column headers with each row. For retrieval, we index both the structured representation and a natural language description generated by an LLM. The description enables semantic search; the structured data preserves accuracy.
For complex tables, we generate multiple representations: the full table for comprehensive queries, individual rows for specific lookups, and summary statistics for aggregate queries.
Code and Technical Documentation
Code requires special handling. You can't chunk in the middle of a function—the opening brace needs its closing brace. Docstrings should stay with the functions they describe. Import statements provide context for the entire file.
We use code-aware chunking that respects syntax structure: functions, classes, and methods as natural chunk boundaries. We include preceding comments and docstrings. For large files, we index at multiple levels: the full file summary, individual classes, and individual functions.
Language-specific handling matters. Python's significant whitespace, JavaScript's various module systems, SQL's statement boundaries—each language has chunking quirks.
Multi-Modal Content
Documents increasingly include images, diagrams, and charts that carry information not present in the text. "See Figure 3" in the text is useless without Figure 3.
For diagrams and charts, we generate text descriptions using vision models, then index those descriptions alongside the visual content. The description enables retrieval; the original image provides accuracy for generation.
For presentations, we treat each slide as a document, combining the visual content (OCR'd text, image descriptions) with any speaker notes. The slide-as-document model works better than extracting all text into one blob.
Debugging and Troubleshooting
RAG systems fail in ways that are hard to diagnose without proper tooling.
The Retrieval Debug Loop
When a RAG system gives a bad answer, the first question is always: was it a retrieval problem or a generation problem? Did the system retrieve the right documents but fail to synthesize them correctly? Or did retrieval fail, leaving generation with nothing to work from?
We build explicit retrieval logging into every system: for each query, record the retrieved documents, their scores, and their ranks. When investigating a failure, start by examining what was retrieved. More often than not, retrieval is the culprit.
Near-Miss Analysis
Often the relevant document was retrieved—but not in the top positions. It was at rank 15 instead of rank 3, pushed down by documents that matched keywords but lacked the actual answer.
Near-miss analysis examines these cases: what made the correct document score lower? Was the query phrasing poor? Was the chunk containing the answer too diluted with other content? Was there a keyword mismatch that hurt ranking?
These cases often reveal systematic issues—topics where your chunking strategy fails, query patterns that need expansion, metadata that could enable filtering.
User Feedback Integration
Users know when answers are wrong—or at least when answers don't help. Capturing this feedback creates a continuous improvement loop.
We implement explicit feedback (thumbs up/down, ratings) and implicit feedback (did the user ask a follow-up clarifying the same thing? did they give up and contact support?). Both signals feed into a review queue where bad cases get human analysis.
The analysis identifies patterns: are there topics where the system consistently fails? Are there query phrasings that confuse retrieval? Are there documents that should be in the corpus but aren't? Each finding becomes an improvement—better chunking, query expansion rules, corpus additions.
Scaling RAG Systems
Small-scale RAG is straightforward. Enterprise-scale RAG introduces new challenges.
Corpus Size Strategies
As document collections grow into millions of documents, naive approaches break. Vector search over millions of embeddings adds latency. Index updates become expensive. Storage costs escalate.
We address scale with tiered architectures. A fast tier contains recent and frequently-accessed documents with aggressive caching. A standard tier contains the bulk of the corpus with optimized indexing. A cold tier archives rarely-accessed documents with cheaper storage and on-demand indexing.
Routing queries to the right tier requires intent detection. "What's our current return policy?" should search only recent documents. "What was our policy in 2019?" needs the archive. Most queries can be satisfied from the fast tier, reducing average latency.
Multi-Tenant Considerations
Enterprise RAG often serves multiple tenants—different departments, clients, or access levels. Documents from different tenants must be isolated; users must see only what they're authorized to see.
We implement tenant isolation at the index level (separate vector collections per tenant), at the filter level (tenant ID as a mandatory filter), or at the permission level (document-level access control checked at query time). The right approach depends on scale, security requirements, and whether cross-tenant queries are ever needed.
Performance optimization for multi-tenant systems requires careful index design. Tenant-specific indexes are simpler but duplicate common content. Shared indexes with filtering are more efficient but require careful permission handling.
Cost Optimization at Scale
Enterprise RAG can get expensive fast. Embedding costs, vector database costs, LLM token costs—they all scale with query volume.
We optimize aggressively: cache frequent queries, batch similar queries, route simple queries to cheaper models, and use smaller embeddings where quality permits. For systems with predictable query patterns, pre-computing answers for common queries eliminates runtime costs entirely.
The 80/20 rule usually applies: 80% of queries can be answered with cached or pre-computed responses. The remaining 20% justify the full RAG pipeline. Identifying and handling the common cases cheaply makes the system economically viable at scale.
Common Failure Patterns We've Seen
After working on many RAG implementations, certain failure patterns recur often enough to be worth naming:
The "it worked in testing" failure. The test corpus was clean, small, and well-structured. Production documents are messy, enormous, and inconsistent. Systems optimized for synthetic test cases crumble under real-world messiness. Test with real, ugly data from day one.
The "users can't describe what they want" failure. RAG systems assume users know how to query. In practice, users ask vague questions, use the wrong terminology, and expect the system to read their minds. Query expansion, clarifying questions, and zero-result handling are essential, not optional.
The "stale documents" failure. The knowledge base is updated quarterly, but the business changes daily. Users ask about things that happened last week and get year-old information. Real-time or near-real-time indexing is often necessary, even if it's expensive.
The "one size fits all" failure. Different document types need different chunking strategies. Different query types need different retrieval parameters. Treating everything uniformly works for demos but fails for production diversity.
The "nobody owns it" failure. RAG systems need continuous maintenance—new document types, changing user needs, evolving business knowledge. Systems without clear ownership degrade as the organization changes around them.
Naming these patterns doesn't prevent them, but awareness helps. Every RAG project should have explicit plans for handling each one.
What We Recommend for New RAG Projects
Based on our experience, here's what we'd recommend for most new RAG implementations:
Start simple. Single vector database, basic chunking, straightforward retrieval. Add complexity only when you have data showing you need it. Many systems never need advanced features.
Instrument everything from day one. Every query, every retrieval, every generation. You'll need this data for optimization. It's much harder to add instrumentation later.
Build evaluation into the workflow. Not as an afterthought, but as a core part of how you develop. The ability to measure quality is what lets you improve it.
Plan for maintenance. RAG systems aren't set-and-forget. Someone needs to own document updates, quality monitoring, and user feedback. Build that into project planning.
Expect iteration. The first version won't be good enough. Budget for multiple rounds of improvement based on user feedback and evaluation data. The teams that ship once and move on have mediocre RAG systems.
Consider hybrid approaches. RAG alone might not be the answer. Sometimes structured data queries, sometimes cached responses, sometimes human escalation. The best systems know when RAG is the right tool and when it isn't.
Building a RAG system? We've done this enough times to know where the problems hide. Get in touch if you want to skip the mistakes we made.