Essay · Nov 2025 · 11 min read

RAG without the sales pitch.

Twelve checks that separate a working RAG system from a demo. Nothing here is novel. Most are skipped.

Every other LinkedIn post in 2025 was a RAG architecture diagram. Most of them are wrong in production. Here’s the checklist I run before calling a RAG system done.

Retrieval

1. You measured retrieval recall@k separately from end-to-end accuracy

If you only have an end-to-end metric, you don’t know whether your failures are retrieval failures or generation failures. They have completely different fixes. Build a retrieval-only eval. Score recall@k on documents you marked as the gold answer.

2. You’re using hybrid search, not just dense embeddings

Embeddings miss exact-match queries (product SKUs, error codes, names with unusual spellings). BM25 catches them. Combine the two. The cheapest 10% accuracy bump in the entire pipeline is adding BM25 alongside whatever dense retriever you started with.

3. Your chunking strategy is matched to your documents

Naive 512-token sliding window with 50-token overlap is the default that everyone uses and it’s wrong for almost every domain. Code? Chunk by function. Legal? Chunk by clause. Customer support tickets? Chunk by message. The chunking strategy is the single biggest free win in retrieval, and the one most teams skip.

4. You’re reranking

A cross-encoder reranker on the top 50 results, returning the top 5, is a 15–30% recall@k improvement on most domains. Cohere and Voyage both have hosted ones. Latency cost: ~80ms. There is no excuse for not having this.

Generation

5. Your prompt explicitly tells the model what to do when retrieval fails

If retrieval returns nothing relevant, your model will hallucinate. The fix is one sentence in the prompt: “If the context does not contain the answer, say I don’t know based on the provided documents.” Then you score how often it does that vs. inventing things.

6. Citations are required, not optional

Every claim in the response cites a chunk by ID. This is a one-line prompt change and it transforms the system from a black box into something you can audit. It also dramatically reduces hallucinations — the model will not invent a citation as readily as it will invent a fact.

7. You’re using a model the size of the task

I’ve seen teams burn budget on Claude Opus or GPT-4 for what is effectively a structured-extraction task that Claude Haiku does just as well 80% cheaper. Pick the smallest model that hits your eval bar. Re-test quarterly — smaller models keep getting better.

Eval

8. You have an eval set with at least 100 examples

Below 100, you can’t tell signal from noise on a single percentage-point change. Above 1000, you’re paying for diminishing returns on labeling effort. 100–300 is the sweet spot for most teams.

9. Your eval includes adversarial examples

Questions whose answer is “the documents don’t say.” Questions phrased ambiguously. Questions about content that contradicts what’s in retrieval. If your eval set is only happy-path, you’re measuring the wrong thing.

10. You run the eval in CI

Every prompt change. Every retriever change. Every model upgrade. The day you don’t catch a 5% regression in CI is the day a customer catches it.

Operations

11. You’re logging the full retrieval payload, not just the answer

When a customer reports a bad answer, you need to be able to reconstruct what the model saw. That means: the query, the embeddings, the chunks retrieved (in order), the reranked order, the final prompt, and the response. All of it. Cheap on disk, priceless when something breaks.

12. You have a feedback loop from production back into eval

Bad answers reported by users become entries in your eval set the next week. Without this, your eval set is frozen at the moment you wrote it, and your production traffic has moved on. The loop is what keeps the system useful.

What I left out

I deliberately didn’t cover: vector database choice (Pinecone, Weaviate, pgvector are all fine for <10M chunks), embedding model choice (start with text-embedding-3-large, switch only when you have a measured reason), or fine-tuning (you almost certainly don’t need to). These are the choices that get the LinkedIn posts. They’re also the choices that matter least until you’ve done the twelve above.

None of this is novel. Most of it is skipped. That’s the gap between a RAG demo and a RAG system that ships.

← All writing