RAG that actually works: beyond the naive vector search
Retrieval-augmented generation is the workhorse behind most useful LLM products: connect a model to your data so it answers from facts, not vibes. The first prototype is famously easy — embed some documents, do a similarity search, stuff the results into a prompt. Then you point it at real data and the quality falls off a cliff.
After shipping RAG into legal, medical and financial products, here's what we've learned actually matters.
Chunking is a product decision, not a default
How you split documents determines what the model can ever retrieve. Naive fixed-size chunks slice sentences in half and destroy context. We chunk along semantic boundaries — headings, sections, logical units — and attach metadata (source, date, section title) so retrieval can be filtered and citations can be precise.
Garbage chunks in, garbage answers out. More than half of RAG quality is decided before a single query runs.
Hybrid search beats pure vectors
Vector similarity is great at meaning but weak at exact terms — product codes, names, acronyms. We combine dense vector search with classic keyword (BM25) search and fuse the rankings. The result catches both 'what they meant' and 'the exact string they typed'.
- Dense retrieval for semantic intent
- Sparse/keyword retrieval for precise terms and identifiers
- A reranker to put the genuinely most relevant chunks at the top
Rerank before you generate
Retrieval gives you candidates; a cross-encoder reranker tells you which ones actually answer the question. Adding a reranking step is the single highest-ROI upgrade we make to most RAG systems — it consistently lifts answer quality more than swapping to a bigger generation model.
Cite everything
Users don't trust an AI that asserts. They trust one that shows its sources. We make the model quote and link the exact passages it used, so every answer is auditable. This also turns hallucinations into something you can catch: if the citation doesn't support the claim, the answer is wrong by construction.
Measure retrieval and generation separately
When a RAG answer is bad, you need to know why: did retrieval fail to find the right document, or did the model ignore it? We evaluate the two stages independently — retrieval recall on one axis, answer faithfulness on the other — so we fix the actual bottleneck instead of randomly swapping components.
The bottom line
Great RAG is an information-retrieval problem wearing an AI costume. Invest in chunking, hybrid retrieval, reranking and citations — and evaluate each stage on its own — and you'll ship answers people actually rely on.
More articles
Putting AI agents into production: a 2026 field guide
Agentic AI is the defining shift of the year — but a demo that dazzles and a system you can trust with real users are very different things. Here's how we ship agents that hold up.
Designing AI-native interfaces people actually trust
Bolting a chat box onto your app isn't an AI product. Designing for uncertainty, control and trust is. Here's how we approach AI-native UX.
From autocomplete to autonomous: how AI is rewiring software teams
AI coding tools jumped from suggesting lines to shipping whole pull requests. Here's how we actually use them — and where a human still has to own the outcome.
Have a project in mind?
Let's turn these ideas into your product. Tell us what you're building.
