Retrieval Strategy Playbook
A practical guide to retrieval quality, indexing choices, and grounding strategy.
By Ryan Setter
Retrieval is not a feature. It is the memory interface between your model and the world you actually operate in.
This playbook is for engineers and architects who need retrieval to behave like infrastructure: measurable, debuggable, and boring (in the best way).
The retrieval system boundary (model vs system)
If you remember one framing, make it this:
- The model generates tokens.
- The retrieval system decides what the model is allowed to know right now.
When retrieval fails, people blame the model. When retrieval succeeds, people credit the prompt. Both are convenient stories.
The canonical RAG pipeline (reference architecture)
This is the shape you will converge on even if you pretend you won't:
User
-> Intent / answer-class router
-> (No retrieval) -> LLM
-> (Retrieval)
-> Query builder (rewrite/decompose)
-> Candidate fetch (sparse + dense + filters)
-> Reranking (cheap -> expensive)
-> Context assembly (dedupe/pack/cite)
-> LLM generation (with constraints)
-> Post-checks (grounding, policy, formatting)
Everything in that middle block is engineering. None of it is vibes.
Step 0: Define answer classes (the router you keep avoiding)
Not every question needs retrieval. If you retrieve by default, you will:
- pay latency/cost for nothing,
- inject irrelevant context (quality regression),
- and eventually build a "RAG system" that is really a randomness amplifier.
Define answer classes and route explicitly:
| Answer class | What it is | Retrieval? | Typical source of truth | Failure mode if you get it wrong |
|---|---|---|---|---|
| Static reference | Stable facts that don't change often | Usually yes | Docs, manuals, KB | Hallucinated details / outdated answers |
| Dynamic policy | Rules that change (pricing, access, on-call, incident status) | Yes, with freshness | Config, DB, APIs | Confidently wrong policy decisions |
| Personalized state | User/team-specific state | Yes, with permissions | CRM, tickets, profiles | Data leakage or generic answers |
| Procedural reasoning | "Think through it" tasks | Maybe | Sometimes none | Retrieval distracts reasoning |
| Tool-required | Needs computation / action | No (or minimal) | Tools | Model makes up results |
Dry truth: most "hallucinations" in production are "wrong answer class + ungrounded generation."
Implementation note: your router can be rules, a small classifier, or an LLM with a strict schema. The key is that it is versioned and measurable.
Step 1: Define the corpus contract (what are we indexing, exactly?)
Before embeddings, decide what a "document" is in your system.
Minimum contract you want to lock down:
doc_id: stable identifier (never reuse)source: system-of-record (wiki, repo, ticketing, etc.)titleurl(or internal handle)owner/domaincreated_at,updated_at,ingested_atacl: permissions boundary (tenant/team/user)content_hash: detect churn + avoid unnecessary re-embeds
If you cannot answer "what is the authoritative source for this chunk?" you are building a rumor database.
Step 2: Chunking is an information architecture problem, not a token problem
Chunking determines what retrieval can possibly return. If chunking is wrong, no embedding model will save you.
Good chunking properties:
- Semantic boundaries: split on headings/sections/paragraph intent.
- Addressable: each chunk has a stable
chunk_idand cites back to a source span. - Composable: chunks can be re-assembled into "just enough context" without dragging the entire internet.
Common strategies (pick intentionally):
| Strategy | Best for | Cost | Failure mode |
|---|---|---|---|
| Section-based | Docs with headings | Low | Misses cross-section context |
| Sliding window + overlap | Unstructured text | Medium | Duplicate context + wasted tokens |
| Multi-granularity (small + parent) | Precision + coherence | Medium | More index complexity |
| Structured extraction (FAQs, tables, policies) | Highly operational content | High | Requires reliable parsing |
Practical recommendation:
- Index two granularities: a smaller "answerable chunk" and a larger "parent section" pointer.
- Retrieve small chunks for precision, then optionally attach parent context if the model needs coherence.
Step 3: Index for intent (dense, sparse, hybrid, and filters)
Retrieval is a ranking problem under constraints (latency, cost, governance). Your index choice is how you express that.
Sparse (BM25 / keyword)
- Strong when queries contain rare terms, IDs, error codes, product names.
- Predictable failure: synonyms and paraphrases.
Dense (embeddings)
- Strong when queries are conceptual ("how does X relate to Y?").
- Predictable failure: over-smoothing (everything looks kind of similar), weak on exact matches.
Hybrid (recommended default)
Hybrid wins in production because reality contains both:
- human language ("why is latency spiking?")
- and machine identifiers ("OOMKilled", "HTTP 429", "HNSW efSearch").
Implementation pattern:
- Retrieve candidates from sparse and dense indexes.
- Normalize scores per channel.
- Merge with simple weighting.
- Rerank on the merged candidate set.
Metadata filters (non-negotiable)
Filters are not "nice to have." They are how you prevent:
- cross-tenant leakage,
- wrong-environment answers,
- and stale policy being treated as current.
At minimum, your retrieval call needs to support filters like:
tenant_id/team_idenvironment(prod/stage)doc_type(policy/runbook/spec)updated_atranges (freshness)
Step 4: Query building (rewrite, decompose, and don't get injected)
Your user query is not a search query. It is an intent signal.
Useful query transforms:
- Rewrite: remove pleasantries, expand acronyms, canonicalize product names.
- Decompose: split multi-part questions into subqueries.
- Entity extraction: pull IDs, services, error codes, versions.
Guardrail:
- Never let retrieved content change your query-building policy.
If your query rewrite can be instructed by the corpus, you have invented prompt injection with extra steps.
Step 5: Candidate retrieval (optimize for recall, then earn precision)
Think in two stages:
- Stage A: high-recall candidate fetch
- Stage B: precision via reranking
Operational dials that matter:
k_dense/k_sparse: start larger than you think, then measure.- Dynamic k: raise k when query is ambiguous; lower when query contains rare identifiers.
- Time boost: prefer newer content for dynamic policy classes.
If you only do one thing: stop tuning embedding models and start measuring candidate recall.
Step 6: Reranking (the cheapest quality win you will ever ship)
Reranking answers: "given these candidates, which ones actually answer the query?"
Typical ladder:
| Reranker | Cost | When to use |
|---|---|---|
| Lightweight heuristic | Low | Remove duplicates, down-rank boilerplate |
| Cross-encoder | Medium | Default for quality-critical retrieval |
| LLM judge rerank | High | Last mile when you can afford latency/cost |
Practical notes:
- Cache rerank results for repeated queries (or near-duplicates).
- Rerank on chunk text + title + metadata, not just raw chunk.
- If your reranker is an LLM, constrain it: JSON output, explicit rubric.
Step 7: Context assembly (packing is a budgeted optimization problem)
The output of retrieval is not "context." It is a set of candidates that must be packed into a constrained window.
Context assembly should be deterministic:
- Deduplicate near-identical chunks.
- Prefer primary sources over summaries.
- Keep citations stable (chunk IDs don't change).
- Quote the exact lines that support key claims when possible.
If you do citations, make them enforceable:
- Require the model to cite a chunk for any non-trivial claim.
- Reject or down-rank outputs with missing/irrelevant citations.
Step 8: Grounding and evaluation (measure the pipeline, not the vibes)
You need two eval layers:
Retrieval layer metrics
recall@k: did we retrieve relevant chunks at all?MRR: how high did the first relevant chunk appear?nDCG: are we ranking relevance well across the list?
Answer layer metrics
- Faithfulness: are claims supported by the provided context?
- Attribution quality: do citations map to the claims they appear next to?
- Refusal correctness: did we abstain when we should?
Build a small golden set (50-200 questions) that matches real traffic. Update it monthly. Treat it like tests.
Step 9: Observability (how to debug retrieval in one screen)
Instrument retrieval as a first-class system.
Minimum events per request:
- router decision (answer class)
- query transforms (what changed)
- candidate fetch stats (per channel)
- rerank stats (latency, top reasons if available)
- context pack stats (tokens, dedupe rate)
- generation outcome (accepted/edited/refused)
Minimum dimensions:
- request ID, tenant/team
- model + prompt version
- index version + embedding model version
- top retrieved
doc_id/chunk_idlist
Related: AI Observability Basics
Step 10: Failure modes and fixes (the playbook part)
When retrieval is "bad," classify the failure before you tweak knobs.
| Symptom | Likely cause | Fix |
|---|---|---|
| Great answers in dev, terrible in prod | permissions/filters, index drift | verify filters + index versions per env |
| Correct docs exist, never retrieved | chunking too coarse/fine, sparse missing | re-chunk; add sparse channel; raise k |
| Retrieved chunks are relevant, answer is wrong | context assembly, instruction conflict | pack fewer, higher-signal chunks; tighten constraints |
| Citations exist but don't support claims | post-processing missing | add faithfulness checks + rejection |
| Answers are stale | ingestion lag, freshness signal missing | index faster; add time boost; route dynamic policy to APIs |
| Everything looks vaguely relevant | embedding collapse, no rerank | add reranker; add sparse; add filters |
If you can't reproduce the failure with a request ID and a retrieval trace, you don't have a retrieval problem. You have a guessing problem.
A pragmatic maturity path
If you want a sane rollout sequence:
- Hybrid candidate retrieval + metadata filters
- Cross-encoder reranking
- Deterministic context assembly + stable citations
- Golden-set evals (retrieval + faithfulness)
- Observability dashboards + replay
- Dynamic routing by answer class + freshness
Most teams try to skip from (1) to "agentic RAG." That is how you get a demo that cannot be maintained.