Retrieval Strategy Playbook

A practical guide to retrieval quality, indexing choices, and grounding strategy.

By Ryan Setter

8/14/20258 min read Reading

Retrieval is not a feature. It is the memory interface between your model and the world you actually operate in.

This playbook is for engineers and architects who need retrieval to behave like infrastructure: measurable, debuggable, and boring (in the best way).

The retrieval system boundary (model vs system)

If you remember one framing, make it this:

  • The model generates tokens.
  • The retrieval system decides what the model is allowed to know right now.

When retrieval fails, people blame the model. When retrieval succeeds, people credit the prompt. Both are convenient stories.

The canonical RAG pipeline (reference architecture)

This is the shape you will converge on even if you pretend you won't:

User
  -> Intent / answer-class router
    -> (No retrieval) -> LLM
    -> (Retrieval)
         -> Query builder (rewrite/decompose)
         -> Candidate fetch (sparse + dense + filters)
         -> Reranking (cheap -> expensive)
         -> Context assembly (dedupe/pack/cite)
         -> LLM generation (with constraints)
         -> Post-checks (grounding, policy, formatting)

Everything in that middle block is engineering. None of it is vibes.

Step 0: Define answer classes (the router you keep avoiding)

Not every question needs retrieval. If you retrieve by default, you will:

  • pay latency/cost for nothing,
  • inject irrelevant context (quality regression),
  • and eventually build a "RAG system" that is really a randomness amplifier.

Define answer classes and route explicitly:

Answer classWhat it isRetrieval?Typical source of truthFailure mode if you get it wrong
Static referenceStable facts that don't change oftenUsually yesDocs, manuals, KBHallucinated details / outdated answers
Dynamic policyRules that change (pricing, access, on-call, incident status)Yes, with freshnessConfig, DB, APIsConfidently wrong policy decisions
Personalized stateUser/team-specific stateYes, with permissionsCRM, tickets, profilesData leakage or generic answers
Procedural reasoning"Think through it" tasksMaybeSometimes noneRetrieval distracts reasoning
Tool-requiredNeeds computation / actionNo (or minimal)ToolsModel makes up results

Dry truth: most "hallucinations" in production are "wrong answer class + ungrounded generation."

Implementation note: your router can be rules, a small classifier, or an LLM with a strict schema. The key is that it is versioned and measurable.

Step 1: Define the corpus contract (what are we indexing, exactly?)

Before embeddings, decide what a "document" is in your system.

Minimum contract you want to lock down:

  • doc_id: stable identifier (never reuse)
  • source: system-of-record (wiki, repo, ticketing, etc.)
  • title
  • url (or internal handle)
  • owner / domain
  • created_at, updated_at, ingested_at
  • acl: permissions boundary (tenant/team/user)
  • content_hash: detect churn + avoid unnecessary re-embeds

If you cannot answer "what is the authoritative source for this chunk?" you are building a rumor database.

Step 2: Chunking is an information architecture problem, not a token problem

Chunking determines what retrieval can possibly return. If chunking is wrong, no embedding model will save you.

Good chunking properties:

  • Semantic boundaries: split on headings/sections/paragraph intent.
  • Addressable: each chunk has a stable chunk_id and cites back to a source span.
  • Composable: chunks can be re-assembled into "just enough context" without dragging the entire internet.

Common strategies (pick intentionally):

StrategyBest forCostFailure mode
Section-basedDocs with headingsLowMisses cross-section context
Sliding window + overlapUnstructured textMediumDuplicate context + wasted tokens
Multi-granularity (small + parent)Precision + coherenceMediumMore index complexity
Structured extraction (FAQs, tables, policies)Highly operational contentHighRequires reliable parsing

Practical recommendation:

  • Index two granularities: a smaller "answerable chunk" and a larger "parent section" pointer.
  • Retrieve small chunks for precision, then optionally attach parent context if the model needs coherence.

Step 3: Index for intent (dense, sparse, hybrid, and filters)

Retrieval is a ranking problem under constraints (latency, cost, governance). Your index choice is how you express that.

Sparse (BM25 / keyword)

  • Strong when queries contain rare terms, IDs, error codes, product names.
  • Predictable failure: synonyms and paraphrases.

Dense (embeddings)

  • Strong when queries are conceptual ("how does X relate to Y?").
  • Predictable failure: over-smoothing (everything looks kind of similar), weak on exact matches.

Hybrid wins in production because reality contains both:

  • human language ("why is latency spiking?")
  • and machine identifiers ("OOMKilled", "HTTP 429", "HNSW efSearch").

Implementation pattern:

  • Retrieve candidates from sparse and dense indexes.
  • Normalize scores per channel.
  • Merge with simple weighting.
  • Rerank on the merged candidate set.

Metadata filters (non-negotiable)

Filters are not "nice to have." They are how you prevent:

  • cross-tenant leakage,
  • wrong-environment answers,
  • and stale policy being treated as current.

At minimum, your retrieval call needs to support filters like:

  • tenant_id / team_id
  • environment (prod/stage)
  • doc_type (policy/runbook/spec)
  • updated_at ranges (freshness)

Step 4: Query building (rewrite, decompose, and don't get injected)

Your user query is not a search query. It is an intent signal.

Useful query transforms:

  • Rewrite: remove pleasantries, expand acronyms, canonicalize product names.
  • Decompose: split multi-part questions into subqueries.
  • Entity extraction: pull IDs, services, error codes, versions.

Guardrail:

  • Never let retrieved content change your query-building policy.

If your query rewrite can be instructed by the corpus, you have invented prompt injection with extra steps.

Step 5: Candidate retrieval (optimize for recall, then earn precision)

Think in two stages:

  • Stage A: high-recall candidate fetch
  • Stage B: precision via reranking

Operational dials that matter:

  • k_dense / k_sparse: start larger than you think, then measure.
  • Dynamic k: raise k when query is ambiguous; lower when query contains rare identifiers.
  • Time boost: prefer newer content for dynamic policy classes.

If you only do one thing: stop tuning embedding models and start measuring candidate recall.

Step 6: Reranking (the cheapest quality win you will ever ship)

Reranking answers: "given these candidates, which ones actually answer the query?"

Typical ladder:

RerankerCostWhen to use
Lightweight heuristicLowRemove duplicates, down-rank boilerplate
Cross-encoderMediumDefault for quality-critical retrieval
LLM judge rerankHighLast mile when you can afford latency/cost

Practical notes:

  • Cache rerank results for repeated queries (or near-duplicates).
  • Rerank on chunk text + title + metadata, not just raw chunk.
  • If your reranker is an LLM, constrain it: JSON output, explicit rubric.

Step 7: Context assembly (packing is a budgeted optimization problem)

The output of retrieval is not "context." It is a set of candidates that must be packed into a constrained window.

Context assembly should be deterministic:

  • Deduplicate near-identical chunks.
  • Prefer primary sources over summaries.
  • Keep citations stable (chunk IDs don't change).
  • Quote the exact lines that support key claims when possible.

If you do citations, make them enforceable:

  • Require the model to cite a chunk for any non-trivial claim.
  • Reject or down-rank outputs with missing/irrelevant citations.

Step 8: Grounding and evaluation (measure the pipeline, not the vibes)

You need two eval layers:

Retrieval layer metrics

  • recall@k: did we retrieve relevant chunks at all?
  • MRR: how high did the first relevant chunk appear?
  • nDCG: are we ranking relevance well across the list?

Answer layer metrics

  • Faithfulness: are claims supported by the provided context?
  • Attribution quality: do citations map to the claims they appear next to?
  • Refusal correctness: did we abstain when we should?

Build a small golden set (50-200 questions) that matches real traffic. Update it monthly. Treat it like tests.

Step 9: Observability (how to debug retrieval in one screen)

Instrument retrieval as a first-class system.

Minimum events per request:

  • router decision (answer class)
  • query transforms (what changed)
  • candidate fetch stats (per channel)
  • rerank stats (latency, top reasons if available)
  • context pack stats (tokens, dedupe rate)
  • generation outcome (accepted/edited/refused)

Minimum dimensions:

  • request ID, tenant/team
  • model + prompt version
  • index version + embedding model version
  • top retrieved doc_id / chunk_id list

Related: AI Observability Basics

Step 10: Failure modes and fixes (the playbook part)

When retrieval is "bad," classify the failure before you tweak knobs.

SymptomLikely causeFix
Great answers in dev, terrible in prodpermissions/filters, index driftverify filters + index versions per env
Correct docs exist, never retrievedchunking too coarse/fine, sparse missingre-chunk; add sparse channel; raise k
Retrieved chunks are relevant, answer is wrongcontext assembly, instruction conflictpack fewer, higher-signal chunks; tighten constraints
Citations exist but don't support claimspost-processing missingadd faithfulness checks + rejection
Answers are staleingestion lag, freshness signal missingindex faster; add time boost; route dynamic policy to APIs
Everything looks vaguely relevantembedding collapse, no rerankadd reranker; add sparse; add filters

If you can't reproduce the failure with a request ID and a retrieval trace, you don't have a retrieval problem. You have a guessing problem.

A pragmatic maturity path

If you want a sane rollout sequence:

  1. Hybrid candidate retrieval + metadata filters
  2. Cross-encoder reranking
  3. Deterministic context assembly + stable citations
  4. Golden-set evals (retrieval + faithfulness)
  5. Observability dashboards + replay
  6. Dynamic routing by answer class + freshness

Most teams try to skip from (1) to "agentic RAG." That is how you get a demo that cannot be maintained.