Retrieval Strategy Playbook

Retrieval is not a feature. It is the memory interface between your model and the world you actually operate in.

This playbook is for engineers and architects who need retrieval to behave like infrastructure: measurable, debuggable, and boring (in the best way).

The retrieval system boundary (model vs system)

If you remember one framing, make it this:

The model generates tokens.
The retrieval system decides what the model is allowed to know right now.

When retrieval fails, people blame the model. When retrieval succeeds, people credit the prompt. Both are convenient stories.

The canonical RAG pipeline (reference architecture)

This is the shape you will converge on even if you pretend you won't:

User
  -> Intent / answer-class router
    -> (No retrieval) -> LLM
    -> (Retrieval)
         -> Query builder (rewrite/decompose)
         -> Candidate fetch (sparse + dense + filters)
         -> Reranking (cheap -> expensive)
         -> Context assembly (dedupe/pack/cite)
         -> LLM generation (with constraints)
         -> Post-checks (grounding, policy, formatting)

Everything in that middle block is engineering. None of it is vibes.

Step 0: Define answer classes (the router you keep avoiding)

Not every question needs retrieval. If you retrieve by default, you will:

pay latency/cost for nothing,
inject irrelevant context (quality regression),
and eventually build a "RAG system" that is really a randomness amplifier.

Define answer classes and route explicitly:

Answer class	What it is	Retrieval?	Typical source of truth	Failure mode if you get it wrong
Static reference	Stable facts that don't change often	Usually yes	Docs, manuals, KB	Hallucinated details / outdated answers
Dynamic policy	Rules that change (pricing, access, on-call, incident status)	Yes, with freshness	Config, DB, APIs	Confidently wrong policy decisions
Personalized state	User/team-specific state	Yes, with permissions	CRM, tickets, profiles	Data leakage or generic answers
Procedural reasoning	"Think through it" tasks	Maybe	Sometimes none	Retrieval distracts reasoning
Tool-required	Needs computation / action	No (or minimal)	Tools	Model makes up results

Dry truth: most "hallucinations" in production are "wrong answer class + ungrounded generation."

Implementation note: your router can be rules, a small classifier, or an LLM with a strict schema. The key is that it is versioned and measurable.

Step 1: Define the corpus contract (what are we indexing, exactly?)

Before embeddings, decide what a "document" is in your system.

Minimum contract you want to lock down:

doc_id: stable identifier (never reuse)
source: system-of-record (wiki, repo, ticketing, etc.)
title
url (or internal handle)
owner / domain
created_at, updated_at, ingested_at
acl: permissions boundary (tenant/team/user)
content_hash: detect churn + avoid unnecessary re-embeds

If you cannot answer "what is the authoritative source for this chunk?" you are building a rumor database.

Step 2: Chunking is an information architecture problem, not a token problem

Chunking determines what retrieval can possibly return. If chunking is wrong, no embedding model will save you.

Good chunking properties:

Semantic boundaries: split on headings/sections/paragraph intent.
Addressable: each chunk has a stable chunk_id and cites back to a source span.
Composable: chunks can be re-assembled into "just enough context" without dragging the entire internet.

Common strategies (pick intentionally):

Strategy	Best for	Cost	Failure mode
Section-based	Docs with headings	Low	Misses cross-section context
Sliding window + overlap	Unstructured text	Medium	Duplicate context + wasted tokens
Multi-granularity (small + parent)	Precision + coherence	Medium	More index complexity
Structured extraction (FAQs, tables, policies)	Highly operational content	High	Requires reliable parsing

Practical recommendation:

Index two granularities: a smaller "answerable chunk" and a larger "parent section" pointer.
Retrieve small chunks for precision, then optionally attach parent context if the model needs coherence.

Step 3: Index for intent (dense, sparse, hybrid, and filters)

Retrieval is a ranking problem under constraints (latency, cost, governance). Your index choice is how you express that.

Sparse (BM25 / keyword)

Strong when queries contain rare terms, IDs, error codes, product names.
Predictable failure: synonyms and paraphrases.

Dense (embeddings)

Strong when queries are conceptual ("how does X relate to Y?").
Predictable failure: over-smoothing (everything looks kind of similar), weak on exact matches.

Hybrid (recommended default)

Hybrid wins in production because reality contains both:

human language ("why is latency spiking?")
and machine identifiers ("OOMKilled", "HTTP 429", "HNSW efSearch").

Implementation pattern:

Retrieve candidates from sparse and dense indexes.
Normalize scores per channel.
Merge with simple weighting.
Rerank on the merged candidate set.

Metadata filters (non-negotiable)

Filters are not "nice to have." They are how you prevent:

cross-tenant leakage,
wrong-environment answers,
and stale policy being treated as current.

At minimum, your retrieval call needs to support filters like:

tenant_id / team_id
environment (prod/stage)
doc_type (policy/runbook/spec)
updated_at ranges (freshness)

If you want the doctrine-layer version of this issue rather than the mechanics, read Retrieval Boundaries: What Your AI System Is Allowed to Know.

Step 4: Query building (rewrite, decompose, and don't get injected)

Your user query is not a search query. It is an intent signal.

Useful query transforms:

Rewrite: remove pleasantries, expand acronyms, canonicalize product names.
Decompose: split multi-part questions into subqueries.
Entity extraction: pull IDs, services, error codes, versions.

Guardrail:

Never let retrieved content change your query-building policy.

If your query rewrite can be instructed by the corpus, you have invented prompt injection with extra steps.

Step 5: Candidate retrieval (optimize for recall, then earn precision)

Think in two stages:

Stage A: high-recall candidate fetch
Stage B: precision via reranking

Operational dials that matter:

k_dense / k_sparse: start larger than you think, then measure.
Dynamic k: raise k when query is ambiguous; lower when query contains rare identifiers.
Time boost: prefer newer content for dynamic policy classes.

If you only do one thing: stop tuning embedding models and start measuring candidate recall.

Step 6: Reranking (the cheapest quality win you will ever ship)

Reranking answers: "given these candidates, which ones actually answer the query?"

Typical ladder:

Reranker	Cost	When to use
Lightweight heuristic	Low	Remove duplicates, down-rank boilerplate
Cross-encoder	Medium	Default for quality-critical retrieval
LLM judge rerank	High	Last mile when you can afford latency/cost

Practical notes:

Cache rerank results for repeated queries (or near-duplicates).
Rerank on chunk text + title + metadata, not just raw chunk.
If your reranker is an LLM, constrain it: JSON output, explicit rubric.

Step 7: Context assembly (packing is a budgeted optimization problem)

The output of retrieval is not "context." It is a set of candidates that must be packed into a constrained window.

Context assembly should be deterministic:

Deduplicate near-identical chunks.
Prefer primary sources over summaries.
Keep citations stable (chunk IDs don't change).
Quote the exact lines that support key claims when possible.

If you do citations, make them enforceable:

Require the model to cite a chunk for any non-trivial claim.
Reject or down-rank outputs with missing/irrelevant citations.

Step 8: Grounding and evaluation (measure the pipeline, not the vibes)

You need two eval layers:

Retrieval layer metrics

recall@k: did we retrieve relevant chunks at all?
MRR: how high did the first relevant chunk appear?
nDCG: are we ranking relevance well across the list?

Answer layer metrics

Faithfulness: are claims supported by the provided context?
Attribution quality: do citations map to the claims they appear next to?
Refusal correctness: did we abstain when we should?

Build a small golden set (50-200 questions) that matches real traffic. Update it monthly. Treat it like tests.

Step 9: Observability (how to debug retrieval in one screen)

Instrument retrieval as a first-class system.

Minimum events per request:

router decision (answer class)
query transforms (what changed)
candidate fetch stats (per channel)
rerank stats (latency, top reasons if available)
context pack stats (tokens, dedupe rate)
generation outcome (accepted/edited/refused)

Minimum dimensions:

request ID, tenant/team
model + prompt version
index version + embedding model version
top retrieved doc_id / chunk_id list

Related: AI Observability Basics

Step 10: Failure modes and fixes (the playbook part)

When retrieval is "bad," classify the failure before you tweak knobs.

Symptom	Likely cause	Fix
Great answers in dev, terrible in prod	permissions/filters, index drift	verify filters + index versions per env
Correct docs exist, never retrieved	chunking too coarse/fine, sparse missing	re-chunk; add sparse channel; raise k
Retrieved chunks are relevant, answer is wrong	context assembly, instruction conflict	pack fewer, higher-signal chunks; tighten constraints
Citations exist but don't support claims	post-processing missing	add faithfulness checks + rejection
Answers are stale	ingestion lag, freshness signal missing	index faster; add time boost; route dynamic policy to APIs
Everything looks vaguely relevant	embedding collapse, no rerank	add reranker; add sparse; add filters

If you can't reproduce the failure with a request ID and a retrieval trace, you don't have a retrieval problem. You have a guessing problem.

A pragmatic maturity path

If you want a sane rollout sequence:

Hybrid candidate retrieval + metadata filters
Cross-encoder reranking
Deterministic context assembly + stable citations
Golden-set evals (retrieval + faithfulness)
Observability dashboards + replay
Dynamic routing by answer class + freshness

Most teams try to skip from (1) to "agentic RAG." That is how you get a demo that cannot be maintained.