AI Observability Basics

What to instrument first when your product starts depending on language models.

By Ryan Setter

9/21/20256 min read Reading

AI systems are probabilistic. Production systems are not.

Observability is the bridge between those two facts.

This guide focuses on what to instrument first so you can answer the only questions that matter after launch:

  • What happened?
  • Why did it happen?
  • How often does it happen?
  • What does it cost?
  • Can we fix it without guessing?

The mistake: treating LLM calls like HTTP calls

Traditional services are mostly deterministic. You can often debug with:

  • request/response,
  • status codes,
  • a stack trace.

LLM-driven systems need more, because the "response" is the end of a pipeline:

  • routing
  • retrieval
  • tool calls
  • context packing
  • model inference
  • validation / policy enforcement

If you only log the final text, you have built a system that fails silently and looks confident while doing it.

What "good" looks like: the one-screen rule

For any request ID, an operator should be able to answer in one screen:

  • what the user asked (redacted as needed)
  • how the system classified/routed it
  • what context was retrieved/assembled
  • what tools were called (args, results, latency)
  • what model and prompt version ran
  • what policy/validation gates were applied
  • what the system returned (and why)

If your debugging workflow begins with "paste the conversation into another model," you do not have observability. You have superstition with better UX.

Reference architecture: the observability surface

Most AI products converge on the same stages. Instrument them explicitly.

User
  -> API boundary (authn/authz, rate limits)
  -> Router (answer class, tool intent)
  -> Retrieval (optional): query build -> fetch -> rerank -> pack
  -> Tool execution (optional): call -> result -> retries/timeouts
  -> Model call (prompt version + model)
  -> Post-processing: schema validation, grounding checks, redaction
  -> Output: UI formatting, citations, follow-up actions

Observability is not a feature bolted on at the end. It is cross-cutting plumbing.

Step 1: Define events (traces) before dashboards (charts)

Dashboards are aggregates. Aggregates are useless if the raw events are missing.

Instrument a small set of first-class events. Start here:

Core events (minimum viable trace)

  • request_received
  • route_selected
  • context_assembled
  • retrieval_performed (if applicable)
  • tool_called (0..N)
  • model_called
  • output_validated
  • policy_applied
  • response_sent

Why these events

  • They map to controllable architecture stages.
  • They are replayable.
  • They let you separate model issues from integration issues.

Step 2: Standard dimensions (the keys that make analysis possible)

Every event needs consistent dimensions. If you cannot group by it, you cannot learn from it.

Required dimensions (start strict, relax later):

  • request_id (global correlation)
  • user_id (or anonymous stable id)
  • tenant_id / team_id (if multi-tenant)
  • environment (prod/stage)
  • feature_flag / experiment_id
  • route (API endpoint / product surface)
  • answer_class (router output)
  • model + model_version
  • prompt_version
  • retrieval_index_version (if retrieval)
  • tool_schema_version

Also capture:

  • latency_ms per stage
  • tokens_in, tokens_out
  • cost_usd (even if estimated)

If you do not log versions, you cannot answer "what changed" after an incident.

Step 3: Treat prompt + tool + retrieval as versioned code

You should be able to draw a line from a production output back to the exact configuration that produced it.

Minimum versioning set:

  • model name + pinned version (or provider snapshot)
  • prompt template id + version
  • router version
  • tool schemas version
  • retrieval index version (plus embedding model)

This is not paperwork. It is how you avoid "we shipped a prompt tweak" becoming "we have no idea why conversion dropped."

Step 4: Log the right payloads (and not the wrong ones)

There are two competing truths:

  • Payload logs make debugging possible.
  • Payload logs can create privacy, security, and compliance liabilities.

Practical compromise

  • Always log metadata + metrics.
  • Log structured summaries of payloads.
  • Sample full payloads only when needed, under strict retention.

What to log for each stage:

  • router input summary + output label
  • retrieval query (sanitized) + top chunk IDs + source URLs
  • tool calls: tool name, args (redacted), status, latency
  • model call: model, prompt version, token counts, stop reason
  • output validation: schema pass/fail, refusal reason, grounding score (if present)

What not to log by default:

  • raw secrets, API keys, credentials
  • entire documents from retrieval
  • full conversation transcripts with PII

If you must store raw text for debugging, store it behind access controls, encrypt it, and set retention windows that an auditor would not laugh at.

Step 5: Metrics that matter (reliability, quality, cost)

Reliability metrics

  • request error rate by route
  • tool error rate by tool name
  • timeout rate by stage
  • refusal rate by policy category
  • fallback rate (model fallback, retrieval fallback, tool fallback)

Latency metrics

  • p50/p95/p99 end-to-end latency
  • latency breakdown by stage (retrieval, tool, model)
  • time-to-first-token (if streaming)

Cost metrics

  • tokens in/out by route and model
  • cost per request
  • cost per successful outcome
  • cost per accepted answer (not per generated answer)

Cost per outcome is the honest metric. Tokens are just the billing surface.

Step 6: Quality signals (without pretending you have ground truth)

Quality is the hardest thing to measure, which is why teams avoid it until users do it for them.

Start with observable proxies:

  • user acceptance (thumbs up/down, "solved" clicks)
  • edit distance (how much did the user rewrite the output)
  • escalation rate (handoff to human)
  • repeat-question rate (user asks the same thing again)

Then add structured evaluation:

  • golden set regression (50-200 cases to start)
  • offline faithfulness checks (does the answer cite the provided chunks)
  • pairwise comparisons (A vs B) when changing prompts/models

If you use an LLM as a judge, log judge model/version too. Otherwise your evaluator will drift and you will think your product improved.

Step 7: Retrieval observability (if you do RAG)

Retrieval failures masquerade as model failures.

At minimum log per request:

  • retrieval enabled? (yes/no)
  • query transforms (rewrite/decompose)
  • candidate counts per channel (dense/sparse)
  • top chunk IDs + source
  • rerank latency
  • context pack tokens

Core retrieval metrics:

  • recall@k on a labeled set
  • MRR / nDCG (if you have graded relevance)

Related: Retrieval Strategy Playbook

Step 8: Safety and security instrumentation (yes, this is observability)

AI systems create new failure classes. You want those failures to be visible.

Track:

  • prompt-injection detections (direct and indirect)
  • tool-call blocks (policy denied)
  • data access denials (authz)
  • PII redaction actions
  • "suspicious" retrieval sources (unknown domains, external web)

If you cannot measure attempted abuse, you cannot improve defenses. You can only hope.

Step 9: Dashboards that earn their screen space

Dashboards should answer operational questions quickly.

Reliability dashboard

  • error rate, timeout rate, tool failure rate
  • by route, model version, and feature flag

Quality dashboard

  • acceptance rate, escalation rate, repeat-question rate
  • by user segment and answer class

Cost dashboard

  • cost per outcome
  • spend by model, route, tenant
  • cost anomalies (sudden token spikes)

Safety dashboard

  • blocks/refusals by reason
  • injection attempts over time
  • data access denials

Step 10: Debugging workflow (the operator loop)

When a report comes in, do not guess. Follow the trace.

  1. Identify request ID (or reconstruct via user/time/route).
  2. Check router decision and answer class.
  3. Check retrieval (if any): were relevant chunks even present?
  4. Check tool calls: did anything fail or return partial data?
  5. Check model + prompt versions.
  6. Check validation/policy: was output rejected or modified?
  7. Compare against known-good traces.

Your goal is to classify the failure into a fixable bucket:

  • routing
  • retrieval
  • tool integration
  • prompt/constraints
  • policy enforcement
  • model capability

A pragmatic maturity path

If you want a sane order of operations:

  1. Correlation IDs + stage events + version logging
  2. Token/cost accounting + latency breakdowns
  3. Tool-call audit logs + idempotency
  4. Retrieval traces (if applicable)
  5. Quality proxies + golden sets
  6. Safety instrumentation + retention controls

The model gets the headlines. The trace gets you uptime.