AI Observability Basics

AI systems are probabilistic. Production systems are not.

Observability is the bridge between those two facts.

This guide focuses on what to instrument first so you can answer the only questions that matter after launch:

What happened?
Why did it happen?
How often does it happen?
What does it cost?
Can we fix it without guessing?

The mistake: treating LLM calls like HTTP calls

Traditional services are mostly deterministic. You can often debug with:

request/response,
status codes,
a stack trace.

LLM-driven systems need more, because the "response" is the end of a pipeline:

routing
retrieval
tool calls
context packing
model inference
validation / policy enforcement

If you only log the final text, you have built a system that fails silently and looks confident while doing it.

What "good" looks like: the one-screen rule

For any request ID, an operator should be able to answer in one screen:

what the user asked (redacted as needed)
how the system classified/routed it
what context was retrieved/assembled
what tools were called (args, results, latency)
what model and prompt version ran
what policy/validation gates were applied
what the system returned (and why)

If your debugging workflow begins with "paste the conversation into another model," you do not have observability. You have superstition with better UX.

Reference architecture: the observability surface

Most AI products converge on the same stages. Instrument them explicitly.

User
  -> API boundary (authn/authz, rate limits)
  -> Router (answer class, tool intent)
  -> Retrieval (optional): query build -> fetch -> rerank -> pack
  -> Tool execution (optional): call -> result -> retries/timeouts
  -> Model call (prompt version + model)
  -> Post-processing: schema validation, grounding checks, redaction
  -> Output: UI formatting, citations, follow-up actions

Observability is not a feature bolted on at the end. It is cross-cutting plumbing.

Step 1: Define events (traces) before dashboards (charts)

Dashboards are aggregates. Aggregates are useless if the raw events are missing.

Instrument a small set of first-class events. Start here:

Core events (minimum viable trace)

request_received
route_selected
context_assembled
retrieval_performed (if applicable)
tool_called (0..N)
model_called
output_validated
policy_applied
response_sent

Why these events

They map to controllable architecture stages.
They are replayable.
They let you separate model issues from integration issues.

Step 2: Standard dimensions (the keys that make analysis possible)

Every event needs consistent dimensions. If you cannot group by it, you cannot learn from it.

Required dimensions (start strict, relax later):

request_id (global correlation)
user_id (or anonymous stable id)
tenant_id / team_id (if multi-tenant)
environment (prod/stage)
feature_flag / experiment_id
route (API endpoint / product surface)
answer_class (router output)
model + model_version
prompt_version
retrieval_index_version (if retrieval)
tool_schema_version

Also capture:

latency_ms per stage
tokens_in, tokens_out
cost_usd (even if estimated)

If you do not log versions, you cannot answer "what changed" after an incident.

Step 3: Treat prompt + tool + retrieval as versioned code

You should be able to draw a line from a production output back to the exact configuration that produced it.

Minimum versioning set:

model name + pinned version (or provider snapshot)
prompt template id + version
router version
tool schemas version
retrieval index version (plus embedding model)

This is not paperwork. It is how you avoid "we shipped a prompt tweak" becoming "we have no idea why conversion dropped."

Step 4: Log the right payloads (and not the wrong ones)

There are two competing truths:

Payload logs make debugging possible.
Payload logs can create privacy, security, and compliance liabilities.

Practical compromise

Always log metadata + metrics.
Log structured summaries of payloads.
Sample full payloads only when needed, under strict retention.

What to log for each stage:

router input summary + output label
retrieval query (sanitized) + top chunk IDs + source URLs
tool calls: tool name, args (redacted), status, latency
model call: model, prompt version, token counts, stop reason
output validation: schema pass/fail, refusal reason, grounding score (if present)

What not to log by default:

raw secrets, API keys, credentials
entire documents from retrieval
full conversation transcripts with PII

If you must store raw text for debugging, store it behind access controls, encrypt it, and set retention windows that an auditor would not laugh at.

Step 5: Metrics that matter (reliability, quality, cost)

Reliability metrics

request error rate by route
tool error rate by tool name
timeout rate by stage
refusal rate by policy category
fallback rate (model fallback, retrieval fallback, tool fallback)

Latency metrics

p50/p95/p99 end-to-end latency
latency breakdown by stage (retrieval, tool, model)
time-to-first-token (if streaming)

Cost metrics

tokens in/out by route and model
cost per request
cost per successful outcome
cost per accepted answer (not per generated answer)

Cost per outcome is the honest metric. Tokens are just the billing surface.

Step 6: Quality signals (without pretending you have ground truth)

Quality is the hardest thing to measure, which is why teams avoid it until users do it for them.

Start with observable proxies:

user acceptance (thumbs up/down, "solved" clicks)
edit distance (how much did the user rewrite the output)
escalation rate (handoff to human)
repeat-question rate (user asks the same thing again)

Then add structured evaluation:

golden set regression (50-200 cases to start)
offline faithfulness checks (does the answer cite the provided chunks)
pairwise comparisons (A vs B) when changing prompts/models

If you use an LLM as a judge, log judge model/version too. Otherwise your evaluator will drift and you will think your product improved.

Step 7: Retrieval observability (if you do RAG)

Retrieval failures masquerade as model failures.

At minimum log per request:

retrieval enabled? (yes/no)
query transforms (rewrite/decompose)
candidate counts per channel (dense/sparse)
top chunk IDs + source
rerank latency
context pack tokens

Core retrieval metrics:

recall@k on a labeled set
MRR / nDCG (if you have graded relevance)

Related: Retrieval Strategy Playbook

Step 8: Safety and security instrumentation (yes, this is observability)

AI systems create new failure classes. You want those failures to be visible.

Track:

prompt-injection detections (direct and indirect)
tool-call blocks (policy denied)
data access denials (authz)
PII redaction actions
"suspicious" retrieval sources (unknown domains, external web)

If you cannot measure attempted abuse, you cannot improve defenses. You can only hope.

Step 9: Dashboards that earn their screen space

Dashboards should answer operational questions quickly.

Reliability dashboard

error rate, timeout rate, tool failure rate
by route, model version, and feature flag

Quality dashboard

acceptance rate, escalation rate, repeat-question rate
by user segment and answer class

Cost dashboard

cost per outcome
spend by model, route, tenant
cost anomalies (sudden token spikes)

Safety dashboard

blocks/refusals by reason
injection attempts over time
data access denials

Step 10: Debugging workflow (the operator loop)

When a report comes in, do not guess. Follow the trace.

Identify request ID (or reconstruct via user/time/route).
Check router decision and answer class.
Check retrieval (if any): were relevant chunks even present?
Check tool calls: did anything fail or return partial data?
Check model + prompt versions.
Check validation/policy: was output rejected or modified?
Compare against known-good traces.

Your goal is to classify the failure into a fixable bucket:

routing
retrieval
tool integration
prompt/constraints
policy enforcement
model capability

A pragmatic maturity path

If you want a sane order of operations:

Correlation IDs + stage events + version logging
Token/cost accounting + latency breakdowns
Tool-call audit logs + idempotency
Retrieval traces (if applicable)
Quality proxies + golden sets
Safety instrumentation + retention controls

The model gets the headlines. The trace gets you uptime.