Error Taxonomy: Classifying AI System Failures Before They Become Incidents
A failure-language doctrine for production AI: classify failures by boundary crossed, control missed, and detection path so incidents stop dissolving into vague model blame.
By Ryan Setter
Saying "the model failed" is not a diagnosis.
It is the architectural equivalent of saying "something weird happened" with better branding.
Production AI failures only become manageable once they are classifiable.
The Pattern
A production error taxonomy is a control-plane artifact for operations. It tells the team how to name a failure, where to look first, and what kind of gate should have intercepted it before users or operators had to.
The useful unit of classification is not "the answer was bad." It is the boundary that was crossed and the control that failed to stop it.
That sounds pedantic until the same symptom starts coming from three different causes:
- wrong evidence retrieved
- correct evidence retrieved but ignored
- correct answer generated, then broken by a downstream tool or validator path
Those are not one problem with three flavors. They are three different failures that happen to look similar at the UI layer.
This page treats taxonomy as a systems contract, not a postmortem writing style.
Key Takeaways
- A useful error taxonomy classifies failures by boundary crossed and control missed, not by vague symptoms like "hallucination".
- Every failure class should map to a detection path, an owner, and a release gate that should have caught it earlier.
- The same visible symptom can come from different failure classes; taxonomy exists to stop teams from applying the wrong fix to the right pain.
- If incidents do not feed new cases into traces, evals, and runbooks, the taxonomy is decorative.
Why Taxonomy Exists
Teams without a taxonomy usually fall back to symptom words:
- hallucination
- weird answer
- bad tool call
- prompt drift
Those labels feel descriptive, but they usually collapse cause, symptom, and blame into one foggy category. The same failed request gets discussed as a retrieval issue by one engineer, a prompt issue by another, and a model issue by whoever arrived last with the strongest tone.
That is how incident reviews become literature.
Taxonomy exists to make failure legible enough that the next action is obvious. If the class is clear, the likely owner is clearer, the release implication is clearer, and the regression case you add afterward has a fighting chance of testing the right thing.
One symptom, different diagnoses
Suppose the system returns a confident but wrong answer.
That could be:
- a retrieval boundary failure because the wrong tenant's evidence was pulled
- a grounding failure because the right evidence was present but unsupported claims slipped through
- an evaluation blind spot because the release gate never contained a case for that failure mode
The user sees one bad answer. The operator should not.
Failure surfaces show where symptoms can emerge. The taxonomy classifies the crossed boundary and the missed control, not just the symptom that reached the user.
The Failure Classification Contract
A classified failure record should be lightweight enough to use during real incidents and strict enough to support later analysis. At minimum, capture:
failure_idworkflow_idrequest_idortrace_idfailure_classobserved_symptomboundary_crossedcontrol_misseddetection_stageseverityoperator_impactrelease_actionregression_case_added
Draft example shape
{
"failure_id": "fail_2026_03_013",
"workflow_id": "incident-triage",
"trace_id": "trc_01HQ...",
"failure_class": "retrieval-boundary-failure",
"observed_symptom": "answer cited irrelevant deployment notes",
"boundary_crossed": "tenant-evidence-scope",
"control_missed": "retrieval filter enforcement",
"detection_stage": "pre-release-eval",
"severity": "high",
"release_action": "block",
"regression_case_added": true
}
The point is not to build a perfect universal schema. The point is to make classification consistent enough that incident review, evaluation, and release policy are all talking about the same failure.
Core Failure Classes
Use a small number of classes that map to actual system boundaries. If every incident gets its own bespoke class, you have rebuilt confusion with extra ceremony.
| Failure class | What actually failed | First control that should catch it | Primary owner lane |
|---|---|---|---|
contract-failure | malformed output, missing required fields, unsupported values | schema validation, parser, repair loop | interface / workflow contract |
retrieval-boundary-failure | wrong tenant, stale source, irrelevant evidence, broken evidence scope | retrieval filters, provenance checks, isolation tests | retrieval / data boundary |
grounding-failure | claims exceed or contradict available evidence | grounding validators, citation checks, answer routing | generation + validation |
tool-authority-failure | unsafe tool selection, over-broad args, unauthorized write path | capability contract, arg validation, Two-Key Writes | tooling / authority boundary |
policy-failure | refusal, escalation, or compliance behavior breaks under bounded cases | policy validators, refusal tests, release gates | governance / policy enforcement |
budget-failure | latency, cost, or loop depth exceeds architecture limits | hard budgets, trace analytics, retry ceilings | operational control |
evaluation-blind-spot | the release process never covered the failure class that later surfaced | Golden Sets, targeted eval subsets | evaluation / release discipline |
operator-process-failure | evidence existed, but review, approval, rollback, or incident handling failed | runbook, approval workflow, release procedure | operational process |
The table is not a taxonomy for all time. It is a sane starting grid. Add classes only when repeated failures cannot be explained cleanly with the existing set.
A quick distinction, because these three get conflated constantly:
retrieval-boundary-failure: the wrong evidence entered the reasoning pathgrounding-failure: the right evidence was available, but the output exceeded or contradicted itevaluation-blind-spot: the release process never covered the case, so the failure shipped unchallenged
Example: A "Hallucination" That Was Actually a Retrieval Boundary Failure
Suppose a support copilot answers an enterprise customer's question with a fluent, confident response and two clean-looking citations.
The answer is wrong.
At first glance, this gets labeled the usual way: hallucination.
But the model did not invent unsupported facts out of nowhere. It answered from retrieved material that should never have been in scope for that request. The citations were real. The boundary was wrong.
Scenario
- workflow: support copilot
- visible symptom: confident wrong answer with plausible citations
- actual failure: documents were retrieved from the wrong tenant corpus
- root cause: retrieval isolation filter bug
Classification
observed_symptom: wrong confident answerfailure_class:retrieval-boundary-failureboundary_crossed: tenant evidence boundarycontrol_missed: retrieval isolation filterdetection_stage: runtime incidentrelease_action: block until retrieval isolation coverage exists
This is exactly why taxonomy matters.
If the team calls this a hallucination, they will probably tighten prompts, adjust answer wording, or blame the model. None of those fixes touch the real failure surface.
The right response is different:
- fix the retrieval boundary
- add an isolation-focused regression case
- ensure the trace captures retrieval set and filter decisions
- treat future occurrences as a data-boundary incident, not a generation-quality debate
Detection Paths
Detection path matters because it turns classification into action. A useful taxonomy does not stop at naming the class. It answers the more uncomfortable question: where should this have been caught?
When a failure lands, operators should be able to answer three questions quickly:
- What boundary was crossed?
- What control missed it?
- At what stage should the system have caught it?
For most production AI systems, the detection stack looks like this:
- interface/schema validation
- retrieval isolation checks
- grounding validators
- trace review
- golden set regressions
- runtime policy blocks
- operator review or incident response
The later a class is first detected, the more expensive the lesson becomes.
- If a
contract-failureis first discovered in production, the interface contract is too soft. - If a
retrieval-boundary-failureis first discovered by a customer, the data boundary is not real yet. - If a
policy-failureis only found during incident response, the release gate was ceremonial.
This is why taxonomy belongs next to traces and evals rather than inside a separate incident wiki nobody reads during release week.
Related control surfaces:
Decision Criteria
Use a formal error taxonomy when:
- multiple change surfaces can create the same visible symptom
- operators need to distinguish model, retrieval, tool, and policy failures quickly
- the workflow has release gates, audits, or real production consequence
- you are tired of every regression getting labeled "hallucination" and then fixed by vibes
It becomes mandatory when you have more than one meaningful change surface - model, prompt, retrieval, tools, validators, policy - because the odds of one visible symptom hiding multiple causes goes up fast.
This doctrine ties directly to:
Failure Modes
Taxonomy collapse
Everything gets labeled with one vague umbrella term such as "hallucination" or "model failure." Ownership stays blurry, fixes target the wrong surface, and the incident review sounds decisive while changing very little.
Symptom-first diagnosis
The team classifies the symptom that was visible to the user instead of the boundary that failed underneath it. Retrieval, grounding, and policy issues get mixed together, and the wrong guardrail gets tightened with great confidence.
Taxonomy sprawl
Every new incident creates a shiny new class. The taxonomy stops supporting trend analysis and turns into a graveyard of one-off labels that explain one meeting and nothing else.
No release consequence
The taxonomy exists in postmortems but never changes traces, eval subsets, or release gates. The same incident returns wearing a different shirt and everyone acts surprised by the outfit.
Ownership blur
Nobody knows whether the fix belongs to prompt design, retrieval policy, validator logic, release discipline, or operator workflow. Incident response turns into committee theater with excellent attendance and weak control improvement.
Minimal Implementation
Step 1: Define 6-8 stable failure classes
Keep them architecture-aligned, not vendor-aligned. The taxonomy should survive a provider swap without losing its meaning.
Step 2: Add taxonomy fields to trace and incident records
Classification should attach to request evidence, not float separately in a meeting note or postmortem spreadsheet.
Step 3: Require class assignment in postmortems
Every serious incident should land in the taxonomy or force a taxonomy revision.
Step 4: Tie classes to eval subsets and release gates
The release process should know which classes block ship, which trigger escalation, and which require explicit operator review.
Step 5: Feed incidents back into the doctrine system
Close the loop deliberately:
- traces explain the failure
- golden sets catch the recurrence
- policy and architecture docs encode the new boundary
That is how a taxonomy stops being a naming exercise and becomes part of the operating system.
Closing Position
AI systems do not fail mysteriously.
They fail at boundaries.
If you cannot name the boundary that broke, you are not debugging yet.
Related Reading
- Probabilistic Core / Deterministic Shell: Containing Uncertainty Without Shipping Chaos
- Two-Key Writes: Preventing Accidental Autonomy in AI Systems
- The Minimum Useful Trace: An Observability Contract for Production AI
- Golden Sets: Regression Engineering for Probabilistic Systems
- Architecture Principles for AI Products