Error Taxonomy: Classifying AI System Failures Before They Become Incidents

A failure-language doctrine for production AI: classify failures by boundary crossed, control missed, and detection path so incidents stop dissolving into vague model blame.

By Ryan Setter

3/13/20268 min read Reading

Saying "the model failed" is not a diagnosis.

It is the architectural equivalent of saying "something weird happened" with better branding.

Production AI failures only become manageable once they are classifiable.

The Pattern

A production error taxonomy is a control-plane artifact for operations. It tells the team how to name a failure, where to look first, and what kind of gate should have intercepted it before users or operators had to.

The useful unit of classification is not "the answer was bad." It is the boundary that was crossed and the control that failed to stop it.

That sounds pedantic until the same symptom starts coming from three different causes:

  • wrong evidence retrieved
  • correct evidence retrieved but ignored
  • correct answer generated, then broken by a downstream tool or validator path

Those are not one problem with three flavors. They are three different failures that happen to look similar at the UI layer.

This page treats taxonomy as a systems contract, not a postmortem writing style.

Key Takeaways

  • A useful error taxonomy classifies failures by boundary crossed and control missed, not by vague symptoms like "hallucination".
  • Every failure class should map to a detection path, an owner, and a release gate that should have caught it earlier.
  • The same visible symptom can come from different failure classes; taxonomy exists to stop teams from applying the wrong fix to the right pain.
  • If incidents do not feed new cases into traces, evals, and runbooks, the taxonomy is decorative.

Why Taxonomy Exists

Teams without a taxonomy usually fall back to symptom words:

  • hallucination
  • weird answer
  • bad tool call
  • prompt drift

Those labels feel descriptive, but they usually collapse cause, symptom, and blame into one foggy category. The same failed request gets discussed as a retrieval issue by one engineer, a prompt issue by another, and a model issue by whoever arrived last with the strongest tone.

That is how incident reviews become literature.

Taxonomy exists to make failure legible enough that the next action is obvious. If the class is clear, the likely owner is clearer, the release implication is clearer, and the regression case you add afterward has a fighting chance of testing the right thing.

One symptom, different diagnoses

Suppose the system returns a confident but wrong answer.

That could be:

  • a retrieval boundary failure because the wrong tenant's evidence was pulled
  • a grounding failure because the right evidence was present but unsupported claims slipped through
  • an evaluation blind spot because the release gate never contained a case for that failure mode

The user sees one bad answer. The operator should not.

AI system failure surfaces showing central request, orchestration, retrieval, model, and tool layers with cross-cutting governance, evaluation, and infrastructure controls.

Failure surfaces show where symptoms can emerge. The taxonomy classifies the crossed boundary and the missed control, not just the symptom that reached the user.

The Failure Classification Contract

A classified failure record should be lightweight enough to use during real incidents and strict enough to support later analysis. At minimum, capture:

  • failure_id
  • workflow_id
  • request_id or trace_id
  • failure_class
  • observed_symptom
  • boundary_crossed
  • control_missed
  • detection_stage
  • severity
  • operator_impact
  • release_action
  • regression_case_added

Draft example shape

{
  "failure_id": "fail_2026_03_013",
  "workflow_id": "incident-triage",
  "trace_id": "trc_01HQ...",
  "failure_class": "retrieval-boundary-failure",
  "observed_symptom": "answer cited irrelevant deployment notes",
  "boundary_crossed": "tenant-evidence-scope",
  "control_missed": "retrieval filter enforcement",
  "detection_stage": "pre-release-eval",
  "severity": "high",
  "release_action": "block",
  "regression_case_added": true
}

The point is not to build a perfect universal schema. The point is to make classification consistent enough that incident review, evaluation, and release policy are all talking about the same failure.

Core Failure Classes

Use a small number of classes that map to actual system boundaries. If every incident gets its own bespoke class, you have rebuilt confusion with extra ceremony.

Failure classWhat actually failedFirst control that should catch itPrimary owner lane
contract-failuremalformed output, missing required fields, unsupported valuesschema validation, parser, repair loopinterface / workflow contract
retrieval-boundary-failurewrong tenant, stale source, irrelevant evidence, broken evidence scoperetrieval filters, provenance checks, isolation testsretrieval / data boundary
grounding-failureclaims exceed or contradict available evidencegrounding validators, citation checks, answer routinggeneration + validation
tool-authority-failureunsafe tool selection, over-broad args, unauthorized write pathcapability contract, arg validation, Two-Key Writestooling / authority boundary
policy-failurerefusal, escalation, or compliance behavior breaks under bounded casespolicy validators, refusal tests, release gatesgovernance / policy enforcement
budget-failurelatency, cost, or loop depth exceeds architecture limitshard budgets, trace analytics, retry ceilingsoperational control
evaluation-blind-spotthe release process never covered the failure class that later surfacedGolden Sets, targeted eval subsetsevaluation / release discipline
operator-process-failureevidence existed, but review, approval, rollback, or incident handling failedrunbook, approval workflow, release procedureoperational process

The table is not a taxonomy for all time. It is a sane starting grid. Add classes only when repeated failures cannot be explained cleanly with the existing set.

A quick distinction, because these three get conflated constantly:

  • retrieval-boundary-failure: the wrong evidence entered the reasoning path
  • grounding-failure: the right evidence was available, but the output exceeded or contradicted it
  • evaluation-blind-spot: the release process never covered the case, so the failure shipped unchallenged

Example: A "Hallucination" That Was Actually a Retrieval Boundary Failure

Suppose a support copilot answers an enterprise customer's question with a fluent, confident response and two clean-looking citations.

The answer is wrong.

At first glance, this gets labeled the usual way: hallucination.

But the model did not invent unsupported facts out of nowhere. It answered from retrieved material that should never have been in scope for that request. The citations were real. The boundary was wrong.

Scenario

  • workflow: support copilot
  • visible symptom: confident wrong answer with plausible citations
  • actual failure: documents were retrieved from the wrong tenant corpus
  • root cause: retrieval isolation filter bug

Classification

  • observed_symptom: wrong confident answer
  • failure_class: retrieval-boundary-failure
  • boundary_crossed: tenant evidence boundary
  • control_missed: retrieval isolation filter
  • detection_stage: runtime incident
  • release_action: block until retrieval isolation coverage exists

This is exactly why taxonomy matters.

If the team calls this a hallucination, they will probably tighten prompts, adjust answer wording, or blame the model. None of those fixes touch the real failure surface.

The right response is different:

  • fix the retrieval boundary
  • add an isolation-focused regression case
  • ensure the trace captures retrieval set and filter decisions
  • treat future occurrences as a data-boundary incident, not a generation-quality debate

Detection Paths

Detection path matters because it turns classification into action. A useful taxonomy does not stop at naming the class. It answers the more uncomfortable question: where should this have been caught?

When a failure lands, operators should be able to answer three questions quickly:

  1. What boundary was crossed?
  2. What control missed it?
  3. At what stage should the system have caught it?

For most production AI systems, the detection stack looks like this:

  • interface/schema validation
  • retrieval isolation checks
  • grounding validators
  • trace review
  • golden set regressions
  • runtime policy blocks
  • operator review or incident response

The later a class is first detected, the more expensive the lesson becomes.

  • If a contract-failure is first discovered in production, the interface contract is too soft.
  • If a retrieval-boundary-failure is first discovered by a customer, the data boundary is not real yet.
  • If a policy-failure is only found during incident response, the release gate was ceremonial.

This is why taxonomy belongs next to traces and evals rather than inside a separate incident wiki nobody reads during release week.

Related control surfaces:

Decision Criteria

Use a formal error taxonomy when:

  • multiple change surfaces can create the same visible symptom
  • operators need to distinguish model, retrieval, tool, and policy failures quickly
  • the workflow has release gates, audits, or real production consequence
  • you are tired of every regression getting labeled "hallucination" and then fixed by vibes

It becomes mandatory when you have more than one meaningful change surface - model, prompt, retrieval, tools, validators, policy - because the odds of one visible symptom hiding multiple causes goes up fast.

This doctrine ties directly to:

Failure Modes

Taxonomy collapse

Everything gets labeled with one vague umbrella term such as "hallucination" or "model failure." Ownership stays blurry, fixes target the wrong surface, and the incident review sounds decisive while changing very little.

Symptom-first diagnosis

The team classifies the symptom that was visible to the user instead of the boundary that failed underneath it. Retrieval, grounding, and policy issues get mixed together, and the wrong guardrail gets tightened with great confidence.

Taxonomy sprawl

Every new incident creates a shiny new class. The taxonomy stops supporting trend analysis and turns into a graveyard of one-off labels that explain one meeting and nothing else.

No release consequence

The taxonomy exists in postmortems but never changes traces, eval subsets, or release gates. The same incident returns wearing a different shirt and everyone acts surprised by the outfit.

Ownership blur

Nobody knows whether the fix belongs to prompt design, retrieval policy, validator logic, release discipline, or operator workflow. Incident response turns into committee theater with excellent attendance and weak control improvement.

Minimal Implementation

Step 1: Define 6-8 stable failure classes

Keep them architecture-aligned, not vendor-aligned. The taxonomy should survive a provider swap without losing its meaning.

Step 2: Add taxonomy fields to trace and incident records

Classification should attach to request evidence, not float separately in a meeting note or postmortem spreadsheet.

Step 3: Require class assignment in postmortems

Every serious incident should land in the taxonomy or force a taxonomy revision.

Step 4: Tie classes to eval subsets and release gates

The release process should know which classes block ship, which trigger escalation, and which require explicit operator review.

Step 5: Feed incidents back into the doctrine system

Close the loop deliberately:

  • traces explain the failure
  • golden sets catch the recurrence
  • policy and architecture docs encode the new boundary

That is how a taxonomy stops being a naming exercise and becomes part of the operating system.

Closing Position

AI systems do not fail mysteriously.

They fail at boundaries.

If you cannot name the boundary that broke, you are not debugging yet.