Architecture Principles for AI Products

AI products fail in predictable ways. Not because models are "bad", but because we keep wrapping a probabilistic component in a deterministic product and then acting surprised.

These principles are not theory. They are the boring architecture moves that let you ship AI features without turning your on-call rotation into a literary genre.

AI products are systems, not models

The model is a component.

It interprets language and generates tokens.
It does not enforce policy.
It does not own state.
It does not provide accountability.

Your product does all of that.

If you design the system boundary correctly, model upgrades are manageable. If you do not, every model change becomes a production incident with better marketing.

Reference architecture (minimal viable AI system)

Most production AI features converge on a similar shape:

User
  -> API boundary (authn/authz, rate limits, input validation)
  -> Orchestrator (routing, budgets, retries, timeouts)
      -> Context assembly (instructions, memory, retrieval)
      -> Tool executor (least privilege, schemas, audit logs)
      -> Model call (versioned prompt + model)
      -> Post-processing (schema validation, policy checks, redaction)
  -> Output (UI constraints, citations, follow-up actions)

Observability + evaluation spans all of it.

The rest of this article is how to make each box survivable.

Principle 1: Separate interpretation from enforcement

Models interpret. Systems enforce.

If you let a model both interpret the request and enforce the rules, you have built a policy engine that cannot be audited, cannot be tested, and will eventually be "aligned" into refusing your own product requirements.

Patterns that work

Policy as code: enforce permissions, safety constraints, and data access in deterministic code.
Tool gating: the model can propose a tool call, but the system decides whether it happens.
Schema-first outputs: the model can propose structured output, but the system validates it.

Example: tool calls that always pass through enforcement.

// Pseudocode
const decision = await router.classify(request);
const plan = await model.proposePlan({ request, allowedTools: policy.allowedTools(user) });

for (const call of plan.toolCalls) {
  if (!policy.canCallTool(user, call.name, call.args)) {
    return refuse("tool_not_allowed");
  }
  const result = await toolExecutor.execute(call, { audit: true, timeoutMs: 5000 });
  plan = await model.continue({ toolResult: result });
}

const output = await model.finalize();
return policy.validateOutput(output);

Anti-patterns

"The system prompt says not to do that." It will anyway. Eventually. Under load.
Letting the model decide which tools exist or what their parameters mean.
Treating moderation as a UI concern instead of an enforcement concern.

Checklist

Are permissions enforced outside the model?
Can you audit why a tool was called?
Can you replay the decision without calling the model again?

Principle 2: Treat model I/O as untrusted data

The model output is not an API response. It is untrusted input that happens to be fluent.

If you would not JSON.parse() an anonymous string and run it against production systems, do not do the moral equivalent with an LLM.

Patterns that work

Strict schemas for tool calls and structured outputs.
Validation + repair loops: validate, reject, ask for a corrected payload.
Idempotent tool execution with request IDs to prevent duplicate side effects.

Example: a tool contract that is testable.

{
  "tool": "create_ticket",
  "schema": {
    "type": "object",
    "required": ["title", "severity", "service"],
    "properties": {
      "title": { "type": "string", "minLength": 8 },
      "severity": { "type": "string", "enum": ["sev1", "sev2", "sev3"] },
      "service": { "type": "string" },
      "summary": { "type": "string" }
    },
    "additionalProperties": false
  }
}

Anti-patterns

Free-form tool arguments.
"We'll just regex it." You are building a parser with denial.
Treating the model output as authoritative data instead of a proposal.

Checklist

Are tool schemas versioned and reviewed like code?
Do you validate every structured output before use?
Do you have idempotency keys for side-effecting tools?

Principle 3: Keep state explicit and versioned

AI products feel conversational, but production behavior must be inspectable.

State lives in at least three places:

Conversation state: what the user said, what the system answered.
Workflow state: where the product is in a multi-step task.
Memory state: what the system persists across sessions (profiles, documents, retrieval indexes).

If any of those are implicit, you cannot debug, test, or govern the system.

Patterns that work

Event log per request: inputs, decisions, tool calls, retrieved chunks, outputs.
Version everything: prompt versions, tool schemas, router rules, index versions.
Explicit budgets: token caps, tool call limits, timeouts.

Example: a minimal trace event shape.

{
  "request_id": "req_01HT...",
  "user_id": "u_123",
  "model": "gpt-4.1",
  "prompt_version": "assist.v17",
  "router": { "class": "dynamic_policy", "version": "router.v5" },
  "retrieval": { "index": "kb.v12", "top_chunk_ids": ["c_77", "c_19"], "k": 20 },
  "tools": [{ "name": "get_policy", "status": "ok", "latency_ms": 142 }],
  "output": { "status": "accepted", "tokens_out": 612 }
}

Anti-patterns

Relying on "whatever is in the chat" as the state store.
Not logging retrieved context because it is "too big".
Changing prompts in production without a version bump.

Checklist

Can you replay a request using logged inputs and versions?
Can you answer "what changed" when behavior changes?
Is long-term memory governed like data, not like a feature?

Principle 4: Design failure paths before success paths

This is the part everyone skips until the first outage.

Models fail. Retrieval fails. Tools fail. Networks fail. And the user will ask for the one thing you did not test.

A useful failure taxonomy

No relevant context: retrieval misses or the corpus is wrong.
Conflicting context: the corpus disagrees with itself (it will).
Tool failure: timeouts, 500s, partial results.
Schema failure: output does not validate.
Policy failure: request is disallowed.
Ambiguity: the question cannot be safely answered without clarification.

Patterns that work

Degrade gracefully: answer with what you know, ask for what you need, refuse when required.
Fallback models / fallback modes: smaller model, retrieval-only, citations-only.
Stop doing work: a hard timeout is better than a heroic spiral.

Checklist

Do you have explicit behavior for each failure class?
Do you know your default on tool failure (retry, skip, or refuse)?
Are refusals measurable and actionable (not just "sorry")?

Principle 5: Prefer composable modules with narrow contracts

Monolithic "agent" code is fun until you need to debug it.

Prefer small modules with explicit I/O:

router (answer class + tool intent)
retriever (candidate fetch)
reranker
context packer
generator
validator / policy checker
tool executor

This is not academic purity. It is how you make change safe.

Patterns that work

Deterministic boundaries: the same input yields the same output (where possible).
Test harnesses per module: chunking tests, retrieval tests, schema tests.
Replaceability: swap retrievers or models without rewriting the product.

Checklist

Can you unit test retrieval without calling the model?
Can you evaluate prompts without hitting production tools?
Can you swap a model without rewriting business logic?

Principle 6: Make evaluation and observability part of the architecture

If you cannot measure quality, you cannot improve it. And if you cannot trace failures, you cannot operate it.

What to evaluate

Retrieval quality: recall@k, MRR, nDCG (if you do retrieval)
Answer quality: faithfulness to context, correctness on golden sets
Operational quality: latency, cost per successful outcome, refusal rate

Patterns that work

Golden sets that reflect real traffic (50-200 cases to start).
Regression runs on prompt/model/index changes.
"LLM-as-judge" used carefully, with drift monitoring and spot checks.

What to instrument

request ID through every component
prompt + model version
retrieval candidates and chosen chunks
tool calls and results (with redaction)
validation outcomes and refusal reasons

Related: AI Observability Basics

Checklist

Do you have a golden set and a regression gate?
Can you answer "what changed" after a deployment?
Can you attribute cost to outcomes, not tokens?

Principle 7: Optimize for operator cognition

Operators do not need more telemetry. They need less ambiguity.

If an operator cannot explain a failure in one screen, the system is too opaque.

The one-screen rule

For any request ID, you should be able to see:

router decision
retrieval top chunks (with source links)
tool calls (args, status, latency)
final output + validation results
versions (model, prompt, index, schemas)

If your debugging process begins with "paste the conversation into ChatGPT", you have built a system that requires an oracle to operate.

Checklist

Can an engineer debug without reproducing the issue live?
Do you have stable IDs for chunks, tools, prompts, and routers?
Are refusal reasons categorized and visible?

A pragmatic rollout sequence

If you want a sane order of operations:

Policy enforcement + tool schemas (interpretation vs enforcement)
Explicit state + versioning + request traces
Retrieval done properly (hybrid + rerank + context packing)
Golden set evals + regression gates
Operator UX (one-screen traces + replay)
Only then: more autonomy, more tools, more "agents"

The rule of thumb: increase autonomy only when you can constrain and observe it.

AI products are systems, not models

Reference architecture (minimal viable AI system)

Principle 1: Separate interpretation from enforcement

Patterns that work

Anti-patterns

Checklist

Principle 2: Treat model I/O as untrusted data

Patterns that work

Anti-patterns

Checklist

Principle 3: Keep state explicit and versioned

Patterns that work

Anti-patterns

Checklist

Principle 4: Design failure paths before success paths

A useful failure taxonomy

Patterns that work

Checklist

Principle 5: Prefer composable modules with narrow contracts

Patterns that work

Checklist

Principle 6: Make evaluation and observability part of the architecture

What to evaluate

Patterns that work

What to instrument

Checklist

Principle 7: Optimize for operator cognition

The one-screen rule

Checklist

A pragmatic rollout sequence

Related reading