Architecture Principles for AI Products

Core principles for building maintainable, testable, and resilient AI products.

By Ryan Setter

10/18/20256 min read Reading

AI products fail in predictable ways. Not because models are "bad", but because we keep wrapping a probabilistic component in a deterministic product and then acting surprised.

These principles are not theory. They are the boring architecture moves that let you ship AI features without turning your on-call rotation into a literary genre.

AI products are systems, not models

The model is a component.

  • It interprets language and generates tokens.
  • It does not enforce policy.
  • It does not own state.
  • It does not provide accountability.

Your product does all of that.

If you design the system boundary correctly, model upgrades are manageable. If you do not, every model change becomes a production incident with better marketing.

Reference architecture (minimal viable AI system)

Most production AI features converge on a similar shape:

User
  -> API boundary (authn/authz, rate limits, input validation)
  -> Orchestrator (routing, budgets, retries, timeouts)
      -> Context assembly (instructions, memory, retrieval)
      -> Tool executor (least privilege, schemas, audit logs)
      -> Model call (versioned prompt + model)
      -> Post-processing (schema validation, policy checks, redaction)
  -> Output (UI constraints, citations, follow-up actions)

Observability + evaluation spans all of it.

The rest of this article is how to make each box survivable.

Principle 1: Separate interpretation from enforcement

Models interpret. Systems enforce.

If you let a model both interpret the request and enforce the rules, you have built a policy engine that cannot be audited, cannot be tested, and will eventually be "aligned" into refusing your own product requirements.

Patterns that work

  • Policy as code: enforce permissions, safety constraints, and data access in deterministic code.
  • Tool gating: the model can propose a tool call, but the system decides whether it happens.
  • Schema-first outputs: the model can propose structured output, but the system validates it.

Example: tool calls that always pass through enforcement.

// Pseudocode
const decision = await router.classify(request);
const plan = await model.proposePlan({ request, allowedTools: policy.allowedTools(user) });

for (const call of plan.toolCalls) {
  if (!policy.canCallTool(user, call.name, call.args)) {
    return refuse("tool_not_allowed");
  }
  const result = await toolExecutor.execute(call, { audit: true, timeoutMs: 5000 });
  plan = await model.continue({ toolResult: result });
}

const output = await model.finalize();
return policy.validateOutput(output);

Anti-patterns

  • "The system prompt says not to do that." It will anyway. Eventually. Under load.
  • Letting the model decide which tools exist or what their parameters mean.
  • Treating moderation as a UI concern instead of an enforcement concern.

Checklist

  • Are permissions enforced outside the model?
  • Can you audit why a tool was called?
  • Can you replay the decision without calling the model again?

Principle 2: Treat model I/O as untrusted data

The model output is not an API response. It is untrusted input that happens to be fluent.

If you would not JSON.parse() an anonymous string and run it against production systems, do not do the moral equivalent with an LLM.

Patterns that work

  • Strict schemas for tool calls and structured outputs.
  • Validation + repair loops: validate, reject, ask for a corrected payload.
  • Idempotent tool execution with request IDs to prevent duplicate side effects.

Example: a tool contract that is testable.

{
  "tool": "create_ticket",
  "schema": {
    "type": "object",
    "required": ["title", "severity", "service"],
    "properties": {
      "title": { "type": "string", "minLength": 8 },
      "severity": { "type": "string", "enum": ["sev1", "sev2", "sev3"] },
      "service": { "type": "string" },
      "summary": { "type": "string" }
    },
    "additionalProperties": false
  }
}

Anti-patterns

  • Free-form tool arguments.
  • "We'll just regex it." You are building a parser with denial.
  • Treating the model output as authoritative data instead of a proposal.

Checklist

  • Are tool schemas versioned and reviewed like code?
  • Do you validate every structured output before use?
  • Do you have idempotency keys for side-effecting tools?

Principle 3: Keep state explicit and versioned

AI products feel conversational, but production behavior must be inspectable.

State lives in at least three places:

  • Conversation state: what the user said, what the system answered.
  • Workflow state: where the product is in a multi-step task.
  • Memory state: what the system persists across sessions (profiles, documents, retrieval indexes).

If any of those are implicit, you cannot debug, test, or govern the system.

Patterns that work

  • Event log per request: inputs, decisions, tool calls, retrieved chunks, outputs.
  • Version everything: prompt versions, tool schemas, router rules, index versions.
  • Explicit budgets: token caps, tool call limits, timeouts.

Example: a minimal trace event shape.

{
  "request_id": "req_01HT...",
  "user_id": "u_123",
  "model": "gpt-4.1",
  "prompt_version": "assist.v17",
  "router": { "class": "dynamic_policy", "version": "router.v5" },
  "retrieval": { "index": "kb.v12", "top_chunk_ids": ["c_77", "c_19"], "k": 20 },
  "tools": [{ "name": "get_policy", "status": "ok", "latency_ms": 142 }],
  "output": { "status": "accepted", "tokens_out": 612 }
}

Anti-patterns

  • Relying on "whatever is in the chat" as the state store.
  • Not logging retrieved context because it is "too big".
  • Changing prompts in production without a version bump.

Checklist

  • Can you replay a request using logged inputs and versions?
  • Can you answer "what changed" when behavior changes?
  • Is long-term memory governed like data, not like a feature?

Principle 4: Design failure paths before success paths

This is the part everyone skips until the first outage.

Models fail. Retrieval fails. Tools fail. Networks fail. And the user will ask for the one thing you did not test.

A useful failure taxonomy

  • No relevant context: retrieval misses or the corpus is wrong.
  • Conflicting context: the corpus disagrees with itself (it will).
  • Tool failure: timeouts, 500s, partial results.
  • Schema failure: output does not validate.
  • Policy failure: request is disallowed.
  • Ambiguity: the question cannot be safely answered without clarification.

Patterns that work

  • Degrade gracefully: answer with what you know, ask for what you need, refuse when required.
  • Fallback models / fallback modes: smaller model, retrieval-only, citations-only.
  • Stop doing work: a hard timeout is better than a heroic spiral.

Checklist

  • Do you have explicit behavior for each failure class?
  • Do you know your default on tool failure (retry, skip, or refuse)?
  • Are refusals measurable and actionable (not just "sorry")?

Principle 5: Prefer composable modules with narrow contracts

Monolithic "agent" code is fun until you need to debug it.

Prefer small modules with explicit I/O:

  • router (answer class + tool intent)
  • retriever (candidate fetch)
  • reranker
  • context packer
  • generator
  • validator / policy checker
  • tool executor

This is not academic purity. It is how you make change safe.

Patterns that work

  • Deterministic boundaries: the same input yields the same output (where possible).
  • Test harnesses per module: chunking tests, retrieval tests, schema tests.
  • Replaceability: swap retrievers or models without rewriting the product.

Checklist

  • Can you unit test retrieval without calling the model?
  • Can you evaluate prompts without hitting production tools?
  • Can you swap a model without rewriting business logic?

Principle 6: Make evaluation and observability part of the architecture

If you cannot measure quality, you cannot improve it. And if you cannot trace failures, you cannot operate it.

What to evaluate

  • Retrieval quality: recall@k, MRR, nDCG (if you do retrieval)
  • Answer quality: faithfulness to context, correctness on golden sets
  • Operational quality: latency, cost per successful outcome, refusal rate

Patterns that work

  • Golden sets that reflect real traffic (50-200 cases to start).
  • Regression runs on prompt/model/index changes.
  • "LLM-as-judge" used carefully, with drift monitoring and spot checks.

What to instrument

  • request ID through every component
  • prompt + model version
  • retrieval candidates and chosen chunks
  • tool calls and results (with redaction)
  • validation outcomes and refusal reasons

Related: AI Observability Basics

Checklist

  • Do you have a golden set and a regression gate?
  • Can you answer "what changed" after a deployment?
  • Can you attribute cost to outcomes, not tokens?

Principle 7: Optimize for operator cognition

Operators do not need more telemetry. They need less ambiguity.

If an operator cannot explain a failure in one screen, the system is too opaque.

The one-screen rule

For any request ID, you should be able to see:

  • router decision
  • retrieval top chunks (with source links)
  • tool calls (args, status, latency)
  • final output + validation results
  • versions (model, prompt, index, schemas)

If your debugging process begins with "paste the conversation into ChatGPT", you have built a system that requires an oracle to operate.

Checklist

  • Can an engineer debug without reproducing the issue live?
  • Do you have stable IDs for chunks, tools, prompts, and routers?
  • Are refusal reasons categorized and visible?

A pragmatic rollout sequence

If you want a sane order of operations:

  1. Policy enforcement + tool schemas (interpretation vs enforcement)
  2. Explicit state + versioning + request traces
  3. Retrieval done properly (hybrid + rerank + context packing)
  4. Golden set evals + regression gates
  5. Operator UX (one-screen traces + replay)
  6. Only then: more autonomy, more tools, more "agents"

The rule of thumb: increase autonomy only when you can constrain and observe it.