Generative AI: A Systems and Architecture Reference
An engineer-first map of generative models: what they optimize, how inference behaves, and what that implies for production architecture.
By Ryan Setter
Generative AI is not a feature. It is a probabilistic runtime that produces artifacts (text, code, images, audio) by sampling from a learned distribution.
That framing matters because it forces the right engineering questions:
- What distribution did we actually learn?
- What conditioning signals do we control at inference time?
- What failure modes are intrinsic to sampling?
- Where do we draw deterministic boundaries so the system is operable?
If you are looking for a beginner tutorial, this is not it. This is the systems view: objectives, inference mechanics, and architectural consequences.
The definition that survives contact with production
A generative model learns a distribution and then samples from it.
Two common forms:
- Unconditional generation: learn
p(x)and samplex. - Conditional generation: learn
p(y | c)wherecis conditioning context (prompt, image, retrieved documents, tool results, metadata), then sampley.
In practice, production systems almost always use conditional generation. Unconditional generation is fun. Conditional generation is how you ship.
Taxonomy (by generative process, not by marketing)
Most deployed generative models cluster into a few process families. The family determines latency shape, controllability, and serving constraints.
Autoregressive (token-by-token)
Large language models (LLMs) are typically autoregressive:
p(y_1..y_T | c) = prod_t p(y_t | y_<t, c)
Implications:
- Latency is sequential (you cannot generate token
t+1before tokent). - Cost scales with
tokens_in + tokens_out. - The decoder policy (temperature/top-p/stop) is part of the product behavior.
Diffusion (iterative denoising)
Diffusion models generate by reversing a noise process over multiple steps.
Implications:
- Latency scales with number of steps (often dozens). Distillation reduces steps but changes failure modes.
- Conditioning often happens via guidance or cross-attention; control surfaces differ from LLM prompts.
- Serving is usually throughput-oriented (batching helps) but step loops complicate tail latency.
Latent token models / masked modeling (fill-in)
Many image/video approaches are effectively "predict missing parts" in a latent space.
Implications:
- Great for editing/inpainting style workflows.
- Less natural for long open-ended sequences.
The useful engineer question is: what is the generation loop, and what does it do to tail latency and debuggability?
Training objectives: what the model is actually optimized to do
The model does not learn "truth". It learns to optimize a training objective under a data distribution.
For autoregressive LLMs, pretraining is typically maximum likelihood / next-token prediction:
minimize: E_{(c,y)~data}[ -log p_\theta(y | c) ]
That objective buys you a strong conditional generator. It does not buy you:
- factuality,
- policy compliance,
- tool correctness,
- or your company's definition of "done".
Those behaviors come from additional stages (instruction tuning, preference optimization, tool-call finetunes) and from the surrounding system.
Post-training is policy shaping, not capability creation
Instruction tuning / preference optimization (RLHF variants, DPO-like objectives) reshape the model into something closer to a conversational policy.
Common production consequences:
- Refusal behavior is a learned policy; it can change with model updates.
- "Helpfulness" can increase verbosity and confidence without increasing correctness.
- Models become better at aligning with instructions, which makes prompt-injection a first-class security problem.
Inference is where architecture gets real
The serving behavior you observe is a function of:
- model weights,
- runtime kernels and precision,
- context window and caching,
- and decoding strategy.
Treat "the model" as a component with a large configuration surface.
Tokens, context windows, and why memory is an illusion
Tokenization is not a UX detail. It is the unit of compute and billing.
The context window is a bounded working set:
- The model conditions on the provided context.
- It does not persist that context unless your system does.
If you want persistence, you need explicit memory architecture (retrieval, profiles, state stores). Related: Retrieval Strategy Playbook
KV cache: the hidden serving budget
Autoregressive Transformers rely on caching key/value tensors for prior tokens.
Practical implication: serving often becomes a VRAM allocation problem.
- Long contexts and many concurrent sessions inflate cache memory.
- Cache pressure drives OOMs, eviction strategies, and multi-tenant isolation concerns.
If your scaling plan does not include KV cache math, your scaling plan is a mood board.
Decoding: your product chooses a point on the determinism spectrum
Decoding is where you decide whether the system behaves like:
- a deterministic service (low temperature, constrained outputs), or
- a stochastic generator (higher temperature/top-p),
- or a constrained planner (structured tool calls + validation loops).
Useful rule:
- For decisions and actions: prefer constrained, schema-validated outputs.
- For drafts and ideation: controlled stochasticity can be a feature.
If you need repeatability, you need more than a seed. You need stable versions of: model, prompt, runtime, and any retrieved context.
Failure modes (classify them like an engineer)
"Hallucination" is a label, not a diagnosis. In production, you want a taxonomy that points to a fix.
1) Ungrounded generation
The model produces plausible text that is not supported by any authoritative source.
Fix surface:
- route to retrieval / tools when the answer class requires it,
- require citations or structured evidence,
- add post-checks (faithfulness, schema validation).
2) Wrong answer class (routing failure)
The system asked the model to generate when it should have looked up, retrieved, or called a tool.
Fix surface:
- explicit router with versioned labels,
- regressions tracked with golden sets.
3) Context poisoning and instruction collision
Retrieved text, user content, or tool output contains instructions that override system intent.
Fix surface:
- strict separation of instruction channels (system vs retrieved vs user),
- retrieval sanitization, tool gating, least-privilege execution,
- treat model I/O as untrusted data.
4) Format and contract violations
The model does not follow your schema, even if you asked nicely.
Fix surface:
- schema-first outputs + validation + repair loops,
- reduce degrees of freedom (shorter outputs, fewer fields),
- avoid free-form tool arguments.
Related architectural framing: Architecture Principles for AI Products
Architecture patterns that survive production
Generative models are best treated as interpreters and proposal engines. Deterministic systems should enforce constraints and execute actions.
Pattern: interpret -> retrieve -> decide -> act
Reference pipeline (minimal but robust):
User input
-> API boundary (authn/authz, rate limits, input validation)
-> Router (answer class, tool intent, risk class)
-> (Optional) Retrieval / tools (governed, audited)
-> Generator (LLM) with explicit constraints
-> Validators (schema, policy, grounding)
-> Output renderer / action executor
If you merge these stages into a single "agent" loop without contracts and logs, you will get a demo that cannot be debugged.
Pattern: make the model argue with evidence, not with confidence
If the domain cares about correctness, require the output to carry its own traceability:
- citations to retrieved chunk IDs,
- tool outputs referenced by request-scoped IDs,
- explicit uncertainty and abstention behavior.
The model should generate claims. The system should require support.
RAG vs fine-tuning vs "just prompt it" (a decision table)
Architecturally, these are different levers:
| Need | Best lever | Why |
|---|---|---|
| Fresh, changing facts (policies, runbooks, tickets) | Retrieval + tools | Update data without retraining; enforce permissions |
| Private corpora with governance requirements | Retrieval with ACL filters | You need access control on memory |
| Stable style/format, narrow domain phrasing | Fine-tune / adapters | Shift output distribution toward your conventions |
| New capabilities (reasoning, multi-step planning) | Usually not fine-tune | You need better base models + system constraints |
| Hard constraints ("never do X") | Deterministic enforcement | The model is not a policy engine |
Prompts are configuration. Retrieval is memory. Fine-tuning is behavior shaping. Policy is code.
Evaluation: measure behavior, not vibes
Generative systems need evaluation at two levels:
- Component-level: retrieval quality, schema compliance, tool correctness.
- System-level: task success, refusal correctness, operator burden, cost per successful outcome.
Practical baseline:
- 50-200 golden cases that reflect real traffic.
- regression gates on model/prompt/index changes.
- logging of model + prompt + index versions per request.
Related operations view: AI Observability Basics
Security: treat the prompt as a control plane
Generative AI expands your attack surface because it accepts instructions from:
- users,
- retrieved documents,
- tool outputs,
- and sometimes other models.
If any of those can influence tool execution or data access without deterministic enforcement, you have built a polite remote code execution surface.
Minimum viable defenses:
- strict authz around tools and retrieval,
- schema validation for tool arguments,
- audit logs for tool calls and data reads,
- prompt-injection monitoring (attempts are signals).
Cost and performance: tokens are the new network egress
From an architecture standpoint:
- Tokens are a metered input/output.
- Latency is dominated by model time plus any retrieval/tool calls.
- Tail latency is often driven by retries, tool timeouts, and long generations.
High-leverage knobs:
- route aggressively (not every request deserves the biggest model),
- cap output length and require structured outputs where possible,
- cache what is safe to cache (retrieval candidates, reranks, tool results),
- treat quantization/precision changes as model variants with their own evals.
A checklist for architects (the non-hype version)
Before you call something "generative AI in production", you should be able to answer:
- What answer classes exist, and what routes to retrieval/tools vs generation?
- What is the system-of-record for facts, and how is freshness handled?
- What are the deterministic enforcement points (policy, schemas, tool gating)?
- What is versioned (model, prompt, router, tool schemas, index), and how do you roll back?
- What is measured (quality, cost, latency, refusals), and what is the operator workflow?
If you cannot answer those, you do not have an AI system. You have a text generator attached to a pager.