AI Terminology
A high-signal terminology reference for modern AI systems (LLMs, retrieval, agents, and production concerns).
By Ryan Setter
2/19/202610 min read Reading
This reference defines terms the way they behave in real systems, not the way they are marketed.
Core framing
- Model vs system: The model produces tokens; the system provides context, tools, policy, memory, and accountability.
- Probabilistic output: Even at temperature 0, you are getting deterministic decoding over uncertain internal representations, not a truth oracle.
- Interfaces matter: Most failures are integration failures (context assembly, tool schemas, permissions, evaluation), not "bad prompts".
Model primitives
- Context window: The maximum tokens available for prompt + retrieved context + tool traces + output.
- Decoding: The strategy that turns probabilities into text (greedy, sampled, beam, speculative).
- Logits: Pre-softmax scores for next-token probabilities.
- Logprob: Log probability of a token (or sequence); useful for confidence heuristics and routing.
- Prompt: The input sequence; in practice it is a structured envelope (instructions + context + constraints), not a paragraph.
- Stop sequence: A delimiter that truncates generation; common footgun when the model emits the delimiter in normal text.
- Streaming: Incremental token output; improves UX but complicates moderation, tool gating, and audits.
- System prompt: The highest-priority instruction channel (when supported); treat it like policy, not copywriting.
- Temperature: Sampling noise; higher values increase diversity and error variance.
- Token: The discrete unit a model reads/writes (not always a word); tokens drive cost, latency, and context limits.
- Tokenizer: The algorithm mapping text to tokens (often BPE-like); a hidden source of failure for non-English, code, and weird punctuation.
- Top-k: Samples from the k most likely next tokens; constrains tails, sometimes too aggressively.
- Top-p (nucleus sampling): Samples from the smallest probability mass >= p; often a more stable dial than temperature.
- User prompt: The user-supplied content; by default it is untrusted input.
- Vocabulary: The token set the model knows; changing it is not a config toggle, it is a retrain.
Architectures and model types
- Attention: Content-addressed mixing over prior tokens; compute and memory scale roughly with sequence length.
- Autoregressive LM: Generates one token at a time conditioned on prior tokens; most chat LLMs are this.
- Diffusion model: Iterative denoising generator (common in images/video); different failure modes than autoregressive LMs.
- Encoder-decoder: Separate encode/decode stages (classic translation); less common for frontier chat LLMs.
- Mixture of Experts (MoE): Routes tokens through subsets of parameters; cheaper per token at inference, harder to serve.
- Multi-head attention (MHA): Multiple attention subspaces; improves representation capacity.
- Multimodal model: Accepts/produces multiple modalities (text, images, audio); reliability depends on modality adapters.
- Positional encoding: How the model represents order; modern LLMs often use RoPE-like schemes.
- RoPE: Rotary position embeddings; enables longer contexts, with caveats around extrapolation.
- Self-attention: Attention within a single sequence; the engine of in-context "working memory".
- Sparse model: Activates only parts of the network per token (MoE is one approach); favors throughput, complicates debugging.
- State-space model (SSM): Sequence models with linear-time recurrence properties; interesting tradeoffs, not a drop-in replacement.
- Transformer: The dominant architecture for language/video/vision generation; attention is the core operator.
- VLM: Vision-language model; typically vision encoder + language decoder.
Training and adaptation
- Adapters: Small trainable modules inserted into a frozen base; cheaper to train and easier to swap.
- Alignment: Shaping behavior toward human/organizational goals; includes safety, reliability, and policy.
- Catastrophic forgetting: Fine-tuning that degrades general capability; often a sign of overly aggressive adaptation.
- Data contamination: Evaluation items leaking into training; the fastest way to fool yourself.
- DPO: Direct preference optimization; simpler preference learning alternative to RLHF-style pipelines.
- Distillation: Training a smaller model to mimic a larger one; useful for latency/cost, but can harden quirks.
- Fine-tuning: Updating model weights for a task/domain; powerful, but adds lifecycle and drift obligations.
- Instruction tuning: Training on (instruction, response) pairs to shape interactive behavior.
- LoRA: Low-rank adapters; common default for efficient fine-tuning.
- Next-token prediction: The classic LM objective; surprisingly broad in what it induces.
- Pretraining: Learning general representations from large corpora; where most capability comes from.
- Prompt tuning: Learning soft prompt vectors; niche but useful for constrained deployments.
- QLoRA: LoRA with quantized base weights during training; reduces VRAM requirements.
- RLAIF: RL from AI feedback; scales preference collection, also scales preference errors.
- Reward model: A model that scores outputs; it becomes part of your product's incentive system.
- RLHF: Reinforcement learning from human feedback; trains to maximize preference signals, not truth.
- Safety training: Techniques to reduce harmful outputs; expect tradeoffs with helpfulness and edge-case refusal.
- Supervised fine-tuning (SFT): A supervised pass on curated examples; improves style and compliance.
Embeddings, retrieval, and memory
- ANN: Fast similarity search that trades exactness for speed.
- Attribution: Linking claims to sources (citations); necessary, not sufficient, for groundedness.
- BM25: Sparse (keyword) retrieval baseline; still wins more often than people admit.
- Chunking: Splitting source material into retrieval units; chunk boundaries are an accuracy dial.
- Cosine similarity: Common similarity measure for normalized embeddings.
- Cross-encoder: A model that scores query+document jointly; higher quality, higher cost.
- Embedding: A vector representation of content; used for similarity search and clustering.
- Episodic memory: Remembering interactions/events; useful for personalization, risky for privacy.
- Freshness: How up-to-date your system's answers are; typically an indexing and policy problem.
- GraphRAG: RAG using graph structure for retrieval/assembly; adds complexity, sometimes earns it.
- Grounding: Constraining generation to retrieved/authoritative sources; implies measurement, not just citations.
- HNSW: A common ANN index structure; good recall/speed, non-trivial memory overhead.
- Hybrid search: Combining dense (embeddings) and sparse (BM25) signals.
- Knowledge cutoff: The model's training data horizon; retrieval is the usual workaround.
- Knowledge graph: Entity-relationship store; good for constrained domains and explainable traversal.
- Long-term memory (system): Persisted state outside the model (profiles, histories, documents); treat as data with governance.
- RAG: Retrieval-augmented generation; the dominant pattern for grounding LLM answers in external knowledge.
- Reranking: Re-scoring candidates (often with a cross-encoder) to improve top-k quality.
- Semantic memory: Remembering stable facts/knowledge; often implemented as curated notes + retrieval.
- Short-term memory (context): What fits in the prompt; it is expensive, lossy, and ephemeral.
- Vector database: Storage + approximate nearest neighbor (ANN) search over embeddings; operational characteristics matter.
- Vector space: The geometric space embeddings live in; "close" means similar under a chosen metric.
Tool use, agents, and orchestration
- Agent: A loop that plans, acts (tools), and observes; "agent" is an architecture, not a model feature.
- Chain-of-thought (CoT): Intermediate reasoning text; useful as a technique, unreliable as an explanation.
- Context assembly: The process of selecting and packaging instructions, memory, and retrieval into the prompt.
- Function calling: Tool calling constrained to a declared schema (often JSON); reduces parsing chaos.
- MCP (Model Context Protocol): A pattern/protocol for exposing tools and resources to models; essentially "drivers" for agent context.
- Orchestration: The control plane for tool routing, retries, budgets, and policy; where production systems live.
- Planner-executor: Separates plan generation from tool execution; improves debuggability.
- ReAct: A pattern blending reasoning and actions; effective, but easy to leak sensitive context into tool inputs.
- Scratchpad: Internal working area for intermediate steps; treat it as potentially exfiltratable unless isolated.
- Self-consistency: Sampling multiple reasoning paths and voting; trades cost for robustness.
- Structured output: Forcing outputs into schemas; improves reliability but can hide semantic errors.
- Tool calling: The model emits a structured request to call external code (APIs, DB queries, actions).
Safety and security (practical)
- Content filtering: Detecting unsafe content; needs to cover both inputs and outputs.
- Data exfiltration: Coaxing the model/system to leak secrets (system prompts, keys, private docs).
- Guardrails: Pre/post checks, schemas, policies, and constraints around the model; guardrails are code, not vibes.
- Indirect prompt injection: Injection via retrieved web pages, documents, or tool outputs.
- Jailbreak: Attempts to bypass safety constraints; treat as an adversarial testing discipline.
- Over-permissioned tools: The fastest path from demo to incident; least privilege is not optional.
- PII: Personally identifiable information; governs retention, logging, and training data reuse.
- Prompt injection: Malicious instructions embedded in user/retrieved content to subvert system policy.
- Red teaming: Structured adversarial testing; do it before you ship, and after each model/tool change.
- Tool injection: Malicious content that causes unsafe tool calls (e.g., "run this SQL").
Inference, serving, and performance
- Batching: Serving multiple requests together; improves throughput, can hurt tail latency.
- Caching (response): Reusing prior outputs; only safe when inputs + context are identical and policy allows.
- Context truncation: Dropping tokens when over budget; silent truncation is a source of "it forgot" bugs.
- Continuous batching: Dynamically forming batches over time; common in high-throughput LLM serving.
- Distilled model: A smaller student model; cheaper to run, not necessarily cheaper to maintain.
- GPU memory (VRAM): The hard limit for model + KV cache; most serving failures are actually VRAM math.
- Inference: Running the model to generate outputs; dominated by memory bandwidth and batching behavior.
- KV cache: Cached attention keys/values for prior tokens; critical for speed, expensive in memory.
- Latency: End-to-end time to answer; includes retrieval, tool calls, and post-processing.
- Prefix caching: Reusing KV cache for shared prompt prefixes; huge win for system prompts and templates.
- Quantization: Lower-precision weights/activations (int8/int4); reduces VRAM, can reduce quality.
- Rate limiting: Enforcing budgets per user/tenant; necessary for cost control and abuse resistance.
- Speculative decoding: Using a small draft model to propose tokens and a large model to verify; improves throughput.
- Throughput: Tokens per second (or requests per second) under load.
- TTFT: Time to first token; drives perceived responsiveness.
Evaluation, quality, and observability
- Benchmark: A standardized eval; useful for comparison, dangerous as a proxy for your product.
- Cost per outcome: The metric that matters; tokens are just the billing surface.
- Eval: A defined measurement over a dataset (offline) or traffic (online); if you cannot measure it, you cannot improve it.
- Factuality: Whether claims match reality.
- Faithfulness: Whether outputs are supported by provided context.
- Golden set: Curated test cases with expected outcomes; your regression firewall.
- Groundedness: Whether outputs are supported by retrieved/cited sources.
- Hallucination: Output that is not grounded in reality or provided sources; a system property, not just a model flaw.
- LLM-as-judge: Using a model to score outputs; scalable, but judge drift is real.
- Model versioning: Changing the model changes behavior; plan for canaries, fallbacks, and auditability.
- Pairwise evaluation: Comparing two outputs to reduce scoring noise.
- Prompt versioning: Treat prompts like code; changes need diffs, reviews, and rollbacks.
- Refusal: The model declines to answer; can be correct policy enforcement or a capability regression.
- Tracing: Capturing structured spans/events across retrieval, model calls, and tools.
Governance and licensing
- Data residency: Where data is processed/stored; can be a hard constraint for enterprise.
- Data retention: How long you store prompts, traces, and outputs; defaults are rarely acceptable.
- Model card: Documentation about training, intended use, limitations, and evaluation.
- Model license: Terms governing usage, redistribution, and training; read it like you are shipping it.
- Open weights: Model weights are available; does not imply open source or permissive licensing.
- Provenance: Where data and artifacts come from; necessary for trust and compliance.
Quick distinctions that prevent arguments
- "AI" vs "ML": AI is the system goal; ML is one class of statistical methods to achieve it.
- "Chatbot" vs "assistant": A chatbot chats; an assistant owns tasks with tools, memory, and policy.
- "Citations" vs "grounding": Citations are UI; grounding is a measurable constraint.
- "RAG" vs "fine-tuning": RAG changes context at runtime; fine-tuning changes weights and lifecycle.