AI Terminology

A high-signal terminology reference for modern AI systems (LLMs, retrieval, agents, and production concerns).

By Ryan Setter

2/19/202610 min read Reading

This reference defines terms the way they behave in real systems, not the way they are marketed.

Core framing

  • Model vs system: The model produces tokens; the system provides context, tools, policy, memory, and accountability.
  • Probabilistic output: Even at temperature 0, you are getting deterministic decoding over uncertain internal representations, not a truth oracle.
  • Interfaces matter: Most failures are integration failures (context assembly, tool schemas, permissions, evaluation), not "bad prompts".

Model primitives

  • Context window: The maximum tokens available for prompt + retrieved context + tool traces + output.
  • Decoding: The strategy that turns probabilities into text (greedy, sampled, beam, speculative).
  • Logits: Pre-softmax scores for next-token probabilities.
  • Logprob: Log probability of a token (or sequence); useful for confidence heuristics and routing.
  • Prompt: The input sequence; in practice it is a structured envelope (instructions + context + constraints), not a paragraph.
  • Stop sequence: A delimiter that truncates generation; common footgun when the model emits the delimiter in normal text.
  • Streaming: Incremental token output; improves UX but complicates moderation, tool gating, and audits.
  • System prompt: The highest-priority instruction channel (when supported); treat it like policy, not copywriting.
  • Temperature: Sampling noise; higher values increase diversity and error variance.
  • Token: The discrete unit a model reads/writes (not always a word); tokens drive cost, latency, and context limits.
  • Tokenizer: The algorithm mapping text to tokens (often BPE-like); a hidden source of failure for non-English, code, and weird punctuation.
  • Top-k: Samples from the k most likely next tokens; constrains tails, sometimes too aggressively.
  • Top-p (nucleus sampling): Samples from the smallest probability mass >= p; often a more stable dial than temperature.
  • User prompt: The user-supplied content; by default it is untrusted input.
  • Vocabulary: The token set the model knows; changing it is not a config toggle, it is a retrain.

Architectures and model types

  • Attention: Content-addressed mixing over prior tokens; compute and memory scale roughly with sequence length.
  • Autoregressive LM: Generates one token at a time conditioned on prior tokens; most chat LLMs are this.
  • Diffusion model: Iterative denoising generator (common in images/video); different failure modes than autoregressive LMs.
  • Encoder-decoder: Separate encode/decode stages (classic translation); less common for frontier chat LLMs.
  • Mixture of Experts (MoE): Routes tokens through subsets of parameters; cheaper per token at inference, harder to serve.
  • Multi-head attention (MHA): Multiple attention subspaces; improves representation capacity.
  • Multimodal model: Accepts/produces multiple modalities (text, images, audio); reliability depends on modality adapters.
  • Positional encoding: How the model represents order; modern LLMs often use RoPE-like schemes.
  • RoPE: Rotary position embeddings; enables longer contexts, with caveats around extrapolation.
  • Self-attention: Attention within a single sequence; the engine of in-context "working memory".
  • Sparse model: Activates only parts of the network per token (MoE is one approach); favors throughput, complicates debugging.
  • State-space model (SSM): Sequence models with linear-time recurrence properties; interesting tradeoffs, not a drop-in replacement.
  • Transformer: The dominant architecture for language/video/vision generation; attention is the core operator.
  • VLM: Vision-language model; typically vision encoder + language decoder.

Training and adaptation

  • Adapters: Small trainable modules inserted into a frozen base; cheaper to train and easier to swap.
  • Alignment: Shaping behavior toward human/organizational goals; includes safety, reliability, and policy.
  • Catastrophic forgetting: Fine-tuning that degrades general capability; often a sign of overly aggressive adaptation.
  • Data contamination: Evaluation items leaking into training; the fastest way to fool yourself.
  • DPO: Direct preference optimization; simpler preference learning alternative to RLHF-style pipelines.
  • Distillation: Training a smaller model to mimic a larger one; useful for latency/cost, but can harden quirks.
  • Fine-tuning: Updating model weights for a task/domain; powerful, but adds lifecycle and drift obligations.
  • Instruction tuning: Training on (instruction, response) pairs to shape interactive behavior.
  • LoRA: Low-rank adapters; common default for efficient fine-tuning.
  • Next-token prediction: The classic LM objective; surprisingly broad in what it induces.
  • Pretraining: Learning general representations from large corpora; where most capability comes from.
  • Prompt tuning: Learning soft prompt vectors; niche but useful for constrained deployments.
  • QLoRA: LoRA with quantized base weights during training; reduces VRAM requirements.
  • RLAIF: RL from AI feedback; scales preference collection, also scales preference errors.
  • Reward model: A model that scores outputs; it becomes part of your product's incentive system.
  • RLHF: Reinforcement learning from human feedback; trains to maximize preference signals, not truth.
  • Safety training: Techniques to reduce harmful outputs; expect tradeoffs with helpfulness and edge-case refusal.
  • Supervised fine-tuning (SFT): A supervised pass on curated examples; improves style and compliance.

Embeddings, retrieval, and memory

  • ANN: Fast similarity search that trades exactness for speed.
  • Attribution: Linking claims to sources (citations); necessary, not sufficient, for groundedness.
  • BM25: Sparse (keyword) retrieval baseline; still wins more often than people admit.
  • Chunking: Splitting source material into retrieval units; chunk boundaries are an accuracy dial.
  • Cosine similarity: Common similarity measure for normalized embeddings.
  • Cross-encoder: A model that scores query+document jointly; higher quality, higher cost.
  • Embedding: A vector representation of content; used for similarity search and clustering.
  • Episodic memory: Remembering interactions/events; useful for personalization, risky for privacy.
  • Freshness: How up-to-date your system's answers are; typically an indexing and policy problem.
  • GraphRAG: RAG using graph structure for retrieval/assembly; adds complexity, sometimes earns it.
  • Grounding: Constraining generation to retrieved/authoritative sources; implies measurement, not just citations.
  • HNSW: A common ANN index structure; good recall/speed, non-trivial memory overhead.
  • Hybrid search: Combining dense (embeddings) and sparse (BM25) signals.
  • Knowledge cutoff: The model's training data horizon; retrieval is the usual workaround.
  • Knowledge graph: Entity-relationship store; good for constrained domains and explainable traversal.
  • Long-term memory (system): Persisted state outside the model (profiles, histories, documents); treat as data with governance.
  • RAG: Retrieval-augmented generation; the dominant pattern for grounding LLM answers in external knowledge.
  • Reranking: Re-scoring candidates (often with a cross-encoder) to improve top-k quality.
  • Semantic memory: Remembering stable facts/knowledge; often implemented as curated notes + retrieval.
  • Short-term memory (context): What fits in the prompt; it is expensive, lossy, and ephemeral.
  • Vector database: Storage + approximate nearest neighbor (ANN) search over embeddings; operational characteristics matter.
  • Vector space: The geometric space embeddings live in; "close" means similar under a chosen metric.

Tool use, agents, and orchestration

  • Agent: A loop that plans, acts (tools), and observes; "agent" is an architecture, not a model feature.
  • Chain-of-thought (CoT): Intermediate reasoning text; useful as a technique, unreliable as an explanation.
  • Context assembly: The process of selecting and packaging instructions, memory, and retrieval into the prompt.
  • Function calling: Tool calling constrained to a declared schema (often JSON); reduces parsing chaos.
  • MCP (Model Context Protocol): A pattern/protocol for exposing tools and resources to models; essentially "drivers" for agent context.
  • Orchestration: The control plane for tool routing, retries, budgets, and policy; where production systems live.
  • Planner-executor: Separates plan generation from tool execution; improves debuggability.
  • ReAct: A pattern blending reasoning and actions; effective, but easy to leak sensitive context into tool inputs.
  • Scratchpad: Internal working area for intermediate steps; treat it as potentially exfiltratable unless isolated.
  • Self-consistency: Sampling multiple reasoning paths and voting; trades cost for robustness.
  • Structured output: Forcing outputs into schemas; improves reliability but can hide semantic errors.
  • Tool calling: The model emits a structured request to call external code (APIs, DB queries, actions).

Safety and security (practical)

  • Content filtering: Detecting unsafe content; needs to cover both inputs and outputs.
  • Data exfiltration: Coaxing the model/system to leak secrets (system prompts, keys, private docs).
  • Guardrails: Pre/post checks, schemas, policies, and constraints around the model; guardrails are code, not vibes.
  • Indirect prompt injection: Injection via retrieved web pages, documents, or tool outputs.
  • Jailbreak: Attempts to bypass safety constraints; treat as an adversarial testing discipline.
  • Over-permissioned tools: The fastest path from demo to incident; least privilege is not optional.
  • PII: Personally identifiable information; governs retention, logging, and training data reuse.
  • Prompt injection: Malicious instructions embedded in user/retrieved content to subvert system policy.
  • Red teaming: Structured adversarial testing; do it before you ship, and after each model/tool change.
  • Tool injection: Malicious content that causes unsafe tool calls (e.g., "run this SQL").

Inference, serving, and performance

  • Batching: Serving multiple requests together; improves throughput, can hurt tail latency.
  • Caching (response): Reusing prior outputs; only safe when inputs + context are identical and policy allows.
  • Context truncation: Dropping tokens when over budget; silent truncation is a source of "it forgot" bugs.
  • Continuous batching: Dynamically forming batches over time; common in high-throughput LLM serving.
  • Distilled model: A smaller student model; cheaper to run, not necessarily cheaper to maintain.
  • GPU memory (VRAM): The hard limit for model + KV cache; most serving failures are actually VRAM math.
  • Inference: Running the model to generate outputs; dominated by memory bandwidth and batching behavior.
  • KV cache: Cached attention keys/values for prior tokens; critical for speed, expensive in memory.
  • Latency: End-to-end time to answer; includes retrieval, tool calls, and post-processing.
  • Prefix caching: Reusing KV cache for shared prompt prefixes; huge win for system prompts and templates.
  • Quantization: Lower-precision weights/activations (int8/int4); reduces VRAM, can reduce quality.
  • Rate limiting: Enforcing budgets per user/tenant; necessary for cost control and abuse resistance.
  • Speculative decoding: Using a small draft model to propose tokens and a large model to verify; improves throughput.
  • Throughput: Tokens per second (or requests per second) under load.
  • TTFT: Time to first token; drives perceived responsiveness.

Evaluation, quality, and observability

  • Benchmark: A standardized eval; useful for comparison, dangerous as a proxy for your product.
  • Cost per outcome: The metric that matters; tokens are just the billing surface.
  • Eval: A defined measurement over a dataset (offline) or traffic (online); if you cannot measure it, you cannot improve it.
  • Factuality: Whether claims match reality.
  • Faithfulness: Whether outputs are supported by provided context.
  • Golden set: Curated test cases with expected outcomes; your regression firewall.
  • Groundedness: Whether outputs are supported by retrieved/cited sources.
  • Hallucination: Output that is not grounded in reality or provided sources; a system property, not just a model flaw.
  • LLM-as-judge: Using a model to score outputs; scalable, but judge drift is real.
  • Model versioning: Changing the model changes behavior; plan for canaries, fallbacks, and auditability.
  • Pairwise evaluation: Comparing two outputs to reduce scoring noise.
  • Prompt versioning: Treat prompts like code; changes need diffs, reviews, and rollbacks.
  • Refusal: The model declines to answer; can be correct policy enforcement or a capability regression.
  • Tracing: Capturing structured spans/events across retrieval, model calls, and tools.

Governance and licensing

  • Data residency: Where data is processed/stored; can be a hard constraint for enterprise.
  • Data retention: How long you store prompts, traces, and outputs; defaults are rarely acceptable.
  • Model card: Documentation about training, intended use, limitations, and evaluation.
  • Model license: Terms governing usage, redistribution, and training; read it like you are shipping it.
  • Open weights: Model weights are available; does not imply open source or permissive licensing.
  • Provenance: Where data and artifacts come from; necessary for trust and compliance.

Quick distinctions that prevent arguments

  • "AI" vs "ML": AI is the system goal; ML is one class of statistical methods to achieve it.
  • "Chatbot" vs "assistant": A chatbot chats; an assistant owns tasks with tools, memory, and policy.
  • "Citations" vs "grounding": Citations are UI; grounding is a measurable constraint.
  • "RAG" vs "fine-tuning": RAG changes context at runtime; fine-tuning changes weights and lifecycle.