AI Terminology

This reference defines terms the way they behave in real systems, not the way they are marketed.

Core framing

Model vs system: The model produces tokens; the system provides context, tools, policy, memory, and accountability.
Probabilistic output: Even at temperature 0, you are getting deterministic decoding over uncertain internal representations, not a truth oracle.
Interfaces matter: Most failures are integration failures (context assembly, tool schemas, permissions, evaluation), not "bad prompts".

Model primitives

Context window: The maximum tokens available for prompt + retrieved context + tool traces + output.
Decoding: The strategy that turns probabilities into text (greedy, sampled, beam, speculative).
Logits: Pre-softmax scores for next-token probabilities.
Logprob: Log probability of a token (or sequence); useful for confidence heuristics and routing.
Prompt: The input sequence; in practice it is a structured envelope (instructions + context + constraints), not a paragraph.
Stop sequence: A delimiter that truncates generation; common footgun when the model emits the delimiter in normal text.
Streaming: Incremental token output; improves UX but complicates moderation, tool gating, and audits.
System prompt: The highest-priority instruction channel (when supported); treat it like policy, not copywriting.
Temperature: Sampling noise; higher values increase diversity and error variance.
Token: The discrete unit a model reads/writes (not always a word); tokens drive cost, latency, and context limits.
Tokenizer: The algorithm mapping text to tokens (often BPE-like); a hidden source of failure for non-English, code, and weird punctuation.
Top-k: Samples from the k most likely next tokens; constrains tails, sometimes too aggressively.
Top-p (nucleus sampling): Samples from the smallest probability mass >= p; often a more stable dial than temperature.
User prompt: The user-supplied content; by default it is untrusted input.
Vocabulary: The token set the model knows; changing it is not a config toggle, it is a retrain.

Architectures and model types

Attention: Content-addressed mixing over prior tokens; compute and memory scale roughly with sequence length.
Autoregressive LM: Generates one token at a time conditioned on prior tokens; most chat LLMs are this.
Diffusion model: Iterative denoising generator (common in images/video); different failure modes than autoregressive LMs.
Encoder-decoder: Separate encode/decode stages (classic translation); less common for frontier chat LLMs.
Mixture of Experts (MoE): Routes tokens through subsets of parameters; cheaper per token at inference, harder to serve.
Multi-head attention (MHA): Multiple attention subspaces; improves representation capacity.
Multimodal model: Accepts/produces multiple modalities (text, images, audio); reliability depends on modality adapters.
Positional encoding: How the model represents order; modern LLMs often use RoPE-like schemes.
RoPE: Rotary position embeddings; enables longer contexts, with caveats around extrapolation.
Self-attention: Attention within a single sequence; the engine of in-context "working memory".
Sparse model: Activates only parts of the network per token (MoE is one approach); favors throughput, complicates debugging.
State-space model (SSM): Sequence models with linear-time recurrence properties; interesting tradeoffs, not a drop-in replacement.
Transformer: The dominant architecture for language/video/vision generation; attention is the core operator.
VLM: Vision-language model; typically vision encoder + language decoder.

Training and adaptation

Adapters: Small trainable modules inserted into a frozen base; cheaper to train and easier to swap.
Alignment: Shaping behavior toward human/organizational goals; includes safety, reliability, and policy.
Catastrophic forgetting: Fine-tuning that degrades general capability; often a sign of overly aggressive adaptation.
Data contamination: Evaluation items leaking into training; the fastest way to fool yourself.
DPO: Direct preference optimization; simpler preference learning alternative to RLHF-style pipelines.
Distillation: Training a smaller model to mimic a larger one; useful for latency/cost, but can harden quirks.
Fine-tuning: Updating model weights for a task/domain; powerful, but adds lifecycle and drift obligations.
Instruction tuning: Training on (instruction, response) pairs to shape interactive behavior.
LoRA: Low-rank adapters; common default for efficient fine-tuning.
Next-token prediction: The classic LM objective; surprisingly broad in what it induces.
Pretraining: Learning general representations from large corpora; where most capability comes from.
Prompt tuning: Learning soft prompt vectors; niche but useful for constrained deployments.
QLoRA: LoRA with quantized base weights during training; reduces VRAM requirements.
RLAIF: RL from AI feedback; scales preference collection, also scales preference errors.
Reward model: A model that scores outputs; it becomes part of your product's incentive system.
RLHF: Reinforcement learning from human feedback; trains to maximize preference signals, not truth.
Safety training: Techniques to reduce harmful outputs; expect tradeoffs with helpfulness and edge-case refusal.
Supervised fine-tuning (SFT): A supervised pass on curated examples; improves style and compliance.

Embeddings, retrieval, and memory

ANN: Fast similarity search that trades exactness for speed.
Attribution: Linking claims to sources (citations); necessary, not sufficient, for groundedness.
BM25: Sparse (keyword) retrieval baseline; still wins more often than people admit.
Chunking: Splitting source material into retrieval units; chunk boundaries are an accuracy dial.
Cosine similarity: Common similarity measure for normalized embeddings.
Cross-encoder: A model that scores query+document jointly; higher quality, higher cost.
Embedding: A vector representation of content; used for similarity search and clustering.
Episodic memory: Remembering interactions/events; useful for personalization, risky for privacy.
Freshness: How up-to-date your system's answers are; typically an indexing and policy problem.
GraphRAG: RAG using graph structure for retrieval/assembly; adds complexity, sometimes earns it.
Grounding: Constraining generation to retrieved/authoritative sources; implies measurement, not just citations.
HNSW: A common ANN index structure; good recall/speed, non-trivial memory overhead.
Hybrid search: Combining dense (embeddings) and sparse (BM25) signals.
Knowledge cutoff: The model's training data horizon; retrieval is the usual workaround.
Knowledge graph: Entity-relationship store; good for constrained domains and explainable traversal.
Long-term memory (system): Persisted state outside the model (profiles, histories, documents); treat as data with governance.
RAG: Retrieval-augmented generation; the dominant pattern for grounding LLM answers in external knowledge.
Reranking: Re-scoring candidates (often with a cross-encoder) to improve top-k quality.
Semantic memory: Remembering stable facts/knowledge; often implemented as curated notes + retrieval.
Short-term memory (context): What fits in the prompt; it is expensive, lossy, and ephemeral.
Vector database: Storage + approximate nearest neighbor (ANN) search over embeddings; operational characteristics matter.
Vector space: The geometric space embeddings live in; "close" means similar under a chosen metric.

Tool use, agents, and orchestration

Agent: A loop that plans, acts (tools), and observes; "agent" is an architecture, not a model feature.
Chain-of-thought (CoT): Intermediate reasoning text; useful as a technique, unreliable as an explanation.
Context assembly: The process of selecting and packaging instructions, memory, and retrieval into the prompt.
Function calling: Tool calling constrained to a declared schema (often JSON); reduces parsing chaos.
MCP (Model Context Protocol): A pattern/protocol for exposing tools and resources to models; essentially "drivers" for agent context.
Orchestration: The control plane for tool routing, retries, budgets, and policy; where production systems live.
Planner-executor: Separates plan generation from tool execution; improves debuggability.
ReAct: A pattern blending reasoning and actions; effective, but easy to leak sensitive context into tool inputs.
Scratchpad: Internal working area for intermediate steps; treat it as potentially exfiltratable unless isolated.
Self-consistency: Sampling multiple reasoning paths and voting; trades cost for robustness.
Structured output: Forcing outputs into schemas; improves reliability but can hide semantic errors.
Tool calling: The model emits a structured request to call external code (APIs, DB queries, actions).

Safety and security (practical)

Content filtering: Detecting unsafe content; needs to cover both inputs and outputs.
Data exfiltration: Coaxing the model/system to leak secrets (system prompts, keys, private docs).
Guardrails: Pre/post checks, schemas, policies, and constraints around the model; guardrails are code, not vibes.
Indirect prompt injection: Injection via retrieved web pages, documents, or tool outputs.
Jailbreak: Attempts to bypass safety constraints; treat as an adversarial testing discipline.
Over-permissioned tools: The fastest path from demo to incident; least privilege is not optional.
PII: Personally identifiable information; governs retention, logging, and training data reuse.
Prompt injection: Malicious instructions embedded in user/retrieved content to subvert system policy.
Red teaming: Structured adversarial testing; do it before you ship, and after each model/tool change.
Tool injection: Malicious content that causes unsafe tool calls (e.g., "run this SQL").

Inference, serving, and performance

Batching: Serving multiple requests together; improves throughput, can hurt tail latency.
Caching (response): Reusing prior outputs; only safe when inputs + context are identical and policy allows.
Context truncation: Dropping tokens when over budget; silent truncation is a source of "it forgot" bugs.
Continuous batching: Dynamically forming batches over time; common in high-throughput LLM serving.
Distilled model: A smaller student model; cheaper to run, not necessarily cheaper to maintain.
GPU memory (VRAM): The hard limit for model + KV cache; most serving failures are actually VRAM math.
Inference: Running the model to generate outputs; dominated by memory bandwidth and batching behavior.
KV cache: Cached attention keys/values for prior tokens; critical for speed, expensive in memory.
Latency: End-to-end time to answer; includes retrieval, tool calls, and post-processing.
Prefix caching: Reusing KV cache for shared prompt prefixes; huge win for system prompts and templates.
Quantization: Lower-precision weights/activations (int8/int4); reduces VRAM, can reduce quality.
Rate limiting: Enforcing budgets per user/tenant; necessary for cost control and abuse resistance.
Speculative decoding: Using a small draft model to propose tokens and a large model to verify; improves throughput.
Throughput: Tokens per second (or requests per second) under load.
TTFT: Time to first token; drives perceived responsiveness.

Evaluation, quality, and observability

Benchmark: A standardized eval; useful for comparison, dangerous as a proxy for your product.
Cost per outcome: The metric that matters; tokens are just the billing surface.
Eval: A defined measurement over a dataset (offline) or traffic (online); if you cannot measure it, you cannot improve it.
Factuality: Whether claims match reality.
Faithfulness: Whether outputs are supported by provided context.
Golden set: Curated test cases with expected outcomes; your regression firewall.
Groundedness: Whether outputs are supported by retrieved/cited sources.
Hallucination: Output that is not grounded in reality or provided sources; a system property, not just a model flaw.
LLM-as-judge: Using a model to score outputs; scalable, but judge drift is real.
Model versioning: Changing the model changes behavior; plan for canaries, fallbacks, and auditability.
Pairwise evaluation: Comparing two outputs to reduce scoring noise.
Prompt versioning: Treat prompts like code; changes need diffs, reviews, and rollbacks.
Refusal: The model declines to answer; can be correct policy enforcement or a capability regression.
Tracing: Capturing structured spans/events across retrieval, model calls, and tools.

Governance and licensing

Data residency: Where data is processed/stored; can be a hard constraint for enterprise.
Data retention: How long you store prompts, traces, and outputs; defaults are rarely acceptable.
Model card: Documentation about training, intended use, limitations, and evaluation.
Model license: Terms governing usage, redistribution, and training; read it like you are shipping it.
Open weights: Model weights are available; does not imply open source or permissive licensing.
Provenance: Where data and artifacts come from; necessary for trust and compliance.

Quick distinctions that prevent arguments

"AI" vs "ML": AI is the system goal; ML is one class of statistical methods to achieve it.
"Chatbot" vs "assistant": A chatbot chats; an assistant owns tasks with tools, memory, and policy.
"Citations" vs "grounding": Citations are UI; grounding is a measurable constraint.
"RAG" vs "fine-tuning": RAG changes context at runtime; fine-tuning changes weights and lifecycle.