AI Glossary
Short, plain-English definitions of the terms used across the AI section. Where a concept has a dedicated page, the term links to it. Terms are listed alphabetically.
A2A (Agent2Agent)
An open standard from Google for communication between agents built on different frameworks or by different vendors. Where MCP connects an agent to tools, A2A connects an agent to other agents, using Agent Cards for capability discovery. See Agents.
Agent
An LLM autonomously using tools in a loop: the model decides what to do next, your code runs the chosen tool, the result is fed back, and the loop repeats. See AI Agents.
Agentic AI
The umbrella label for systems where an LLM plans, decides, and acts via tools rather than producing a single output. The same territory as "agents", framed as an architectural property you can add incrementally. See AI Agents.
Agent skill
A portable, version-controlled workflow package (SKILL.md plus optional scripts and references) that
teaches a coding agent how to perform a specific task. Loaded on demand when the agent matches the skill's
description to the current task, unlike always-on rules or project memory files. See Agent Skills.
ANN (Approximate Nearest Neighbor)
The efficiency primitive behind vector databases: instead of comparing a query against every stored vector, an ANN algorithm checks a carefully selected subset, trading a little accuracy for a large speedup. See RAG.
Attention
The core operation inside each transformer layer: every token computes how much to "attend to" every prior token in the context window, letting the model relate words across a sequence.
Base model
A foundation model straight out of pre-training -- fluent but not yet a helpful assistant. It becomes an instruct/chat model only after post-training. See LLMs.
Chain-of-thought
Prompting or training a model to reason step by step in tokens before answering, trading latency for accuracy on multi-step problems. See LLMs.
Context compaction
Summarizing a conversation that is nearing the context-window limit and reinitializing a fresh window with the summary -- the first lever for long-horizon coherence. See Context Engineering.
Context engineering
The discipline of curating what goes into the context window across many turns -- compaction, structured notes, retrieval, and sub-agents. The agentic-era successor to prompt engineering.
Context rot
The empirical degradation of an LLM's recall as the context window fills -- "lost in the middle". Caused by O(n^2) attention and short-sequence-heavy training data; bigger windows do not fix it. See Context Engineering.
Context window
The maximum number of tokens a model can attend to at once (today: thousands to millions). All input, retrieved context, and generated output must fit inside it. See LLMs.
Cosine similarity
The dominant metric for comparing embeddings: the cosine of the angle between two vectors, measuring how similar in meaning two pieces of text are regardless of length. See RAG.
Deep modules
Modules with simple interfaces that hide significant internal complexity (Ousterhout). AI agents work best inside such clear boundaries, because the interface is the contract. See AI-Assisted Software Development.
Embedding
A dense numeric vector (typically 384--4096 dimensions) representing text, an image, or audio, chosen so that semantically similar inputs land close together in vector space. The layer RAG depends on.
Fallback chain
Trying a cheaper or faster model first and escalating to a higher tier only when validation fails, confidence is low, or the user requests it. See Cost, Latency & Model Routing.
Fine-tuning
Continued training of a pretrained model on task- or domain-specific data, baking knowledge and behavior into the weights. Best for style, tone, and format; use RAG for changing facts. See RAG vs fine-tuning.
Foundation model
A large model trained on broad data that serves as a reusable base you adapt to many tasks (via prompting, RAG, or fine-tuning) rather than training per task. LLMs are the best-known foundation models. See LLMs.
Frontier model
The current capability ceiling -- the largest, most capable models (Claude, GPT, Gemini), usually closed-source and accessed via API. See Cloud vs Local Models.
Function calling
The developer's name for tool use: you expose functions with a schema, and the model emits a structured request to call one. Your code executes it. See Agents.
GPT (Generative Pre-trained Transformer)
A decoder-only transformer trained to generate text by next-token prediction; also OpenAI's product line. The architecture pattern underlying most modern LLMs.
Groundedness
An evaluation metric (also "faithfulness") for whether an answer stuck to its retrieved sources rather than hallucinating. See Evaluation and LLMOps.
Graceful degradation
Designing a product so core functionality still works when an LLM API fails, is disabled, or returns low-quality output -- manual mode, cached answers, or queued retry instead of a broken UX. See AI in Products.
Guardrails
Input/output policy enforcement around an LLM: content filtering, PII redaction, claim verification, and refusal paths. One of the layered mitigations for hallucination and misuse.
Hallucination
Fluent, confident, wrong output -- fabricated facts, invented citations, made-up parameters. A structural consequence of next-token prediction, contained (not eliminated) by RAG, tool use, evaluation, and guardrails. See LLMs.
Harness engineering
Building the evaluation and execution scaffolding -- datasets, runner, judges, traces, CI gates -- that turns a non-deterministic model into a testable, observable system. See Evaluation and LLMOps.
Human-in-the-loop
Requiring human approval, review, or escalation before an agent or LLM feature takes high-impact or irreversible action. Includes maker-checker and audit trails. See Human-in-the-Loop.
Hybrid retrieval
Combining dense (semantic) embedding search with sparse keyword search (BM25) and fusing the results. The highest-ROI fix for weak RAG.
Inference
Running a trained model to produce output, as opposed to training it. The per-request cost and latency layer of any LLM application. See LLMs.
Instruct / chat model
A foundation model after post-training -- the variant end users actually talk to. See LLMs.
Jailbreaking
Crafting inputs that bypass an LLM's safety constraints to elicit forbidden output; closely related to prompt injection. See AI Safety & Guardrails.
Just-in-time context
A retrieval strategy where an agent keeps lightweight references (paths, queries, links) and loads data on demand via tools, instead of pre-loading everything via embeddings. See Context Engineering.
LLM (Large Language Model)
A neural network trained on large text corpora to predict the next token; aligned via post-training into a useful assistant. See Large Language Models.
LLM-as-judge
Using a second LLM to score an output against a rubric -- necessary for open-ended responses, but it must be calibrated against human judgment. See Evaluation and LLMOps.
LLM-wiki pattern
A knowledge-management pattern where an LLM incrementally builds and maintains a structured, interlinked Markdown wiki between you and raw sources -- compiling synthesis once and keeping it current. See Knowledge Management.
LLMOps
The operational discipline of running LLM apps and agents in production: prompt and model versioning, evaluation, cost/latency optimization, monitoring, and incident response. See Tooling.
llms.txt
A proposed standard for a curated, LLM-readable Markdown overview of a website, published at /llms.txt. See
Knowledge Management.
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that injects small trainable matrices into a frozen model, so you adapt under 1% of the parameters. The basis of QLoRA. See RAG vs fine-tuning.
MCP (Model Context Protocol)
An open standard from Anthropic -- the "USB-C for AI" -- that gives LLM apps a uniform way to discover and invoke external tools and data, collapsing the N x M integration problem to N + M. See Agents.
Memex
Vannevar Bush's 1945 vision of a personal, curated knowledge store with associative trails -- the conceptual ancestor of the LLM-wiki pattern. See Knowledge Management.
MLOps
Machine Learning Operations -- DevOps applied to the full ML lifecycle (data, training, deployment, monitoring). LLMOps is its LLM-specific extension. See Evaluation and LLMOps.
Multi-agent system
A system where several specialized agents coordinate (orchestrator-worker, router, hierarchical, critic-refiner, network) to exceed a single agent's capability or context. See Agents.
Model routing
Sending each request to an appropriate model tier (frontier, mid, small/local) via classifiers, fallback chains, or task-specific rules to balance quality, cost, and latency. See Cost, Latency & Model Routing.
Open-weights model
A model whose weights are downloadable (Llama, Mistral, Qwen, DeepSeek, Gemma, Phi), so it can be self-hosted and fine-tuned. The basis for local model usage.
Orchestrator-worker
The most common multi-agent pattern: a central manager decomposes a task and delegates subtasks to specialist workers. The orchestrator layer is the main defense against error amplification. See Agents.
Parameters
The learned weights of a model. Count (billions to trillions) correlates roughly with capability and cost. See LLMs.
PEFT (Parameter-Efficient Fine-Tuning)
The family of techniques (including LoRA and QLoRA) that adapt a model by training a small fraction of its parameters instead of all of them. See RAG vs fine-tuning.
Post-training
The alignment stage that turns a base model into a helpful assistant: supervised fine-tuning plus preference optimization (RLHF / DPO). See LLMs.
Pre-training
The compute-dominant first stage: self-supervised next-token prediction on a web-scale corpus, producing a fluent base model. See LLMs.
Prompt engineering
Shaping model behavior by changing the input text -- instructions, examples, and formatting. The cheapest, fastest adaptation lever before RAG or fine-tuning. See Context Engineering.
Prompt caching
A provider feature that discounts repeated identical prefix tokens across requests -- useful when system prompts or long documents stay stable. See Cost, Latency & Model Routing.
Prompt injection
An attack that disguises malicious instructions as normal input to override an LLM's original instructions; the OWASP #1 LLM risk, including indirect injection via retrieved data. See AI Safety & Guardrails.
Quantization
Replacing model weights with lower-precision approximations (e.g. 4-bit NF4) to cut memory use, with minimal quality loss for most tasks. What lets large models fit on consumer GPUs. See Cloud vs Local Models.
QLoRA
Quantize the base model to 4-bit, freeze it, and train LoRA adapters on top -- the standard recipe for fine-tuning a moderate-size LLM on a single consumer GPU. See Cloud vs Local Models.
RAG (Retrieval-Augmented Generation)
Retrieving relevant information from an external source before generation and injecting it into the prompt, so the model summarizes facts instead of recalling them. The primary defense against hallucination. See RAG.
Red-teaming
Continuously and adversarially probing a system to discover new failure modes, complementing static benchmarks. See AI Safety & Guardrails.
Reranking
Re-scoring the top retrieved candidates with a cross-encoder model to improve relevance -- often a bigger quality win than swapping the embedding model. See RAG.
Semantic search
Searching by meaning rather than keyword match, using embeddings and similarity. The capability that powers retrieval in RAG.
Semantic cache
Caching LLM responses keyed by embedding similarity of the query -- returning a stored answer when a new question is close enough to a prior one. Requires TTL and invalidation when source data changes. See Cost, Latency & Model Routing.
SPDD (REASONS Canvas)
Structured-Prompt-Driven Development -- a methodology treating prompts as versioned, reviewed delivery artifacts, structured by the seven-part REASONS Canvas. See AI-Assisted Software Development.
Structured note-taking
An agent writing to a persistent store outside the context window and reading it back -- external memory that conserves the attention budget. See Context Engineering.
Structured output
Constraining an LLM response to a machine-parseable schema (JSON Schema, tool arguments) with validation and repair loops. See Structured Outputs.
Sub-agent architecture
A lead agent delegating focused subtasks to sub-agents that explore in clean context windows and return distilled summaries. See Context Engineering.
Temperature
A sampling parameter controlling randomness during generation: 0 is deterministic, higher values produce more varied output. See LLMs.
Token
The atomic unit a model reads and writes -- a learned subword, not a whole word. Text is split into tokens before the model sees it. See LLMs.
Tool use
Letting an LLM act in the outside world by requesting calls to functions you define; your code executes them and returns the result. The mechanism that turns a text generator into an agent. See Agents.
Transformer
The neural-network architecture every modern LLM uses, built from stacked layers of self-attention and feed-forward networks. See LLMs.
Vector database
A database specialized for storing and searching high-dimensional embeddings by similarity (via ANN) rather than exact match. Examples: Pinecone, pgvector, OpenSearch, Weaviate, Milvus, Chroma. See RAG.
Vector quantization
Compressing embedding vectors (e.g. to int8 or binary) to cut storage and speed up search; distinct from
model-weight quantization. For retrieval, unbiased similarity preservation matters more than
reconstruction accuracy. See Embeddings Deep Dive.
Vertical slices
Implementing a feature across all layers in one pass ("tracer bullets") rather than horizontally per layer, forcing integration to work. See AI-Assisted Software Development.
vLLM
A high-throughput inference engine for serving open-weights models on GPUs in production. See Cloud vs Local Models.