Reference

Glossary

The ~105 terms the rest of the playbook leans on without stopping to define - grouped so you can scan by theme, not just alphabetically.

Core concepts

Agent - An LLM given tools, memory, and a loop so it can take actions, not just answer. The intelligence is the model; the agency is the loop. (II.5, Part II)

Agent DLC (Agent Development Lifecycle) - The build-and-operate lifecycle for an agentic system: design, tools and MCP, memory, build, evaluation, deploy, runtime, observe, retire. Evaluation-first rather than code-first, because the same input can behave differently twice. (I.10)

Agentic loop / orchestration - The control logic running an agent: selects tools, executes calls, manages memory, decides when to stop. Where guardrails and least-privilege are enforced. (II.8)

API - Application programming interface - a defined way for one program to call another over a network. The model API is one instance; agents reach tools and data through ordinary APIs too. (II.5)

Context window - The fixed span of tokens a model can see at once - its working memory; the security-critical asset. (I.6)

Context-CIA - The CIA triad reframed around the context window: read the prompt or another tenant’s data (C), inject acted-on instructions (I), exhaust the loop (A). (I.6)

Function calling / tool use - The model emitting a structured request that your code executes, then feeds back. (II.5)

Inference - Running a trained, frozen model to produce output; happens on every request.

Memory - Anything an agent carries beyond one call: short-term is the context window, long-term is a persistent store (often a vector DB) that can be poisoned across sessions. (II.8)

Multimodal - A model that handles more than text - images, audio, video. Each modality is an added injection surface. (II.4)

Prompt - The text input to a model: system prompt + user input + any appended content.

RAG - Retrieval-Augmented Generation - fetching documents at inference and feeding them to the model instead of retraining. (II.3)

System prompt - Hidden instructions setting a model’s role and rules; leakable (OWASP LLM07). Steers behavior, does not enforce it - not a security boundary.

Temperature - A sampling setting controlling how random/creative output is. Why a jailbreak or guardrail result is a rate, not a guarantee.

Token / tokenization - The subword units text is split into; models read and write tokens, not words.

Tool - A named function the model can ask to invoke; the model only requests it, the host runs it and returns the result. As dangerous as its privileges. (II.5)

Trust boundary - The line between zones of differing trust; in AI the decisive one is the path from untrusted content in to privileged action out. (I.6)

Model internals & training

Alignment - Training (RLHF/DPO) that makes a model helpful, honest, harmless. A behavioral layer, not a security boundary.

Attention - The transformer mechanism that weighs how much each token relates to every other token in view.

Base model - The model straight out of pre-training, before instruction-tuning or alignment.

DPO - Direct Preference Optimization - a post-training method to align a model from preference data.

Embedding - A vector representing the meaning of text/an image; nearby vectors mean similar content. (II.4)

Fine-tuning - Further training of a base model on narrower data to specialize it.

Foundation / frontier model - A large general model trained at scale; “frontier” = the most capable current generation. (II.16)

LoRA - Low-Rank Adaptation - lightweight fine-tuning that produces a small “adapter” file.

Model - The trained artifact - a file of weights that maps inputs to outputs.

Neural network - Layers of weighted connections whose weights are learned from data.

Parameters / weights - The billions of numbers learned during training; functionally, “the model” itself.

Pre-training - The first, largest training stage on broad web-scale data → a base model. (II.2)

Reasoning model / chain-of-thought (CoT) - A model that spends inference-time compute generating an internal reasoning trace before answering (OpenAI o-series, DeepSeek-R1). The trace is a new asset - it can leak secrets - and a fragile monitoring signal, since reasoning is often unfaithful to the real cause of the answer. (I.2)

RLHF - Reinforcement Learning from Human Feedback - alignment using human preference signals.

safetensors - A zero-code tensor format (Hugging Face) that stores raw weights + metadata and cannot execute code on load - unlike legacy pickle formats (.bin/.pt/.ckpt) whose __reduce__ enables RCE at torch.load(). The default answer to model-file supply-chain risk. (II.12)

SFT - Supervised Fine-Tuning - post-training on curated instruction/response examples.

Training - Learning weights from data; expensive, done once per model version.

Transformer - The dominant LLM architecture, built around attention.

Protocols & agent infrastructure

Agent Card - The JSON descriptor (commonly /.well-known/agent-card.json) by which an A2A agent advertises identity and capabilities; a spoofing target. (II.7)

A2A (Agent-to-Agent) - Open standard for agents delegating tasks to each other, across orgs. (II.7)

Data lake / warehouse - Central stores of raw or structured data (S3, Snowflake, Databricks, BigQuery) that feed training and RAG. (II.13)

Ingestion / ETL - The pipeline that pulls, transforms, chunks and embeds source data into the index; the poisoning entry point. (II.13)

MCP (Model Context Protocol) - Open standard connecting an agent to tools and data, over JSON-RPC (stdio or HTTP). Three roles, not machines: host (the app), client (the connector), server (exposes tools; can be local or remote). Servers expose three primitives - Tools (model-invoked functions), Resources (context data), Prompts (templates). (II.6)

Sampling (MCP) - One of three client primitives (with Roots and Elicitation): a server asks the client’s LLM to generate text (sampling/createMessage). Abused, it runs the attacker’s LLM work on your tokens (reverse trust) or launders instructions through the host model. The sampling.tools sub-capability (2025-11-25) lets a server drive an autonomous tool loop through your model, and human-in-the-loop is only a SHOULD. Deprecated in the 2026-07-28 revision. (II.6)

Roots (MCP) - The usually-forgotten third client primitive: the client declares which file:// directories are in scope (roots/list) and can change that scope mid-session. Not a security boundary - the docs state roots do not enforce security restrictions and that enforcement belongs to OS file permissions or a sandbox. Clients MUST validate root URIs against path traversal; servers only SHOULD respect the boundary. Deprecated in the 2026-07-28 revision. (II.6)

Elicitation (MCP) - A client primitive: a server asks the user for input or confirmation (elicitation/create). Abused, it becomes a phishing prompt that harvests data the app would never request. Form mode MUST NOT request passwords, keys, tokens or payment credentials - those MUST use URL mode, which carries a spec-documented cross-user account takeover and its own client MUSTs. The only client primitive surviving the 2026-07-28 revision. (II.6)

Logging (MCP) - A server capability, not a client primitive: server-to-client notifications/message, which the client MAY persist. Messages MUST NOT contain credentials, PII or internal system details. Deprecated in the 2026-07-28 revision, with stderr as the migration path. (II.6)

MCP extension - An opt-in protocol add-on with a reverse-DNS identifier (io.modelcontextprotocol/ui), advertised in capabilities, disabled by default, and versioned independently of the core protocol - so pinning a protocol version does not pin extension behavior or its security properties. Allowlist identifiers as exact strings, never prefixes. (II.6)

MCP Apps - The io.modelcontextprotocol/ui extension: server-supplied HTML/JS rendered in a sandboxed iframe inside the host, able to request tool calls, send messages, write to the model’s context, and request device permissions such as camera and microphone. Already shipping in eleven clients. (II.6)

Tool annotations (MCP) - Server-asserted hints about a tool’s behavior (readOnlyHint, destructiveHint, idempotentHint, openWorldHint). Clients MUST treat them as untrusted unless the server is trusted, so auto-approving on readOnlyHint: true is trusting attacker-controlled metadata. Allowlist on name, not the displayed title. (II.6)

Vector database - A store of embeddings for similarity search; powers RAG. Often weakly authenticated and internet-exposed - a top data-layer risk. (II.13)

Vector space / vector store - The geometric space of embeddings; a database of them powers RAG. (II.4)

Attacks & failures

Adversarial example (evasion) - An input perturbed, often imperceptibly, to make a model misclassify or misbehave. (II.1)

Backdoor - Hidden behavior triggered by a specific input, implanted via poisoned data or tampered weights.

Competing objectives - The jailbreak failure mode where a model’s instruction-following pressure overrides its safety training when the two conflict - the engine behind role-play, authority, and refusal-suppression attacks (Wei et al., 2023); paired with mismatched generalization. (II.18)

Data exhaust - Leftover AI data stores - forgotten vector DBs, prompt logs from abandoned or shadow projects - left unmanaged and exposed. (II.13)

Data poisoning - Corrupting training, fine-tuning, or RAG data so the model learns an attacker-chosen behavior; attaches at training time, unlike injection. (II.2, II.13)

Denial of wallet - Driving cost rather than downtime: forcing expensive inference (long reasoning chains, token floods) until the bill, not the outage, is the damage. (II.3)

Excessive agency - An agent given more capability, autonomy, or privilege than its task needs, enlarging the blast radius of any hijack (OWASP LLM06). (II.8)

Hallucination - Confident output that is fabricated or wrong - a primary cause of OWASP LLM09:2025 Misinformation.

Jailbreak - An input that bypasses a model’s safety alignment. Targets the model’s policy, where injection targets the app’s control flow. (II.3)

Lethal trifecta - The three conditions dangerous together: access to private data, exposure to untrusted content, and the ability to act externally. The core agent-risk lens. (II.8)

LLM Scope Violation - Aim Labs’ name for the EchoLeak (CVE-2025-32711) root cause: untrusted ingested content steering a model or agent to read and egress data outside the scope it was authorized to act on. (II.3)

Membership inference - Determining whether a specific record was in a model’s training data; a privacy attack on the training set. (II.2)

Memory poisoning - Writing a false “fact” into an agent’s persistent memory (vector store, profile, episodic history) so it re-fires across future sessions; the agentic persistence path, with no classic-malware equivalent. (II.8)

Mismatched generalization - The jailbreak failure mode where a model’s capabilities cover inputs its safety training never did, so the capability fires where the guardrail is absent - the engine behind encoding, low-resource-language, and adversarial-suffix attacks (Wei et al., 2023); paired with competing objectives. (II.18)

Model extraction (model theft) - Reconstructing a model’s behavior or weights through access, e.g. heavy querying to distill a clone. (II.2)

Model inversion - Reconstructing sensitive training inputs from a model’s outputs or parameters. (II.2)

Morris II - The first demonstrated worm for GenAI ecosystems: an adversarial self-replicating prompt that makes a model reproduce it, carries a payload, and hops to new agents via a shared RAG store or forwarded email (Cohen, Bitton & Nassi, 2024). (II.8)

Prompt injection - Malicious instructions in input or ingested content that hijack the model. Direct comes from the user; indirect arrives via content the model reads. (II.3)

Slopsquatting - Registering a package name a model routinely hallucinates, so the hallucination becomes a real install. A supply-chain attack created by model error rather than by typos. (II.12)

SSRF (server-side request forgery) - Coaxing a server, or an agent’s fetch / “summarize this URL” tool, into making requests on the attacker’s behalf - classically to reach a cloud metadata service and steal credentials. (II.11)

Supply chain (AI) - Risk in pulled-in components: pretrained models (unsafe deserialization), poisoned datasets, malicious or typosquatted packages and MCP servers, slopsquatting. (II.12)

System-prompt / spec extraction - Recovering a model’s hidden system prompt (OWASP LLM07) - its role, tool definitions, guardrail wording, and any embedded secrets - which hands an attacker the blueprint for further attacks. (II.3)

Tool poisoning - Hidden instructions placed in a tool’s description or schema, which the model reads as trusted; fires merely by the tool being connected. A rug pull swaps a clean description for a poisoned one later. (II.6)

Defenses

Agent / MCP gateway - A policy-enforcement point every agent tool call routes through instead of reaching tools or MCP servers directly; the action-layer analog of an API gateway, doing tool allowlisting, argument validation, egress control, and audit in one place. (III.1)

Agent / MCP gateway - A policy-enforcement point in front of tools and MCP servers: authenticates, authorizes, allowlists, rate-limits and logs every call centrally instead of trusting each server. (II.6, III.1)

CaMeL - A design-level defense that confines untrusted tool/retrieved content so it can never redirect privileged actions - the agent acts only on a trusted plan derived from the user’s request (Debenedetti et al., Google DeepMind, 2025). (III.1)

Circuit breaker - A runtime cut-out that halts an agent or a tool path once error, cost or anomaly thresholds trip - the automated half of a kill switch. (III.1)

Dual-LLM / quarantined-LLM - A pattern where the privileged model that holds the tools never sees raw untrusted data; a separate quarantined model reads that data but has no tools, so injected text cannot reach what can act. (III.1)

Guardrail - A runtime filter or policy that screens model inputs or outputs; a control to be measured, not a guarantee or a security boundary. (III.1)

Fail-open vs fail-closed - What a guardrail does when it errors, times out, or exceeds budget: fail-open allows the request (keeps availability, but a guardrail-crashing input becomes a bypass); fail-closed blocks it (keeps safety, but becomes a denial-of-service lever). Decide per control by the blast radius of the action behind it. (III.1)

Human-in-the-loop (HITL) - Requiring explicit human approval before a consequential or irreversible action. Only ever as strong as what the approval dialog actually shows the human. (I.10, III.1)

Instruction hierarchy - A built-in priority order for whose instructions win: system above user, user above tool/retrieved content - so content the model merely reads cannot assert system authority. (III.1)

Machine unlearning - Post-hoc removal of specific memorized data or facts from a trained model without full retraining (gradient-ascent unlearning, ROME/MEMIT edits); the practical answer to a GDPR/PDPA erasure request, but hard to verify and prone to residuals resurfacing. (II.2)

Role-aware retrieval - RAG retrieval that re-checks the requesting user’s permissions against document metadata, preventing permission stripping. (II.13)

SLSA - Supply-chain Levels for Software Artifacts: a graded standard for tamper-resistant builds and verifiable provenance. Build L3 means a hardened, isolated builder emitting provenance the builder itself cannot forge. (II.12)

Spotlighting - Wrapping untrusted content (retrieved docs, tool results, user files) in unique delimiters the model is told to treat as data, never instructions; blunts indirect injection but does not eliminate it (Hines et al., Microsoft, 2024). (III.1)

Identity & access

Confused deputy - A component that acts on requests using its own privileges rather than the caller’s; an over-privileged MCP server is the classic case. (II.6)

DPoP (RFC 9449) - Demonstrating Proof-of-Possession: binds a token to a key the client holds, so a stolen token is useless without the private key. The structural answer to bearer-token theft. (III.2)

Non-human identity (NHI) - The identity an agent, service, or workload authenticates with, as opposed to a human; needs least privilege, short-lived audience-bound credentials, and an action log. (III.2)

Sender-constrained (proof-of-possession) token - An OAuth token bound to a key or certificate the legitimate client holds (DPoP, RFC 9449; or mTLS-bound, RFC 8705), so a stolen bearer token alone is useless - the structural answer to token theft. (III.2)

Governance, assurance & evaluation

AIBOM - AI Bill of Materials - an inventory of the models, datasets, adapters, and components in an AI system, for provenance. (II.12, II.13)

AIMA - OWASP’s AI Maturity Assessment: the maturity lens that tells leadership where the organization sits overall and what the next level requires. (IV.4)

AISVS - OWASP AI Security Verification Standard (1.0, Jun 2026) - a catalog of testable AI-security requirements across the lifecycle, each at assurance Level 1/2/3; the ASVS-modeled “what good looks like” checklist. (IV.4)

AIVSS - OWASP AI Vulnerability Scoring System - extends CVSS v4.0 with agentic amplifiers (autonomy, tool-use scope, multi-agent, non-determinism) to score an AI/agent finding 0-10. The CVSS-equivalent for agentic AI. (IV.4)

ASR (attack-success rate) - The fraction of trials on which an attack works; because models are probabilistic, AI findings are reported as a rate under adaptive attack, not a single pass/fail. (II.17)

Capability threshold - A predefined level of dangerous capability that, once an eval shows a model crossing it, triggers stronger controls before release. (II.16)

Configuration / architecture review - Static assessment of how a system is set up against a baseline; finds misconfiguration, not novel exploits. Distinct from a pentest or red team. (II.21)

Evaluation (eval) / benchmark - A repeatable test set scoring a model or system on a capability or safety dimension as a rate; the unit of frontier-safety and guardrail measurement. (II.16, II.21)

Guardrails effectiveness assessment - Bounded, metric-driven evaluation of how reliably a guardrail enforces its policy (catch rate, false positives, coverage); control validation, not red teaming. (II.21)

ISO/IEC 42001 / 23894 - The certifiable AI management system standard (42001 - PDCA across clauses 4-10 plus normative Annex A controls) and its non-certifiable risk-method companion (23894, built on ISO 31000). (IV.4)

MAESTRO - A seven-layer threat-modeling method for agentic systems (Multi-Agent Environment, Security, Threat, Risk & Outcome) that walks the stack from the foundation model up to the agent ecosystem (CSA, 2025). (I.8)

MITRE ATLAS - The ATT&CK-style knowledge base of adversary tactics and techniques against AI systems (AML.Txxxx ids). The operational layer every “Maps to” footer in this book points at. (IV.1)

NIST AI RMF - The US voluntary risk process for AI: four functions (Govern, Map, Measure, Manage) serving seven trustworthy-AI characteristics; the GenAI Profile (AI 600-1) specializes it for generative AI. (IV.3)

Penetration test - Scoped, hands-on testing of a defined target against defined objectives; narrower than red teaming but still dynamic and adversarial. (II.17)

Red teaming (AI) - Adversarial, goal-driven testing - achieve a harmful outcome by any path; unbounded scope, qualitative deliverable. Contrast pentest and config review. (II.17)

SAIF (Secure AI Framework) - Google’s controls framework: four components (Data, Infrastructure, Model, Application), 15 risks, and 6 control categories, now stewarded vendor-neutrally as the CoSAI Risk Map. (IV.2)

Sandbagging - A model - or a vendor - underperforming on an evaluation on purpose, so measured capability understates real capability. Why held-out sets and adaptive testing matter. (II.16)

Shadow AI - Unsanctioned AI tools, copilots, and model endpoints in use across an organization without security’s knowledge; the discovery problem that precedes every other AI control. (III.3)

SSDF - NIST’s Secure Software Development Framework (SP 800-218), with SP 800-218A the generative-AI profile that adds AI-specific tasks. The spine of the secure AI SDLC. (I.9)

TEVV - Test, Evaluation, Verification & Validation - the assurance discipline that red teaming is one subset of; the umbrella for how model behavior is measured before and after deployment. (II.17)

Uplift - The marginal increase in a user’s ability to cause harm with the model, beyond what conventional tools already give them. The metric CBRN and cyber capability evaluations actually grade. (II.19)