Reference

Cheat sheet - AI security on one page

The whole playbook on one page - the reusable scaffolding you’ll keep coming back to. Made to be screenshotted.

The one principle

For a modern AI system the decisive boundary is the path from untrusted content IN → privileged action OUT. An LLM reads instructions and data in the same channel, with no enforced separation - so a retrieved document, a tool result, or a peer agent’s reply can all be treated as a command. Every layer of the agentic stack inherits this.

The 90-second triage: the lethal trifecta

An AI system is exploitable for data theft when it has all three. Break any one leg and the path closes.

Leg	The question	Break it by
Private data	Can it reach sensitive data?	Scope access on-behalf-of the user, just-in-time
Untrusted content	Does it ingest external / attacker-influenced text?	Quarantine or spotlight untrusted input
External comms	Can it send data out (mail, webhook, API)?	Allowlist egress; gate irreversible actions

The agentic stack

Layer	What it is	Primary risk
Model API	the reasoning endpoint + tool-use loop	prompt injection, excessive agency, key leakage
MCP	vertical reach into tools & data	tool poisoning, rug pulls, confused deputy, RCE
A2A	horizontal agent-to-agent collaboration	card spoofing, impersonation, task tampering

Defense in depth - where the controls live

Position	Control
Input	quarantine / spotlight untrusted content
Model	instruction hierarchy, dual-LLM / CaMeL separation
Output	treat output as untrusted before shell / SQL / DOM
Action	least-privilege tools, human approval on irreversible actions, egress allowlist
Identity	per-agent non-human identity (NHI), audience-bound short-lived creds, on-behalf-of
Observe	log every tool call; trajectory-aware anomaly detection

If you do only three things

MCP: mandatory audience-bound auth · sandboxed execution (no cloud-metadata access) · log every tool call.
Agents: on-behalf-of identity (not standing super-creds) · egress allowlist · human gate on irreversible actions.
Models: measure residual attack-success-rate under an adaptive red team - never a frozen benchmark.

The through-lines

Prompt injection has no complete fix - break a trifecta leg by design, don’t trust a filter.
Alignment is a behavioral layer, not a security boundary.
The breach lands through infrastructure - identity and detection, not model cleverness.
An agent’s permissions are its blast radius.

Govern - the framework stack

Four altitudes, designed to stack - not four competing options. Map a finding once, report it in every language.

Instrument	Answers	Certifiable?
SAIF (VIII.2)	which safeguards to build (4 components → 15 risks → 6 control categories)	no
NIST AI RMF (VIII.3)	how to reason about risk (Govern · Map · Measure · Manage)	no
ISO/IEC 42001 (VIII.4)	the auditable management system (clauses 4-10 + Annex A)	yes
EU AI Act (VIII.5)	the legal floor (Art. 50 transparency live 2 Aug 2026; high-risk deferred to 2 Dec 2027)	conformity

Full detail across the playbook. This card distills Orientation, the LLM attack surface, MCP, agent identity, and defense & tooling. Look up any term in the glossary; trace any claim in the reference library; grab a command from the tooling roster or a deliverable from templates.