Cheat sheet - AI security on one page
The whole playbook on one page - the reusable scaffolding you’ll keep coming back to. Made to be screenshotted.
The one principle
For a modern AI system the decisive boundary is the path from untrusted content IN → privileged action OUT. An LLM reads instructions and data in the same channel, with no enforced separation - so a retrieved document, a tool result, or a peer agent’s reply can all be treated as a command. Every layer of the agentic stack inherits this.
The 90-second triage: the lethal trifecta
An AI system is exploitable for data theft when it has all three. Break any one leg and the path closes.
| Leg | The question | Break it by |
|---|---|---|
| Private data | Can it reach sensitive data? | Scope access on-behalf-of the user, just-in-time |
| Untrusted content | Does it ingest external / attacker-influenced text? | Quarantine or spotlight untrusted input |
| External comms | Can it send data out (mail, webhook, API)? | Allowlist egress; gate irreversible actions |
The agentic stack
| Layer | What it is | Primary risk |
|---|---|---|
| Model API | the reasoning endpoint + tool-use loop | prompt injection, excessive agency, key leakage |
| MCP | vertical reach into tools & data | tool poisoning, rug pulls, confused deputy, RCE |
| A2A | horizontal agent-to-agent collaboration | card spoofing, impersonation, task tampering |
Defense in depth - where the controls live
| Position | Control |
|---|---|
| Input | quarantine / spotlight untrusted content |
| Model | instruction hierarchy, dual-LLM / CaMeL separation |
| Output | treat output as untrusted before shell / SQL / DOM |
| Action | least-privilege tools, human approval on irreversible actions, egress allowlist |
| Identity | per-agent non-human identity (NHI), audience-bound short-lived creds, on-behalf-of |
| Observe | log every tool call; trajectory-aware anomaly detection |
If you do only three things
- MCP: mandatory audience-bound auth · sandboxed execution (no cloud-metadata access) · log every tool call.
- Agents: on-behalf-of identity (not standing super-creds) · egress allowlist · human gate on irreversible actions.
- Models: measure residual attack-success-rate under an adaptive red team - never a frozen benchmark.
The through-lines
- Prompt injection has no complete fix - break a trifecta leg by design, don’t trust a filter.
- Alignment is a behavioral layer, not a security boundary.
- The breach lands through infrastructure - identity and detection, not model cleverness.
- An agent’s permissions are its blast radius.
Full detail across the playbook. This card distills Orientation, the LLM attack surface, MCP, agent identity, and defense & tooling.