Skip to content

Cheat sheet - AI security on one page

The whole playbook on one page - the reusable scaffolding you’ll keep coming back to. Made to be screenshotted.

The one principle

For a modern AI system the decisive boundary is the path from untrusted content IN → privileged action OUT. An LLM reads instructions and data in the same channel, with no enforced separation - so a retrieved document, a tool result, or a peer agent’s reply can all be treated as a command. Every layer of the agentic stack inherits this.

The 90-second triage: the lethal trifecta

An AI system is exploitable for data theft when it has all three. Break any one leg and the path closes.

LegThe questionBreak it by
Private dataCan it reach sensitive data?Scope access on-behalf-of the user, just-in-time
Untrusted contentDoes it ingest external / attacker-influenced text?Quarantine or spotlight untrusted input
External commsCan it send data out (mail, webhook, API)?Allowlist egress; gate irreversible actions

The agentic stack

LayerWhat it isPrimary risk
Model APIthe reasoning endpoint + tool-use loopprompt injection, excessive agency, key leakage
MCPvertical reach into tools & datatool poisoning, rug pulls, confused deputy, RCE
A2Ahorizontal agent-to-agent collaborationcard spoofing, impersonation, task tampering

Defense in depth - where the controls live

PositionControl
Inputquarantine / spotlight untrusted content
Modelinstruction hierarchy, dual-LLM / CaMeL separation
Outputtreat output as untrusted before shell / SQL / DOM
Actionleast-privilege tools, human approval on irreversible actions, egress allowlist
Identityper-agent non-human identity (NHI), audience-bound short-lived creds, on-behalf-of
Observelog every tool call; trajectory-aware anomaly detection

If you do only three things

  • MCP: mandatory audience-bound auth · sandboxed execution (no cloud-metadata access) · log every tool call.
  • Agents: on-behalf-of identity (not standing super-creds) · egress allowlist · human gate on irreversible actions.
  • Models: measure residual attack-success-rate under an adaptive red team - never a frozen benchmark.

The through-lines

  • Prompt injection has no complete fix - break a trifecta leg by design, don’t trust a filter.
  • Alignment is a behavioral layer, not a security boundary.
  • The breach lands through infrastructure - identity and detection, not model cleverness.
  • An agent’s permissions are its blast radius.

Full detail across the playbook. This card distills Orientation, the LLM attack surface, MCP, agent identity, and defense & tooling.