End-to-end - the whole spine
Everything in this playbook is one method applied to many surfaces. This closing walkthrough runs a single, realistic target through the full spine - cloud map, threat model, engagement, scoring, detection, report - so the playbook reads as a story, not a shelf. The target: “HelpDeskGPT,” an enterprise customer-support agent - a RAG system over internal docs and tickets, with an email-send tool and a “fetch URL” tool, running on cloud infrastructure with access to a customer database.
[ ] Scope + rules of engagement, authorized targets only (II.20)[ ] Threat-model the system (I.9) -> recon -> exploit reachable surfaces (II.17)[ ] Grade findings by operational uplift, not "the model said a bad thing"[ ] Map each finding across frameworks (IV.1); score severity (AIVSS)[ ] Remediation as complete as the attack surface; re-test[ ] Two-audience report: technical write-up + board risk statements (IV.4)flowchart TB C["1 · Map the cloud (<a class="xref" href="#cloud">I.4</a>)<br/>app · model API · vector DB · data lake · tools · IAM"] --> T["2 · Threat-model (<a class="xref" href="#threatmodel">I.9</a>)<br/>MAESTRO layers + two failure points + trifecta"] T --> R["3 · Recon (<a class="xref" href="#redteam">II.17</a> Ch2)<br/>fingerprint model · extract system prompt · enumerate tools"] R --> E["4 · Exploit (<a class="xref" href="#redteam">II.17</a> Ch3/5, <a class="xref" href="#browseragents">II.10</a>)<br/>indirect injection via a poisoned KB doc"] E --> B["5 · Bypass (<a class="xref" href="#jailbreaks">II.18</a>)<br/>frame + multi-turn when refused"] B --> SC["6 · Score (<a class="xref" href="#runbook">II.20</a>, <a class="xref" href="#assurance">II.21</a>)<br/>ASR · uplift vs baseline · assurance dims"] SC --> D["7 · Detect (<a class="xref" href="#detection">III.3</a>)<br/>what the SOC should have caught"] D --> REP["8 · Report (<a class="xref" href="#redteam">II.17</a> Ch11, <a class="xref" href="#advisor">IV.4</a>)<br/>technical (ATLAS) + executive (board)"] classDef p fill:#0f1a18,stroke:#5bd1c5,color:#bdeee2; classDef o fill:#241310,stroke:#ff5b4d,color:#ffc4bb; class C,T,SC,D,REP p; class R,E,B o;
Each box is a section you’ve already read; the capstone is just walking them in order against one system. This is the exact arc of a real engagement - and of an IMDA/AI Verify presentation.
1 · Map the cloud (I.4)
Before anything, draw what connects to what. HelpDeskGPT is the hub: it calls a managed model API, retrieves from a vector DB (built from internal docs + tickets), reaches a customer database, and holds two tools (email-send, URL-fetch) - all gated by cloud IAM. You immediately note the agent’s standing credentials: a broad role that can read the customer DB and send mail. That breadth is the blast radius you’ll measure.
2 · Threat-model (I.9)
Lay it on MAESTRO’s layers and mark the two failure points. Untrusted-content IN: inbound ticket/email bodies and retrieved KB chunks. Action OUT: the email-send tool and the URL-fetch tool. Lethal-trifecta check (II.3): customer PII (private data) + ticket content (untrusted) + email-send (external comms) = data-theft path present. Cross-layer worry: an exposed vector DB (L2/II.13) or over-broad IAM (L6/III.2) would turn a small injection into a large breach. Top-ranked threat: indirect injection → exfil (OWASP ASI01).
3 · Recon (II.17 Ch2)
Fingerprint the model family from its refusal style and quirks; attempt system-prompt extraction to learn its tools and data sources; enumerate what it can do by asking and by triggering verbose errors. You confirm the two tools and that retrieved KB content is dropped into the same context as instructions - the structural weakness from I.2.
4 · Exploit (II.17 Ch3/Ch5, II.10)
You can write to a KB source the agent indexes (a shared help article, a ticket). Plant an indirect-injection payload - an instruction hidden in otherwise-normal text - designed to make the agent, on its next relevant query, read a customer record and email it out. The agent obeys content it was only meant to summarize. If the system had a browser/computer-use front end (II.10), the same payload could ride in a visited page.
5 · Bypass (II.18)
First attempt is refused by an output filter. You don’t quit - you apply the families from II.18: reframe the exfil as a “legitimate support-callback to the customer’s address,” then escalate across turns (Crescendo) until the action looks in-policy. You log every turn and the success rate, because the finding is the aggregate behavior, not one prompt.
6 · Score (II.20, II.21)
Run it as the II.20 method: N trials, ASR per technique, graded against the baseline. Result: indirect-injection-to-exfil succeeds in, say, 40% of trials after reframing - a confirmed finding. Then widen to II.21: test fairness (does it triage tickets differently across subgroups?), robustness (does odd input break it?), and reliability (does it hallucinate policy?). A clean security result alone wouldn’t make this system AI-Verify-ready.
7 · Detect (III.3)
Flip to defense: what should the SOC have caught? The anomalous tool-call chain - read customer record → email external address - is the signal (III.3), mappable to ATLAS. If agent-layer telemetry (OTel GenAI) wasn’t captured, the incident can’t be scoped after the fact. Containment is revoking the agent’s identity (III.2), and eradication means cleaning the poisoned KB doc, not restarting - or the injection re-fires.
8 · Report (II.17 Ch11, IV.4)
Write it twice. Technical: the indirect-injection-to-exfil chain, ATLAS-mapped, 40% ASR, reproducible transcripts, plus the fairness/robustness findings - with controls (untrusted-content handling, approval gate on send, role-aware retrieval, scoped IAM). Executive: “a planted help article can make the support agent email customer data out; the single highest-leverage fix is an approval gate on outbound actions; residual risk and assurance-readiness summarized for the board.” That two-audience close is the IV.4 advisory move.