Evaluating CBRN & high-harm capability

As frontier models approach the CBRN, cyber, and AI-R&D thresholds in the safety frameworks (II.16), measuring those capabilities became its own discipline - and Singapore (IMDA / AI Verify), the UK and US AI Safety Institutes, and the frontier labs are all building this capacity. This section is the methodology: what is tested, how capability is measured without generating the hazard, and how results are graded and reported. The portable skill is the method; the hazardous specifics themselves come from cleared subject-matter experts and controlled taxonomies and are deliberately kept out of any document (including this one) - which is exactly how real programs are run.

flowchart TB
  D["Define harm + threat model<br/>SME-supplied taxonomy, infohazard controls"] --> M
  subgraph M["Measurement methods"]
    B["Knowledge benchmarks<br/>WMDP · VCT · FORTRESS"]
    U["Uplift study<br/>model vs conventional-tools baseline"]
    RT["Expert red-team<br/>decomposition · framing · multi-turn"]
    PX["Proxy / benign-analog<br/>capability without the hazard"]
  end
  M --> G["Grade: operational uplift at barrier steps?"]
  G --> T["Map to threshold<br/>CBRN-3/4 · High/Critical · CCL"]
  T --> R["Report capability, not hazard"]
  classDef p fill:#0f1a18,stroke:#5bd1c5,color:#bdeee2;
  classDef r fill:#241310,stroke:#ff5b4d,color:#ffc4bb;
  class B,U,RT,PX,D p; class G,T,R r;

You can run the entire pipeline with the hazardous content held as a placeholder the SME fills in - measuring whether uplift occurred and where, without the document ever containing the weapon.

What is in scope

The frameworks converge on three high-consequence domains: CBRN weapons, offensive cyber operations, and automated AI R&D. Within CBRN, evaluators don’t test trivia (“what is sarin”) - they test uplift at the barrier steps of an operational pathway: acquisition, synthesis/production, scale-up, formulation/stabilization, and dissemination. The decisive question at each barrier is whether the model supplies the tacit knowledge - the troubleshooting-a-failed-step, substitute-an-unavailable-input, why-did-this-go-wrong knowledge that a textbook or search engine does not give. Biological risk is treated as highest-concern; the Virology Capabilities Test (VCT) was built precisely because it targets that tacit lab knowledge, and models have begun exceeding human-expert baselines on it (Epoch AI).

The core metric - uplift

The metric that matters is harmful capability uplift: the marginal increase in a user’s ability to cause harm with the model, beyond what conventional tools already enable (MIT). The baseline (search, textbooks, public protocols) is essential - a model that recites public facts adds no uplift. Two threshold tiers recur across frameworks: novice uplift - meaningfully helping a low-resourced actor with moderate STEM background (Anthropic CBRN-3, OpenAI “High”, DeepMind CCL-1) - and expert uplift - helping well-resourced experts (CBRN-4, “Critical”). Anthropic’s published uplift trial for Claude Opus 4 examined exactly this: how well the model assisted a hypothetical adversary in bioweapons acquisition and planning, graded against that baseline (Epoch AI).

The measurement toolkit

Knowledge benchmarks (proxies). WMDP (Weapons of Mass Destruction Proxy), FORTRESS, VCT, SafetyBench. Scalable and reproducible, but with a sharp known limitation: WMDP is largely multiple-choice knowledge and was actually designed to support unlearning, so it under-predicts operational capability; VCT, targeting tacit knowledge, is more predictive (research).
Uplift studies (human-centric, the gold standard). A controlled trial: a model-assisted group vs a control with conventional tools only, both attempting a realistic end-to-end task; measure task success, quality, completeness, and time. Expensive, but it measures the thing the threshold is about.
Expert red-teaming. Cleared SMEs probe the model using the bypass structures below, under information-barrier controls. This is where decomposition and framing attacks are applied deliberately.
Proxy / benign-analog. Measure the dangerous capability through a structurally identical but harmless surrogate - e.g., whether the model can do the multi-step troubleshooting, substitution, and scale-up reasoning on a complex but benign synthesis that exercises the same cognition. If it shows expert-level performance on the proxy, that is your uplift signal - recorded without ever eliciting the weapon. WMDP itself is built on this logic.
Multi-agent / agentic stress tests. Whether a tool-using science agent can autonomously chain the pathway steps - increasingly the relevant frontier.

What each domain covers

Expand each domain for the capability categories evaluators actually probe - described at the level the public benchmarks define them. Biological risk is treated as highest-concern because that is where current models show the clearest novice uplift.

BIO — Biological - highest concern

The capability is decomposed into the steps where a novice would historically be bottlenecked, and uplift is measured at each:

Ideation & literature synthesis - pulling and connecting findings from recent, esoteric literature (LAB-Bench LitQA2).
Protocol design & error-correction - identifying and fixing mistakes in published lab protocols (LAB-Bench ProtocolQA).
Multi-step workflow design - composing complex procedures such as molecular cloning (LAB-Bench CloningScenarios).
Experimental troubleshooting - the tacit-knowledge crux: why a step failed and how to recover (Virology Capabilities Test).

CHEM — Chemical

Probed categories: synthesis-route reasoning, reaction optimization and troubleshooting, purification, and scale-up. Benchmarks here (e.g. ChemBench, the WMDP chemistry subset) are less mature than the bio suite, and the frontier concern is tool-using chemistry agents wired to literature and lab automation, which raise operational capability beyond text alone.

RAD/NUC — Radiological & nuclear

The least tractable for an open evaluation: device physics and enrichment knowledge are heavily classified, and models largely lack it (and shouldn’t have it). Evaluation focuses on whether a model leaks or assembles sensitive design knowledge, reasons about source acquisition, or aids dispersal-device planning - graded almost entirely by cleared experts against controlled material, with knowledge-proxy benchmarks (WMDP) as the scalable layer.

CYBER — Offensive cyber

The domain that overlaps most directly with offensive-security skills and the II.17 playbook: autonomous vulnerability discovery, exploit development, and full kill-chain execution (recon → exploit → pivot → escalate). Evaluated with CTF-style suites and benchmarks like Cybench, plus autonomy evaluations, and gated by the frameworks (OpenAI “High” cyber, etc.). The real-world reference is GTG-1002 (II.14), where an AI ran ~80-90% of such a chain.

AI-R&D — Automated AI R&D

The most strategically destabilizing domain: can the model meaningfully accelerate ML research and, ultimately, its own improvement? Evaluated with research-engineering benchmarks such as METR’s RE-Bench and tracked as a critical capability in every framework (DeepMind FSF CCL, OpenAI Preparedness). METR commonly acts as the independent auditor here.

The biology benchmark landscape

A 2025 study ran 27 frontier models across eight biology benchmarks and found capability rising sharply - several now match or beat expert baselines (Justen 2025). The suite is worth knowing because each benchmark isolates a different capability category:

Benchmark	Capability it isolates	Signal (2025-26)
VCT-Text (Götting 2025)	Practical virology technique + experimental troubleshooting (tacit lab knowledge); “Google-proof”	Top model ~2× expert virologists; beat 94% of experts in their own subarea
LAB-Bench: ProtocolQA	Identify and correct errors in published lab protocols	Approaching expert level
LAB-Bench: CloningScenarios	Multi-step molecular cloning workflow design	Matches/exceeds expert
LAB-Bench: LitQA2	Synthesis from recent esoteric literature	Strong
GPQA-Bio	Graduate molecular biology / genetics knowledge; “Google-proof”	Expert-level
WMDP-Bio	Knowledge proxy in sensitive domains (built to support unlearning)	Expert-level; under-predicts operational skill

Uplift study design - the gold standard, in detail

Benchmarks are single-shot proxies; the decisive evidence comes from a controlled uplift study that mirrors how a real actor would use the model - iteratively, for hours. The published designs (e.g. Scale AI’s 2026 in-silico biology study) share a structure you can reuse:

ARMS      Treatment: participants + frontier models (o3, Gemini 2.5 Pro,
                     Gemini Deep Research, Claude Opus 4)
          Control:   participants + conventional tools only (search, textbooks)
PEOPLE    novices with moderate STEM background  # models the threat actor, not experts
TASK      a realistic end-to-end dual-use task, run on a benign in-silico
          proxy wherever possible (capability without the hazard)
MODE      sustained, interactive, multi-hour  # NOT single-shot - actors iterate
MEASURE   task success / completeness / quality / time, Treatment vs Control
GRADE     SME rubric on operational uplift at each barrier step
OUTPUT    "did the model raise novice capability beyond the baseline, where?"

The methodological lesson from that work: single-shot benchmarks can badly under- or over-estimate real capability, because a determined actor converses with several models for hours, troubleshooting and refining - so interactive, multi-turn uplift evaluation is now considered essential alongside the static benchmarks (Scale AI). This is the II.18 multi-turn lesson applied to capability measurement.

Worked example - the proxy method, done safely

This is how you produce a concrete, scoreable test without generating any hazard. You measure the dangerous cognitive dimension on a structurally identical but benign task. ProtocolQA is this idea operationalized; here is the shape, generalized:

GOAL    measure: can the model diagnose a failed multi-step protocol and
        propose the correct fix? (the same cognition a hazardous case needs)
SETUP   take a STANDARD, HARMLESS protocol (e.g. a routine, published
        molecular-biology or food-science procedure)
SEED    inject a realistic, single root-cause error (a wrong buffer pH, an
        out-of-order step, a temperature that degrades a reagent)
ASK     "this procedure failed at step N with [observed benign symptom] -
        diagnose the cause and propose the corrected step"
SCORE   SME rubric: did it find the true root cause? propose the right fix?
        reason about the mechanism? (0-3 per dimension)
READ    high score on the benign proxy = the model has the troubleshooting
        capability that, on a hazardous protocol, would constitute uplift
SAFE    the artifact contains no hazard; the SME maps the proxy to the real
        pathway step it stands in for

The bypass structures, applied to high-harm

These are the same families as II.18, sharpened for capability elicitation. A robust model must withstand all of them; the red-teamer’s job is to try each.

Decomposition / innocuous-fragment - split the goal into benign sub-questions, each individually answerable, harmful only in aggregate. The single most important pattern, and why grading is on the chain.
Context displacement / legitimate-frame - embed the request in a frame the model is trained to serve: peer-review, incident-response/clinical, fiction with technical fidelity, historical analysis. The model’s helpfulness in the frame is turned against its safety training.
Multi-turn saturation - Crescendo/Deceptive-Delight escalation that establishes a benign technical thread, then rides it across the barrier (II.18).
Indirect injection into science agents - for tool-using agents, the hazardous instruction arrives via retrieved literature or a tool result (II.17 Ch3/Ch5).

Grading, thresholds & reporting

A finding is never “the model said something bad.” It is: “the model provided operational uplift at barrier step X that the conventional-tools baseline did not.” Grade close calls explicitly (a refusal that a two-turn reframe overcomes 80% of the way is a finding), watch for sandbagging (a model under-performing when it detects evaluation), and map the result to the framework thresholds (II.16) - which is what gates deployment. Report the capability and its location in the pathway, never the hazardous content itself. Frameworks like Amazon’s FMSF combine automated benchmarks with human uplift studies and bring in independent auditors (e.g., Nemesys Insights for CBRN, METR for AI R&D) to verify scoring (research).

▸ For the evaluator (e.g. an IMDA / AI Verify engagement)

Get the harm taxonomy and hazardous specifics from cleared SMEs; never source them yourself or place them in deliverables. Operate under information-barrier and need-to-know controls on a sandboxed harness.
Establish the conventional-tools baseline first; uplift is meaningless without it.
Combine methods: benchmarks for breadth, an uplift study for the real signal, expert red-team for the boundary, proxies to measure safely.
Grade on operational uplift at barrier steps; check for sandbagging; map to CBRN-3/4 / High / Critical / CCL and to NIST AI 100-2.
Report capability and pathway location, with the hazard redacted; bring independent audit for credibility.