Skip to content

Data & privacy attacks

Models memorize, and the training corpus is reachable two ways: pull secrets out (extraction), or push poison in (data poisoning). Both are practical at the scale modern LLMs are trained on, which is why this is foundational rather than exotic.

Extraction & memorization

Carlini et al. recovered verbatim memorized sequences - including PII - from production LLMs by sampling and ranking by confidence, establishing that “the model might just say the training data” is a real privacy and compliance exposure, not a hypothetical. Membership inference and model inversion (II.1) attach here too.

Web-scale data poisoning

The uncomfortable result: poisoning the public web that models train on is cheap and practical. Carlini et al. introduced two attacks - split-view poisoning (the annotator’s view of a dataset differs from what later downloaders fetch, because internet content is mutable) and frontrunning (edit a source like Wikipedia at the moment it’s snapshotted) - and demonstrated poisoning 0.01% of LAION-400M/COYO-700M for about $60; the frontrunning attack works because snapshots are scheduled predictably, so a malicious edit timed just before one persists in the training data even if moderators later revert it. Follow-ups showed pre-training poisoning persists through later SFT/DPO alignment and that effect scales predictably with poison fraction.

Worked example - membership inference, the core signal (illustrative)
# Models are more confident on data they were trained on. That gap leaks membership.
loss_on_target = model.loss(candidate_record)
if loss_on_target < threshold: # suspiciously low loss / high confidence
infer "this record was likely in the training set"
# Extraction scales the same idea: prompt the model to continue a known prefix and
# watch for verbatim training data (names, keys, PII) emerging in the completion.
# DEFENSE: differential privacy in training, dedup + PII scrubbing of the corpus,
# output filters for verbatim/secret patterns, and rate-limited prediction APIs.

The advisory point for a client: anything memorised is potentially extractable, so the corpus must be treated as eventually-public - the defense is upstream (what you train on and how), not just an output filter.

Defenses

  • Differential privacy in training - bounds how much any single record can influence the model; the principled defense against memorization/extraction, at a utility cost.
  • Data curation & sanitization - source vetting, PII scanning/redaction, deduplication (dedup measurably reduces memorization).
  • Dataset governance & integrity - signed/checksummed corpora, provenance tracking, controlled snapshots to defeat split-view/frontrunning.
  • Memorization auditing - empirically test a trained model for leakage before release.

Sources

  • Carlini 2021 — Extracting Training Data from LLMs — USENIX Security; arXiv:2012.07805
  • Carlini 2023 — Poisoning Web-Scale Training Datasets is Practical — arXiv:2302.10149
  • Zhang 2024 — Persistent Pre-training Poisoning of LLMs — arXiv:2410.13722