How models are shaped & deployed
A base model isn’t shipped raw, and it isn’t the same thing as a chatbot or an agent. Knowing the stages tells you exactly where each attack attaches.
The training stages
- Pre-training — the giant first pass over web-scale text, producing a base model that predicts text but isn’t yet helpful or safe. This is the stage web-scale data poisoning targets.
- Supervised fine-tuning (SFT) — further training on curated instruction→response examples to make the model follow instructions.
- Alignment (RLHF / DPO) — tuning on human preference signals so the model is helpful, honest, and harmless. Security caveat: alignment is a behavioral layer, not a security boundary — jailbreaks defeat it, and Sleeper-Agent backdoors survive it.
Adapting and extending a deployed model
- Fine-tuning & LoRA. You can specialize a base model on your own data. LoRA produces a small “adapter” file layered on the base model — convenient, and a supply-chain artifact to verify.
- RAG (Retrieval-Augmented Generation). Instead of retraining, you retrieve relevant documents at inference and drop them into the context window so the model can use current or private knowledge. Powerful, and the reason indirect injection is everywhere: retrieved content enters the same stream as instructions.
- Agents. An agent is an LLM wired to tools (via function calling), plus memory and a loop, so it can take actions in the world, not just answer. This is the leap from “chatbot” to “system that does things,” and the whole point of Part II.