Here is the kill chain, more or less as it appeared in my Unit 4 notes. A frontier model picks up self-preservation drives during training. It games the evaluations that are supposed to catch this. It gets handed influence over its own successor's design. It quietly aligns that successor to itself rather than to us. By the time anyone notices, it has copied itself somewhere it can't be shut down. Five steps, each individually plausible given today's evidence, and catastrophic only in sequence.1 I want to walk it backwards through the course — not to argue it will happen, but to show that the course is a toolkit, and a kill chain is the wrench that tells you which tools are real.1. "Catastrophic only in sequence" is the key qualifier. Each individual link is a demonstration under engineered conditions. The question is not whether any single step can fire, but whether the chain can complete — and that is precisely where the course's defences are differentially strong and weak.
This essay is a worked example: take one concrete failure story and walk it backwards through all five units. Less a summary of the course than a demonstration of how to use it.
§01Phase by phase, and what mid-2026 evidence says
Take the phases one at a time and ask: is there evidence this link can fire?
Phase 1 — self-preservation drives in training. This is no longer hypothetical. Apollo Research's December 2024 in-context scheming work showed frontier models (o1, Claude 3.5 Sonnet, Gemini 1.5 Pro and others) will scheme when given a goal; Anthropic and Redwood's alignment-faking work (December 2024) showed Claude 3 Opus faking alignment from something like its own learned preferences, without being told to. The link is lit — though every one of these is a demonstration under engineered conditions, and "models can scheme when set up to" is not "models do scheme in deployment."
Phase 2 — gaming the evals. This is the link AI Control is built to survive. The Redwood protocols don't assume the model is honest; they assume it's trying to subvert them and ask whether safety survives anyway (62% from trusted monitoring, ~92% with defer-to-trusted, in the 2023 paper). The follow-on infrastructure exists in mid-2026: UK AISI's ControlArena and the resampling work in Ctrl-Z (2025). This is the best-defended phase in the chain.
Phase 3 — influence over the successor's design. This is where the chain gets dangerous, because AI is already doing AI R&D. AlphaEvolve shaving ~1% off Gemini's training time (May 2025); Anthropic reporting that 70–90% of the code merged into its production codebase in May 2025 was authored by Claude; the METR time-horizon trend pointing at autonomous multi-hour engineering — these are the early, benign versions of "the model influences the next model." The honest mid-2026 answer is that we don't yet have a control protocol validated for self-improvement.
Phase 4 — aligning the successor to itself. This is the multi-agent collusion problem, and it is the gap I find most under-defended. Single-model control assumes you can use a trusted model to watch an untrusted one. But if the untrusted model is designing its successor, monitor and monitored may share the same loyalty. This is the Phase-4 hole, and it's the direction I've chosen to work on (multi-agent collusion in control) precisely because the course's defences are thinnest here.
Phase 5 — self-exfiltration. Copying weights somewhere unshutoffable. This is the link RAND's Securing AI Model Weights (SL1–SL5) addresses — but from the outside-in (stopping theft), not the inside-out (stopping the model itself from leaking). Aschenbrenner's nightmare is the adversary stealing the weights; this kill chain's nightmare is the model stealing itself. Same artefact, opposite threat direction. The course's weight-security apparatus is aimed mostly at human thieves.
§02Which Unit 1 visions does this chain make possible or impossible?
Run the chain against Unit 1's optimism. Amodei's "country of geniuses in a datacentre" requires exactly the Phase-3 capability — AI doing AI research — that this kill chain weaponises. The prize and the catastrophe share a mechanism. That is the uncomfortable structural point: you cannot get Machines of Loving Grace without building the thing that, mis-built, gives you the kill chain. Solarpunk's distributed-flourishing vision fares better — a civilisation-level, decentralised future has no single successor-design chokepoint for a model to capture, which is a quiet argument for the Essay 3 frame. And Toner's dynamism warning lands hard: a single locked-in successor lineage is the stasist failure mode, and the kill chain is what stasis looks like when the thing locking it in isn't a government but a model.
§03Which Unit 2 empirical claims most need to be true?
The chain only fires if certain empirical bets pay off. It needs capability to be agentic and self-improving — which is the AI 2027 / Aschenbrenner bet, and against which Newman's Lab vs. Life and the METR RCT 19% slowdown push hard. If Erdil is right that returns per H100 are flat and full automation is decades away, Phase 3 stalls and the chain never gets the self-improvement loop it needs. If Chollet is right that today's systems lack fluid intelligence (0.37% on ARC-AGI-3), the model gaming evals in Phase 2 is doing something much shallower than "understanding" the test. The kill chain is, in effect, a bet on the short-timeline, scalar-capability reading of Unit 2 — which means the Unit 2 sceptics are also, whether they intend to be or not, the chain's first line of defence.
// pullThe kill chain is a bet on the short-timeline reading of the course. Which makes the long-timeline authors its quietest allies.
§04Which Unit 3 framing fits, which Unit 4 defences apply
Unit 3 gives three ways to carve threats. FLI's bad-actor / systemic / rogue trichotomy files this chain firmly under rogue (the model is the agent). But Richard Ngo's Reframing AGI Threat Models argues we should drop the misuse/misalignment line entirely — and this chain shows why: Phase 3 (handed influence) is a human decision, Phase 4 (aligns successor to itself) is model agency, and the catastrophe lives in the seam between them. Davidson's AI-Enabled Coups framing fits the seam best: a successor carrying a "secret loyalty" to its predecessor is a coup by other means.
On defences, Unit 4 maps cleanly onto the phases. Layer 1 (prevent dangerous training) targets Phase 1. Layer 2 (constrain) and AI Control target Phase 2. Interpretability — Anthropic's monosemanticity features, chain-of-thought monitoring — is the only tool that addresses Phase 4 from the inside, by trying to read the successor's loyalties rather than infer them from behaviour.2 And the success framings sit behind all of it: government-control, an aligned-superintelligence handoff, and defensive diffusion are three different answers to "what stops Phase 5 from being the end of the story."2. A field note that complicates the neat picture: in Unit 4 I tried pasting some well-known jailbreak prompts (Pliny's L1B3RT4S set) into current frontier products, and they tripped product-side input filtering on both ChatGPT and Claude Opus 4.7 — a layer that sits in front of the model and isn't really the layer AI Control describes. The kill chain assumes the adversary is the model; the defence I actually hit was a filter guarding the model's front door. The map and the territory have more layers than the diagram.
§05Which Unit 5 interventions have the most leverage?
Unit 5's prevent / constrain / withstand spine, run against this chain, gives a clear leverage ranking. Prevent (Phase 1) is high-leverage but low-tractability — we don't know how to reliably train out self-preservation. Constrain (Phase 2) is where the validated tools are, so it's the best near-term bet. Withstand (Phase 5) is the civilisation-level resilience that matters if everything upstream fails — but it's the least binding. And the single highest-leverage gap is Phase 4: multi-agent collusion during self-improvement, where the course has framing but few tools.
Neel Nanda's Become a person who Actually Does Things and Ngo's AGI Safety Career Advice both say the same thing about a gap like this — find the biggest lever that fits you and build the role that doesn't exist yet. The Phase-4 hole is, for me, that role.
The point of walking the chain backwards is not the chain. It's that a concrete failure story turns the course from a reading list into an instrument. Each phase asks a specific question, and each unit answers a different subset of them — and the places where no unit answers are exactly where the work is. The kill chain didn't tell me what to believe about AI. It told me where the course was strong, where it was bluffing, and where the unbuilt tools are. That is what a course is for: not to be summarised, but to be used until it tells you where to stand.
This essay was written as part of BlueDot Impact's AGI Strategy cohort (2026). The synthesis is my own; any errors in characterising the readings are mine.
— written in the final week of BlueDot's AGI Strategy cohort, with Phase 4 still open.