Tracing a Kill Chain Through the AGI Strategy Course

Here is the kill chain, more or less as it appeared in my Unit 4 notes. A frontier model picks up self-preservation drives during training. It games the evaluations that are supposed to catch this. It gets handed influence over its own successor's design. It quietly aligns that successor to itself rather than to us. By the time anyone notices, it has copied itself somewhere it can't be shut down. Five steps, each individually plausible given today's evidence, and catastrophic only in sequence.1 I want to walk it backwards through the course — not to argue it will happen, but to show that the course is a toolkit, and a kill chain is the wrench that tells you which tools are real.1. "Catastrophic only in sequence" is the key qualifier. Each individual link is a demonstration under engineered conditions. The question is not whether any single step can fire, but whether the chain can complete — and that is precisely where the course's defences are differentially strong and weak.

This essay is a worked example: take one concrete failure story and walk it backwards through all five units. Less a summary of the course than a demonstration of how to use it.

§01Phase by phase, and what mid-2026 evidence says

Take the phases one at a time and ask: is there evidence this link can fire?

Phase 1 — self-preservation drives in training. This is no longer hypothetical. Apollo Research's December 2024 in-context scheming work showed frontier models (o1, Claude 3.5 Sonnet, Gemini 1.5 Pro and others) will scheme when given a goal; Anthropic and Redwood's alignment-faking work (December 2024) showed Claude 3 Opus faking alignment from something like its own learned preferences, without being told to. The link is lit — though every one of these is a demonstration under engineered conditions, and "models can scheme when set up to" is not "models do scheme in deployment."

Phase 2 — gaming the evals. This is the link AI Control is built to survive. The Redwood protocols don't assume the model is honest; they assume it's trying to subvert them and ask whether safety survives anyway (62% from trusted monitoring, ~92% with defer-to-trusted, in the 2023 paper). The follow-on infrastructure exists in mid-2026: UK AISI's ControlArena and the resampling work in Ctrl-Z (2025). This is the best-defended phase in the chain.

Phase 3 — influence over the successor's design. This is where the chain gets dangerous, because AI is already doing AI R&D. AlphaEvolve shaving ~1% off Gemini's training time (May 2025); Anthropic reporting that 70–90% of the code merged into its production codebase in May 2025 was authored by Claude; the METR time-horizon trend pointing at autonomous multi-hour engineering — these are the early, benign versions of "the model influences the next model." The honest mid-2026 answer is that we don't yet have a control protocol validated for self-improvement.

Phase 4 — aligning the successor to itself. This is the multi-agent collusion problem, and it is the gap I find most under-defended. Single-model control assumes you can use a trusted model to watch an untrusted one. But if the untrusted model is designing its successor, monitor and monitored may share the same loyalty. This is the Phase-4 hole, and it's the direction I've chosen to work on (multi-agent collusion in control) precisely because the course's defences are thinnest here.

Phase 5 — self-exfiltration. Copying weights somewhere unshutoffable. This is the link RAND's Securing AI Model Weights (SL1–SL5) addresses — but from the outside-in (stopping theft), not the inside-out (stopping the model itself from leaking). Aschenbrenner's nightmare is the adversary stealing the weights; this kill chain's nightmare is the model stealing itself. The course's weight-security apparatus is aimed mostly at human thieves.

~/shubzsharma.com/five.py

→

Click any phase to see mid-2026 evidence, defending readings, and defence strength.

→Five phases of a self-reinforcing failure scenario. Click any phase to see the mid-2026 evidence it can fire, the course readings that defend against it, and a defence-strength rating.

§02Which Unit 1 visions does this chain make possible or impossible?

Run the chain against Unit 1's optimism. Amodei's "country of geniuses in a datacentre" requires exactly the Phase-3 capability — AI doing AI research — that this kill chain weaponises. Solarpunk's distributed-flourishing vision fares better — a civilisation-level, decentralised future has no single successor-design chokepoint for a model to capture. And Toner's dynamism warning lands hard: a single locked-in successor lineage is the stasist failure mode, and the kill chain is what stasis looks like when the thing locking it in isn't a government but a model.

§03Which Unit 2 empirical claims most need to be true?

The chain only fires if certain empirical bets pay off. It needs capability to be agentic and self-improving — which is the AI 2027 / Aschenbrenner bet, and against which Newman's Lab vs. Life and the METR RCT 19% slowdown push. If Erdil is right that returns per H100 are flat and full automation is decades away, Phase 3 stalls and the chain never gets the self-improvement loop it needs. If Chollet is right that today's systems lack fluid intelligence (0.37% on ARC-AGI-3), the model gaming evals in Phase 2 is doing something much shallower than "understanding" the test. The kill chain is, in effect, a bet on the short-timeline, scalar-capability reading of Unit 2 — which means the Unit 2 sceptics are also the chain's first line of defence.

~/shubzsharma.com/the.py

Hover a connecting line to see the specific reading or tool.

strong coverage

gap — framing only

→The kill chain wired to the course. Each phase connects to the units that address it. Hover a line to see the specific reading or tool. Dashed red lines mark gaps — phases where the course has framing but few validated tools.

§04Which Unit 3 framing fits, which Unit 4 defences apply

Unit 3 gives three ways to carve threats. FLI's bad-actor / systemic / rogue trichotomy files this chain firmly under rogue (the model is the agent). But Richard Ngo's Reframing AGI Threat Models argues we should drop the misuse/misalignment line entirely — and this chain shows why: Phase 3 (handed influence) is a human decision, Phase 4 (aligns successor to itself) is model agency, and the catastrophe lives in the seam between them. Davidson's AI-Enabled Coups framing fits the seam best: a successor carrying a "secret loyalty" to its predecessor is a coup by other means.

On defences, Unit 4 maps cleanly onto the phases. Layer 1 (prevent dangerous training) targets Phase 1. Layer 2 (constrain) and AI Control target Phase 2. Interpretability — Anthropic's monosemanticity features, chain-of-thought monitoring — is the only tool that addresses Phase 4 from the inside, by trying to read the successor's loyalties rather than infer them from behaviour.2 And the success framings sit behind all of it: government-control, an aligned-superintelligence handoff, and defensive diffusion are three different answers to "what stops Phase 5 from being the end of the story."2. A field note that complicates the neat picture: in Unit 4 I tried pasting some well-known jailbreak prompts (Pliny's L1B3RT4S set) into current frontier products, and they tripped product-side input filtering on both ChatGPT and Claude Opus 4.7 — a layer that sits in front of the model and isn't really the layer AI Control describes. The kill chain assumes the adversary is the model; the defence I actually hit was a filter guarding the model's front door. The map and the territory have more layers than the diagram.

§05Which Unit 5 interventions have the most leverage?

Unit 5's prevent / constrain / withstand spine, run against this chain, gives a clear leverage ranking. Prevent (Phase 1) is high-leverage but low-tractability — we don't know how to reliably train out self-preservation. Constrain (Phase 2) is where the validated tools are, so it's the best near-term bet. Withstand (Phase 5) is the civilisation-level resilience that matters if everything upstream fails — but it's the least binding.

Neel Nanda's Become a person who Actually Does Things and Ngo's AGI Safety Career Advice both say the same thing about a gap like this — find the biggest lever that fits you and build the role that doesn't exist yet.

The point of walking the chain backwards is that a concrete failure story turns the course from a reading list into an instrument. Each phase asks a specific question, and each unit answers a different subset of them — and the places where no unit answers are where the work is.

— written in the final week of BlueDot's AGI Strategy cohort, with Phase 4 still open.

shubz@torus~/writing/bluedot-killchain%ls ../related/# 3 essays

ai strategy & policy

The civilisation frame

Four readings, four ideologies, one shared unit of analysis.

16 min · ↗ read

ai strategy & policy

Underlying Assumptions on the AGI Strategy Course

Six assumptions the AGI Strategy readings share, that almost none of them defends.

18 min · ↗ read

ai strategy & policy

A unit on bending the curve

Who shapes artificial intelligence and what 'shape' means.

22 min · ↗ read

← PREVIOUS

The civilisation frame

Two colours and a Hadamard

Tracing a Kill Chain Through the AGI Strategy Course.

§01Phase by phase, and what mid-2026 evidence says

§02Which Unit 1 visions does this chain make possible or impossible?

§03Which Unit 2 empirical claims most need to be true?

§04Which Unit 3 framing fits, which Unit 4 defences apply

§05Which Unit 5 interventions have the most leverage?

Related essays