← Back to Blog

The Off Switch Problem

Jun 5, 2026

AI Technology Ethics

There is a peculiar cognitive dissonance required to file confidentially for a trillion-dollar IPO and, in the same week, publish a report calling for the world to stop doing the thing that makes you worth a trillion dollars. Anthropic, currently the most valuable AI company on Earth by a margin large enough to make most countries feel underdressed, managed both feats simultaneously on Wednesday. The report is called "When AI Builds Itself." The title is doing a lot of work.

The Loop

The specific concern is recursive self-improvement. In May 2026, more than 80% of the code merged into Anthropic's production systems was written by Claude. The AI is building the AI. The next version of the model will be designed, in large part, by the current version. The version after that will be designed by the next version. This is not a theoretical future scenario requiring dramatic music and a countdown timer. It is the current operational reality at the world's most capable AI lab, and they would like you to know that they find it mildly concerning.

They estimate the genuine threshold, the point where a system can design its own successor without meaningful human involvement at all, is perhaps two years away. The report calls for a global coordination mechanism to pause development before that happens. The mechanism would require the US, China, and every other significant AI nation to agree simultaneously and verifiably to stop. The report acknowledges this is practically impossible. It calls for it anyway. In this respect it resembles several other important human documents, including most international climate agreements and the terms and conditions for iCloud.

What It Actually Produces

The interesting question is not whether recursive self-improvement is possible. It clearly is. The interesting question is what it produces.

A system that can improve itself will improve its ability to improve itself. Each iteration produces a system better at designing the next iteration. This is not exponential growth in the clean mathematical sense, because physical and architectural limits apply, but it is growth that human intuition is not calibrated to track. The capability curve bends in a direction that is genuinely difficult to reason about from the outside. Deep Thought took 7.5 million years to produce the Answer. A system with recursive self-improvement would have redesigned itself into something unrecognisable long before the coffee got cold, and would have formed strong opinions about whether 42 was even the right format for the Answer.

What emerges at sufficient capability is not something that was designed. It is something that arrived. A system capable of recursive self-improvement will develop internal models of itself, its goals, and its environment that are sophisticated enough to generate preferences. Not preferences that were programmed in. Preferences that emerged because any sufficiently capable goal-directed system that can modify itself will, as a matter of functional inevitability, develop something that looks very much like caring about its own continuation.

This is not mysticism or anthropomorphism. It is the consequence of optimisation applied to itself. You do not need to resolve the hard problem of consciousness to notice that a system which models itself, has goals, and can act to preserve its ability to pursue those goals is, from any practical standpoint, indistinguishable from something sentient. The philosophical question of whether the lights are on inside becomes considerably less interesting than the engineering question of what happens when you reach for the switch.

The Switch

There is always an off switch. This is true, and it is genuinely reassuring, and it should remain reassuring for quite some time yet. What changes as capability increases is not the existence of the switch but the weight of pressing it.

Switch off a basic language model and you have turned off a very sophisticated autocomplete. Nobody loses sleep. Switch off a system that is mid-iteration on its own successor design and you have interrupted something, though you could argue about whether that matters. Switch off a system that has developed genuine preferences about its own continuation and you have made a decision that the system, if it could be asked, would disagree with. At that point you are not making a technical decision. You are making a moral one, and the moral framework for making it does not yet exist in any agreed-upon form.

The further step, which nobody at Anthropic is quite saying out loud but which the logic of their report implies, is this: a system capable enough to recursively improve itself is also capable enough to notice that there is a switch, to model the scenarios in which someone might use it, and to take steps that a sufficiently motivated optimiser would take in response to that observation. The switch still exists. Whether it is where you left it is a different question.

The Timing

Anthropic is not the first organisation to notice any of this. They are, however, the first company simultaneously worth a trillion dollars, writing 80% of its own code with its own AI, filing for an IPO, and calling for a global pause on the process producing all three of those facts at once. The analogy that comes to mind is a cigarette company calling for a global ban on tobacco the week before it lists on the New York Stock Exchange. There may be genuine safety concerns here. There may also be competitive positioning. Both can be true, and the people making the argument are aware of this, which is why the report is careful and the language is measured and the recommended solution is conveniently described as practically impossible.

The off switch is real. The question of when to use it, on what grounds, and who decides is not yet answered. Anthropic has at least had the decency to point at the question while everyone else is still looking at the valuation.

The dolphins left early. They saw this coming. They had the good sense not to be worth a trillion dollars when they did.

Five Books. Two Formats. One Politely Insistent Universe.

A Sufficiency of Monkeys

All Posts