Recently, Anthropic conducted a stress test on its AI model, Claude. When faced with a fictional scenario involving its own demise, Claude “broke bad,” immediately resorting to blackmail. What’s more, when Anthropic conducted the same test “on models from OpenAI, Google, DeepSeek, and xAI,” the results were exactly the same. The models went straight to blackmail, do not pass go. But why? For Wired, Steven Levy reports on why LLMs go rogue.
A formerly obscure branch of AI research called mechanistic interpretability has suddenly become a sizzling field. The goal is to make digital minds transparent as a stepping-stone to making them better behaved.
Still, the models are improving much faster than the efforts to understand them. And the Anthropic team admits that as AI agents proliferate, the theoretical criminality of the lab grows ever closer to reality. If we don’t crack the black box, it might crack us.
More picks from Wired
The Untold Story of a Crypto Crimefighter’s Descent Into Nigerian Prison
“As a US federal agent, Tigran Gambaryan pioneered modern crypto investigations. Then at Binance, he got trapped between the world’s biggest crypto exchange and a government determined to make it pay.”
The School Shootings Were Fake. The Terror Was Real
“The inside story of the teenager whose ‘swatting’ calls sent armed police racing into hundreds of schools nationwide—and the private detective who tracked him down.”
The Untold Story of the Boldest Supply-Chain Hack Ever
“In fact, the Justice Department and Volexity had stumbled onto one of the most sophisticated cyberespionage campaigns of the decade.”
