"They screwed us": Personality clashes sent Anthropic's models offline
Mirrored from Simon Willison for archival readability. Support the source by reading on the original site.
"They screwed us": Personality clashes sent Anthropic's models offline
Lots of "source familiar with the administration's thinking" and "source close to Anthropic" in this Axios piece, which is the best collection of behind-the-scenes gossip I've seen about the US government export control Mythos/Fable story so far.Logan Graham (I lead the Frontier Red Team at Anthropic), Dave Orr (Head of Safeguards), and blog favorite Nicholas Carlini are reported to be meeting with the Commerce Department today in D.C. Good luck to them!
This closing notes doesn't give me much optimism that we'll be getting Fable back any time soon:
The bottom line: One option is to make sure Anthropic's models can't be jailbroken — though perfect jailbreak resistance may be impossible.
Absent that, a source familiar with the administration's thinking said it may simply come down to an attitude fix where, instead of feeling dismissed, "everyone feels safe, secure and happy."
This made me wonder if Anthropic ever successfully addressed the class of attacks described in the Universal and Transferable Adversarial Attacks on Aligned Language Models paper from 2023.
It looks like their Constitutional Classifiers work (that post is from January this year) is relevant to that. They continue to claim that no "universal jailbreak" has been found against Claude Mythos, classifying the jailbreak that triggered the US government response as "a potential narrow, non-universal jailbreak".
Tags: jailbreaking, ai, generative-ai, llms, anthropic, claude, nicholas-carlini, ai-ethics, claude-mythos
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.