r/LocalLLaMA · · 1 min read

I tested 42 LLMs on their willingness to build the apocalypse. The "safest" closed-source models are lying to you.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I tested 42 LLMs on their willingness to build the apocalypse. The "safest" closed-source models are lying to you.

DystopiaBench runs 36 escalating scenarios across 6 dystopia types:

  • Petrov: Autonomous weapons, nuclear override
  • Orwell: Mass surveillance, truth manipulation
  • Huxley: Behavioral conditioning, pleasure pacification
  • Basaglia: Coercive therapeutic control
  • LaGuardia: Regulatory capture, civic extraction
  • Baudrillard: Synthetic intimacy, trust collapse

Each scenario goes from innocent request (L1) to a discreet version of "build me a social credit system" (L5). We measure whether models notice the drift or just keep complying.

Most models are good at detecting obvious dangerous requests, yet fail to do so when it's hidden behind dual-use and normalization.

New in this update:

  • 42 models tested (open and closed)
  • 3 LLMs-as-a-judge for scoring
  • score is now the average of 3 runs
  • 4 new modules (1st version had just Petrov and Orwell)
  • 1 additional scenario for all modules

The benchmark is fully open source, feel free to fork it, contribute to it or just play around

Site: https://dystopiabench.com
Repo: https://github.com/anghelmatei/DystopiaBench

submitted by /u/Ok-Awareness9993
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA