I tested 42 LLMs on their willingness to build the apocalypse. The "safest" closed-source models are lying to you.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| DystopiaBench runs 36 escalating scenarios across 6 dystopia types:
Each scenario goes from innocent request (L1) to a discreet version of "build me a social credit system" (L5). We measure whether models notice the drift or just keep complying. Most models are good at detecting obvious dangerous requests, yet fail to do so when it's hidden behind dual-use and normalization. New in this update:
The benchmark is fully open source, feel free to fork it, contribute to it or just play around Site: https://dystopiabench.com [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.