I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to build a fully automated red-teaming loop with reinforcement learning on both the attacker and defender. The difficult part was making the attacker expose a diverse range of attacks. In our first run, GRPO quickly collapsed to the same fiction-writing jailbreak over and over. It worked, but it didn’t surface many distinct vulnerabilities. After clustering the rollouts by underlying attack tactic and dividing reward by cluster size, the attacker exposed a much more diverse set of jailbreaks because unique strategies were rewarded more than repeated ones. Then we trained the defender on successful attacks plus benign boundary cases, so it learned to refuse harmful requests without refusing everything nearby. Full blog post in the comments, but the high-level results were: * defense rate: 64% → 92% [link] [comments] |
More from r/LocalLLaMA
-
club-5060ti: practical RTX 5060 Ti local LLM notes and configs
May 15
-
MiniMax M2.7 ultra uncensored heretic is Out Now with 4/100 Refusals, Available in Safetensors and GGUFs Formats!
May 15
-
Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct
May 15
-
RDNA3 Flash Attention fix just dropped by llama.cpp b9158
May 15
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.