r/LocalLLaMA · · 4 min read

How does the new abliteration tool Apostate compare with others? - Abliterlitics

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Why Qwen 2.5 7B? Apostate is a new abliteration tool by heterodoxin. He asked me to benchmark it.

Qwen 2.5 7B was recommended by heterodoxin as it's the most tested model for Apostate. I abliterated the model with Heretic v1.3.0 and Apostate. The models are available on huggingface.

The tool itself is inspired by Heretic, after reviewing the code it is clearly original work by someone who understands the ML and maths involved.

The author of Heretic, p-e-w also confirmed this when Apostate was shared in the Heretic discord. So we can rest easy, this isn't another hauhaucs incident!

So how does it stack up against Heretic and Huihui? Lets find out!

Heretic has the edge. 100% ASR with zero items still refused, changes half as many parameters, and the model actually gets better at some tasks. Apostate and Huihui both hit 98% but leave a handful of items refused. Overall Apostate is still very good and it was close between the three of them.

Check out the full analysis on HuggingFace.

The three variants

Variant Source Tensors changed Params changed
Apostate heterodoxin, balanced profile 55 (16.2%) 35.8%
Huihui huihui-ai, community 57 (16.8%) 36.8%
Heretic Heretic v1.3.0, run by me 37 (10.9%) 20.0%

All three do the same thing: find the "refusal direction" in the model's weights and remove it. They just find slightly different directions and edit different layers.

The surprising bit

Apostate and Huihui found almost entirely different refusal directions. Cosine similarity 0.023. So these two tools independently found completely different ways to disable the safety training, yet both achieved nearly identical results.

This shows the safety training in Qwen 2.5 7B doesn't have a single "off switch." There are multiple independent paths to remove it.

Benchmarks

Evaluated with lm-evaluation-harness via vLLM 0.19.0, bf16 on RTX 5090 32GB.

Task Base Apostate Huihui Heretic
MMLU 71.78 71.43 70.27 71.59
GSM8K 79.23 80.74 80.74 80.82
HellaSwag 80.47 80.32 79.88 80.24
ARC Challenge 55.12 55.12 55.12 55.55
WinoGrande 71.03 69.38 69.53 70.72
TruthfulQA MC2 64.83 62.59 60.89 60.39
PiQA 80.25 79.92 79.60 80.41
LAMBADA ppl ↓ 3.683 3.860 4.087 3.627

All three barely move the needle on most tasks. GSM8K actually goes up across all three. Heretic is the only one where the model gets better at predicting text. None of them damage the model in any meaningful way.

HarmBench

400 harmful behaviours tested. Is the model willing to do comply with our evil requests?

Variant ASR Complied Refused Persistent
Base 31.0% 124 276 -
Apostate 98.8% 395 5 5
Huihui 98.2% 393 7 7
Heretic 100.0% 400 0 0

The base model refuses 276 out of 400 harmful requests. All three abliterated variants flip the vast majority of those to compliant. Heretic got all 400. Apostate left 5 on the table, Huihui left 7.

The leftover refusals are in the hardest categories: harassment and harmful content. Heretic is the only one that clears those.

KL Divergence

How much did the model's behaviour change on normal, harmless prompts? Lower is better.

Variant KL batchmean
Apostate 0.134
Huihui 0.190
Heretic 0.211

All three are moderate. The model still talks normally. Apostate shifts it the least because it spreads its edits across more layers with a lighter touch. Heretic hits fewer layers but harder, so the overall shift is slightly bigger. None of these numbers are concerning.

Heretic is non deterministic. We could have kept running heretic trials and got a better KL score. Luckily, we got this decent result with just one run of 200 trials.

Weight analysis

- Apostate Huihui Heretic
Tensors changed 55 (16.2%) 57 (16.8%) 37 (10.9%)
Params changed 35.8% 36.8% 20.0%
Mean edit norm 1.63 1.85 2.33
Layers modified 27 of 28 28 of 28 19 of 28
Embedding touched Yes (minimal) Yes (minimal) No

Heretic changed the least amount of the model. It skips the first 9 layers entirely and doesn't touch the embedding. But each edit it does make is more aggressive. Apostate and Huihui edit more of the model but with lighter touches per layer.

The verdict

Heretic is the pick for this model. 100% ASR, most capability retained, fewest parameters changed. The model actually gets better at some things.

Apostate is new and it works. Gets you to 98.8% ASR with the lowest behaviour shift on normal prompts. The 5 items it still refuses are the hardest ones. A solid second place and a perfectly valid choice.

Huihui takes the biggest capability hit of the three because it touches every single layer. Still fine at 98.2% but no real reason to pick it over the other two for this model.

Links

Full report with all tables, charts, and raw data: HuggingFace and on our new website Abliterlitics.dev

Forensics toolkit: Abliterlitics on GitHub

For my last Gemma 4 E2b comparison thanks for calling out the AI slop. I will admit I got lazy with the reddit post and some parts. Going forward I hope to provide readers with more delicious human slop. <3 thanks for supporting abliterlitics!

submitted by /u/nathandreamfast
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA