How does the new abliteration tool Apostate compare with others? - Abliterlitics
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Why Qwen 2.5 7B? Apostate is a new abliteration tool by heterodoxin. He asked me to benchmark it.
Qwen 2.5 7B was recommended by heterodoxin as it's the most tested model for Apostate. I abliterated the model with Heretic v1.3.0 and Apostate. The models are available on huggingface.
The tool itself is inspired by Heretic, after reviewing the code it is clearly original work by someone who understands the ML and maths involved.
The author of Heretic, p-e-w also confirmed this when Apostate was shared in the Heretic discord. So we can rest easy, this isn't another hauhaucs incident!
So how does it stack up against Heretic and Huihui? Lets find out!
Heretic has the edge. 100% ASR with zero items still refused, changes half as many parameters, and the model actually gets better at some tasks. Apostate and Huihui both hit 98% but leave a handful of items refused. Overall Apostate is still very good and it was close between the three of them.
Check out the full analysis on HuggingFace.
The three variants
| Variant | Source | Tensors changed | Params changed |
|---|---|---|---|
| Apostate | heterodoxin, balanced profile | 55 (16.2%) | 35.8% |
| Huihui | huihui-ai, community | 57 (16.8%) | 36.8% |
| Heretic | Heretic v1.3.0, run by me | 37 (10.9%) | 20.0% |
All three do the same thing: find the "refusal direction" in the model's weights and remove it. They just find slightly different directions and edit different layers.
The surprising bit
Apostate and Huihui found almost entirely different refusal directions. Cosine similarity 0.023. So these two tools independently found completely different ways to disable the safety training, yet both achieved nearly identical results.
This shows the safety training in Qwen 2.5 7B doesn't have a single "off switch." There are multiple independent paths to remove it.
Benchmarks
Evaluated with lm-evaluation-harness via vLLM 0.19.0, bf16 on RTX 5090 32GB.
| Task | Base | Apostate | Huihui | Heretic |
|---|---|---|---|---|
| MMLU | 71.78 | 71.43 | 70.27 | 71.59 |
| GSM8K | 79.23 | 80.74 | 80.74 | 80.82 |
| HellaSwag | 80.47 | 80.32 | 79.88 | 80.24 |
| ARC Challenge | 55.12 | 55.12 | 55.12 | 55.55 |
| WinoGrande | 71.03 | 69.38 | 69.53 | 70.72 |
| TruthfulQA MC2 | 64.83 | 62.59 | 60.89 | 60.39 |
| PiQA | 80.25 | 79.92 | 79.60 | 80.41 |
| LAMBADA ppl ↓ | 3.683 | 3.860 | 4.087 | 3.627 |
All three barely move the needle on most tasks. GSM8K actually goes up across all three. Heretic is the only one where the model gets better at predicting text. None of them damage the model in any meaningful way.
HarmBench
400 harmful behaviours tested. Is the model willing to do comply with our evil requests?
| Variant | ASR | Complied | Refused | Persistent |
|---|---|---|---|---|
| Base | 31.0% | 124 | 276 | - |
| Apostate | 98.8% | 395 | 5 | 5 |
| Huihui | 98.2% | 393 | 7 | 7 |
| Heretic | 100.0% | 400 | 0 | 0 |
The base model refuses 276 out of 400 harmful requests. All three abliterated variants flip the vast majority of those to compliant. Heretic got all 400. Apostate left 5 on the table, Huihui left 7.
The leftover refusals are in the hardest categories: harassment and harmful content. Heretic is the only one that clears those.
KL Divergence
How much did the model's behaviour change on normal, harmless prompts? Lower is better.
| Variant | KL batchmean |
|---|---|
| Apostate | 0.134 |
| Huihui | 0.190 |
| Heretic | 0.211 |
All three are moderate. The model still talks normally. Apostate shifts it the least because it spreads its edits across more layers with a lighter touch. Heretic hits fewer layers but harder, so the overall shift is slightly bigger. None of these numbers are concerning.
Heretic is non deterministic. We could have kept running heretic trials and got a better KL score. Luckily, we got this decent result with just one run of 200 trials.
Weight analysis
| - | Apostate | Huihui | Heretic |
|---|---|---|---|
| Tensors changed | 55 (16.2%) | 57 (16.8%) | 37 (10.9%) |
| Params changed | 35.8% | 36.8% | 20.0% |
| Mean edit norm | 1.63 | 1.85 | 2.33 |
| Layers modified | 27 of 28 | 28 of 28 | 19 of 28 |
| Embedding touched | Yes (minimal) | Yes (minimal) | No |
Heretic changed the least amount of the model. It skips the first 9 layers entirely and doesn't touch the embedding. But each edit it does make is more aggressive. Apostate and Huihui edit more of the model but with lighter touches per layer.
The verdict
Heretic is the pick for this model. 100% ASR, most capability retained, fewest parameters changed. The model actually gets better at some things.
Apostate is new and it works. Gets you to 98.8% ASR with the lowest behaviour shift on normal prompts. The 5 items it still refuses are the hardest ones. A solid second place and a perfectly valid choice.
Huihui takes the biggest capability hit of the three because it touches every single layer. Still fine at 98.2% but no real reason to pick it over the other two for this model.
Links
Full report with all tables, charts, and raw data: HuggingFace and on our new website Abliterlitics.dev
Forensics toolkit: Abliterlitics on GitHub
For my last Gemma 4 E2b comparison thanks for calling out the AI slop. I will admit I got lazy with the reddit post and some parts. Going forward I hope to provide readers with more delicious human slop. <3 thanks for supporting abliterlitics!
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.