arXiv — Machine Learning · · 3 min read

Exemplar Partitioning for Mechanistic Interpretability

Mirrored from arXiv — Machine Learning for archival readability. Support the source by reading on the original site.

Computer Science > Machine Learning

arXiv:2605.14347 (cs)
[Submitted on 14 May 2026]

Title:Exemplar Partitioning for Mechanistic Interpretability

View a PDF of the paper titled Exemplar Partitioning for Mechanistic Interpretability, by Jessica Rumbelow
View PDF HTML (experimental)
Abstract:We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with $\sim 10^{3}\times$ fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints.
We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: $\sim 20\%$ of EP regions match an SAE feature at $F_{1} > 0.5$, and EP one-hot probes retain $\sim 97\%$ of raw-activation probe accuracy at $\ell_{0} = 1$. Nearest-exemplar distance provides a free out-of-distribution signal at inference. On AxBench latent concept detection at Gemma-2-2B-it L20, EP at $p_{1}$ reaches mean AUROC $0.881$, $+0.126$ over the canonical GemmaScope SAE leaderboard entry and within $0.030$ of SAE-A's $0.911$, at $\sim 10^{3}\times$ less build compute.
Comments: Code: this https URL. Pretrained dictionaries: this https URL
Subjects: Machine Learning (cs.LG)
Cite as: arXiv:2605.14347 [cs.LG]
  (or arXiv:2605.14347v1 [cs.LG] for this version)
  https://doi.org/10.48550/arXiv.2605.14347
arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jessica Rumbelow [view email]
[v1] Thu, 14 May 2026 04:15:30 UTC (1,157 KB)
Full-text links:

Access Paper:

Current browse context:

cs.LG
< prev   |   next >
Change to browse by:
cs

References & Citations

Loading...

BibTeX formatted citation

loading...
Data provided by:

Bookmark

BibSonomy Reddit
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos

Demos

Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers

Recommenders and Search Tools

Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
IArxiv recommender toggle
IArxiv Recommender (What is IArxiv?)
About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from arXiv — Machine Learning