Hugging Face Daily Papers · · 6 min read

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

TL;DR: you can read interpretable features out of transformer hidden states with<br><strong>no trained SAE and no probe</strong>, straight from the standard basis.</p>\n<p>The idea is to treat each dimension as an independent register, its <strong>sign</strong> carries<br>content, its <strong>magnitude</strong> carries confidence. A \"feature\" is then just a subset of<br>dims with a consistent sign pattern, read by counting sign agreements. One forward<br>pass, zero training.</p>\n<p>Look at the figures in the paper, they directly triggered the idea to look the raw dims.</p>\n<p>Some results that surprised me:</p>\n<ul>\n<li>Sign alone (all magnitudes set to 1) preserves <strong>60–93% of top-5</strong> next-token accuracy</li>\n<li><strong>175 semantic categories</strong> deteted from a single-token cache with zero labels;<br>a trained probe adds only <strong>+0.018 AUC</strong> and converges to axis-aligned weights</li>\n<li>Same structure appears in <strong>vision (DINOv2, ViT) and audio (AST)</strong>, looks like a<br>property of transformer training, not language</li>\n<li>Flipping a feature's signs mid-forward-pass <strong>causally suppresses</strong> the concept</li>\n</ul>\n<p>It's deliberately simple, and simple enough that you can hand the paper to a coding agent, and reproduce in ~15-20Mins on a model of your choice on your laptop, I kept the code aways it would be better testament of the methodology both if it work or fails</p>\n<p>Would love scrutiny — especially from folks deep in SAEs</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6947035278b1f73818619c77/vbwfzhhKM28Gz_I8Ewoft.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6947035278b1f73818619c77/vbwfzhhKM28Gz_I8Ewoft.png\" alt=\"image\"></a></p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6947035278b1f73818619c77/6OBMcZMnMtkovt-Va8eCc.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6947035278b1f73818619c77/6OBMcZMnMtkovt-Va8eCc.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-06-18T13:08:04.322Z","author":{"_id":"6947035278b1f73818619c77","avatarUrl":"/avatars/b2dc636bdd04a65c7a27a554eb780cc2.svg","fullname":"Varun Reddy Nalagatla","name":"varun86861993","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8658889532089233},"editors":["varun86861993"],"editorAvatarUrls":["/avatars/b2dc636bdd04a65c7a27a554eb780cc2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12629","authors":[{"_id":"6a2f26bca0d4daae4285f67f","user":{"_id":"6947035278b1f73818619c77","avatarUrl":"/avatars/b2dc636bdd04a65c7a27a554eb780cc2.svg","isPro":false,"fullname":"Varun Reddy Nalagatla","user":"varun86861993","type":"user","name":"varun86861993"},"name":"Varun Reddy Nalagatla","status":"claimed_verified","statusLastChangedAt":"2026-06-15T12:19:08.677Z","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns","submittedOnDailyBy":{"_id":"6947035278b1f73818619c77","avatarUrl":"/avatars/b2dc636bdd04a65c7a27a554eb780cc2.svg","isPro":false,"fullname":"Varun Reddy Nalagatla","user":"varun86861993","type":"user","name":"varun86861993"},"summary":"We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST).\n Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits).\n The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.","upvotes":1,"discussionId":"6a2f26bca0d4daae4285f680","ai_summary":"The standard basis of transformer hidden states serves as a training-free, architecture-general feature representation where individual dimensions encode semantic content through signs and confidence through magnitudes, functioning as independent binary registers without requiring learned rotations or optimization.","ai_keywords":["transformer hidden states","standard basis","feature basis","semantic content","sign patterns","magnitude","binary registers","Bag of Dims framework","next-token accuracy","Hamming scoring","token cache","forward pass","attention projections","FFN neuron coalitions","concept suppression","pairwise mutual information","self-supervised vision","supervised vision","audio transformer"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6947035278b1f73818619c77","avatarUrl":"/avatars/b2dc636bdd04a65c7a27a554eb780cc2.svg","isPro":false,"fullname":"Varun Reddy Nalagatla","user":"varun86861993","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12629.md","query":{}}">
Papers
arxiv:2606.12629

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Published on Jun 17
· Submitted by
Varun Reddy Nalagatla
on Jun 18

Abstract

The standard basis of transformer hidden states serves as a training-free, architecture-general feature representation where individual dimensions encode semantic content through signs and confidence through magnitudes, functioning as independent binary registers without requiring learned rotations or optimization.

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

Community

Paper author Paper submitter about 3 hours ago

TL;DR: you can read interpretable features out of transformer hidden states with
no trained SAE and no probe, straight from the standard basis.

The idea is to treat each dimension as an independent register, its sign carries
content, its magnitude carries confidence. A "feature" is then just a subset of
dims with a consistent sign pattern, read by counting sign agreements. One forward
pass, zero training.

Look at the figures in the paper, they directly triggered the idea to look the raw dims.

Some results that surprised me:

  • Sign alone (all magnitudes set to 1) preserves 60–93% of top-5 next-token accuracy
  • 175 semantic categories deteted from a single-token cache with zero labels;
    a trained probe adds only +0.018 AUC and converges to axis-aligned weights
  • Same structure appears in vision (DINOv2, ViT) and audio (AST), looks like a
    property of transformer training, not language
  • Flipping a feature's signs mid-forward-pass causally suppresses the concept

It's deliberately simple, and simple enough that you can hand the paper to a coding agent, and reproduce in ~15-20Mins on a model of your choice on your laptop, I kept the code aways it would be better testament of the methodology both if it work or fails

Would love scrutiny — especially from folks deep in SAEs

image

image

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.12629
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12629 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12629 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12629 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers