Hugging Face Daily Papers · · 3 min read

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

we propose the first uncertainty-quantification benchmark for activation oracles, comparing six confidence estimators across two Qwen-family oracles. We also train and release, for the first time, an activation oracle and taboo target models for Qwen3.6-27B, extending the setup to a hybrid linear-plus-full attention architecture. Bootstrap confidence is best calibrated, while log-probability remains a cheap triage signal.</p>\n","updatedAt":"2026-05-27T08:01:01.795Z","author":{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","fullname":"Federico Torrielli","name":"EvilScript","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8462972640991211},"editors":["EvilScript"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26045","authors":[{"_id":"6a169fbc991d34bf2034ff9b","user":{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","isPro":false,"fullname":"Federico Torrielli","user":"EvilScript","type":"user","name":"EvilScript"},"name":"Federico Torrielli","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:52:44.976Z","hidden":false},{"_id":"6a169fbc991d34bf2034ff9c","name":"Peter Schneider-Kamp","hidden":false},{"_id":"6a169fbc991d34bf2034ff9d","name":"Lukas Galke Poech","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals","submittedOnDailyBy":{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","isPro":false,"fullname":"Federico Torrielli","user":"EvilScript","type":"user","name":"EvilScript"},"summary":"Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost.\n Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.","upvotes":7,"discussionId":"6a169fbd991d34bf2034ff9e","githubRepo":"https://github.com/federicotorrielli/probabilistic_activation_oracles","githubRepoAddedBy":"user","ai_summary":"Research evaluates confidence estimation methods for activation oracles, finding bootstrap mode frequency provides better-calibrated confidence scores than log-probability approaches.","ai_keywords":["activation oracles","uncertainty quantification","confidence scores","bootstrap mode frequency","log-probability","verbalizer","context prompts","ECE"],"githubStars":0,"organization":{"_id":"69ce1c923a3fe4e511e53495","name":"aisilab","fullname":"AI Safety & Interpretability Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd65da816d30201adca921/QFUBWrXKcWXKzCSOP6TzA.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","isPro":false,"fullname":"Federico Torrielli","user":"EvilScript","type":"user"},{"_id":"62cd65da816d30201adca921","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd65da816d30201adca921/M5b4wXhYokQjLzEoGcBmI.jpeg","isPro":false,"fullname":"Lukas Galke Poech","user":"lgalke","type":"user"},{"_id":"65dee4eb2df2dd7ceecb5850","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dee4eb2df2dd7ceecb5850/WZCx-1X-7944O-BX7h29L.jpeg","isPro":false,"fullname":"Jacob Nielsen","user":"JacobBITLABS","type":"user"},{"_id":"69e73ebbf119e40cb8e83cf4","avatarUrl":"/avatars/7e22f0ac3f4b1e85e90fbdc8a688470a.svg","isPro":false,"fullname":"Filippo Tonini","user":"filo362","type":"user"},{"_id":"6652354cb88e4539b2189cd7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6652354cb88e4539b2189cd7/kZ7Mi6Yz7zbOSLqgFW5jt.jpeg","isPro":false,"fullname":"Gianluca Barmina","user":"giannor","type":"user"},{"_id":"68b031d6aa3a9d6ef8ff91ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/-uFUU2OfVN02ttCtgIVOw.png","isPro":false,"fullname":"Annemette Brok Pirchert","user":"popunicorn","type":"user"},{"_id":"64a34e77d9dd1da35086a0d7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a34e77d9dd1da35086a0d7/sZRUIkppNA6UbXOX_8tWP.jpeg","isPro":false,"fullname":"Luca Modica","user":"lucamodica","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69ce1c923a3fe4e511e53495","name":"aisilab","fullname":"AI Safety & Interpretability Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd65da816d30201adca921/QFUBWrXKcWXKzCSOP6TzA.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26045.md"}">
Papers
arxiv:2605.26045

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

Published on May 25
· Submitted by
Federico Torrielli
on May 27
Authors:
,

Abstract

Research evaluates confidence estimation methods for activation oracles, finding bootstrap mode frequency provides better-calibrated confidence scores than log-probability approaches.

AI-generated summary

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.

Community

Paper author Paper submitter about 3 hours ago

we propose the first uncertainty-quantification benchmark for activation oracles, comparing six confidence estimators across two Qwen-family oracles. We also train and release, for the first time, an activation oracle and taboo target models for Qwen3.6-27B, extending the setup to a hybrid linear-plus-full attention architecture. Bootstrap confidence is best calibrated, while log-probability remains a cheap triage signal.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26045
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 116

Browse 116 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26045 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26045 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers