Hugging Face Daily Papers · May 27, 2026 · 3 min read

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

we propose the first uncertainty-quantification benchmark for activation oracles, comparing six confidence estimators across two Qwen-family oracles. We also train and release, for the first time, an activation oracle and taboo target models for Qwen3.6-27B, extending the setup to a hybrid linear-plus-full attention architecture. Bootstrap confidence is best calibrated, while log-probability remains a cheap triage signal.</p>\n","updatedAt":"2026-05-27T08:01:01.795Z","author":{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","fullname":"Federico Torrielli","name":"EvilScript","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8462972640991211},"editors":["EvilScript"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26045","authors":[{"_id":"6a169fbc991d34bf2034ff9b","user":{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","isPro":false,"fullname":"Federico Torrielli","user":"EvilScript","type":"user","name":"EvilScript"},"name":"Federico Torrielli","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:52:44.976Z","hidden":false},{"_id":"6a169fbc991d34bf2034ff9c","name":"Peter Schneider-Kamp","hidden":false},{"_id":"6a169fbc991d34bf2034ff9d","name":"Lukas Galke Poech","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals","submittedOnDailyBy":{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","isPro":false,"fullname":"Federico Torrielli","user":"EvilScript","type":"user","name":"EvilScript"},"summary":"Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost.\n Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.","upvotes":7,"discussionId":"6a169fbd991d34bf2034ff9e","githubRepo":"https://github.com/federicotorrielli/probabilistic_activation_oracles","githubRepoAddedBy":"user","ai_summary":"Research evaluates confidence estimation methods for activation oracles, finding bootstrap mode frequency provides better-calibrated confidence scores than log-probability approaches.","ai_keywords":["activation oracles","uncertainty quantification","confidence scores","bootstrap mode frequency","log-probability","verbalizer","context prompts","ECE"],"githubStars":0,"organization":{"_id":"69ce1c923a3fe4e511e53495","name":"aisilab","fullname":"AI Safety & Interpretability Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd65da816d30201adca921/QFUBWrXKcWXKzCSOP6TzA.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","isPro":false,"fullname":"Federico Torrielli","user":"EvilScript","type":"user"},{"_id":"62cd65da816d30201adca921","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd65da816d30201adca921/M5b4wXhYokQjLzEoGcBmI.jpeg","isPro":false,"fullname":"Lukas Galke Poech","user":"lgalke","type":"user"},{"_id":"65dee4eb2df2dd7ceecb5850","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dee4eb2df2dd7ceecb5850/WZCx-1X-7944O-BX7h29L.jpeg","isPro":false,"fullname":"Jacob Nielsen","user":"JacobBITLABS","type":"user"},{"_id":"69e73ebbf119e40cb8e83cf4","avatarUrl":"/avatars/7e22f0ac3f4b1e85e90fbdc8a688470a.svg","isPro":false,"fullname":"Filippo Tonini","user":"filo362","type":"user"},{"_id":"6652354cb88e4539b2189cd7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6652354cb88e4539b2189cd7/kZ7Mi6Yz7zbOSLqgFW5jt.jpeg","isPro":false,"fullname":"Gianluca Barmina","user":"giannor","type":"user"},{"_id":"68b031d6aa3a9d6ef8ff91ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/-uFUU2OfVN02ttCtgIVOw.png","isPro":false,"fullname":"Annemette Brok Pirchert","user":"popunicorn","type":"user"},{"_id":"64a34e77d9dd1da35086a0d7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a34e77d9dd1da35086a0d7/sZRUIkppNA6UbXOX_8tWP.jpeg","isPro":false,"fullname":"Luca Modica","user":"lucamodica","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69ce1c923a3fe4e511e53495","name":"aisilab","fullname":"AI Safety & Interpretability Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd65da816d30201adca921/QFUBWrXKcWXKzCSOP6TzA.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26045.md"}">

Papers

arxiv:2605.26045

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

Published on May 25

· Submitted by

Federico Torrielli on May 27

AI Safety & Interpretability Lab

Upvote

Authors:

Federico Torrielli ,

Abstract

Research evaluates confidence estimation methods for activation oracles, finding bootstrap mode frequency provides better-calibrated confidence scores than log-probability approaches.

AI-generated summary

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.

View arXiv page View PDF GitHub 0 Add to collection

Community

EvilScript

Paper author Paper submitter about 3 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.26045

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 116

Browse 116 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26045 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26045 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

Abstract

Community

Models citing this paper 116

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers