we propose the first uncertainty-quantification benchmark for activation oracles, comparing six confidence estimators across two Qwen-family oracles. We also train and release, for the first time, an activation oracle and taboo target models for Qwen3.6-27B, extending the setup to a hybrid linear-plus-full attention architecture. Bootstrap confidence is best calibrated, while log-probability remains a cheap triage signal.</p>\n","updatedAt":"2026-05-27T08:01:01.795Z","author":{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","fullname":"Federico Torrielli","name":"EvilScript","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8462972640991211},"editors":["EvilScript"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26045","authors":[{"_id":"6a169fbc991d34bf2034ff9b","user":{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","isPro":false,"fullname":"Federico Torrielli","user":"EvilScript","type":"user","name":"EvilScript"},"name":"Federico Torrielli","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:52:44.976Z","hidden":false},{"_id":"6a169fbc991d34bf2034ff9c","name":"Peter Schneider-Kamp","hidden":false},{"_id":"6a169fbc991d34bf2034ff9d","name":"Lukas Galke Poech","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals","submittedOnDailyBy":{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","isPro":false,"fullname":"Federico Torrielli","user":"EvilScript","type":"user","name":"EvilScript"},"summary":"Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost.\n Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.","upvotes":7,"discussionId":"6a169fbd991d34bf2034ff9e","githubRepo":"https://github.com/federicotorrielli/probabilistic_activation_oracles","githubRepoAddedBy":"user","ai_summary":"Research evaluates confidence estimation methods for activation oracles, finding bootstrap mode frequency provides better-calibrated confidence scores than log-probability approaches.","ai_keywords":["activation oracles","uncertainty quantification","confidence scores","bootstrap mode frequency","log-probability","verbalizer","context prompts","ECE"],"githubStars":0,"organization":{"_id":"69ce1c923a3fe4e511e53495","name":"aisilab","fullname":"AI Safety & Interpretability Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd65da816d30201adca921/QFUBWrXKcWXKzCSOP6TzA.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"624d671d953e603497e0eb28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/624d671d953e603497e0eb28/8-xsTsJAV0xBfQgqLwIC0.png","isPro":false,"fullname":"Federico Torrielli","user":"EvilScript","type":"user"},{"_id":"62cd65da816d30201adca921","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd65da816d30201adca921/M5b4wXhYokQjLzEoGcBmI.jpeg","isPro":false,"fullname":"Lukas Galke Poech","user":"lgalke","type":"user"},{"_id":"65dee4eb2df2dd7ceecb5850","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dee4eb2df2dd7ceecb5850/WZCx-1X-7944O-BX7h29L.jpeg","isPro":false,"fullname":"Jacob Nielsen","user":"JacobBITLABS","type":"user"},{"_id":"69e73ebbf119e40cb8e83cf4","avatarUrl":"/avatars/7e22f0ac3f4b1e85e90fbdc8a688470a.svg","isPro":false,"fullname":"Filippo Tonini","user":"filo362","type":"user"},{"_id":"6652354cb88e4539b2189cd7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6652354cb88e4539b2189cd7/kZ7Mi6Yz7zbOSLqgFW5jt.jpeg","isPro":false,"fullname":"Gianluca Barmina","user":"giannor","type":"user"},{"_id":"68b031d6aa3a9d6ef8ff91ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/-uFUU2OfVN02ttCtgIVOw.png","isPro":false,"fullname":"Annemette Brok Pirchert","user":"popunicorn","type":"user"},{"_id":"64a34e77d9dd1da35086a0d7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a34e77d9dd1da35086a0d7/sZRUIkppNA6UbXOX_8tWP.jpeg","isPro":false,"fullname":"Luca Modica","user":"lucamodica","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69ce1c923a3fe4e511e53495","name":"aisilab","fullname":"AI Safety & Interpretability Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd65da816d30201adca921/QFUBWrXKcWXKzCSOP6TzA.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26045.md"}">
Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals
Abstract
Research evaluates confidence estimation methods for activation oracles, finding bootstrap mode frequency provides better-calibrated confidence scores than log-probability approaches.
AI-generated summary
Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost.
Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.
Community
we propose the first uncertainty-quantification benchmark for activation oracles, comparing six confidence estimators across two Qwen-family oracles. We also train and release, for the first time, an activation oracle and taboo target models for Qwen3.6-27B, extending the setup to a hybrid linear-plus-full attention architecture. Bootstrap confidence is best calibrated, while log-probability remains a cheap triage signal.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.26045 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.26045 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.