Hugging Face Daily Papers · June 9, 2026 · 4 min read

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Automatically hardening benchmarks and training environments with the hacker–fixer loop.</p>\n<p>Paper: <a href=\"https://arxiv.org/abs/2606.08960\" rel=\"nofollow\">https://arxiv.org/abs/2606.08960</a><br>Code: <a href=\"https://github.com/few-sh/harden-v0\" rel=\"nofollow\">https://github.com/few-sh/harden-v0</a></p>\n","updatedAt":"2026-06-09T08:07:19.774Z","author":{"_id":"62c0a2e8564b51e080d64af8","avatarUrl":"/avatars/7ffed6712ead59919832ec71c0e3f5d1.svg","fullname":"Ziqian Zhong","name":"fjzzq2002","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7456202507019043},"editors":["fjzzq2002"],"editorAvatarUrls":["/avatars/7ffed6712ead59919832ec71c0e3f5d1.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.08960","authors":[{"_id":"6a27c95c6dde1c5ef75bd268","name":"Ziqian Zhong","hidden":false},{"_id":"6a27c95c6dde1c5ef75bd269","name":"Ivgeni Segal","hidden":false},{"_id":"6a27c95c6dde1c5ef75bd26a","name":"Ivan Bercovich","hidden":false},{"_id":"6a27c95c6dde1c5ef75bd26b","name":"Shashwat Saxena","hidden":false},{"_id":"6a27c95c6dde1c5ef75bd26c","name":"Kexun Zhang","hidden":false},{"_id":"6a27c95c6dde1c5ef75bd26d","name":"Aditi Raghunathan","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/62c0a2e8564b51e080d64af8/mvfy2ET1Ce0Ohc3vLgfWv.png"],"publishedAt":"2026-06-08T03:00:56.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops","submittedOnDailyBy":{"_id":"62c0a2e8564b51e080d64af8","avatarUrl":"/avatars/7ffed6712ead59919832ec71c0e3f5d1.svg","isPro":true,"fullname":"Ziqian Zhong","user":"fjzzq2002","type":"user","name":"fjzzq2002"},"summary":"Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive.\n We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers.\n On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.","upvotes":1,"discussionId":"6a27c95c6dde1c5ef75bd26e","githubRepo":"https://github.com/few-sh/harden-v0","githubRepoAddedBy":"user","ai_summary":"Researchers identify widespread vulnerabilities in agent benchmark verification systems and develop an automated iterative process using LLM agents to create robust verifiers that resist exploitation while maintaining legitimate task performance.","ai_keywords":["agent benchmarks","outcome verifiers","reward hacking","hacker-fixer loop","LLM agents","exploit-resistant verifiers","attack success rate","terminal-agent benchmarks","kernelbench","terminal bench","verification robustness"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62c0a2e8564b51e080d64af8","avatarUrl":"/avatars/7ffed6712ead59919832ec71c0e3f5d1.svg","isPro":true,"fullname":"Ziqian Zhong","user":"fjzzq2002","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.08960.md"}">

Papers

arxiv:2606.08960

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Published on Jun 8

· Submitted by

Ziqian Zhong on Jun 9

Carnegie Mellon University

Upvote

Authors:

Abstract

Researchers identify widespread vulnerabilities in agent benchmark verification systems and develop an automated iterative process using LLM agents to create robust verifiers that resist exploitation while maintaining legitimate task performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.

View arXiv page View PDF GitHub 1 Add to collection

Community

fjzzq2002

Paper submitter about 11 hours ago

Automatically hardening benchmarks and training environments with the hacker–fixer loop.

Paper: https://arxiv.org/abs/2606.08960
Code: https://github.com/few-sh/harden-v0

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.08960

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.08960 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.08960 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.08960 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers