Compute-aware jailbreak evaluation framework showing that attack success alone is misleading, and that measuring adversarial effort in FLOPs reveals nuanced tradeoffs between alignment, model scaling, attack transferability, and harm-category-specific robustness.</p>\n","updatedAt":"2026-06-12T13:59:18.971Z","author":{"_id":"63cacbc502ee13c2af9d6759","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63cacbc502ee13c2af9d6759/dQCSTb6IOPIQDK-DJmIgs.jpeg","fullname":"Malikeh Ehghaghi","name":"Malikeh1375","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":29,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.907659113407135},"editors":["Malikeh1375"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63cacbc502ee13c2af9d6759/dQCSTb6IOPIQDK-DJmIgs.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11409","authors":[{"_id":"6a2b59094957fcdd3aac056c","name":"Malikeh Ehghaghi","hidden":false},{"_id":"6a2b59094957fcdd3aac056d","name":"Boglárka Ecsedi","hidden":false},{"_id":"6a2b59094957fcdd3aac056e","name":"Marsha Chechik","hidden":false},{"_id":"6a2b59094957fcdd3aac056f","name":"Colin Raffel","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/63cacbc502ee13c2af9d6759/huBiURDQzAZU1eFYFd9fi.png"],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models","submittedOnDailyBy":{"_id":"63cacbc502ee13c2af9d6759","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63cacbc502ee13c2af9d6759/dQCSTb6IOPIQDK-DJmIgs.jpeg","isPro":false,"fullname":"Malikeh Ehghaghi","user":"Malikeh1375","type":"user","name":"Malikeh1375"},"summary":"Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to {approx}5{times} across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.","upvotes":7,"discussionId":"6a2b590a4957fcdd3aac0570","githubRepo":"https://github.com/r-three/risk-under-pressure","githubRepoAddedBy":"user","ai_summary":"Compute-aware evaluation framework using FLOPs and risk-compute curves reveals non-monotonic effects of alignment training and varying attack costs across different harm categories.","ai_keywords":["adversarial robustness","large language models","attack success rate","computational pressure","cumulative floating-point operations","risk-compute curves","gradient-based attacks","iterative refinement","template-based attacks","jailbreak robustness","alignment training","surrogate model","safety-aligned RL"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"6378f3977dee98a4ec6a3d56","name":"r-three","fullname":"r-three","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1668871047960-6079c29765b9d0165cb18392.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63cacbc502ee13c2af9d6759","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63cacbc502ee13c2af9d6759/dQCSTb6IOPIQDK-DJmIgs.jpeg","isPro":false,"fullname":"Malikeh Ehghaghi","user":"Malikeh1375","type":"user"},{"_id":"68cdb7c6f0406909464915b7","avatarUrl":"/avatars/b1a7f3d77d21b384c54dca88600e6efc.svg","isPro":false,"fullname":"Boglarka Ecsedi","user":"eeboogi","type":"user"},{"_id":"6079c29765b9d0165cb18392","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1618592397610-noauth.jpeg","isPro":false,"fullname":"Colin Raffel","user":"craffel","type":"user"},{"_id":"6a2c1ddc1527e0d8ed59ba7a","avatarUrl":"/avatars/390d47b307ff38fdfe1409fcdf4d10a7.svg","isPro":false,"fullname":"Jacob Solawetz","user":"solaerien","type":"user"},{"_id":"63ab94260ed3c325284f208e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ab94260ed3c325284f208e/PxqYmAhOFPscexCqR8xo7.jpeg","isPro":false,"fullname":"Mansi Sakarvadia","user":"msakarvadia","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":false,"fullname":"Urro","user":"urroxyz","type":"user"},{"_id":"66ef5e52e9c6e64c6631742d","avatarUrl":"/avatars/fff5649a336918892038671bec46cc7f.svg","isPro":true,"fullname":"Noah Juravsky","user":"NoahEJ","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6378f3977dee98a4ec6a3d56","name":"r-three","fullname":"r-three","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1668871047960-6079c29765b9d0165cb18392.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11409.md","query":{}}">
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
Abstract
Compute-aware evaluation framework using FLOPs and risk-compute curves reveals non-monotonic effects of alignment training and varying attack costs across different harm categories.
Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to {approx}5{times} across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.
Community
Compute-aware jailbreak evaluation framework showing that attack success alone is misleading, and that measuring adversarial effort in FLOPs reveals nuanced tradeoffs between alignment, model scaling, attack transferability, and harm-category-specific robustness.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.11409 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.11409 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.11409 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.