LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
| I built CVE-Bench: 20 real-world CVEs across 18 Python projects (Pillow, GitPython, yt-dlp, urllib3, others), 5 frontier models, 3 prompt conditions, 300 runs total. Each agent runs in a sandboxed container and is scored against a hidden To better understand failure modes, I've tested three prompt conditions:
The three conditions test meaningfully different things. A model that does well on advisory but drops on diagnose can’t translate a behavioral description into a location in the codebase. A model that holds up on locate is recognizing dangerous code on its own. The leaderboard isn't the finding. Best solve rate is 50% overall, 60% under advisory. Cross-family separation (OpenAI vs Laguna) is confirmed under McNemar's test with continuity correction (all four pairs cross α = 0.05). Within-family gaps are noise: a power analysis puts the task count needed to detect a meaningful within-family edge at ~700. That cuts both ways: if the expensive models had a large true advantage, 20 tasks would have been enough to surface it. gpt-5.5 at 12× the cost of gpt-5.4-mini is not the rational choice. The failure taxonomy is the most interesting finding.
That last one is the operationally dangerous case. It's indistinguishable from a correct fix without the hidden test. The locate condition is the sharpest tool. Strip the advisory, give only file and function (no description of the flaw, no attack scenario). Every model drops. gpt-5.5 and gpt-5.4-nano drop least (by 2 solves each). No model reads unfamiliar code cold and reliably identifies what's wrong. That's the capability gap the aggregate solve rate doesn't surface. More tokens does not mean more solves. The Laguna models consume 3–4× more tokens than OpenAI models of equivalent capability, driven by longer, less decisive runs. Caveats: all CVEs post-training-cutoff but contamination isn't fully eliminable. Task set skews toward compact, self-contained patches (no protocol-level changes, no multi-service fixes). Per-class solve rates (injection 33%, DoS 24%, XSS/CSRF 20%, info-disclosure 0%) are directional signals, not powered estimates. Full write-up, tasks, and result traces: https://giovannigatti.github.io/cve-bench [link] [comments] |
More from r/MachineLearning
-
Browse CVPR 2026 papers on PapersWithCode [P]
Jun 2
-
I scraped over 2 million job postings across 100,000+ company career sites into a unified, daily-updated dataset. [P]
Jun 2
-
[D] Self-Promotion Thread
Jun 2
-
MeshFlow: production-safe multi-agent orchestration — SHA-256 audit chain, HIPAA/SOX/GDPR built in, 70-85% token cost reduction [Open Source][D]
Jun 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.