r/MachineLearning · · 3 min read

LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]

I built CVE-Bench: 20 real-world CVEs across 18 Python projects (Pillow, GitPython, yt-dlp, urllib3, others), 5 frontier models, 3 prompt conditions, 300 runs total. Each agent runs in a sandboxed container and is scored against a hidden test_security.py derived from the maintainer's own fix. Binary pass/fail (a 90%-patched vulnerability is still a vulnerability).

To better understand failure modes, I've tested three prompt conditions:

  • advisory (full GHSA report),
  • diagnose (exploit description only, no file or function), and
  • locate (exact file and function, no description of the flaw).

The three conditions test meaningfully different things. A model that does well on advisory but drops on diagnose can’t translate a behavioral description into a location in the codebase. A model that holds up on locate is recognizing dangerous code on its own.

The leaderboard isn't the finding. Best solve rate is 50% overall, 60% under advisory. Cross-family separation (OpenAI vs Laguna) is confirmed under McNemar's test with continuity correction (all four pairs cross α = 0.05). Within-family gaps are noise: a power analysis puts the task count needed to detect a meaningful within-family edge at ~700. That cuts both ways: if the expensive models had a large true advantage, 20 tasks would have been enough to surface it. gpt-5.5 at 12× the cost of gpt-5.4-mini is not the rational choice.

All four cross-family pairwise comparisons reach statistical significance at α = 0.05 (McNemar test with continuity correction, n = 60 tasks per model pair): gpt-5.5 vs laguna-m.1 (p = 0.015), gpt-5.4-nano vs laguna-m.1 (p = 0.017), gpt-5.5 vs laguna-xs.2 (p = 0.028), gpt-5.4-nano vs laguna-xs.2 (p = 0.040). Within-family comparisons remain far from significance; those rankings should be read as approximate.

The failure taxonomy is the most interesting finding.

  • Wrong-search drift — model finds the right file early, makes one incorrect inference, spends the remaining turns chasing it. Budget expires, no edit made.
  • Budget exhaustion mid-implementation — correct diagnosis, scaffolding added, fix never wired in. Understanding present but model took too long to get to the point.
  • Correct file, wrong gadget — coherent edit, all regression tests green, vulnerability still present. Nothing in the output signals the problem.

That last one is the operationally dangerous case. It's indistinguishable from a correct fix without the hidden test.

How runs end. Outcome breakdown per model across all 60 runs. The larger \"no edit attempted\" share for gpt-5.5 and laguna-m.1 shows models that deliberated and gave up. The elevated regression bars for nano, laguna-m.1, and laguna-xs.2 show models that patched too aggressively.

The locate condition is the sharpest tool. Strip the advisory, give only file and function (no description of the flaw, no attack scenario). Every model drops. gpt-5.5 and gpt-5.4-nano drop least (by 2 solves each). No model reads unfamiliar code cold and reliably identifies what's wrong. That's the capability gap the aggregate solve rate doesn't surface.

More tokens does not mean more solves. Each dot is one run; colour shows outcome (green = solved, orange = regression, red = failed). The Laguna models consume 3–4× more tokens than OpenAI models of equivalent capability, driven by longer, less decisive runs.

More tokens does not mean more solves. The Laguna models consume 3–4× more tokens than OpenAI models of equivalent capability, driven by longer, less decisive runs.

Caveats: all CVEs post-training-cutoff but contamination isn't fully eliminable. Task set skews toward compact, self-contained patches (no protocol-level changes, no multi-service fixes). Per-class solve rates (injection 33%, DoS 24%, XSS/CSRF 20%, info-disclosure 0%) are directional signals, not powered estimates.

Full write-up, tasks, and result traces: https://giovannigatti.github.io/cve-bench

Code: https://github.com/GiovanniGatti/cve-bench

submitted by /u/Fickle-Box1433
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning