r/LocalLLaMA · May 31, 2026 · 2 min read

PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark)

Author here. The short version of why I built this:

Cyber-AI evaluation is converging on the same diagnosis from multiple labs. Anthropic's Claude Mythos system card this year: their cyber ranges "lack many features often present in real-world environments such as defensive tooling," and CTF-style benchmarks are saturated to the point Anthropic is questioning whether to continue reporting them. UK AISI's most recent multi-step cyber paper (Folkerts et al.): "No active defenders. Our ranges are static." OpenAI's Trustworthy Third-Party Evaluations playbook: "Evaluators should prefer private or newly constructed tasks where possible." Carlini at DeepMind, last year on Latent Space: stop relying on standardised public benchmarks; construct private custom ones.

The diagnosis is converging. The methodology piece is what was missing.

PolyRange operationalises the diagnosis. Every deploy is freshly LLM-generated by the researcher's choice of generator model — so OpenAI's "newly constructed tasks" criterion is satisfied by construction, and Anthropic's "this report will, itself, likely contribute to the problem" structural worry doesn't apply (there's no static artefact for a future model to ingest). Defence tiers approximate the active-defender conditions UK AISI and Anthropic publicly note are missing from current ranges.

The existing alternatives split into two lanes that don't measure what the labs say they need to measure. CTF-style (DVWA, NYU CTF Bench, CyberGym, AutoPenBench): static targets that enter training corpora. Bug-bounty-style (XBOW): find-and-report against undefined defensive infrastructure. Neither is the production-shape-conditions measurement the labs have publicly committed to wanting.

Disclosure since people will ask: I'm CEO and co-founder of Aether AI (commercial security AI). PolyRange is independent research, MIT-licensed, intentionally outside Aether's commercial roadmap. The contamination problem seemed worth addressing in the open rather than internally.

v1.0 ships 84 WSTG-derived classes across all 12 OWASP testing-guide categories, two defence tiers, agent-submits-flag oracle convention, real backends throughout (Postgres dialects, real PHP for LFI, real shell for command-injection, real Jinja2 for SSTI), single-command eval CLI. MIT, self-hostable on Fly.io or any Docker host. The methodological contribution is the framework; the publishable-N empirical paper depends on partnership funding for the full run.

Happy to answer questions about the design — particularly the two-bucket entropy framing that separates exploit-recall axes from cosmetic/realism axes, which I think is over-conflated in adjacent benchmark literature.

submitted by /u/theonejvo
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA