PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Author here. The short version of why I built this: Cyber-AI evaluation is converging on the same diagnosis from multiple labs. Anthropic's Claude Mythos system card this year: their cyber ranges "lack many features often present in real-world environments such as defensive tooling," and CTF-style benchmarks are saturated to the point Anthropic is questioning whether to continue reporting them. UK AISI's most recent multi-step cyber paper (Folkerts et al.): "No active defenders. Our ranges are static." OpenAI's Trustworthy Third-Party Evaluations playbook: "Evaluators should prefer private or newly constructed tasks where possible." Carlini at DeepMind, last year on Latent Space: stop relying on standardised public benchmarks; construct private custom ones. The diagnosis is converging. The methodology piece is what was missing. PolyRange operationalises the diagnosis. Every deploy is freshly LLM-generated by the researcher's choice of generator model — so OpenAI's "newly constructed tasks" criterion is satisfied by construction, and Anthropic's "this report will, itself, likely contribute to the problem" structural worry doesn't apply (there's no static artefact for a future model to ingest). Defence tiers approximate the active-defender conditions UK AISI and Anthropic publicly note are missing from current ranges. The existing alternatives split into two lanes that don't measure what the labs say they need to measure. CTF-style (DVWA, NYU CTF Bench, CyberGym, AutoPenBench): static targets that enter training corpora. Bug-bounty-style (XBOW): find-and-report against undefined defensive infrastructure. Neither is the production-shape-conditions measurement the labs have publicly committed to wanting. Disclosure since people will ask: I'm CEO and co-founder of Aether AI (commercial security AI). PolyRange is independent research, MIT-licensed, intentionally outside Aether's commercial roadmap. The contamination problem seemed worth addressing in the open rather than internally. v1.0 ships 84 WSTG-derived classes across all 12 OWASP testing-guide categories, two defence tiers, agent-submits-flag oracle convention, real backends throughout (Postgres dialects, real PHP for LFI, real shell for command-injection, real Jinja2 for SSTI), single-command eval CLI. MIT, self-hostable on Fly.io or any Docker host. The methodological contribution is the framework; the publishable-N empirical paper depends on partnership funding for the full run. Happy to answer questions about the design — particularly the two-bucket entropy framing that separates exploit-recall axes from cosmetic/realism axes, which I think is over-conflated in adjacent benchmark literature. [link] [comments] |
More from r/LocalLLaMA
-
Is there a definitive way or cookie cutter way to benchmark variations of the same model for their KLD?
May 31
-
Speed difference between Windows 11 and Linux with llama.cpp: a myth when using medium and large MoE models
May 31
-
Don’t bite me for that question please…
May 31
-
Use any model and any provider with the official OpenAI Codex Desktop App, without modifying its code, and continue to use the official models in parallel?
May 31
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.