r/MachineLearning · June 24, 2026 · 1 min read

DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

DeepSWE delivers four advances over existing public benchmarks:

Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
Real-world complexity: Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.
Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.

The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.

Discussion (0)

No comments yet. Sign in and be the first to say something.