r/MachineLearning · · 1 min read

DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

DeepSWE delivers four advances over existing public benchmarks:

  • Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
  • High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
  • Real-world complexity: Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.
  • Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.

The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.

https://preview.redd.it/lacvagyr159h1.png?width=1373&format=png&auto=webp&s=6514340a15d51d7f03da733f08fb3f6a302cac75

It's open-source: https://github.com/datacurve-ai/deep-swe

submitted by /u/we_are_mammals
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning