SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Hi all, Sorry for going missing — we’ve been collecting a larger, higher-quality set of more complex tasks. We’re excited to share a major leaderboard update covering the past three months. We’ve updated the SWE-rebench leaderboard with 110 fresh Python tasks from GitHub PRs created in March, April, and part of May. The setup follows the standard SWE-bench format: models read real PR issues, edit code, run tests, and must make the full test suite pass. This time, instead of our usual monthly updates with a smaller number of tasks, we collected a larger batch so we could evaluate models on a broader task set. You can still select narrower task windows on the leaderboard if you want a more focused view. We’ll add more models over the next week, including Gemini Flash 3.5, DeepSeek v4 Pro, Qwen3.5-397B-A17B, along with smaller models for local development. Going forward, we’ll continue updating models frequently, but over relatively larger task batches. We’re also working on adding multilingual tasks to the leaderboard, plus a few more things we’ll share soon. Please send requests for models you want us to run! Looking forward to your thoughts and feedback. Join the leaderboard channel in our Discord to discuss models, share ideas, ask questions, or report issues: [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.