DeepSWE: Measuring coding agents on original, long-horizon engineering tasks
DeepSWE is a new benchmark for evaluating AI coding agents on fresh, complex software engineering tasks. It avoids data contamination, covers diverse repositories, requires significant code changes, and uses hand-written verifiers. Leading models show a wide range of performance, with GPT-5.5 achieving 70% and others lower.
Article intelligence
Key points
- DeepSWE is a contamination-free benchmark with original tasks.
- Tasks span 91 repositories in 5 languages.
- Solutions require 5.5x more code than SWE-bench Pro.
- GPT-5.5 leads at 70% accuracy.
Why it matters
This matters because deepSWE is a contamination-free benchmark with original tasks.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:
Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
Real-world complexity: Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.
Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.
The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.
Leaderboard
gpt-5.5[xhigh]
70%±4%
gpt-5.4[xhigh]
56%±5%
claude-opus-4.7[max]
54%±5%
claude-sonnet-4.6[high]
32%±4%
gemini-3.5-flash[medium]
28%±4%
gpt-5.4-mini[xhigh]
24%±4%
kimi-k2.6
24%±4%
mimo-v2.5-pro
19%±4%
glm-5.1
18%±4%
gemini-3.1-pro
10%±3%
deepseek-v4-pro
8%±2%
gemini-3-flash
5%±2%
0%20%40%60%80%
gpt-5.5[xhigh]
70%±4%
gpt-5.4[xhigh]
56%±5%
claude-opus-4.7[max]
54%±5%
claude-sonnet-4.6[high]
32%±4%
gemini-3.5-flash[medium]
28%±4%
gpt-5.4-mini[xhigh]
24%±4%
kimi-k2.6
24%±4%
mimo-v2.5-pro
19%±4%
glm-5.1
18%±4%
gemini-3.1-pro
10%±3%
deepseek-v4-pro
8%±2%
gemini-3-flash
5%±2%
0%25%50%75%100%
All models are run with mini-swe-agent; see Why mini-swe-agent for a comparison against other harnesses.
Task Examples
Abort pending body reads on shutdown
Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.
capricorn86/happy-domtypescript
Fix PromQL label sorting across typed and untyped values
PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.
prometheus/prometheusgo
Add config file parsing to Cliffy commands
Add command-level config file loading, parsing, merging, and precedence handling.
c4spar/cliffytypescript
Add deterministic map conflict detection to Y.Map writes
Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies.
yjs/yjsjavascript
Add trap coredump generation to wasmi
Generate opt-in Wasm coredumps on traps and attach the bytes to errors.
wasmi-labs/wasmirust
Add XML diff, patch, and merge operations to etree
Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.
beevik/etreego
All 113 tasks
Read the full blog
Open
01
IntroductionWhy a new benchmark
02
OverviewWhat separates DeepSWE
03
MethodologyHow tasks and verifiers are built
04
ResultsWhere frontier models diverge
05
Qualitative analysisHow each frontier model fails
06
Limitations & future workWhat we'd do differently