AI News HubLIVE
站内改写

DeepSWE: Measuring coding agents on original, long-horizon engineering tasks

DeepSWE is a new benchmark for evaluating AI coding agents on fresh, complex software engineering tasks. It avoids data contamination, covers diverse repositories, requires significant code changes, and uses hand-written verifiers. Leading models show a wide range of performance, with GPT-5.5 achieving 70% and others lower.

Article intelligence

EngineersAdvanced

Key points

  • DeepSWE is a contamination-free benchmark with original tasks.
  • Tasks span 91 repositories in 5 languages.
  • Solutions require 5.5x more code than SWE-bench Pro.
  • GPT-5.5 leads at 70% accuracy.

Why it matters

This matters because deepSWE is a contamination-free benchmark with original tasks.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:

Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.

High diversity: Tasks span a broad pool of 91 repositories across 5 languages.

Real-world complexity: Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.

Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.

The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.

Leaderboard

gpt-5.5[xhigh]

70%±4%

gpt-5.4[xhigh]

56%±5%

claude-opus-4.7[max]

54%±5%

claude-sonnet-4.6[high]

32%±4%

gemini-3.5-flash[medium]

28%±4%

gpt-5.4-mini[xhigh]

24%±4%

kimi-k2.6

24%±4%

mimo-v2.5-pro

19%±4%

glm-5.1

18%±4%

gemini-3.1-pro

10%±3%

deepseek-v4-pro

8%±2%

gemini-3-flash

5%±2%

0%20%40%60%80%

gpt-5.5[xhigh]

70%±4%

gpt-5.4[xhigh]

56%±5%

claude-opus-4.7[max]

54%±5%

claude-sonnet-4.6[high]

32%±4%

gemini-3.5-flash[medium]

28%±4%

gpt-5.4-mini[xhigh]

24%±4%

kimi-k2.6

24%±4%

mimo-v2.5-pro

19%±4%

glm-5.1

18%±4%

gemini-3.1-pro

10%±3%

deepseek-v4-pro

8%±2%

gemini-3-flash

5%±2%

0%25%50%75%100%

All models are run with mini-swe-agent; see Why mini-swe-agent for a comparison against other harnesses.

Task Examples

Abort pending body reads on shutdown

Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.

capricorn86/happy-domtypescript

Fix PromQL label sorting across typed and untyped values

PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.

prometheus/prometheusgo

Add config file parsing to Cliffy commands

Add command-level config file loading, parsing, merging, and precedence handling.

c4spar/cliffytypescript

Add deterministic map conflict detection to Y.Map writes

Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies.

yjs/yjsjavascript

Add trap coredump generation to wasmi

Generate opt-in Wasm coredumps on traps and attach the bytes to errors.

wasmi-labs/wasmirust

Add XML diff, patch, and merge operations to etree

Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.

beevik/etreego

All 113 tasks

Read the full blog

Open

01

IntroductionWhy a new benchmark

02

OverviewWhat separates DeepSWE

03

MethodologyHow tasks and verifiers are built

04

ResultsWhere frontier models diverge

05

Qualitative analysisHow each frontier model fails

06

Limitations & future workWhat we'd do differently