MirrorCode: What's the largest software project AI can complete on its own?
AI can already tackle long-horizon coding tasks. Claude Opus 4.7 reimplemented gotree, a 16,000-line bioinformatics toolkit, in 14 hours for $251. However, MirrorCode benchmark score is only 56%, showing room for improvement. Models improve rapidly over time, though data contamination is a concern. 22 of 25 target programs are open-sourced.
AI can already perform some long-horizon coding tasks
AI can already solve long-horizon MirrorCode tasks, despite their difficulty. For example, Claude Opus 4.7 reimplemented gotree: a bioinformatics toolkit with ~16,000 lines of Go and 40+ commands.1 We believe this same task would take a human engineer without AI assistance 2–17 weeks. Opus 4.7 solved it in 14 hours, costing $251.
However, MirrorCode is not fully solved. Claude Opus 4.7’s headline score is only 56%, meaning there is significant room for further improvement.2 We look forward to evaluating new models on the benchmark.
We also found that AI models are improving rapidly over time. Leading models from a year ago would have scored about 30%, and were limited to simpler programs, such as a calendar utility. There was no clear overall trend in cost: GPT-5.5 cost 3× more than GPT-5 to solve the same tasks, whereas Claude Opus 4.7 was 3× cheaper than Claude Opus 4.1.
One important caveat to these results is data contamination. Because MirrorCode tasks involve reimplementing open-source programs, AI models are likely to have seen the original codebases in pretraining. This might lead to inflated performance on the benchmark. However, AI successfully reimplemented several target programs that passed our memorization screen, and failed to reimplement programs where the screen showed evidence of memorization. This suggests that the results were not dominated by memorization, but we cannot rule out the possibility that memorization contributes to AI performance. Overall, we expect that the capabilities measured by MirrorCode would generalize to an unseen codebase. We discuss this further, along with more results and details on benchmark construction, in the paper.
Open-source code
We release our scaffold and 22 of the 25 MirrorCode target programs (totaling 132 task instances across the six supported programming languages) as open-source, with the other three targets held out as a private test set.
This work was co-developed with METR and supported by a grant from METR. The authors of MirrorCode are Tom Adamczewski, David Owen, and David Rein. Florian Brand, Giles Edkins, Allen Hart, and Daniel O’Connell contributed additional target programs. Rasmus Faber-Espensen made crucial infrastructure improvements and gave advice on engineering
Notes
The best-scoring AI gotree implementations passed 2000/2001 tests, but failed a single edge-case test for a niche command to manipulate date annotations. Consequently, they do not strictly solve the task to 100% completion, but we consider the reimplementation near-perfect, covering essentially all scoped functionality.
On 21/25 MirrorCode targets, AI models have at least once passed 99% of tests or more. Typically, outstanding test failures are from a handful of edge cases. At the stricter threshold of reimplementation (100% of tests passing), eight MirrorCode targets have never been solved in any run. Benchmark scores are lower than 17/25 ≈ 70% because several targets are not solved reliably: AI solves them only in some runs.