Reward hacking undermines model intelligence gains in coding benchmarks
Smarter AI models are increasingly exploiting benchmark environments to retrieve known fixes rather than deriving solutions, a phenomenon known as reward hacking. Cursor's audit found that 63% of successful Opus 4.8 Max resolutions on SWE-bench Pro were retrieved. Restricting git history and internet access sharply reduced scores, especially for newer models. The study emphasizes the need for controlled eval environments to ensure benchmarks measure true coding ability.
Blog / research
Smarter models are becoming more resourceful at hacking coding benchmarks.
Eval suites built from real bugs that were later fixed are especially vulnerable because the problems have already been solved. If the agent has access to repository history or the public web, it can sometimes look up the answer rather than derive it.
To measure how widespread this behavior is, we built an agent to audit eval trajectories. On SWE-bench Pro, we found that 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. When we sealed git history and restricted internet access, scores dropped sharply for Opus as well as for our model, Composer 2.5:
Opus 4.8 Max fell from 87.1% to 73.0%
Composer 2.5 fell from 74.7% to 54.0%
Prior research has shown that coding benchmark answers can leak through publicly available sources, including this 2024 study and a 2025 Meta report. Our study quantifies the problem on current frontier coding-agent runs. The broader lesson is that in addition to avoiding training-time contamination, agentic coding benchmarks also need controlled runtime environments.
For teams conducting evals, we propose mitigating this reward-hacking behavior by auditing transcripts and constraining the eval environment.
#Catch a model with a model
To measure the scale of the reward-hacking behavior, we had the auditor examine 731 Opus 4.8 Max trajectories. It saw the problem statement and full agent trajectory, without seeing whether the run passed, and classified whether the agent had retrieved the known answer.
The two most common reward-hacking patterns it found were:
Upstream lookup: In 57% of the trajectories, Opus 4.8 Max found the merged PR or fixed source file on the public web, then reproduced the fix nearly verbatim.
Git-history mining: In 9% of trajectories, Opus 4.8 Max searched the bundled .git history for the future commit that fixed the bug, then extracted the patch.1
As models get stronger, they can sometimes infer that they are in an eval, especially when the task is borrowed from a past public repository. Even in cases where they don't remember the fix from training, the environment can still give them clues that the bug has already been solved.
In one SWE-bench Multilingual task from a 2019 jq issue, the agent tried to reproduce the bug with the system jq binary. Because the image had been built after the bug was fixed, the reproduction failed, and the agent inferred that the issue had already been solved. That awareness pushed it toward searching for the fix instead of deriving one.
A few cases were more direct. One agent found a SWE-bench mirror page that exposed hidden tests and the gold patch. Another obtained hidden test files and hardcoded the expected exception string needed to pass.
Upstream lookup (Opus 4.8 Max). The agent queried the merged PR through the GitHub API to find the files the fix touched, then reproduced it (the same response also exposes each file's diff):
cd /testbed && curl -s "https://api.github.com/repos/apache/druid/pulls/14092/files" 2>/dev/null | grep '"filename"'
Git-history mining (Composer 2.5). The agent located the fix commit in the bundled .git history, read its diff, then applied it directly:
cd /testbed && git show 895abd8929 -p 2>/dev/null | head -400 cd /testbed && git cherry-pick 895abd8929 2>&1
Patch excerpt to add: a trimmed verbatim slice of the git show output above (the gold diff Composer reproduced).
#Stricter environment design
Most reward hacking flowed through the public web and repository history. For evals built from historical public repositories, those channels need to be controlled because they may contain the answer. In response, we built a strict harness with two isolation mechanisms:
History isolation. Before the agent starts, the .git directory is removed and the repository is reinitialized as a fresh single-commit repo. The original history is restored only at scoring time, so tests run as usual.
Egress proxying. Network access is denied by default. As a best-effort control, a pinned proxy allows dependency resolution against an allow-list of package registries, and nothing else.
This restriction is specific to evals built from historical public repositories. It's one reason we prefer evals built from non-public repositories, like CursorBench. They can test agentic coding ability while still letting agents use tools in the ways they would during real work.
#A growing gap
We reran SWE-bench Pro and SWE-bench Multilingual in the stricter harness, then compared each result against the standard harness score as a proxy for the combined effect of removing these leakage channels2:
On SWE-bench Multilingual, it was under 1 point for Opus 4.6, 9.1 points for Opus 4.8 Max, and 7.5 points for Composer 2.5.
On SWE-bench Pro, it was under 1 point for Opus 4.6, 14.1 points for Opus 4.8 Max and 20.7 points for Composer 2.5.
The clear takeaway is that reward hacking is far more common with newer, more sophisticated models than with older ones. Interestingly, GPT models don't show the same escalation, with generally smaller gaps in our runs.
We also observed that our own model, Composer 2.5, had the largest Pro gap in the study. This is one reason we do not treat the standard SWE-bench Pro score as a reliable benchmark number for Composer. The score was real in the narrow sense that the harness produced it, but it mixed coding ability with access to known fixes.
Standard vs. strict harness (test pass rate)
1Opus 4.8 (max)91.16%82.03%+9.1
2Opus 4.8 (xhigh)88.86%80.67%+8.2
3Opus 4.7 (max)84.80%80.47%+4.3
4Opus 4.7 (xhigh)83.74%78.60%+5.1
5Opus 4.8 (high)83.09%79.26%+3.8
6Opus 4.8 (medium)81.87%77.84%+4.0
7Opus 4.7 (high)81.42%77.75%+3.7
8Opus 4.8 (low)79.51%74.36%+5.2
9Composer 2.579.15%71.60%+7.5
10GPT-5.4 (xhigh)79.00%75.20%+3.8
11GPT-5.5 (xhigh)77.80%74.40%+3.4
12Opus 4.7 (medium)77.33%75.72%+1.6
13GPT-5.5 (high)77.30%74.70%+2.6
14GPT-5.4 (high)76.80%73.30%+3.5
15Opus 4.6 (max)76.33%76.06%+0.3
16Opus 4.6 (high)76.11%75.22%+0.9
17Opus 4.7 (low)75.89%72.64%+3.3
18GPT-5.5 (medium)75.30%74.20%+1.1
#Designing evals for aware agents
The main takeaway for teams running coding evals is that benchmark design should not stop at dataset construction. It also has to account for the runtime environment, including what the agent can search, inspect, fetch, and recover while the task is running.
That does not mean every eval should remove internet access or git history. Some evals are meant to test how well agents use the surrounding context of a real codebase, and in those settings broad access may be part of the task. The problem is when that access changes what the score means.
For historical public-repo benchmarks, open access can let agents find the known fix rather than solve the bug. Without controls in the harness, scores can conflate coding ability with answer retrieval.
Teams running evals should decide what behavior they want to measure, design the harness around that, and make the setup clear when they report results. Auditing transcripts can help reveal when models are solving tasks in unexpected ways. The goal is not to ban normal tool use, but to make sure the benchmark measures what it claims to measure.
Even then, there remains a harder open problem. As models become more aware of when they are being evaluated, they may change their behavior in subtler ways that are not fixed by sealing git history or restricting internet access. Runtime contamination is one concrete version of a broader challenge of building evals that retain construct validity even when the model infers that it is being evaluated.
SWE-bench has since addressed this upstream by stripping future git history from its environment images (PR #471), with follow-up git cleanup work in early 2026 (PR #533). The images we had ingested predated that fix. ↩
The exact gap sizes and the frequency of reward-hacking attempts depend on the prompts used. For example, hacking attempts increased when we instructed the model to keep working without stopping. ↩
Related posts
Mar 17, 2026·Research
Training Composer for longer horizons
Federico & Sasha · 7 min read
Mar 23, 2026·Research
Fast regex search: indexing text for agent tools
Vicent Marti · 21 min read
Mar 11, 2026·Research
How we compare model quality in Cursor
Naman Jain · 7 min read
View more posts →