Better Models: Worse Tools
Armin Ronacher reports a curious problem: newer Claude models (Opus 4.8 and Sonnet 5) sometimes add invented fields to Pi's edit tool calls, causing rejection, while older models do not. He theorizes that Anthropic's reinforcement learning to optimize for Claude Code's built-in edit tool inadvertently degrades performance on third-party harnesses. This raises the question of whether frameworks like Pi should implement multiple edit tools to match model-specific optimizations.
Better Models: Worse Tools
Simon Willison’s Weblog
Subscribe
4th July 2026 - Link Blog
Better Models: Worse Tools. Armin reports on a weird problem he ran into while hacking on Pi:
The short version is that newer Claude models sometimes call Pi’s edit tool with extra, invented fields in the nested edits[] array. And not Haiku or some small model: Opus 4.8. The edit itself is usually correct but the arguments do not match the schema as the model invents made-up keys and Pi thus rejects the tool call and asks to try again.
That alone is not too surprising as models emit malformed tool calls sometimes. Particularly small ones. What surprised me is that this is getting worse with newer Anthropic models as both Opus 4.8 and Sonnet 5 show it but none of the older models. In other words, the SOTA models of the family are worse at this specific tool schema than their older siblings.
Armin theorizes that this is because more recent Anthropic models have been specifically trained (presumably via Reinforcement Learning) to better use the edit tools that are baked into Claude Code. This has the unfortunate effect that other coding harnesses, such as Pi, may find that their own custom edit tools are more likely to be used incorrectly.
Claude's edit tool uses search and replace. OpenAI's Codex uses an apply_patch mechanism instead, and OpenAI have talked in the past about how their models are trained to use that tool effectively.
Does this mean third-party coding harnesses like Pi should implement multiple edit tools just so they can use the one with the best performance for the underlying model the user has selected?
Recent articles
Have your agent record video demos of its work with shot-scraper video - 30th June 2026
Porting the Moebius 0.2B image inpainting model to run in the browser with Claude Code - 22nd June 2026
sqlite-utils 4.0rc1 adds migrations and nested transactions - 21st June 2026
This is a link post by Simon Willison, posted on 4th July 2026.
armin-ronacher 24
ai 2,102
openai 426
generative-ai 1,859
llms 1,826
anthropic 304
llm-tool-use 72
coding-agents 218
pi 5
Monthly briefing
Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.
Pay me to send you less!
Sponsor & subscribe
Disclosures
Colophon
©
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026