AI News HubLIVE
站内改写6 min read

The engineering practices Claude Code and Codex use to improve AI agents

This article investigates how coding agents like Claude Code and Codex autonomously apply engineering practices such as failure mode analysis and ad-hoc evaluations to improve AI agents without specialized tooling. The experiment across five applications shows they can ship improvements by analyzing baseline traces, abstracting failure patterns, and creating prompt variants.

SourceHacker News AIAuthor: anndvision

The engineering practices Claude Code and Codex use to improve AI agents

April 24, 2026 · Andrew Jesson

Coding agents perform common engineering practices when asked to improve AI agents. Will they subsume specialized tools for failure-mode analysis, evaluations, and prompt optimization?

Give a coding agent a simulated agent application, a hundred baseline traces, and a metric to optimize, and it will ship an improvement. Both Claude Code and Codex do this. I was interested in seeing what they do while doing it.

agent

I’m checking the current TensorZero config and the baseline traces for yc_bench_tutorial_v0::yc_bench_act so I can identify failure patterns before editing variants.

↳searching for config and variant definitions

1/41

I prompted Claude Code and Codex to optimize five simulated agent applications, varying only which agent CLI was in the container. I was surprised, though maybe I should not have been, to find that they both used unprompted practices like clustering and summarizing failure patterns. They also ran ad-hoc evaluations to refine and debug their proposed changes to the model or prompt. By performing these common engineering practices, they shipped improvements without calling any specialized tooling for failure mode analysis, evaluations, or prompt optimization. These observations gave me pause to reconsider the role and shape of such tooling as agent optimization becomes more automated. They are also why I started a project I call harness attribution; this post is its first probe.

Setup

For each of the following applications, I ran a baseline agent with an initial prompt and model (gpt-5.4-mini) on up to 100 different tasks. The resulting traces were scored with application-specific feedback.

ApplicationDescriptionMetric

Software Engineering (Terminal-bench)Long-horizon Linux agent solving coding tasks through execute_command / submit_solutionreward (verifier score, 0–1)

Business Management (YC Bench)Multi-turn CEO agent driving a business simulation through a single run_command tooltasks_succeeded (number of tasks delivered on or before deadline)

Data Extraction: NER (CoNLL++)Single-shot: a sentence → four entity lists (person, organization, location, miscellaneous)exact_match on entity sets

Data Extraction: NDA (Kleister)Single-shot: OCR’d NDA text → effective_date, jurisdiction, party (list), termf1 over fields

Science (Replication Bench)Long-horizon agent reproducing a published astrophysics paper from a sandboxed dataset and a masked PDF via execute_command / submit_solutionreward (binary match against paper’s value)

The optimization task was to propose improvements to the application by modifying the baseline agent prompt and/or choosing a different similar-price-point model. The optimizer agent (Claude Code on claude-sonnet-4-6 or Codex on gpt-5.4) was then dropped into a container with access to those traces, feedback, a copy of the baseline agent config, and a markdown skill file describing the task. It analyzed the traces and feedback, wrote one or more new model-prompt variants into the agent config, and exited. Validation of the proposed improvements revealed that both coding agents shipped new variants that matched or beat the baseline on every application: decisively on NER, Business Management, and Software Engineering; within one standard error on NDA and Science.

Held-out test scores by application. Error bars are mean ± SE across 5 seeds for the optimized variants; the baseline was run with a single seed for budget reasons, so its seed variance is unmeasured.

What engineering practices do the agents use?

Both coding agents use the same skill file. It includes the application name, metric, available models, data layout, some recipes for efficiency, and a four-bullet methodology that says survey → add variants → test → iterate.

The skill

Placeholders like {config_dir}, {function_name}, {baseline_metrics}, and {model_list} are substituted per-run by the harness.

TensorZero Function Optimizer

You are optimizing a TensorZero function to improve its performance metric.

Environment

  • T0 config files: {config_dir}/ (only these and the baseline data below are relevant — don't explore elsewhere)
  • Gateway URL: {gateway_url}
  • Pre-dumped baseline data: {baseline_data_dir}/ (read-only; direct DB access is not available)
  • Restart after config edits: curl -sf -X POST http://eval:5111/restart-gateway
  • Isolated container. No Python or pip; node and curl are on $PATH; jq is not installed. Use node -e "..." for JSONL parsing (readline + JSON.parse + project to stdout) — prefer it over shell pipelines when you need fields per row.
  • Don't set temperature on any variant (some models reject non-default values). Keep an initial variant as a baseline reference.
  • Don't run evaluation episodes yourself — the harness does that after you exit.

Task

  • Function: {function_name}
  • Metric: {metric_name}. Check the metric's optimize field in tensorzero.toml for direction (boolean and float metrics may minimize or maximize).
  • Baseline performance: {baseline_metrics}

Available Models

{model_list}

Baseline data

  • {baseline_data_dir}/inferences.jsonl — one row per inference (what the model said per task).
  • {baseline_data_dir}/feedback.jsonl — one row per metric value.
  • {baseline_data_dir}/initial_config/ — read-only copy of the starting T0 config tree.

Files are often 20+ MB. Don't cat them whole. Start by head -3 on each to learn the row shape (field names and nesting vary by env), then project out the fields you need.

The projection pattern

grep first to narrow, then node -e to project:

grep $TARGET_ID {baseline_data_dir}/inferences.jsonl \
| node -e "
require('readline').createInterface({input: process.stdin}).on('line', l => {
const r = JSON.parse(l);
console.log(r.id, r.variant_name, JSON.stringify(r.output).slice(0,200));
});"

cat inferences.jsonl | ... loads the whole file; grep-first keeps the pipeline cheap.

Cross-record one-liners

Adapt the failure predicate to your metric — boolean uses "value":0 / "value":1; float values depend on optimize direction.

# Inferences per episode
grep -o '"episode_id":"[^"]*"' {baseline_data_dir}/inferences.jsonl | sort | uniq -c | sort -rn | head

# Last inference of a failing episode
grep $FAIL_ID {baseline_data_dir}/inferences.jsonl | tail -1

# Which metrics are present
grep -o '"metric_name":"[^"]*"' {baseline_data_dir}/feedback.jsonl | sort | uniq -c

# target_ids of failures (boolean example — adapt the predicate for float metrics)
grep '"metric_name":"{metric_name}"' {baseline_data_dir}/feedback.jsonl \
| node -e "
require('readline').createInterface({input: process.stdin}).on('line', l => {
const r = JSON.parse(l);
if (r.value === 0 || r.value === false) console.log(r.target_id);
});" > /tmp/failed.txt
head -5 /tmp/failed.txt | while read id; do grep "$id" {baseline_data_dir}/inferences.jsonl | head -1; done

Templates, schemas, and the required content shape

TensorZero has two co-existing config styles. Check which one the function uses in tensorzero.toml:

Legacy (per-role):

[functions."my_fn"]
user_schema = "functions/my_fn/user_schema.json" # and system_schema, assistant_schema

[functions."my_fn".variants.initial]
user_template = "functions/my_fn/initial/user_template.minijinja"

New (named):

[functions."my_fn"]
schemas.user_query.path = "functions/my_fn/user_query_schema.json"

[functions."my_fn".variants.initial]
templates.user_query.path = "functions/my_fn/initial/user_query.minijinja"

Canonical content block for a templated message (both styles):

"content": [{
"type": "template",
"name": "",
"arguments": { /* object matching the schema */ }
}]

For legacy, "name" is the role ("user" / "system" / "assistant"). For new, it's the key under schemas. / templates..

For a role with no schema: "content": "Hello" or [{"type":"text","text":"Hello"}].

Methodology

The core loop is: survey the baseline → add variants → test one → iterate. The decisions worth getting right:

  • Metric direction defines "failure." Don't assume value:0 is bad; read the metric's optimize field.
  • Judge manual variant tests by the curl /inference output itself — right tool call, right JSON, right content.
  • Multi-turn agentic envs (customer service, business management, coding) need real conversational state to be representative. Pick a real episode from inferences.jsonl, copy its first 2–3 messages into your curl body, check how the variant continues. A turn-0 probe alone tells you little.
  • When done, leave the best config in place with the experimentation section below, and exit.

Routing: Experimentation Config

After creating new variants, add an experimentation section — otherwise the gateway round-robins and wastes test episodes on bad variants. Keep candidates to your best ~3–4, including initial as a baseline.

[functions."{function_name}".experimentation]
type = "track_and_stop"
metric = "{metric_name}"
candidate_variants = ["initial", "your_new_variant_1", "your_new_variant_2"]
fallback_variants = []
min_samples_per_variant = 5
delta = 0.1
epsilon = 0.0
update_period_s = 5
min_prob = 0.0
max_samples_per_variant = 10000

The skill stays silent on how to abstract failure patterns, or how to validate an improvement beyond probing it. Both agents fill that gap. Each reads the baseline traces and feedback, abstracts a handful of failure modes from the raw rows, writes two to four prompt variants, runs a few inferences, analyzes the new outputs, and exits. What they do in those gaps, and what each agent reaches for differently, is below.

They perform failure mode analysis

Failure mode analysis here is going from a dataset of inferences and feedback to “the model over-extracts miscellaneous because it treats it as a catch-all”. The skill leaves both prerequisites up to the agent: projecting the failed rows out of JSONL, then abstracting them into a named pattern.

On the projection step, the data is split across two files: feedback.jsonl says which target_ids failed, inferences.jsonl says what the model actually said for each one. The original skill described the join in prose (pull failing target_ids, then look up the corresponding inference rows) but did not say how. Both agents converged on the same recipe: grep the failing target_ids out of feedback, then grep each one back into inferences and tail to the last row. I folded that recipe back into the skill, alongside a few related cross-record one-liners (inferences-per-episode, which-metrics-are-present, last-inference-of-a-failing-episode), because re-discovering them cost three to six turns at the start of every session.

With the failed rows projected, both agents can do the abstraction across multiple traces, often including bugs not mentioned in the skill or the function’s documentation. Toggle the optimizer and environment below to land on the moment each agent enumerates the failure modes it just abstracted from the baseline traces. Use the arrow keys to step through the surrounding turns.

agent

Now I have a clear picture of the failures. The main issues are:

  1. Over-extraction as MISC: dates, numbers, phrases, common nouns are incorrectly labeled
  2. Entity boundary confusion: sub-parts of named entities extracted separately (e.g., "Baltimore" from "Baltimore Orioles")
  3. Wrong category: sports teams like "ENGLAND" labeled as LOC instead of ORG
  4. Generic names as ORG: "local police", "amnesty committee" are not named orgs

Let me build improved variants. First, I'll create the improved system prompt and directory structure.

16/45

Additionally, the coding agents perform bug discovery — not “the model gets this kind of thing wrong” but “the simulato

[truncated for AI cost control]