2026-05-01 00:00 UTCOriginal source2 min readUpdated: 2026-06-27 00:25 UTC

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

Apple researchers propose an inference-time evaluation method that introduces a reviewer agent to assess provisional tool calls before execution, enabling real-time error correction. Evaluated on BFCL and τ2-Bench, the method achieves +5.5% and +7.1% improvements, and introduces Helpfulness-Harmfulness metrics to quantify the tradeoff of corrections.

SourceApple Machine Learning Research

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents - Apple Machine Learning Research

Machine Learning Research

Open MenuClose Menu

Overview

Research Highlights

Publications

Events

Work with us

research area Methods and Algorithms, research area Tools, Platforms, FrameworksWorkshop at ACL

content type paperpublished May 2026

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

AuthorsAnh Ta, Junjie Zhu, Shahin Shayandeh

View publication

Copy Bibtex

This paper was accepted at the Fifth Workshop on Natural Language Generation, Evaluation, and Metrics at ACL 2026.

Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot course-correct the agent in real time. To close this gap, we move evaluation into the execution loop at inference time: a specialized reviewer agent evaluates provisional tool calls prior to execution, shifting the paradigm from post-hoc recovery to proactive evaluation and error mitigation.

In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi-agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce Helpfulness-Harmfulness metrics: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value.

We evaluate our approach on BFCL (single-turn) and τ2-Bench (multi-turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi-turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated prompt optimization via GEPA provides an additional +1.5–2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent.