AI News HubLIVE
Original source2 min read

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

Apple researchers propose an inference-time evaluation method that introduces a reviewer agent to assess provisional tool calls before execution, enabling real-time error correction. Evaluated on BFCL and τ2-Bench, the method achieves +5.5% and +7.1% improvements, and introduces Helpfulness-Harmfulness metrics to quantify the tradeoff of corrections.

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents - Apple Machine Learning Research

Machine Learning Research

Open MenuClose Menu

Overview

Research Highlights

Publications

Events

Work with us

research area Methods and Algorithms, research area Tools, Platforms, FrameworksWorkshop at ACL

content type paperpublished May 2026

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

AuthorsAnh Ta, Junjie Zhu, Shahin Shayandeh

View publication

Copy Bibtex

This paper was accepted at the Fifth Workshop on Natural Language Generation, Evaluation, and Metrics at ACL 2026.

Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot course-correct the agent in real time. To close this gap, we move evaluation into the execution loop at inference time: a specialized reviewer agent evaluates provisional tool calls prior to execution, shifting the paradigm from post-hoc recovery to proactive evaluation and error mitigation.

In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi-agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce Helpfulness-Harmfulness metrics: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value.

We evaluate our approach on BFCL (single-turn) and τ2-Bench (multi-turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi-turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated prompt optimization via GEPA provides an additional +1.5–2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent.

Related readings and updates.

AgentBuilder: Exploring Scaffolds for Prototyping User Experiences of Interface Agents

January 9, 2026research area Human-Computer Interaction

Interface agents powered by generative AI models (referred to as “agents”) can automate actions based on user commands. An important aspect of developing agents is their user experience (i.e., agent experience). There is a growing need to provide scaffolds for a broader set of individuals beyond AI engineers to prototype agent experiences, since they can contribute valuable perspectives to designing agent experiences. In this work, we explore the…

Read more

Towards Learning Multi-Agent Negotiations via Self-Play

January 28, 2019research area Computer VisionWorkshop at ICCV

Making sophisticated, robust, and safe sequential decisions is at the heart of intelligent systems. This is especially critical for planning in complex multi-agent environments, where agents need to anticipate other agents’ intentions and possible future actions. Traditional methods formulate the problem as a Markov Decision Process, but the solutions often rely on various assumptions and become brittle when presented with corner cases. In…

Read more

Discover opportunities in Machine Learning.

Our research in machine learning breaks new ground every day.

Work with us

Machine Learning Research

Publications

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

Privacy Policy

Terms of Use

Legal

Copyright © 2026 Apple Inc. All rights reserved.