2026-06-02 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

This paper systematically analyzes tool-calling in LLM agents along two axes: effectiveness (measurement) and efficiency (learning). It reveals that evaluation results are highly sensitive to minor implementation choices like random seed and system prompt, making leaderboard rankings unreliable, especially in multi-turn settings. On efficiency, it identifies computational waste in standard RL training and proposes two techniques that achieve substantial speedup without performance loss.

SourcearXiv Machine LearningAuthor: Tong Liu, Cheng Qian, Matej Cief, Yuan He, Daniele Dan, Nikolaos Aletras, Gabriella Kazai

Article intelligence

EngineersAdvanced

Key points

Tool-calling evaluation is highly sensitive to implementation choices such as random seed and system prompt, leading to unreliable leaderboard rankings in multi-turn settings.
Standard RL training for tool-calling suffers from computational waste: many prompts yield no learning signal, and policy updates are costly.
Two proposed acceleration techniques significantly speed up training without degrading performance.

Why it matters

This matters because tool-calling evaluation is highly sensitive to implementation choices such as random seed and system prompt, leading to unreliable leaderboard rankings in multi-turn settings.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

[2606.00135] On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

[Submitted on 28 May 2026]

Title:On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

View a PDF of the paper titled On Effectiveness and Efficiency of Agentic Tool-calling and RL Training, by Tong Liu and 6 other authors

View PDF HTML (experimental)

Abstract:Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.

Comments: ICML 2026

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2606.00135 [cs.LG]

(or arXiv:2606.00135v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2606.00135

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Tong Liu [view email] [v1] Thu, 28 May 2026 22:21:47 UTC (672 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled On Effectiveness and Efficiency of Agentic Tool-calling and RL Training, by Tong Liu and 6 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.LG

new | recent | 2026-06

Change to browse by:

cs cs.AI

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)