AI News HubLIVE
站内改写2 min read

Don't Just Add Tools to Agents—They Can't Choose Wisely! Fudan & Tongyi Propose New CUA Training Paradigm

Fudan University and Tongyi Lab jointly introduce ToolCUA, a Computer Use Agent designed to master hybrid GUI-Tool action spaces. It achieves 46.85% accuracy on OSWorld-MCP, surpassing Claude-4-Sonnet, through a two-stage training pipeline that teaches agents when to use GUI vs tools.

Source量子位Author: Jay

In the rapidly evolving field of AI agents, a common assumption has been that equipping agents with both graphical user interface (GUI) operations and tool calling capabilities would naturally improve performance. However, researchers from Fudan University and Tongyi Lab have uncovered a counterintuitive phenomenon: when agents are given access to both modalities simultaneously, performance can actually decline. Their new work, ToolCUA, presents a systematic solution to this hybrid action space challenge.

Traditional Computer Use Agents (CUAs) rely on atomic GUI actions like clicking, typing, and scrolling. While these actions generalize well, they are often slow and error-prone. Tool calls, on the other hand, are efficient and precise but depend on proper context. The naive integration of both creates a 'path confusion'—agents struggle to decide whether to click a button or call an API, leading to underuse or overuse of tools.

ToolCUA addresses this through a two-stage training paradigm. First, the team developed a data synthesis pipeline that converts existing GUI-only trajectories into interleaved GUI-Tool trajectories. By analyzing task objectives and action sequences, they generate a grounded tool library and produce multiple trajectory variants that mix GUI and tool steps. This initial stage, combined with tool-bootstrapped reinforcement fine-tuning (RFT), gives the model a foundational ability to use tools and recognize switching points.

The second stage employs online agentic reinforcement learning in a real GUI-Tool environment. The key innovation is a Tool-Efficient Path Reward that combines standard task success rewards with two additional components: a tool appropriateness reward (R_tool) that encourages tool use only when beneficial, and a path efficiency reward (R_length) that rewards shorter successful trajectories. This reward structure prevents the model from overusing or underusing tools.

The results are striking. ToolCUA-8B achieves 46.85% accuracy on the OSWorld-MCP benchmark, a relative improvement of about 66% over the Qwen3-VL-8B baseline. It surpasses stronger models including Claude-4-Sonnet (43.54%) and Gemini-3.1-Pro (41.14%), and approaches Claude-4.5-Sonnet (48.35%). More importantly, it completes tasks in an average of only 14.93 steps—the lowest among all tested models—demonstrating both effectiveness and efficiency.

Ablation studies confirm that each component is critical. Without the offline interleaved data, online RL alone fails to learn reliable tool use, with tool invocation rates remaining low. Removing the path reward leads to unstable accuracy and no improvement in trajectory length. The hybrid GUI-Tool training consistently outperforms pure GUI training, indicating that the hybrid action space itself provides a more informative training signal.

ToolCUA also shows strong cross-platform generalization. Despite training only on Linux desktop environments, it achieves 33.8% accuracy on unseen Windows desktop apps in WindowsAgentArena, surpassing even larger models like Qwen3-VL-235B. This suggests the training paradigm imparts transferable hybrid action orchestration skills.

Real-world examples illustrate the power of ToolCUA's path selection. In a LibreOffice Calc task, ToolCUA uses a tool to call create_pivot_table directly, bypassing a lengthy sequence of menu navigations. In a VS Code task, it first calls add_folder tools to add directories, then seamlessly switches to GUI to click 'Yes' on a trust dialog—showing true synergy between modalities.

ToolCUA represents a significant step toward practical computer use agents that can intelligently combine GUI and tool operations. The team has open-sourced the code and model weights, inviting further research into hybrid action training for next-generation CUAs.