AI News HubLIVE
站内改写6 min read

AI Agent Tool Design: What Works and What Doesn't

This article analyzes key patterns in AI agent tool design, arguing that most agent failures stem from tool design flaws rather than model capability. It covers effective practices like single-responsibility tools, tight schemas, scope-defining descriptions, structured error returns, and idempotency, while warning against common pitfalls such as wrapping unfiltered APIs, loading all tools into context, and silent partial success.

SourceHacker News AIAuthor: eigenBasis

AI Agent Tool Design: What Works and What Doesn't

AI Agent Tool Design: What Works and What Doesn't

In this article, you will learn how tool design — not model capability — is the root cause of most AI agent failures, and what concrete design patterns you can apply to fix it.

Topics we will cover include:

Tool design practices that improve agent reliability, including single-responsibility tools, tight schemas, and structured error returns.

Common failure modes such as unfiltered API exposure, silent partial success, and overlapping tool names that break real-world workloads.

Schema and error handling patterns that reduce hallucination and unreliable behavior at the tool boundary.

Let’s get into it.

AI Agent Tool Design: What Works and What Doesn’t

Introduction

Most AI agent failures look like model mistakes: choosing the wrong tool, passing bad arguments, or mishandling errors. But in practice, the model is usually working with the interface it was given. The underlying issue is often the tool design itself.

A model can only reason from the information exposed through the tool interface: the tool name, its description, the parameter schema, and the parameter descriptions. Those details shape how the model interprets intent, plans actions, and executes tasks. When the tool design is unclear, incomplete, or loosely structured, failures become predictable rather than accidental.

Problems like vague naming, ambiguous instructions, inconsistent schemas, weak parameter definitions, and poor error handling all increase the likelihood of failures. Stronger models can reduce some mistakes, but they cannot reliably compensate for a flawed interface. This article covers:

Tool design practices that improve reliability

Failure modes that look fine in demos but break under real workloads

Schema and error design that reduces hallucination at the tool boundary

Each pattern is paired with its failure counterpart, because understanding why a design fails is as important as knowing what to replace it with.

What Works in AI Agent Tool Design

  1. One Tool, One Responsibility

In most agent systems, a tool should represent a single, clear operation. When one tool handles multiple behaviors through an action parameter, the model must first figure out which mode to invoke before it can solve the actual task.

The difference becomes clearer when comparing a multi-action tool against dedicated single-purpose tools:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Avoid: action-based multi-behavior tool

@tool

def manage_customer(

action: str,

customer_id: str | None = None,

data: dict | None = None

):

"""

action: create | get | update | delete | suspend

"""

...

Prefer: single-responsibility tools

@tool

def create_customer(data: CustomerInput) -> Customer:

"""Create a new customer record."""

...

@tool

def get_customer(customer_id: str) -> Customer:

"""Retrieve a customer by ID."""

...

@tool

def suspend_customer(customer_id: str, reason: str) -> SuspensionResult:

"""Suspend a customer account."""

...

One Tool, One Responsibility

Single-responsibility tools give the model an unambiguous function and give you cleaner error handling and easier observability.

⚠️ Note: This is a useful default rather than a universal rule. Some domains — such as shell, filesystem, browser, or calendar tools — may benefit from a constrained multi-action interface because the action space itself is part of the underlying abstraction.

  1. Schemas That Make Invalid States Impossible

In tool-calling agents, the model constructs tool call arguments by reasoning from your schema.

A loose schema means the model guesses at constraints.

A tight schema encodes those constraints so no guessing is needed.

Here’s an example:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

from pydantic import BaseModel, Field

from enum import Enum

class Priority(str, Enum):

LOW = "low"

MEDIUM = "medium"

HIGH = "high"

class CreateTaskInput(BaseModel):

title: str = Field(

description="Short, actionable task title. Use imperative form: 'Review PR', not 'PR Review'.",

min_length=5,

max_length=100

)

priority: Priority = Field(

description="Task priority. Use HIGH only for blockers affecting other work.",

default=Priority.MEDIUM

)

due_date: str = Field(

description="Due date in ISO 8601 format: YYYY-MM-DD. Must be a future date.",

pattern=r"^\d{4}-\d{2}-\d{2}$"

)

Enums are particularly useful for fields with a small set of valid values because they eliminate a class of plausible-but-invalid outputs. Validation failures surface at the tool boundary rather than as cryptic downstream errors.

  1. Descriptions That Define Scope, Not Just Purpose

Tool descriptions are model-facing documentation. They need to do two things: explain when to use the tool, and explain when not to. Most descriptions only do the first.

1

2

3

4

5

6

7

8

9

10

Weak: explains what it does, not when not to use it

"""Search for documents in the knowledge base."""

Strong: defines purpose, scope, and boundaries

"""

Search the internal knowledge base for documents, policies, and reference material.

Use this when the user asks about company procedures, product specs, or documented workflows.

Do NOT use this for real-time data (prices, availability, current status) — use get_live_data() instead.

Returns up to 5 results ranked by relevance. If no results are returned, the information is not in the knowledge base.

"""

Without the disambiguation, the model infers scope from the tool name alone, which is often a reliable source of selection errors at scale. A good tool definition includes clear boundaries from other tools, not just usage instructions.

  1. Structured, Actionable Error Returns

When a tool fails, the model reads the error and decides what to do next. An unhandled exception or stack trace produces noise-driven follow-up behavior. A structured error gives the model something to branch on.

Structured errors should not only report what failed but also help the agent decide what to do next. A good error format makes retry behavior explicit and gives the model a clear recovery path:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

class ToolError(BaseModel):

error_code: str # machine-readable, for the model to branch on

message: str # human-readable description

recoverable: bool # can the agent retry?

suggested_action: str # what the agent should do next

Record not found: retryable

return ToolError(

error_code="RECORD_NOT_FOUND",

message="No user record found with ID 'usr_123'.",

recoverable=True,

suggested_action="Use list_users() to get valid user IDs before calling get_user()."

)

Quota exceeded: not retryable

return ToolError(

error_code="QUOTA_EXCEEDED",

message="API quota for this tool has been reached for today.",

recoverable=False,

suggested_action="Notify the user and stop. Do not retry this tool today."

)

The recoverable flag and suggested_action field are what change agent behavior. Without them, models retry non-retryable errors or abandon recoverable ones.

  1. Idempotent State-Changing Operations

Every tool that mutates state — creates a record, sends a message, transfers funds — must be safe to call twice. In practice, agents retry, networks fail, and the LLM loop may issue a second call because confirmation of the first never arrived.

A simple way to prevent duplicate side effects is to require an idempotency key for every write operation:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

@tool

def send_email(

to: str,

subject: str,

body: str,

idempotency_key: str = Field(

description="Unique key for this send operation. Use a hash of recipient + subject + timestamp. "

"Same key on retry returns the original result without re-sending."

)

) -> dict:

"""Send an email. Idempotent: the same idempotency_key will not trigger a second send."""

existing = idempotency_store.get(idempotency_key)

if existing:

return existing

result = email_service.send(to=to, subject=subject, body=body)

idempotency_store.set(idempotency_key, result, ttl=86400)

return result

Without idempotency guarantees, transient failures can easily turn into duplicate actions.

What Doesn’t Work in AI Agent Tool Design

  1. Thin Wrappers Around Unfiltered APIs

Pointing an agent at a REST API and surfacing it as a tool is the most common shortcut and the most common source of production failures. APIs built for developers often expose far more detail than agents actually need. Responses come packed with hundreds of fields, even when only a handful are relevant. They rely on pagination, use opaque internal IDs with little contextual meaning, and return error codes that require deep domain knowledge to interpret.

A purpose-built wrapper handles pagination internally, projects only the fields the agent needs, and maps API errors to the structured ToolError format discussed above. The agent never constructs API paths or manages pages; it receives typed objects it can reason about.

That said, over-wrapping can also be harmful. If every endpoint becomes a separate, narrowly defined tool with no shared structure, the tool surface can become fragmented and harder for the model to navigate. The goal is not maximal abstraction, but a consistent, agent-friendly abstraction layer.

  1. Loading All Tools Into Every Context

Accuracy degrades as the tool catalog grows. LongFuncEval, a 2025 study on tool-calling performance across long contexts, found performance drops substantially as the tool catalog size increased — even in models with 128K context windows. Loading every tool into every system prompt compounds this by consuming token budget before any task content is processed.

Dynamic tool loading addresses both problems. Determine which tools are relevant to the current step and include only those:

1

2

3

4

5

6

7

8

9

STEP_TOOL_MAP = {

"research": ["search_documents", "search_web", "get_url_content"],

"write": ["create_document", "update_document", "format_text"],

"send": ["send_email", "post_to_slack", "create_calendar_event"],

}

def get_tools_for_step(step_type: str, available_tools: list) -> list:

relevant_names = STEP_TOOL_MAP.get(step_type, [])

return [t for t in available_tools if t.name in relevant_names]

Dynamic Tool Loading

Exposing only a small, relevant subset of tools at each step — rather than the full toolset — generally improves selection accuracy and reduces per-call token cost.

  1. Silent Partial Success

Partial success becomes a problem when a tool completes only part of the requested work but returns a response that looks fully successful. The agent continues execution with an incomplete or misleading view of the system state.

This usually happens when tools suppress internal failures and return only the successful portion of the result:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

This version silently misleads the agent

@tool

def bulk_create_tasks(tasks: list) -> dict:

created = []

for task in tasks:

try:

result = task_api.create(task)

created.append(result.id)

except Exception:

pass # silent failure: this is the bug

return {"created": created}

This version makes partial success explicit

@tool

def bulk_create_tasks(tasks: list) -> BulkCreateResult:

created, failed = [], []

for task in tasks:

try:

created.append(task_api.create(task).id)

except TaskCreationError as e:

failed.append({"input": task.title, "reason": str(e)})

return BulkCreateResult(

created_ids=created,

failed_items=failed,

success=len(failed) == 0,

partial_success=len(created) > 0 and len(failed) > 0

)

The partial_success flag gives the model something to branch on: retry the failed it

[truncated for AI cost control]