How AI Agents Work: An Architectural Deep Dive
This article provides an in-depth analysis of AI agent architecture, focusing on the ReAct pattern, tool use, memory, multi-agent systems, and observability. It highlights that production agents are roughly 98.4% infrastructure and only 1.6% AI logic, and discusses the high failure rates and evaluation challenges in enterprise adoption.
Article intelligence
Key points
- The core of AI agents is the ReAct pattern: a loop of thought, action, and observation until task completion.
- Production agent systems are dominated by operational infrastructure, with AI decision logic comprising a tiny fraction.
- Enterprise adoption faces high failure rates due to integration complexity, costs, and unclear business value.
- Evaluation methodology, not model capability, is the primary bottleneck.
Why it matters
This matters because the core of AI agents is the ReAct pattern: a loop of thought, action, and observation until task completion.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
How AI Agents Actually Work: An Architectural Deep Dive An analysis of the patterns, infrastructure, and trade-offs behind the systems that have redefined what large language models can do Research Technology AI Agents LLM ReAct Tool Use Multi-Agent Systems Observability Software Engineering Claude Code
Executive Summary
The term “AI agent” has become one of the most overloaded in modern tech, but at its core it refers to a simple pattern: a large language model (LLM) connected to external tools and operating in a loop where it reasons about what to do, calls a tool, observes the result, and repeats until the task is complete. This pattern, known as ReAct after the 2022 paper “Synergizing Reasoning and Acting in Language Models,” has become the foundation of every production AI agent today.
What makes agents work well is not the model itself but the surrounding infrastructure: how context windows are managed across thousands of tool calls, how tools are designed for non-deterministic consumers, and how safety boundaries are enforced. A widely-circulated claim has become the defining statistic in this space: Claude Code’s leaked source code revealed only about 1.6% of its codebase constitutes AI decision logic, with the remaining 98.4% being operational infrastructure [3]. This figure is disputed: critics argue it misinterprets how the Liu et al. paper categorizes different kinds of code, and that the distinction between “AI logic” and “infrastructure” is itself an interpretive choice rather than a fact about the code. Regardless of the exact percentage, the underlying intuition holds: production agent systems are dominated by operational engineering.
The architecture has evolved through several identifiable layers:
The ReAct loop (Thought → Action → Observation) interleaves reasoning traces with external actions so the model can induce, track, and update plans while interacting with real data sources.
Tool use connects the model to APIs, files, databases, and other systems. The key insight is that tools must be designed specifically for agents, i.e., non-deterministic consumers, not just wrapped as API endpoints.
Memory comes in two forms: short-term (in-context learning bounded by the context window) and long-term (external vector stores via Retrieval-Augmented Generation).
Planning and composition patterns (orchestrator-workers, evaluator-optimizer, parallelization) allow agents to handle complex multi-step tasks.
Multi-agent systems delegate subtasks to specialized workers, trading exponential token costs for dramatic gains in capability on open-ended problems.
Observability (distributed tracing via OpenTelemetry GenAI semantic conventions, infinite loop detection, cost attribution, and session replay) has emerged as a critical operational layer. Without it, debugging non-deterministic agent behavior is nearly impossible.
The most important finding from this research is that agent architecture has converged around a small set of well-understood patterns. The competition between framework vendors (LangChain, CrewAI, OpenAI’s SDKs, Anthropic’s Agent SDK) is largely about ergonomics. Real engineering effort goes into context management, tool design, and reliability, areas where the best practitioners have accumulated significant domain knowledge.
A second important finding is that the gap between agent benchmarks and real-world performance is much wider than commonly assumed: 95% of enterprise AI pilots deliver zero measurable ROI [25], and roughly half of SWE-bench-passing PRs would not be merged by real maintainers [17]. The field’s primary bottleneck is now evaluation methodology, not model capability [21].
A third finding: the “agent winter” critique has empirical backing. Enterprise adoption has been slower and more cautious than early hype suggested, with Gartner predicting 40% of agentic AI projects will be scrapped by 2027, citing “rising costs, unclear business value, and integration complexity,” and PwC identifying integration complexity (67%), lack of monitoring (58%), and unclear escalation paths (52%) as the top causes of pilot failure.
- Definitions: What Is an “Agent” and How Does It Differ from Other AI Systems?
The word “agent” has a long history in computer science. The classic definition from Russell and Norvig’s Artificial Intelligence: A Modern Approach describes an agent as anything that perceives its environment through sensors and acts upon that environment through actuators. This is a broad definition; a thermostat is technically an agent.
In the modern AI literature, the term has narrowed. Anthropic defines agents as “systems where LLMs dynamically direct their own processes and tool usage,” distinguishing them from workflows: systems where LLMs and tools are orchestrated through predefined code paths. This distinction matters: a customer support bot that follows a decision tree of prompts is a workflow; one that decides on its own whether to query a knowledge base, check a user’s account history, or ask for clarification is an agent.
The key property that makes something “agentic” is autonomy in tool selection and task decomposition. An autonomous system chooses which tools to use and in what order; it breaks complex goals into subgoals without explicit human instruction for each step.
A related term, copilot, refers to systems that assist a human operator but do not operate independently. ChatGPT, GitHub Copilot, and Cursor are copilots: they generate suggestions but require the user to approve and execute each action. Claude Code occupies an interesting middle ground: it can autonomously edit files and run commands in a sandbox, but permission modes (plan, default, auto) control how much autonomy it has.
- The ReAct Pattern: Core Architecture
The single most important pattern in agent design is ReAct (short for “Reasoning and Acting”), introduced by Yao et al. at Google Research and Princeton University in October 2022 [1]. Before ReAct, reasoning (chain-of-thought prompting) and acting (action plan generation) had been studied as separate capabilities. The paper’s central insight was that interleaving them creates a synergy: reasoning traces help the model induce, track, and update action plans, while actions enable interaction with external sources of information.
How the Loop Works
The ReAct loop is deceptively simple:
while not done: thought = model(reasoning_trace + available_tools) if thought is a tool call: result = execute_tool(thought.tool, thought.args) observation = format_result(result) append to reasoning trace else: return thought
In practice, the “thought” that the model generates can be either a natural-language reasoning step or a structured tool call. The model alternates between these two types of outputs. Each iteration adds both a reasoning trace and an observation (the result of the previous action) to the context window.
Why It Works
There are three reasons ReAct outperforms its predecessors:
Error correction: Chain-of-thought reasoning alone is vulnerable to error propagation. If the model makes a mistake in step 2, every subsequent step compounds that error. By interleaving actions (like Wikipedia lookups), the agent can detect and correct mistakes early.
Information grounding: The ReAct paper showed that on question-answering tasks (HotpotQA) and fact verification (FEVER), ReAct “overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API” [1].
Interpretability: Because the agent’s thought process is visible, failures are debuggable. You can see exactly where the model went wrong. Was it the initial plan? A tool call with wrong arguments? An incorrect interpretation of the result?
A Minimal ReAct Implementation
Below is a minimal working implementation of the ReAct loop using OpenAI’s function calling API, illustrating how the pattern translates from theory to code:
import openai
Define tools as JSON schemas the model understands
tools = [ { "type": "function", "function": { "name": "search_wikipedia", "description": "Search Wikipedia for relevant information", "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"} }, "required": ["query"] } } }, { "type": "function", "function": { "name": "calculate", "description": "Perform arithmetic calculation", "parameters": { "type": "object", "properties": { "expression": {"type": "string", "description": "Math expression to evaluate"} }, "required": ["expression"] } } } ]
Tool implementations (executed by deterministic code, not the model)
def search_wikipedia(query: str) -> str: """Actual Wikipedia API call"""
... real implementation
pass
def calculate(expression: str) -> str: return str(eval(expression)) # simplified for illustration
tool_functions = {"search_wikipedia": search_wikipedia, "calculate": calculate}
The ReAct loop
messages = [{"role": "user", "content": "What is the capital of France and what's its population squared?"}] max_iterations = 10
for _ in range(max_iterations): response = openai.chat.completions.create( model="gpt-4o", messages=messages, tools=tools )
msg = response.choices[0].message
if msg.tool_calls:
Model wants to call a tool
for tool_call in msg.tool_calls:
Append the tool call to history (the "Thought" phase)
messages.append({"role": "assistant", "content": None, "tool_calls": [tool_call]})
Execute the tool deterministically
func_name = tool_call.function.name func_args = json.loads(tool_call.function.arguments) result = tool_functions[func_name](**func_args)
Append the observation back to history
messages.append({ "role": "tool", "content": result, "tool_call_id": tool_call.id }) else:
No tool call; model has a final answer
print(msg.content) break
This code illustrates the core separation: the model decides what to do (which tool to call and with what arguments), while deterministic Python code handles the execution. The conversation history grows with each iteration (thought, action, observation) until the model produces a final answer rather than a tool call.
Performance
The ReAct paper reported significant improvements: on ALFWorld (a synthetic household task environment), ReAct outperformed imitation and reinforcement learning methods by an absolute success rate of 34%. On WebShop (an online shopping environment with 1.18 million products), it beat baselines by 10% in success rate. These results were achieved with only one or two in-context examples.
Mechanistic Analysis: Why Interleaving Works (and When It Does Not)
The ReAct paper’s claim of “synergy” between reasoning and acting has been both validated and challenged by subsequent research. Understanding why interleaving helps at the model level requires examining what actually happens inside a transformer during an agent loop.
The functional explanation. At the behavioral level, interleaving creates a dynamic feedback loop: each tool output becomes new input for the next reasoning step, allowing the model to continuously update its understanding of the task. Choices are informed by both internal logic (pre-trained knowledge) and external results (tool outputs). This reduces hallucination because the model cannot rely solely on parametric memory.
The transformer-level explanation. When a model generates a tool call and then receives the tool’s output appended to its context, several things happen at the attention level:
Attention re-weighting: The newly appended tool output tokens receive full attention from all subsequent generation steps. The model’s attention heads redistribute their focus across the entire contex
[truncated for AI cost control]