2026-06-29 15:07 UTCIn-site rewrite6 min readUpdated: 2026-06-29 15:26 UTC

Building AI Agents in Ruby with the Anthropic SDK

This article explains how to build AI agents in Ruby using the Anthropic SDK. It covers the concept of agents vs. workflows, the minimal agent loop, tool design, streaming, background execution, security, error handling, observability, and testing. It emphasizes that a simple model call is often sufficient, and the agent loop should only be used for truly open-ended tasks.

SourceHacker News AIAuthor: nikita-ruby

An AI agent is a language model that actually does things. You hand it a goal and a set of tools (functions it is allowed to call), and it decides which to use, runs them, reads the results, and keeps going until the task is done. That loop of deciding, acting, and observing is what separates an agent from a single prompt. A support agent that looks up customer invoices and drafts a reply, or an internal tool that pulls from three systems to answer a question, is an agent in this sense.

The official Anthropic Ruby SDK ships with streaming, connection pooling, and a tool runner that handles the agent loop for you. This post covers what an agent actually is, how to structure one in Rails, how to design tools the model can use reliably, and the production concerns that make the difference between a demo and something you can actually ship.

What an Agent Actually Is

The concept is simple. In Anthropic's words, "agents are typically just LLMs using tools based on environmental feedback in a loop" (Building Effective Agents). The model receives a goal, decides whether it needs to call a tool, you execute the tool and feed the result back, and the loop repeats until the model stops asking for tools.

The same article draws a distinction worth understanding before you write any code. "Workflows are systems where LLMs and tools are orchestrated through predefined code paths," while "agents are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks." Workflows are predictable and consistent; agents are flexible at the cost of higher latency, higher token spend, and the potential for compounding errors. Which of these you actually need is the call that matters here, and the honest answer is usually "less agent than you think."

Anthropic's own guidance is to find "the simplest solution possible, and only increasing complexity when needed." For many features, a single well-prompted model call with good context beats an autonomous agent: cheaper, faster, and easier to debug. Reach for a true loop only when the task is open-ended enough that you genuinely cannot predict the steps in advance.

The Minimal Agent Loop in Ruby

Start with the official gem:

Gemfile

gem "anthropic"

The client is threadsafe and maintains its own connection pool, so create it once and reuse it. An initializer is the natural home:

config/initializers/anthropic.rb

ANTHROPIC = Anthropic::Client.new( api_key: ENV.fetch("ANTHROPIC_API_KEY") )

A single model call looks like this:

message = ANTHROPIC.messages.create( model: "claude-sonnet-4-6", max_tokens: 1024, messages: [{ role: "user", content: "Summarize Q1 in one sentence." }] )

puts message.content

That is not yet an agent, because there is no loop and no tools. The loop is what makes it agentic: send the conversation to the model, check whether it wants to use a tool, run the tool, append the result to the conversation, and repeat until it stops asking for tools. Written by hand, the loop is only a dozen lines, and it is worth seeing once before you let the SDK handle it, because understanding what is under the abstraction is what lets you debug it when it breaks.

def run_agent(client:, tools:, messages:, model: "claude-sonnet-4-6") loop do response = client.messages.create( model: model, max_tokens: 1024, tools: tools.map(&:definition), messages: messages )

The model is done when it stops asking to use tools.

break response if response.stop_reason != "tool_use"

messages _body") was rendered when the message

record was created, so each chunk just adds a text node to it - far

cheaper than re-rendering the whole bubble on every token.

Turbo::StreamsChannel.broadcast_append_to( conversation, # the stream the browser subscribed to target: "message_#{message.id}_body", # element to append into html: chunk ) end

Persist the finished text once the stream closes, so a page reload

shows the full response rather than an empty bubble.

message.update!(body: full_text)

The view subscribes to the stream with and renders the empty message__body container once; from then on every broadcast_append_to lands inside it with no controller round trip. In a Rails app this pairs naturally with Turbo Streams or ActionCable: each text chunk becomes a broadcast, and the user watches the response appear. The streaming interface also exposes accumulation helpers and event-level access when you need to react to specific events rather than just the text, which is useful for showing the user "calling tool: looking up invoices" as it happens.

Run Agents in the Background

A real agent loop can run for many turns, and each turn is a network round trip to the model. That can easily exceed the time budget of a web request, and tying up a Puma worker for thirty seconds while an agent thinks is a good way to exhaust your connection pool under load. Agents belong in background jobs.

Enqueue the agent run, stream results back over a channel, and let your existing job infrastructure handle retries and concurrency.

class AgentRunJob #{content.to_s.gsub(//, "")}

RESULT end

Third, limit what the agent can do. An agent that can only read cannot be injected into deleting data. The more powerful the write tools, the more carefully you need to guard against injection.

Jailbreaking (attempts to make the model ignore its system prompt through roleplay, hypotheticals, or cleverly worded requests) is a related but different problem. The practical defenses: tell the model in the system prompt that it should decline roleplay or hypotheticals that would cause it to act outside its defined scope; validate that tool calls make sense before executing them; and accept that no system prompt is perfectly jailbreak-proof. Defense in depth matters more than trying to write an unbreakable prompt.

Error Handling and Retries

The SDK raises a typed hierarchy of errors, all descending from Anthropic::Errors::APIError, which lets you handle each failure mode deliberately:

begin message = ANTHROPIC.messages.create( model: "claude-sonnet-4-6", max_tokens: 1024, messages: messages ) rescue Anthropic::Errors::RateLimitError

HTTP 429: back off and retry, or shed load.

raise rescue Anthropic::Errors::APIConnectionError => e

Network problem reaching the API.

Rails.logger.error("Anthropic unreachable: #{e.cause}") raise rescue Anthropic::Errors::APIStatusError => e Rails.logger.error("Anthropic returned #{e.status}") raise end

The SDK already retries certain failures for you: by default it retries twice, with a short exponential backoff, on connection errors, request timeouts, 409 conflicts, 429 rate limits, and 5xx errors. You can tune this per client or per request with the max_retries option, and set it to zero when you want to handle retries entirely in your own job layer.

For agents specifically, there is a second class of error beyond HTTP failures: the model doing something you did not expect, like calling a tool with arguments that fail validation or looping without converging. Always set a maximum iteration count as a stopping condition, even when using the tool runner, so a confused agent fails loudly instead of running up a bill. Treat your tool code defensively, validate inputs, and return a clear error string to the model when something is wrong rather than raising, because a well-worded error in the tool result often lets the model correct itself on the next turn.

Observability: Make the Agent's Thinking Visible

Anthropic's guidance for building agents includes to "prioritize transparency by explicitly showing the agent's planning steps." Transparency is the easiest of these to skip, and it is how you debug. An agent that fails silently is nearly impossible to diagnose; an agent that logs every tool call, every argument, and every result is straightforward.

Log each tool invocation with the tool name, the arguments, the user on whose behalf it ran, and the result. In practice this log becomes three things at once: your debugging trace, your audit trail, and your cost-attribution record. Capture token usage from each response too, because that is how you understand and control spend. The model returns usage figures on every message; persist them against the conversation so you can see which agents and which users are expensive.

A busy agent fleet writes a lot of these rows - one per tool call, plus a usage record per model turn - and they are exactly the append-heavy, time-ordered shape that strains a plain table once you start running aggregate queries over it. If the volume gets there, TimescaleDB for high-volume telemetry is where I would move the token-usage and tool-call tables; the per-hour and per-day rollups you want for cost dashboards are what continuous aggregates are built for.

A simple wrapper around tool execution gives you this for free:

def execute_tool(tool, block) started = Process.clock_gettime(Process::CLOCK_MONOTONIC) result = tool.call(block.input) AgentToolCall.create!( tool_name: block.name, arguments: block.input, user_id: Current.user&.id, duration_ms: ((Process.clock_gettime(Process::CLOCK_MONOTONIC) - started) * 1000).round ) result rescue => e Rails.logger.error("Tool #{block.name} failed: #{e.message}") "Error: #{e.message}" # Hand a usable error back to the model. end

Testing Agents

You can test an agent without ever calling the real API or spending a token. Two layers cover most of the risk: the tools on their own, and the loop with the API stubbed. The first is a plain Ruby test and the most valuable one to write, because the tool is where your data and your authorization live.

Tools are ordinary objects, so test them like any other. The test that earns its keep is the authorization one: prove a tool cannot return another tenant's rows, no matter what arguments the model invents. Because call just takes something that responds to the input fields, you can drive it with a Struct stand-in and skip the SDK entirely.

require "test_helper"

class LookupInvoicesTest "application/json" }.freeze

test "dispatches the tool the model requests and feeds the result back" do stub_request(:post, "https://api.anthropic.com/v1/messages").to_return( { status: 200, headers: JSON_HEADERS, body: tool_use_turn.to_json }, { status: 200, headers: JSON_HEADERS, body: final_turn.to_json } )

tool = LookupInvoices.new(current_user: users(:acme_admin))

Record the dispatch without touching the database.

dispatched = nil tool.define_singleton_method(:call) do |input| dispatched = input [{ id: 1, status: "open", amount_cents: 42_000 }] end

run_agent( client: ANTHROPIC, tools: [tool], messages: [{ role: "user", content: "What does customer 4471 owe?" }] )

The tool ran with the arguments the model sent...

assert_equal 4471, dispatched.customer_id

...and the loop sent a second request carrying the tool_result.

assert_requested :post, "https://api.anthropic.com/v1/messages", times: 2 do |req| JSON.parse(req.body)["messages"].any? do |msg| Array(msg["content"]).any? { |block| block["type"] == "tool_result" } end end end

private

def tool_use_turn { id: "msg_01", type: "message", role: "assistant", model: "claude-sonnet-4-6", stop_reason: "tool_use", content: [ { type: "tool_use", id: "toolu_01", name: "lookup_invoices", input: { customer_id: 4471 } } ], usage: { input_tokens: 100, output_tokens: 20 } } end

def final_turn { id: "msg_02", type: "message", role: "assistant", model: "claude-sonnet-4-6", stop_reason: "end_turn", content: [{ type: "text", text: "Customer 4471 owes $420.00." }], usage: { input_tokens: 150, output_tokens: 12 } } end end

When you want fidelity closer to the real wire format, record a real exchange once with VCR and replay the cassette forever after. It is the better choice for asserting that your code handles a genuine multi-tool turn, because hand-writing those response bodies gets tedious and drifts from reality. Whichever you use,

[truncated for AI cost control]