AI News HubLIVE
站内改写6 min read

Agent Observability with LangSmith, Langfuse, and Arize: A Hands-On Comparison

Your AI agent works great in testing. Then you ship it, and something kinda breaks. A tool called loops forever, like it never learns. A retrieval step returns garbage and costs spike. You have no idea why, at all. That’s the agent observability problem. And if you’re building with LLMs, you need to solve it before production, not after. This post compares three top observability tools: LangSmith, Langfuse, and Arize. We set up each one, trace the same agent, and compare what you actually get.

SourceAnalytics VidhyaAuthor: Riya Bansal

-->

Agent Observability: LangSmith vs. Langfuse vs. Arize Compared

India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

d

:

h

:

m

:

s

Career

GenAI

Prompt Engg

ChatGPT

LLM

Langchain

RAG

AI Agents

Machine Learning

Deep Learning

GenAI Tools

LLMOps

Python

NLP

SQL

AIML Projects

Reading list

How to Become a Data Analyst in 2025: A Complete RoadMap

A Comprehensive Learning Path to Tableau in 2025

A Comprehensive NLP Learning Path 2025

Learning Path to Become a Data Scientist in 2025

Step-by-Step Roadmap to Become a Data Engineer in 2025

A Comprehensive MLOps Learning Path: 2025 Edition

Roadmap to Become an AI Engineer in 2025

A Comprehensive Learning Path to Master Computer Vision in 2025

Best Roadmap to Learn Generative AI in 2025

GenAI Roadmap for Enterprises

Large Language Models Demystified: A Beginner’s Roadmap

Learning Path to Become a Prompt Engineering Specialist

Agent Observability with LangSmith, Langfuse, and Arize: A Hands-On Comparison

Riya Bansal Last Updated : 03 Jun, 2026

8 min read

Your AI agent works great in testing. Then you ship it, and something kinda breaks. A tool called loops forever, like it never learns. A retrieval step returns garbage and costs spike. You have no idea why, at all.

That’s the agent observability problem. And if you’re building with LLMs, you need to solve it before production, not after. This post kinda breaks down three of the most-used observability tools: LangSmith, Langfuse and Arize. We’ll set each one up, trace the same agent and compare what you actually get.

Table of contents

What is Agent Observability?

Setting Up the Test Agent

LangSmith: Native Langchain Tracing

Langfuse: Open Source and Framework-Agnostic

Arize: Production-Grade ML Observability

Which Should You Pick for Agent Observability?

Conclusion

What is Agent Observability?

Traditional application monitoring tracks requests, errors, and latency, but that is not enough for Agents.

An Agent may call multiple tools in sequence, with each LLM step having its own prompt, token usage, latency, and potential failure point. A single failed retrieval or tool call can lead to an incorrect final response.

Agent observability captures the full execution graph: every step, decision, LLM input and output, tool call, arguments, results, token usage, latency, and evaluation score. Without this visibility, debugging agent behavior becomes guesswork.

Setting Up the Test Agent

We will utilize a very simple LangChain agent to compare them. The agent receives a question from the user, retrieves relevant context, and responds using one or more tools to provide an answer.

First, you need to create the test agent and for that install all the required libraries.

Let’s look at the base agent with two methods (search_docs and get_order_status). This will act as our foundational base for comparison with the three observability tools.

""" Base agent used across all three observability demos.

Swap the OPENAI_API_KEY env var or call build_agent() from any demo file. """

import os

from dotenv import load_dotenv from langchain.agents import AgentExecutor, create_openai_tools_agent from langchain.tools import tool from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain_openai import ChatOpenAI

load_dotenv()

@tool def search_docs(query: str) -> str: """Search internal docs for relevant information."""

Simulated retrieval — swap with your actual vector store

docs = { "refund": ( "Refunds are processed within 5-7 business days. " "Items must be returned within 30 days." ), "shipping": ( "Standard shipping takes 3-5 business days. " "Express is 1-2 days." ), "account": ( "You can reset your password via the login page. " "Contact support for account issues." ), }

for keyword, content in docs.items(): if keyword in query.lower(): return content

return f"Found general docs related to: {query}"

@tool def get_order_status(order_id: str) -> str: """Look up the status of an order by ID."""

Simulated order lookup

statuses = { "ORD-001": "Shipped — expected delivery 2026-05-30", "ORD-002": "Processing — not yet shipped", "ORD-003": "Delivered on 2026-05-25", }

return statuses.get( order_id, f"Order {order_id} not found in the system.", )

def build_agent() -> AgentExecutor: llm = ChatOpenAI( model="gpt-4o", temperature=0, api_key=os.environ["OPENAI_API_KEY"], )

tools = [search_docs, get_order_status]

prompt = ChatPromptTemplate.from_messages( [ ( "system", "You are a helpful customer support assistant. " "Use tools when needed.", ), ("user", "{input}"), MessagesPlaceholder(variable_name="agent_scratchpad"), ] )

agent = create_openai_tools_agent(llm, tools, prompt)

return AgentExecutor( agent=agent, tools=tools, verbose=False, )

TEST_QUESTIONS = [ "What are the refund policies?", "What is the status of order ORD-002?", "How long does shipping take?", ]

if name == "main": executor = build_agent()

for question in TEST_QUESTIONS: print(f"\nQ: {question}")

result = executor.invoke({"input": question})

print(f"A: {result['output']}")

This creates a candidate agent that can also be used with each of the tools. The first tool we will explore will be the one provided by LangSmith.

LangSmith: Native Langchain Tracing

The LangChain team has developed LangSmith. If you are using LangChain, then integration will be quick and easy.

""" LangSmith observability demo.

Setup:

pip install langsmith

Set LANGCHAIN_API_KEY in your .env file.

How it works:

LangSmith hooks into LangChain's callback system via env vars, so no code changes are needed beyond the two os.environ lines below. """

import os

from dotenv import load_dotenv

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()

Enable LangSmith tracing. These two vars are all you need.

os.environ["LANGCHAIN_TRACING_V2"] = "true" os.environ["LANGCHAIN_PROJECT"] = "agent-observability-demo"

LANGCHAIN_API_KEY must be set in your .env or environment.

def run_with_metadata( executor, question: str, user_id: str = "demo-user", ): """Run the agent and attach per-run metadata via config.""" return executor.invoke( {"input": question}, config={ "metadata": { "user_id": user_id, "source": "langsmith_demo", },

Optional: tag runs for filtering in the dashboard.

"tags": ["observability-blog", "demo"], }, )

def main(): print("=== LangSmith Demo ===") print("Traces will appear at: https://smith.langchain.com") print(f"Project: {os.environ['LANGCHAIN_PROJECT']}\n")

executor = build_agent()

for question in TEST_QUESTIONS: print(f"Q: {question}")

result = run_with_metadata(executor, question)

print(f"A: {result['output']}\n")

print("Done. Open LangSmith to inspect the full trace tree for each run.")

if name == "main": main()

LangSmith automatically connects to LangChain’s callback system without the need for decorators or wrappers to see each run appear in your project dashboard.

What you’ll see on the dashboard:

LangSmith’s trace view shows the full agent execution tree, from the initial call to tool use, LLM responses, and final output. Each node includes inputs, outputs, and latency.

You can tag runs, add metadata, filter by outcome, save runs as datasets, and run evaluations. This is useful when improving prompts or retrieval logic.

The prompt playground is another strong feature. You can open any trace, edit the prompt inline, and rerun it to debug poor LLM performance.

LangSmith’s limitations appear at scale. The free tier has caps, and integration takes more effort if you are not using LangChain, though OpenTelemetry is supported.

Langfuse: Open Source and Framework-Agnostic

Langfuse is the open-source alternative here. You can either host it on your server, or use their cloud service. It also integrates with all frameworks like LangChain, LlamaIndex, raw OpenAI APIs, etc.

Read this Doc-string for installing the dependencies and their setup

""" Langfuse observability demo.

Setup:

pip install langfuse

Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY in your .env file.

LANGFUSE_HOST defaults to https://cloud.langfuse.com; override for self-hosted.

Key differences from LangSmith:

  • Callback handler is passed per-invoke for more explicit control.
  • Native session grouping for multi-turn conversations.
  • You can score any trace after the fact via the Langfuse client.

"""

import os

from dotenv import load_dotenv from langfuse import Langfuse from langfuse.callback import CallbackHandler

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()

def build_handler( session_id: str, user_id: str = "demo-user", ) -> CallbackHandler: return CallbackHandler( public_key=os.environ["LANGFUSE_PUBLIC_KEY"], secret_key=os.environ["LANGFUSE_SECRET_KEY"], host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"), session_id=session_id, user_id=user_id, metadata={"source": "langfuse_demo"}, tags=["observability-blog", "demo"], )

def score_trace( trace_id: str, score: float, comment: str = "", ): """Add a correctness score to a trace after reviewing the output.""" lf = Langfuse( public_key=os.environ["LANGFUSE_PUBLIC_KEY"], secret_key=os.environ["LANGFUSE_SECRET_KEY"], host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"), )

lf.score( trace_id=trace_id, name="correctness", value=score, comment=comment, )

lf.flush()

print(f"Scored trace {trace_id}: {score}")

def run_single_session( executor, session_id: str, ): """Run all test questions in a single session so they're linked in the UI.""" handler = build_handler(session_id=session_id) trace_ids = []

for question in TEST_QUESTIONS: print(f"Q: {question}")

result = executor.invoke( {"input": question}, config={"callbacks": [handler]}, )

print(f"A: {result['output']}\n")

handler.get_trace_id() returns the trace ID for the last run.

trace_ids.append(handler.get_trace_id())

Flush ensures traces are sent before the process exits.

This is critical in batch jobs.

handler.flush()

return trace_ids

def main(): print("=== Langfuse Demo ===") print(f"Dashboard: {os.getenv('LANGFUSE_HOST', 'https://cloud.langfuse.com')}\n")

executor = build_agent() session_id = "demo-session-001"

trace_ids = run_single_session(executor, session_id)

Example: programmatically score the first trace.

if trace_ids and trace_ids[0]: print("\nScoring first trace as an example:") score_trace(trace_ids[0], score=0.9, comment="Answer was accurate")

print(f"\nDone. Find all runs under session '{session_id}' in your Langfuse dashboard.")

if name == "main": main()

You can pass callback handlers every run, which is a little bit more explicit than LangSmith is, but provides greater flexibility since you can assign user IDs, session IDs, and custom metadata when you invoke it.

Evaluation Workflow

Langfuse has a really good evaluation workflow as well; you can add scores after the trace has been completed.

from langfuse import Langfuse

lf = Langfuse()

Score a specific trace by ID.

lf.score( trace_id="trace-abc123", name="correctness", value=0.9, comment="Answer was accurate but slightly verbose", )

This works in conjunction with human reviews of the responses your team scores, allowing you to get aggregated evaluation metrics over time.

Users can organize their sessions by connecting them, so agents can easily follow conversations across multiple turns. All the traces in an individual user session are connected in the application, which allows you to follow an entire conversation in one place.

Arize: Production-Grade ML Observability

Initially developed as a platform for monitoring conventional machine learning models, Arize is now capable of observing both language models and agents. The fact that it was originally created to help teams deploy models into production at scale has remained intact.

Utilizing OpenInference

In addition to using the OpenInference standard as its measurement scheme, Arize integ

[truncated for AI cost control]