AI News HubLIVE
Original source2 min read

Harbor x LangChain: A Unified Stack for Evaluating Agents

Evaluating long-running, stateful agents requires a new eval runner. Harbor integrates with LangChain's Deep Agents, LangSmith sandboxes, and observability to provide scalable, isolated evaluations with explainable traces.

Observability & Evals

Partner

Harbor x LangChain: A Unified Stack for Evaluating Agents

June 30, 2026

7

min

Go back to blog

Create agents

Key Takeaways

One small entry point connects your agent to Harbor. A langgraph.json registry plus a make_graph factory is the only glue you write, and that factory can stay model-agnostic by reading the model Harbor passes from the command line

Cloud sandboxes let you scale evals horizontally and run agents in isolation. Each trial gets a fresh LangSmith sandbox, so trials never share state, and you can run hundreds in parallel instead of churning through them serially on one machine

Traces turn scores into explanations. With the langsmith plugin, every job lands as a dataset and experiment with the verifier's reward as feedback, and agent traces attach directly so you can see why a trial passed or failed, not just whether it did

As agents increase in capabilities, evaluations have gotten more difficult. Agent harnesses like Claude Code, Pi, and Deep Agents now give agents access to entire computers to read files, execute scripts, run code, and more. Every agent now needs to run in its own clean, reproducible environment for a given task.

Evaluating long-running, stateful agents requires a new eval runner. Harbor has emerged as the industry leader in this space. In this blog, we first explain why everyone running agent evals should know what Harbor is and then show how to integrate Deep Agents, LangSmith Sandboxes, and LangSmith Experiments into Harbor.

We ultimately need to run agents in a real, reproducible, isolated environment, many times in parallel, with a deterministic check at the end. Harbor solves this problem and is now wired directly into Deep Agents, LangSmith Sandboxes, and LangSmith Observability.

How Harbor works

Harbor is an eval harness. You bring three things:

Your agent

Your dataset

Your sandbox

Each dataset has tasks, which consist of:

An Environment (Dockerfile / Docker Compose YAML)

An Instruction (Markdown)

An Evaluation script (test.sh)

Compared to simpler LLM evaluation, there are two main differences:

The environment where the agent is running in is very important - so important that it needs to be called out as part of the task! Simpler LLM evals don’t need an environment - they just call the LLM. Agents do!

Judging the agent is done with a script. Oftentimes the agent produces other files or modifies state in some way. It’s not just enough to look at the agent’s final response - you need to look at the artifacts it creates along the way.

LangChain plugs into Harbor in three places. We integrate with Deep Agents so any deep agent you build can run inside Harbor's sandboxed environment. We integrate with LangSmith Sandboxes so Harbor can run each task in a LangSmith sandbox, giving each run its own clean machine. And we integrate with LangSmith Observability, the evaluation platform where you view results in detail: every job lands as a dataset and experiment with agent traces attached when the agent supports them.

Unifying LangChain agents with Harbor

You plug a custom agent into Harbor through its built-in langgraph agent, selected with --agent langgraph. It runs any LangGraph application including Deep Agents.

Harbor treats langgraph.json as a registry. It lists the dependencies your agent needs and maps a graph name to the function that builds it:

{ "dependencies": [ "deepagents>=0.6.10,=1.3.1,"

export LANGSMITH_PROFILE=prod export LANGSMITH_TRACING=true export LANGSMITH_PROJECT=harbor-deepagents export FIREWORKS_API_KEY=""harbor run \ --agent langgraph \ --model fireworks:accounts/fireworks/models/glm-5p2 \ # agent --ak project_path=./deep-agent --ak graph=deep_agent \ -d [email protected] \ # dataset of tasks -e langsmith \ # cloud environment --plugin langsmith

Read the Harbor integrations docs to get started. For more on running evals in Harbor, see Run evals.

See what your agent is really doing

LangSmith, our agent engineering platform, helps developers debug every agent decision, eval changes, and deploy in one click.

Try LangSmith

Get a demo