Test Run Comparisons
Compare LLM test runs side-by-side with LangSmith's Test Run Comparisons. Manually inspect data, filter results, and gain insights faster.
Article intelligence
Key points
- Compare two or more test runs side by side.
- Combine automated evaluations with manual inspection.
- Use filters to quickly find where runs differ significantly.
Why it matters
This matters because compare two or more test runs side by side.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
Test Run Comparisons
Products
LangSmith Platform
Observability
See exactly what your agents are doing
Evaluation
Score and improve agent performance
Deployment
Ship and scale agents in production
Fleet
Agents for the whole company
Open Source Frameworks
deepagents
Build long-running agents for complex tasks
langchain
Quick start agents with any model provider
langgraph
Build reliable agents with low-level control
Learn
Resources
Blog
Customer Stories
Guides
Max Agency
How-To
LangChain Academy
YouTube
Documentation
Community
LangSmith for Startups
Meetups
Community
Docs
Company
About
Careers
Partners
Events
Pricing
Try LangSmith
Get a demo
Try LangSmith
Get a demo
LangChain
Test Run Comparisons
The LangChain Team
October 17, 2023
4
min
Go back to blog
Create agents
Share
One pattern I noticed is that great AI researchers are willing to manually inspect lots of data. And more than that, they build infrastructure that allows them to manually inspect data quickly. Though not glamorous, manually examining data gives valuable intuitions about the problem.
- Jason Wei, OpenAI
Evaluations continue to be one of the hardest parts of building LLM applications. It's really tough to evaluate in a quantitative way the effect of changes to your prompt, chain, or agent. We're bullish on LLM-assisted evaluation, but, at the same time, we definitely recognize that it's hard to have complete trust in them.
Jason's tweet above sums up what we see a lot of the best researchers (and engineers) doing. They want to manually inspect data to gain intuition about the problem. At LangChain, we want to build the infrastructure to help do that - which is why we're excited to announce Test Run Comparisons today.
In the initial release of LangSmith we had support for running tests, including scoring them with LLM-assisted feedback. However, each test was run in isolation. We quickly saw two usage patterns emerge:
People are still hesitant to trust the LLM-assisted feedback directly
Users often wanted to not only score their test run in isolation, but also compare it to previous iterations
When building Test Run Comparisons, we kept both of these insights in mind. We wanted to create an easy UX to see multiple test runs side-by-side. We also wanted to create an easy UX where people could use LLM-assisted evals (or regex/other eval) to get an initial score, then manually explore those datapoints for further insights.
So how does it work?
First, you need to set up a dataset and run some tests. See documentation here for instructions on how to do that. Nothing new here, so if you've already done that for an existing project you're all good.
Inside a dataset, you can easily select two (or more) test runs, then click Compare.
From there, you will be brought into the Test Run Comparison view. This should look like the below
You can easily see the inputs, the reference output, and then the actual output for each datapoint - along with any eval metrics, time and latency for that run.
This view is designed to make it easy to quickly compare test runs across the same inputs. If you want a deeper look at a particular datapoint, you can click on that row and sidebar will pop up allowing you to drill down into the details of those runs.
On that sidebar, we've also added up and down carets (▲ and ▼) to easily flip between runs.
This view should hopefully make it easy to compare runs for a particular datapoint. But how do you know what datapoints to be looking at?
We've added filters for each column - similar to Excel. Using these filters, you can filter the rows according to any criteria.
💡
The criteria we recommend using to start? Filter one test run to datapoints it got correct, and the other one to datapoints that it got incorrect. This allows you to quickly drill on places of significant difference between the two test runs, which should more easily allow you to discover what has changed.
Building an LLM application is hard. A big part of that is understanding how the LLM is working on a particular task. Setting up an evaluation dataset and then being able to easily compare runs on that dataset is crucial for developing the understanding needed to improve the application. Test Run Comparison in LangSmith aimed at solving this problem. Please let us know any feedback you have!
LangSmith is in private beta - sign up here. We'll be rolling out more access over the next few weeks, as well as continuing to add features like this.
Related content
LangChain
Partner
A Developer’s First 10 Minutes: Secure LangChain Agents with Cisco AI Defense
Siddhant Dash
April 16, 2026
4
min
Company Announcements
LangChain
LangChain Skills
The LangChain Team
March 4, 2026
2
min
Case Studies
LangChain
LangGraph
How Remote uses LangChain and LangGraph to onboard thousands of customers with AI
The LangChain Team
January 19, 2026
5
min
Sign up for our newsletter to stay up to date
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
See what your agent is really doing
LangSmith, our agent engineering platform, helps developers debug every agent decision, eval changes, and deploy in one click.
Try LangSmith
Get a demo
Products
LangSmith PlatformLangSmith ObservabilityLangSmith EvaluationLangSmith DeploymentLangSmith FleetDeep AgentsLangChainLangGraph
Resources
BlogCustomer StoriesGuidesLangChain AcademyCommunityChangelogDocsSupport
Company
AboutCareersPartnersTrust CenterMarketing Assets
Events
Sign up for our newsletter to stay up to date
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
All systems operational
Privacy policyTerms of service