New in LangSmith: a Fleet on-call copilot for alert triage, computer use for agents, voice trace debugging, and experiment status tracking. Plus Deep Agents Rubrics, programmatic subagents, a new LangSmith Deployment course, and upcoming events in Chicago, Berlin, DC, and Vegas.
Fleet On-Call Copilot: a prebuilt agent template that triages alerts and drafts updates using code, traces, and runbooks.
Computer Use: agents can now operate an isolated virtual computer for code, files, and authenticated API calls.
On the Max Agency Podcast, Zack Reneau-Wedeen discusses the future of AI agents, advocating for simple architectures, outcome-based pricing, and avoiding 'org chart shipping.' He shares insights from building customer-facing agents at Sierra.
Simple agent architectures outperform complex multi-agent systems
Outcome-based pricing aligns incentives for high-value tasks
Klarna's AI assistant, built on LangGraph and LangSmith, handles the work of 700 full-time staff, reducing customer query resolution time by 80% and automating 70% of repetitive support tasks.
Klarna's AI assistant handles over 2.5 million conversations, performing the work of 700 full-time employees.
The assistant reduced average customer query resolution time by 80% and automated ~70% of repetitive tasks.
The EU AI Act compliance deadline is August 2, 2026. This article explains what the Act requires for high-risk AI systems and how LangSmith and LangChain OSS help meet each requirement through full observability, automated evaluations, human oversight, and more.
EU AI Act requires risk management, automatic logging, transparency, human oversight, and post-market monitoring for high-risk AI systems.
LangSmith provides end-to-end tracing capturing every agent input, reasoning step, tool call, and output.
A practical guide to adding memory to AI agents, covering short-term and long-term memory concepts, trace analysis, and how LangSmith's tools enable a complete memory loop for agent improvement across runs.
Memory enables agents to remember user preferences and corrections, reducing repeated instruction.
Short-term memory handles current tasks; long-term memory persists facts, preferences, and skills.
LangSmith launches a no-code agent builder that enables non-technical users to create AI agents with memory, guided prompts, and MCP tools. The builder uses conversational guidance, built-in memory, and sub-agents to lower the barrier for agent development, suitable for internal productivity use cases.
LangSmith Agent Builder offers a no-code experience with memory and guided prompt creation.
Agents consist of four core components: prompt, tools, triggers, and sub-agents.
Factory AI leveraged LangSmith's observability and feedback API to close the product feedback loop, achieving a 2x improvement in iteration speed and significant reductions in development cycle time.
Factory integrated LangSmith with AWS CloudWatch for enhanced observability and debugging.
Open SWE is an open-source, cloud-hosted coding agent that autonomously handles GitHub tasks—planning, coding, testing, and opening PRs. It features a multi-agent architecture, human-in-the-loop control, and asynchronous execution.
Open SWE is an open-source, async, cloud-hosted coding agent that integrates directly with GitHub.
It uses a multi-agent architecture (Planner, Programmer, Reviewer) to ensure code quality.
Monte Carlo built an AI Troubleshooting Agent on LangGraph and debugged with LangSmith to help data teams resolve issues faster by exploring multiple investigation paths in parallel.
Monte Carlo used LangGraph to create a dynamic graph for automated, parallel troubleshooting.
LangSmith enabled visualization and rapid iteration of prompts from day one.
LangSmith launches public benchmarks and evaluation dataset sharing to help developers compare LLM architecture performance. The first benchmark is a Q&A dataset over LangChain docs, accompanied by the langchain-benchmarks package. The article analyzes various models and architectures, providing insights into performance and debugging.
LangSmith now supports sharing evaluation datasets and results for community-driven benchmarks.
The initial benchmark is a Q&A dataset over LangChain docs to test RAG systems.
LangSmith's homepage is now organized into Observability, Evaluation, and Prompt Engineering, with improved Resource Tags for flexible resource grouping. Onboarding guides and upcoming ABAC enhance usability.
Homepage divided into three sections: Observability, Evaluation, and Prompt Engineering.
Resource Tags now support flexible grouping by 'Application' or custom tags.
Agent engineering is an emerging discipline that integrates product thinking, engineering, and data science to build reliable LLM agents through rapid iteration and production feedback. It addresses the unpredictability of agents by cycling through build, test, ship, observe, and refine, as practiced by companies like Clay, Vanta, LinkedIn, and Cloudflare.
Agent engineering is an iterative process: build, test, ship, observe, refine, repeat.
It combines product thinking (scope and behavior), engineering (infrastructure), and data science (measurement and improvement).
AI agents work best when they reflect the knowledge and judgment your team has built over time. This article explores how to integrate human judgment into each stage of agent development, using a trader copilot example. It covers workflow design, tool design, and context engineering, and emphasizes the importance of automated evaluations and continuous iteration.
Agents need tacit knowledge from domain experts
Human judgment can be embedded through workflow, tool, and context design
Learn how Deep Agents SDK manages context for long-running AI tasks through offloading, summarization, and filesystem abstraction to prevent context rot.
Three compression techniques: offloading large tool results (>20K tokens), offloading large tool inputs (at >85% context), and summarization (when offloading insufficient).
Offloaded content is saved to filesystem with pointers; agent can retrieve via file operations.
This post explores how to build reliable AI agents by designing loops, not just using a good model. It introduces four nested loops: the agent loop, verification loop, event-driven loop, and hill climbing loop, each building on the previous to create agents that work consistently and improve over time. Using LangChain primitives, developers can implement each level and embed human oversight where needed.
The agent loop lets the model call tools repeatedly to complete tasks. It's the fundamental loop.
The verification loop checks output quality and provides feedback, ensuring consistency.
LangChain and Fireworks fine-tuned an open model to mine perceived error signals from production traces, matching frontier model performance at a fraction of the cost.
LangSmith processes billions of tokens daily across production traces.
Fine-tuned Qwen model detects 'Perceived Error' at frontier performance with 100x cost savings.
The article explores the definition of AI agents, proposing that an agent is a system that uses an LLM to decide the control flow of an application. The author agrees with Andrew Ng that agent capabilities are a spectrum and introduces the concept of 'agentic' behavior, discussing its implications for development, operation, evaluation, and monitoring.
An AI agent is a system that uses an LLM to determine the control flow of an application.
Agent capabilities exist on a spectrum, from simple routing to highly autonomous agents.
LangChain built a GTM agent using Deep Agents that automates lead research, drafting, and account intelligence, achieving a 250% increase in lead conversion and saving 40 hours per rep per month.
Agent automates outbound and inbound lead processing with human-in-the-loop approval via Slack.
Uses Deep Agents for multi-step orchestration and LangSmith for evaluations and feedback.
This article analyzes two seemingly opposing blog posts—'Don't Build Multi-Agents' by Cognition and 'How we built our multi-agent research system' by Anthropic—and finds they share common insights about when and how to build multi-agent systems. Key points include the critical role of context engineering, the relative ease of read-oriented vs. write-oriented multi-agent systems, and production reliability challenges. It also highlights how tools like LangGraph and LangSmith address these challenges.
Context engineering is the most critical part of building multi-agent systems, requiring dynamic communication of task context to models.
Multi-agent systems focused on 'reading' (e.g., research) are easier than those focused on 'writing' (e.g., coding), as writing requires more complex coordination and merging.
Learn how Replit Agent leverages LangSmith's observability features to debug complex agent workflows, including improvements in trace performance, search, and human-in-the-loop threads.
Replit Agent uses LangGraph and LangSmith for monitoring and debugging.
LangSmith was enhanced to handle large traces with hundreds of steps.
Interrupt 2025, LangChain's first industry conference, gathered 800 people in San Francisco. Keynote themes included Agent Engineering as a new discipline, multi-model LLM apps, LangGraph for reliable agents, and AI observability. Product launches included LangGraph Platform GA, Open Agent Platform, LangGraph Studio v2, LangGraph Pre-Builts, LangSmith observability updates, Open Evals, and LLM-as-Judge private preview.
LangChain held its first Interrupt conference, focusing on AI agents.
Several new products were announced, including LangGraph Platform GA and Open Agent Platform.
A guide to building production-ready RAG apps using Pinecone Serverless, LangChain, and LangServe, addressing pain points like vectorstore management, rapid deployment, and observability.
OpenEvals and AgentEvals provide pre-built evaluators for LLM-as-judge, structured data, and agent trajectory evaluation. These open-source packages help developers quickly establish evaluation workflows to ensure reliability of LLM applications.
OpenEvals and AgentEvals offer ready-to-use evaluators covering LLM-as-judge, structured data, and agent trajectory evaluation.
LLM-as-judge evaluators are customizable with few-shot examples and scoring schemas, suitable for conversational quality, hallucination detection, and more.
LangSmith introduces self-improving LLM-as-a-Judge evaluators that leverage human corrections as few-shot examples to align evaluations with human preferences without prompt engineering.
LLM-as-a-Judge evaluators are popular for grading natural language outputs but require careful prompt engineering.
LangSmith's new feature stores human corrections as few-shot examples to improve evaluator alignment over time.