Recapping two days of Interrupt 2026 — LangSmith Engine, Sandboxes GA, LangChain Labs, and 23 talks from teams at LinkedIn, Rippling, Cisco, and more. Now on demand.
LangSmith Engine automates failure analysis from production traces.
LangSmith Sandboxes reaches General Availability for secure agent execution.
Lyft used LangGraph and LangSmith to build a self-serve AI agent platform for customer support, cutting agent development from months to weeks. The platform empowers non-technical domain experts to build agents via prompts and configuration, with a router-based multi-agent architecture and robust evaluation pipeline.
Lyft moved agent development closer to domain experts by letting ops teams, VoC leads, and product managers define agents through prompts and configuration.
A router-based multi-agent architecture with LangGraph routes rider and driver requests across specialized subagents with safety checks and state management.
AI models have plateaued on raw intelligence, and the next gains come from what you build around them. The AI agent harness provides tools, memory, and human-in-the-loop capabilities to transform LLMs into useful digital assistants. Companies like Google, LangChain, OpenAI, and Anthropic offer different solutions.
AI intelligence gains are plateauing; agent harnesses are the new frontier.
Agent harnesses add tools, memory, and human oversight to LLMs.
In this post, we provide a solution to build highly scalable, serverless multi-agent generative AI systems on AWS using LangGraph Agents as orchestrators integrated with Amazon Bedrock AgentCore Memory and Amazon Bedrock AgentCore Observability.
Combines LangGraph, Amazon Bedrock AgentCore, and serverless AWS services for production-ready multi-agent AI systems.
LangGraph's explicit graph-based execution model enables deterministic coordination, parallelism, and conditional routing between agents.
This study integrates a femtosecond laser-pumped Coherent Ising Machine (CIM) with an LLM-driven agentic system using LangGraph and LangChain frameworks. LLMs effectively perform QUBO/Ising model calibration, constraint weight iteration, and validation of literature-reported schemes. All tasks use domestic large models and CIM hardware, achieving practical quantum CIM empowerment fully based on domestic core technologies. A new paradigm is discovered where agent-assisted quantum computing iterations reciprocally enhance the agent's own problem-solving capability.
Integration of femtosecond laser-pumped CIM with LLM-driven agentic system
LLMs perform QUBO/Ising calibration, constraint iteration, and validation
An AI agent written in Java using LangChain4j, similar to Claude Code. Free to use with a free Mistral account. It generated a good calculator app on the first try.
LangSmith Auth Proxy keeps credentials out of sandbox runtimes, injects auth headers at the network layer, and lets teams define egress policies and dynamic credential flows to secure agent network access.
Credentials stay outside the sandbox, reducing exposure from prompt injection and malicious dependencies
Egress policies restrict agent network access to approved destinations
Move beyond token streaming. Learn how the latest streaming primitives in Deep Agents, LangChain, and LangGraph enable typed events, scoped subscriptions, subagent visibility, multimodal outputs, and resilient frontend experiences for building production-ready agent applications.
Streaming needs to evolve beyond tokens; modern agents generate messages, tool calls, subagent activity, state changes, approvals, and media, requiring structured event streams.
Typed events and projections simplify frontend development; applications subscribe directly to messages, tool calls, state, subagents, or custom channels while the runtime handles assembly, ordering, and reconnection.
Amazon SageMaker AI now offers OpenAI-compatible API support for real-time inference endpoints. Users of OpenAI SDK, LangChain, or Strands Agents can invoke models on SageMaker AI by changing only the endpoint URL, without custom clients, SigV4 wrappers, or code rewrites. The feature supports Chat Completions requests and streaming responses, with bearer token authentication.
Amazon SageMaker AI endpoints now support OpenAI-compatible API, simplifying model invocation.
Existing OpenAI SDK or frameworks can be used with only a URL change.
LangSmith Engine is an agent that sits on top of your agent traces, spots recurring issues, and suggests what to do next. This post details its technical architecture, including how it screens traces at scale, investigates likely issues, and produces actionable outputs.
Engine automatically identifies failure patterns in traces and converts them into issues.
It uses a two-phase approach: broad screening with trajectory summaries and deep investigation of flagged traces.
In 2026, enterprise agentic AI has moved from pilots to production. This guide ranks the top 10 platforms — Salesforce Agentforce, Microsoft Copilot Studio, ServiceNow, LangGraph, and more — with verified pricing, real adoption data, and honest constraints to help enterprise teams make the right platform decision.
Salesforce Agentforce leads for CRM-native workflows with $800M ARR and 29,000 deals. But value narrows outside Salesforce ecosystem.
Microsoft Copilot Studio has highest volume: 160,000 organizations, 400,000+ agents. Best for Microsoft 365 enterprises.
Deep Agents was previously designed in a generic way to work well across model families. Today we’re adding model-specific profiles to adjust prompts, tools, and middleware. We ship profiles for OpenAI, Anthropic, and Google models out of the box, which we see leads to a 10–20 point jump on a subset of tau2-bench over the default harness.
Deep Agents introduces model-specific profiles to optimize prompts, tools, and middleware per model.
Custom profiles yield 10–20 point gains on tau2-bench for OpenAI, Anthropic, and Google models.
itsharness is a complete harness for building, running, and observing AI agent workflows. It offers a visual canvas to design flows, exports a runtime-agnostic spec, compiles to various frameworks, and supports running, tracing, and debugging. The spec is at version 0.2.0 with 14 node types and 5 example flows.
itsharness provides a visual canvas for designing AI agent workflows and exports a runtime-agnostic JSON spec.
Adapters compile the spec to frameworks like LangGraph, CrewAI, Mastra, and Microsoft Agent Framework.
A relatively quiet day in AI news highlights a smaller trend: the convergence of coding agent form factors around Conductor's pioneering approach. Key stories include GitHub's new Copilot App mimicking Conductor, OpenAI's Codex mobile launch, LangChain's agent infrastructure updates (SmithDB, Engine, Labs), Anthropic's Claude Code restrictions backlash, Figure's 24/7 autonomous sorting livestream, and notable research releases on diffusion LMs, time-series forecasting, and mechanistic interpretability.
GitHub launches Copilot App with an agent-first UX similar to Conductor; YC CEO Garry Tan publicly endorses Conductor as superior.
OpenAI integrates Codex into ChatGPT mobile, enabling remote task initiation, review, and execution.
LangChain Labs is a new applied research effort focused on continual learning for agents, with partners advancing open research on self-improving AI systems.
LangChain Labs focuses on continual learning for agents using data they generate.
Partners include Harvey, NVIDIA, Prime Intellect, Fireworks, and Baseten.
Halgorithem is a custom algorithm that detects AI hallucinations without using AI itself, by parsing inputs into trees and comparing with file chunk trees to flag inconsistencies. It integrates with Python AI workflows like LangGraph and CrewAI, showing high accuracy in benchmarks.
Halgorithem detects AI hallucinations using tree structure comparison, without its own AI.
Easily integrates into popular Python AI pipelines such as LangGraph, CrewAI, etc.
LangSmith launches SmithDB, a purpose-built distributed database for agent observability, delivering up to 12x faster performance with full portability for self-hosted and multi-cloud deployments.
SmithDB is a distributed database built specifically for agent observability, achieving up to 12x performance improvements.
Backed by object storage with stateless ingestion and query services, it scales easily for self-hosted and multi-cloud environments.
Introducing LangSmith LLM Gateway: runtime governance for AI agents with spend limits, PII redaction, and trace continuity, built directly into LangSmith.
LLM Gateway sits between agents and LLM providers, enforcing spend limits and redacting PII before requests reach the model.
Policy violations appear as traceable events in LangSmith, enabling seamless investigation and fix.
LangSmith launches Context Hub, a centralized repository for storing, versioning, and collaborating on AI agent behavior files like AGENTS.md, skills, and policies. It addresses the need for a dedicated home for context, often managed by non-engineers and updated frequently, with features such as version control, tags, comments, and integration with Deep Agents for persistent memory.
Context Hub centralizes agent context files, including AGENTS.md, skills, policies, and more.
Context significantly impacts agent behavior; many failures stem from missing or stale context.
LangSmith Sandboxes are now GA, providing hardware-virtualized microVMs with kernel isolation for safely running AI agent code. New features include snapshots, cheap forks, service URLs, CLI, and auth proxy with custom callbacks. Built for coding agents, CI agents, and data pipelines.
Each sandbox is a hardware-virtualized microVM, fully kernel-isolated from host and other sandboxes, offering stronger security than containers.
New GA features include snapshots with copy-on-write forks, Service URLs, Sandbox CLI, Auth Proxy with custom callbacks, and auto-pause for idle sandboxes.
Run deep agents in production with durable execution, sandboxes, tool access, and LangSmith observability, without building the runtime yourself. Now in private beta.
Managed Deep Agents provides a hosted runtime with durable threads, streaming, checkpointing, and human-in-the-loop.
Context Hub enables agents to retain and update context across runs, with LangSmith Engine for learning from real usage.
LangSmith Engine watches your production traces, clusters failures into named issues, and proposes targeted fixes and eval coverage. Stop manually triaging agent failures.
LangSmith Engine automates the agent improvement loop by clustering production failures into named issues and proposing fixes.
Each resolved issue strengthens your eval suite to prevent regressions.
Torrix is a self-hosted LLM observability tool that tracks tokens, cost, latency, full prompt traces, reasoning tokens, and PII masking. It supports many LLM providers and can be deployed with Docker without Postgres or Redis. It offers SDKs for Python, Node.js, Go, C#, Java, as well as LangChain callback and HTTP proxy.
Self-hosted LLM observability without Postgres or Redis.
LangGraph 1.2 introduces DeltaChannel, which reduces checkpoint storage from O(N²) to near-constant by storing only diffs each step with periodic full snapshots. Achieves 41x reduction for coding agents, with no config changes or data migration required.
Checkpoint storage under full-snapshot model grows O(N²) with session length
DeltaChannel stores deltas each step and writes full snapshots every K steps, keeping storage flat
OncoAgent is an open-source, privacy-preserving clinical decision support system for oncology. It features a dual-tier LLM architecture (9B speed vs 27B deep reasoning), multi-agent LangGraph topology, Corrective RAG pipeline over 70+ NCCN and ESMO guidelines, and a three-layer reflexion safety validator with Zero-PHI policy. The system routes queries via complexity scoring and was fine-tuned on AMD Instinct MI300X, achieving 56x throughput acceleration. It supports on-premises deployment to ensure data sovereignty.
Open-source, privacy-preserving oncology decision support system for on-premises deployment.
Dual-tier LLM: 9B speed-optimized and 27B deep-reasoning models, routed via additive complexity scorer.
This article outlines a systematic four-phase agent development lifecycle: Build, Test, Deploy, and Monitor. It emphasizes testing before deployment, using runtimes and sandboxes for reliable deployment, and leveraging traces and feedback for monitoring. Covers tools from code-first to no-code, and discusses best practices for datasets, experiments, simulations, and signal harvesting.
The lifecycle consists of Build, Test, Deploy, and Monitor phases. Testing should start before production.
Build phase offers tools from code-first frameworks to no-code platforms, with distinctions between agent frameworks, runtimes, and harnesses.
Seismic data analysis is an essential component of energy exploration, but configuring complex processing workflows has traditionally been a time-consuming and error-prone challenge. Halliburton’s Seismic Engine, a cloud-native application for seismic data processing, is a powerful tool that previously required manual configuration of approximately 100 specialized tools to create workflows. This process was not only time-consuming but also required deep expertise, potentially limiting the accessibility and efficiency of the software.
To address this challenge, Halliburton partnered with the AWS Generative AI Innovation Center to develop an AI-powered assistant for Seismic Engine. The solution uses Amazon Bedrock, Amazon Bedrock Knowledge Bases, Amazon Nova, and Amazon DynamoDB to transform complex workflow creation into conversations. Geoscientists and data scientists can configure processing tools through natural language interaction instead of manual configuration.
In this post, we’ll explore how we built a proof-of-concept that converts natural language queries into executable seismic workflows while providing a question-answering capability for Seismic Engine tools and documentation. We’ll cover the technical details of the solution, share evaluation results showing workflow acceleration of up to 95%, and discuss key learnings that can help other organizations enhance their complex technical workflows with generative AI.
Our collaboration with AWS has been instrumental in accelerating subsurface interpretation workflows. By integrating Amazon Bedrock services with Halliburton Landmark’s DS365 Seismic Engine, we were able to reduce traditionally time‑consuming workflow‑building tasks by an order of magnitude. This generative AI–powered workflow assistant not only improves efficiency and accuracy but also makes our advanced geophysical tools more accessible to a broader range of users. The scalable cloud‑native architecture on AWS has enabled us to deliver a seamless, conversational experience that fundamentally improves productivity across subsurface workflows.
— Phillip Norlund, Manager of Subsurface Technologies, Halliburton Landmark— Slim Bouchrara, Senior Product Owner of Subsurface R&D, Halliburton Landmark
Solution overview
Our project aimed to address two key objectives: transforming natural language queries into executable seismic workflows, and providing an intelligent question and answer (Q&A) system for Seismic Engine documentation. To achieve this, we developed a solution using Amazon Bedrock that enables geoscientists to interact with complex seismic tools through natural conversation.The backbone of our system is a FastAPI application deployed on AWS App Runner, which handles user queries through a streaming interface. When a user submits a query, an intent router powered by Amazon Nova Lite analyzes the request to determine whether it’s seeking workflow generation or technical information. For Q&A requests, the system uses Amazon Bedrock Knowledge Bases with Amazon OpenSearch Serverless to provide relevant answers from indexed documentation. For workflow requests, a generation agent using Anthropic’s Claude on Amazon Bedrock creates YAML workflows by selecting from 82 available Seismic Engine tools.
To maintain context and enable multi-turn conversations, we integrated Amazon DynamoDB for chat history and interaction logging. The system supports streaming responses for both Q&A and workflow generation, providing immediate feedback to users as the system processes their requests. This architecture allows complex technical workflows to be created and modified through natural conversation, while maintaining the precise control required for seismic data processing. The following diagram illustrates the solution architecture.
Query routing and intent classification
After the user’s query is provided to the system, the Intent Router classifies the intent label of the given query by calling Amazon Nova Lite via the Amazon Bedrock API. The large language model (LLM) is given a prompt to produce one of three intent labels: “Workflow_Generation”, “QnA”, and “General_Question”.The “Workflow_Generation” label is used to route queries related to workflow generation, including reading/loading datasets, data processing operations, and various requests involving manipulating specific datasets. The “QnA” intent label is used for questions related to specific tools, requests for sample workflows, or questions about Seismic Engine documentation. The “General_Question” label is reserved for queries unrelated to Seismic Engine operations or workflows.In our implementation, Amazon Nova Lite performed the routing task efficiently, offering a good balance between accuracy and latency.
Question answering implementation
The Q&A component handles Seismic Engine-related queries by using Amazon Bedrock Knowledge Bases, a fully managed service for end-to-end Retrieval Augmented Generation (RAG) workflow. We chose Bedrock Knowledge Bases because it alleviates the operational overhead of managing vector databases, chunking strategies, and embedding pipelines. As a fully managed service, it handles infrastructure scaling, security, and maintenance automatically, so that our team could focus on solution development rather than RAG infrastructure operations. The service provides native support for multiple chunking strategies including hierarchical chunking, which maintains parent-child relationships to balance granular retrieval with broader document context.The data sources include tool documentation markdown files and Seismic Engine manuals stored in S3. We kept tool documentation files unchunked since they’re relatively short, preserving complete context for individual tools. For longer documents like Seismic Engine manuals, we used hierarchical chunking with default settings. We use Amazon Titan Text Embeddings V2 for embedding generation and OpenSearch Serverless as the vector database. The system also stores metadata such as file names, URLs, and document types for each indexed item for downstream use.For both retrieval and response generation, we use Amazon Bedrock Knowledge Bases’ retrieve_and_generate API with Claude 3.5 Haiku as the model. The system supports multi-turn conversations by maintaining session context, and responses are formatted with inline citations for enhanced traceability.
Note: This solution was developed and evaluated using Claude 3.5 Sonnet V2 and Claude 3.5 Haiku. Since then, these models have been succeeded by Claude Sonnet 4.5 and most recently Claude Sonnet 4.6, as well as Claude Haiku 4.5, all available through Amazon Bedrock. The solution architecture supports model upgrades without code changes, so that you can use the latest model capabilities.
This approach enables our system to provide context-aware, relevant answers to user queries about Seismic Engine tools and workflows.
Workflow generation
For queries classified as “Workflow_Generation”, our solution uses LLM agents to convert natural language into executable YAML workflows. The agent is bound with 82 tools available on Seismic Engine. Based on the user’s query and tool specifications that define inputs, parameters, and outputs, the agent selects appropriate tools, determines their correct execution order, and generates a YAML workflow that addresses the user’s requirements. The following figure illustrates the workflow generation process.
We used both Claude 3.5 Sonnet V2 and Claude 3.5 Haiku in our implementation, orchestrated through the LangChain framework for agent management and tool binding. The models are provided with detailed tool descriptions and specifications, so that they can understand each tool’s capabilities and requirements. When generating workflows, the system considers not only the explicit requirements in the user’s query but also includes necessary default parameters when specific values aren’t provided.The workflow generation process supports multi-turn conversations, so that users can modify previously generated workflows through natural language requests. By using conversation history stored in Amazon DynamoDB, the LLM can either generate new workflows or modify existing ones according to the user’s current query.
Evaluation
To evaluate our solution’s effectiveness, we created a comprehensive test dataset of query-workflow pairs, consisting of both low and medium complexity workflows. These were derived from real historical workflows and validated by subject matter experts to verify they accurately represent typical user requests.
Workflow generation results
Model
Complexity
Success Rate
Mean Generation Time (s)
Median Generation Time (s)
Claude Haiku 3.5
simple
84%
8.3
5.9
medium
90%
12.4
9.1
Claude Sonnet 3.5 V2
simple
86%
11.2
11.5
medium
97%
15.8
16.6
Both models demonstrated strong performance, with Claude Sonnet 3.5 V2 showing superior success rates, particularly for medium complexity workflows. The system delivers responses through streaming, providing users with immediate feedback as the workflow is generated, with complete workflows delivered within 5.9-16.6 seconds. Claude Haiku 3.5 offers faster generation times, providing a trade-off option between speed and accuracy.
Comparison to baseline performance
User Type
% Success
% Failure
Time to Build Simple Flow (min)
Time to Build Complex Flow (min)
New User
70%
20%
4
20
Experienced User
85%
10%
2
5
Our Solution
84-97%
3-16%
0.13-0.26
0.21-0.28
Our generative AI solution demonstrates the following improvements:
Success rates of 84-97% surpass both new and experienced users.
Workflow creation time is reduced from minutes to seconds, representing over a 95% time reduction.
These results demonstrate that users across experience levels can enhance productivity by over 95%, while maintaining or exceeding the accuracy of manual workflow creation.
Conclusion
In this post, we showed how we used Amazon Bedrock to transform complex technical processes into natural conversations. By implementing an AI-powered assistant with integrated Q&A capabilities, we achieved workflow generation success rates of 84-97% while reducing creation time by over 95% compared to manual processes. The system’s ability to handle both low and medium complexity workflows, combined with its contextual understanding of Seismic Engine tools, demonstrates how generative AI can improve industrial software usability without compromising accuracy.
This approach also generalizes well to other domains with complex, multi-step agentic workflows requiring specialized tool knowledge and configuration. As next steps, consider exploring multi-agent architectures using frameworks like Strands Agents SDK with Amazon Bedrock AgentCore for improved accuracy through specialized sub-agents.
About the authors
Halliburton partnered with AWS to develop an AI-powered assistant for Seismic Engine using Amazon Bedrock.
The solution converts natural language into executable seismic workflows with up to 95% time reduction.
This post details how to build an automated company due diligence agent using LangChain's Deep Agents for orchestration and Parallel's Task API for structured web research, with five research subagents and compliance observability via LangSmith.
Deep Agents orchestrates five research subagents for corporate profile, financial health, litigation, news, and competitive landscape.
Parallel's Task API returns structured findings with source citations and confidence scores (Basis), enabling verifiable research.
This tutorial demonstrates building a Groq-powered research agent using LangGraph, LangChain, and various custom tools including web search, file I/O, Python execution, skill loading, sub-agent delegation, and long-term memory. The agent runs on Groq's free OpenAI-compatible endpoint and can autonomously research topics, delegate subtasks, generate reports, and store memory across sessions.
Uses Groq's llama-3.3-70b-versatile model via OpenAI-compatible API.
Integrates LangGraph for agent loop management, LangChain for tool binding.
Harrison Chase argues that agent observability should go beyond debugging to power learning loops. Observability alone provides traces of what happened, but feedback—user signals, indirect metrics, LLM-as-judge, and deterministic rules—is essential to turn those traces into actionable insights for improving the model, harness, and context. The post explores learning at multiple levels and outlines what an observability platform needs: store traces, store feedback, and generate feedback.
Agent observability's deeper purpose is to power learning, not just debugging.
Feedback—from users, indirect behavior, LLM judges, and rules—transforms traces into learning signals.
Hapag-Lloyd's Digital Customer Experience team built a generative AI-powered feedback analysis solution using Amazon Bedrock, Elasticsearch, and LangChain/LangGraph to automate sentiment classification, trend analysis, and reporting, reducing manual effort and enabling faster, data-driven product decisions.
Automated customer feedback analysis using generative AI reduces manual effort from hours to seconds.
Solution uses Amazon Bedrock for sentiment classification and content moderation with guardrails.
Open SWE is an open-source framework built on Deep Agents and LangGraph that captures the successful architectural patterns of internal coding agents from companies like Stripe, Ramp, and Coinbase, providing customizable sandbox, toolset, orchestration, and integration components.
Open SWE integrates isolated cloud sandboxes, curated toolsets, subagent orchestration, and Slack/Linear/GitHub invocation.
The framework is composed on Deep Agents, enabling easy upgrades and customization without forking.
Open-weight models like GLM-5 and MiniMax M2.7 now achieve comparable performance to closed frontier models on core agent tasks including file operations, tool use, and instruction following, at significantly lower cost and latency. LangChain's evaluations show correctness scores close to top closed models, making open models viable for production agent workflows. This article details the evaluation methodology, results, and how to use open models with Deep Agents SDK.
Open models GLM-5 and MiniMax M2.7 match closed frontier models on agent tasks.
Cost and latency benefits: up to 20x cheaper, faster inference.
Evaluate and iterate on LLM applications with confidence using LangSmith's regression testing. Compare experiments, track performance, and identify changes.
LangSmith improves regression testing for LLM applications.
AI tests may not achieve perfect scores, requiring performance tracking over time.
LangSmith, the unified DevOps platform for LLM applications, is now available as a transactable offering in the Azure Marketplace, enabling deployment within Azure VPC with full data control and MACC credit support.
LangSmith is now purchasable via the Azure Marketplace as an Azure Kubernetes Application.
Data remains fully contained within the customer's Azure VPC, with no third-party sharing.
Dosu uses evaluation driven development and LangSmith to build reliable LLM products at scale, monitor production performance, and iterate with confidence.
Dosu employs evaluation driven development (EDD) to ensure LLM reliability, similar to test-driven development.
LangSmith's SDK is easy to integrate, providing fine-grained control and customizability for monitoring.
Multi-agent systems that mirror real engineering teams — not just code faster — can cut debug time by 93% and compress cross-team delivery. Here's the architecture built on LangGraph.
Agentic engineering is a multi-agent coordination model where AI agents act as digital team members with defined roles, shared memory, and common observability.
In a pilot, coordinated agent execution reduced time-to-root-cause by 93% in debugging workflows, saving over 200 engineering hours in one month.