GitHub Copilot CLI now uses smarter subagent delegation to reduce unnecessary handoffs and wait times. Production A/B testing shows a 23% reduction in tool failures and a 5% improvement in user wait time. The article details how the team identified delegation bottlenecks, refined the orchestration policy, and validated improvements.
Copilot CLI now delegates more selectively, using subagents only when they create real leverage.
Production A/B test results: tool failures down 23%, P95 wait time reduced by 5%.
Box AI built Box Agent on Deep Agents to search, analyze, and synthesize enterprise content while preserving security, permissions, and model flexibility. The parent/child agent architecture dynamically spawns sub-agents for complex tasks, and middleware handles citations, caching, and context management.
Box Agent evolved from single-file Q&A to multi-document enterprise analysis using Deep Agents.
Deep Agents provided model agnosticism and 3x faster iteration.
TrajGenAgent proposes a hierarchical LLM agent framework for generating realistic synthetic human mobility trajectories without model fine-tuning. It uses a two-stage orchestrator-worker design: an LLM first synthesizes individual- and weekday-conditioned activity chains via in-context learning, then a deterministic workflow grounds each activity into a complete visit using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. An anomaly-detection-based evaluation framework assesses behavioral and semantic plausibility. Experiments show improvements in spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over existing methods.
TrajGenAgent is a hierarchical LLM agent framework for generating human mobility trajectories without fine-tuning.
It employs a two-stage design: LLM synthesizes activity chains, and a deterministic workflow converts activities to visits.
Arbor is a multi-agent framework introducing structured tree search as a cognition layer for autonomous agents in large stateful action spaces. Validated on full-stack LLM inference optimization, it achieves up to 193% Pareto improvement in throughput-latency over vendor baselines, with a critic agent ensuring stability.
Arbor uses tree search as shared working memory across agents for coordinated optimization.
Achieves up to 193% throughput-latency Pareto improvement on full-stack LLM inference, hardware-agnostic.
OpenAI Group PBC today announced plans to acquire Ona, a startup with a platform for managing long-running AI agents. The acquisition will enhance OpenAI's Codex AI assistant by enabling it to perform tasks that span hours or days. Ona's cloud sandbox technology allows AI agents to continue running even when developers shut down their workstations, and provides security features such as blocking malicious programs via hashing.
OpenAI acquires Ona (Gitpod GmbH) to improve its Codex AI assistant's ability to handle long-running tasks.
Ona's platform runs AI agents in cloud sandboxes that persist beyond developer workstation shutdowns.
Benchling's Head of AI Nicholas Larus-Stone discusses building agents for life sciences on the Max Agency podcast. He explains their multi-model approach for quality, production trace review processes, and how agents compress workflows to accelerate scientific discovery. Benchling AI launched in October 2025 on top of a 14-year-old data platform.
Benchling runs multiple models from different providers on the same task to leverage diverse error patterns for higher quality.
A rotating 'fire chief' reviews production traces weekly, supplemented by user feedback (thumbs up/down).
Agent-EvalKit is an open-source toolkit (Apache 2.0) that makes this evaluation infrastructure available by integrating with AI coding assistants, including Claude Code, Kiro CLI, and Kilo Code. This post walks through how Agent-EvalKit works across its six evaluation phases, using a travel research agent built with the Strands Agents SDK and Amazon Bedrock as a running example.
Agent-EvalKit provides a six-phase evaluation workflow (Plan, Data, Trace, Run agent, Eval, Report) integrated with AI coding assistants.
It detects issues like hallucination when tools return empty results, as demonstrated with a travel research agent.
SmithDB supports full-text search and JSON filtering over agent traces with a median latency of 400 ms, despite large nested JSON documents in object storage. The article covers challenges, query shapes, inverted index basics, why Tantivy wasn't used, and the two design iterations.
SmithDB's inverted index is tailored for object storage and large agent trace payloads
Traditional search libraries like Tantivy are not suitable due to mmap and local disk assumptions
Most AI agent tools run on servers, limiting access to browser APIs, device capabilities, and frontend state. Discover how LangChain headless tools enable secure client-side tool execution for modern agent applications.
Most agent tools only see the backend, missing browser and device capabilities.
Headless tools bring client-side capabilities into the agent loop as first-class tools.
This tutorial shows how to build an AI-powered equipment repair assistant using Amazon Bedrock AgentCore, helping farmers and field technicians diagnose problems, identify parts, and access repair procedures via natural language. The solution uses AgentCore Runtime with Strands Agents SDK, Amazon Nova 2 Lite as the foundation model, Amazon Bedrock Knowledge Base for RAG, and AgentCore Memory for conversation persistence.
Build an AI repair assistant supporting natural language diagnostics and repair guidance
Uses Amazon Bedrock AgentCore, Strands Agents SDK, and Nova 2 Lite model
In this post, we demonstrate how a hands-free FNOL intake system combines agents built with the Strands Agents SDK for domain reasoning with Amazon Bedrock AgentCore Browser Tool for live portal interaction. This approach preserves human expertise while removing repetitive screen work.
Combines Strands Agents for domain reasoning with Amazon Bedrock AgentCore Browser Tool for browser automation.
Nova Act drives portal interactions while Strands agents perform evidence interpretation and correlation.
Steve Yegge has 40 years of coding experience, including nearly two decades at Amazon and Google. Known for his influential essays, he now explores multi-agent orchestration with projects like Gas Town and Gas City, and offers AI transformation consulting.
40 years of coding experience, with nearly 20 years at Amazon and Google.
Author of influential essays since 2004, impacting companies and programmers.
Learn how to build a real-time AI voice agent for emergency helplines using LangChain, AssemblyAI, and OpenAI. The agent listens to caller distress, triages the situation, dispatches emergency services, and keeps the caller calm—all without typing or menus.
Use AssemblyAI for real-time speech-to-text transcription with partial and final transcripts.
The AI agent (ARIA) uses LangChain and LangGraph for reasoning and tool use, including location lookup, emergency dispatch, human escalation, and calming protocols.
Pizx is a fork of zx with native Pi AI integration, offering 15 AI agent patterns for shell scripting, AI text generation, coding agents, and orchestration topologies. It includes quick query, script writing, and advanced features like per-phase model selection.
Pizx forks zx and integrates Pi AI with 15 agent patterns.
Supports quick queries, script writing, and coding agents for tasks like code review and auto-fix.
Formalizing complex reasoning from natural text is a central challenge in computational linguistics. Current Argument Mining techniques identify basic claims and premises but struggle with richer structures required by advanced schemas like the Carneades Argumentation Framework (CAF). We introduce CAF-Gen, an automated multi-agent framework that enriches shallow argument structures into CAF-compliant models using an iterative Creator-Reviewer pipeline. Experiments show the iterative feedback loop improves data quality and achieves strong alignment with original annotations.
CAF-Gen is a multi-agent system that enriches basic argument structures into the advanced Carneades Argumentation Framework.
It uses an iterative Creator-Reviewer pipeline to ensure structural integrity.
Yafei Lee, founder of OpenClacky, an open-source AI agent in Ruby, shares how building features like skills, memory, sub-agents, browser automation, dynamic model switching, and long-running sessions led to severe prompt caching issues. Over two years and three architecture generations (first two failed), they converged on seven engineering decisions that achieved 90%+ cache hit rates. The article details the failures of RAG and multi-agent orchestration, and the first three decisions: double cache markers, frozen system prompt, and single meta-tool.
Every agent feature introduces a cache invalidation surface, reducing cache hit rates.
First-generation RAG failed due to high cost, staleness, and insufficient recall.
AI agents need secure execution environments. LangSmith Sandboxes provide hardware-virtualized microVMs, giving each agent a full computer with fast startup and persistent state, enabling code generation, data analysis, CI workflows, and more.
Agents require real computer environments (filesystem, shell, package manager) but direct infrastructure access is dangerous.
Container isolation is insufficient against kernel exploits; hardware-level separation is necessary.
LangGraph provides built-in primitives for retries, timeouts, and error handling to build resilient AI agents. The post explains how to use RetryPolicy, TimeoutPolicy, and error_handler, and demonstrates the SAGA pattern for multi-step workflows with side effects.
LangGraph offers three fault tolerance primitives: RetryPolicy, TimeoutPolicy, and error_handler.
These attach directly to nodes, enabling per-step configuration of automatic retries with backoff.
This article explains how AI agents are transforming data science workflows, automating routine tasks, and requiring new skills such as system design, tool integration, and agent observability. It covers frameworks like LangGraph, AutoGen, and smolagents, the shift from procedural to evaluative work, and emerging roles.
The agentic era is here: AI agents autonomously plan, execute multi-step tasks, and evaluate results, redefining data science.
Data scientists need new skills: system design, prompt engineering, tool design, agent observability, and multi-agent architecture.
A developer experiments with embedding a Swift interpreter (SwiftScript) to replace Bash in an AI agent framework, achieving a more controlled and secure execution environment while maintaining sandboxing.
SwiftScript is an embeddable tree-walking Swift interpreter that avoids compilation steps.
ShellKit provides a controlled runtime environment with sandboxing and file access restrictions.
Explore why model neutrality is critical for AI agents. Learn how labs lock you in at the harness layer—and why a neutral, open-source framework is the answer.
Model neutrality is more important than cloud neutrality due to faster model iteration cycles.
AI labs are replicating cloud-era lock-in strategies at the agent harness layer.
JackHamr is a cloud platform that provides hosted environments, specialist AI agents, and full pipeline orchestration to help teams ship software faster. Agents have personality and autonomy, handling end-to-end tasks from spec to release. Developers interact via chat or voice, and the platform supports custom LLMs, skills, and flexible resource configurations.
Named AI agents with personality and full autonomy from spec to ship
Quick-provisioning dev environments with VS Code, Docker, Git integration
Lookspan is a local-first observability dashboard for AI agents, supporting MCP, LangGraph, CrewAI, and OpenTelemetry. All data stays in local SQLite, no cloud required. Features include real-time tracing, cost tracking, alerts, replay evaluation, and dataset experiments. Launch with one command.
Local-first: data never leaves your machine, zero infrastructure cost
Supports multiple AI agent frameworks including MCP, LangGraph, CrewAI, and OpenTelemetry
This article explores building custom agent harnesses using LangChain's create_agent and middleware. A harness is the scaffolding connecting a model to the real world; customizing it is key to agent usefulness. Middleware hooks into the agent loop at each step, enabling deterministic logic, tool lifecycle management, custom state, and stream handling. Task-harness fit determines effectiveness.
Agent = model + harness; harness determines usability.
create_agent provides the core loop; middleware enables customization at every step.
Harmonic rebuilt their AI Scout using Deep Agents and LangSmith, achieving a 4x increase in user retention and transforming the tool from a rigid search interface to a trusted advisor that handles complex investment queries.
Scout V1 was a rigid LangGraph pipeline requiring extensive evals; V2 uses a single frontier model with two tool categories, simplifying architecture.
The new UX allows users to interact naturally, generating visualizations and search results that the agent can reference, creating a shared source of truth.
LiteHarness is a unified SDK that provides a single TypeScript and Python interface for multiple AI agent harnesses, including Claude Agent SDK and OpenAI Agents SDK. It allows easy switching between harnesses and models, and supports streaming messages. The project is in preview.
Unified interface for Claude Agent and OpenAI Agents harnesses
Deep Agents' RubricMiddleware adds a self-evaluation loop to your agent runs. Set a rubric, configure a grader, and get reliable outputs on tasks where correctness matters.
Agents often produce outputs that need multiple attempts to get right.
RubricMiddleware lets agents self-evaluate and correct based on a rubric.
HCompany releases Holo3.1, a major upgrade to its computer use agent model family, enhancing robustness across desktop, mobile, and agent frameworks, and introducing quantized checkpoints for local inference.
Holo3.1 improves robustness across web, desktop, and mobile environments, with significant gains on AndroidWorld. It also introduces function-calling protocol support for better integration with third-party agent stacks.
New model sizes (0.8B to 35B-A3B) and quantized checkpoints (FP8, Q4 GGUF, NVFP4) offer cost-effective and private deployment options.
Multi-objective molecular optimization requires searching vast chemical spaces under conflicting objectives. Existing methods rely on single policy or fixed scalarization, limiting trade-off representation. We propose ATOM, a multi-agent framework that formulates molecular optimization as tree-structured search. Agents coordinate along different paths, maintaining alternative evolution trajectories. Global memory supports balanced exploration. Experiments show improved Pareto coverage and hypervolume over strong baselines.
ATOM is a multi-agent framework for molecular optimization using tree-structured search.
Agents coordinate along different paths to explore diverse trade-offs.
The Next Era of Knowledge Work report explores how Codex is transforming productivity through AI-powered research, data analysis, workflow automation, and content creation.
BrandOS is an AI marketing operating system that turns brand knowledge, campaign history, and marketing intelligence into a company brain, enabling automated content generation, brand safety compliance, and cross-platform campaign orchestration.
BrandOS centralizes brand rules, legal constraints, and past campaigns into executable marketing intelligence.
It offers 24/7 competitor monitoring and daily marketing intelligence briefings.
A structured six-step LLMOps roadmap covering observability, evaluation, cost control, and agent orchestration to build production-grade LLM systems. The LLMOps market is projected to grow from $1.97 billion in 2024 to $4.9 billion by 2028 at a 42% CAGR.
LLMOps differs from traditional MLOps in prompt versioning, non-deterministic output evaluation, and cost optimization.
Foundational skills required: Python, LLM fundamentals, cloud infrastructure, and version control discipline.
MAVEN (Modular Agentic Verification and Execution Network) is a lightweight symbolic reasoning scaffold designed to enhance generalization in tool-calling environments through structured decomposition, adaptive tool orchestration, and intermediate verification. On the MAVEN-Bench stress test, MAVEN improves the GPT-OSS-120b base model from 48% to 71% accuracy without additional training, using an open-weight backbone at roughly 1/10 the cost of proprietary baselines.
MAVEN is a lightweight symbolic reasoning scaffold for improving generalization in agentic tool calling.
On MAVEN-Bench, MAVEN boosts GPT-OSS-120b accuracy from 48% to 71% without extra training.
OWASP Agent Memory Guard is a runtime defense layer that screens every read and write to AI agent memory, blocking prompt injection, secret leakage, and integrity tampering. It is the OWASP reference implementation for ASI06: Memory Poisoning. Supports LangChain, OpenAI Agents, AutoGen, and more. Benchmark: 92.5% recall, 0% false positive.
Agent Memory Guard is an OWASP Incubator Project focused on preventing AI agent memory poisoning.
It provides runtime defense by screening memory reads and writes, detecting prompt injection, secret leakage, and tampering.
Autolang is a scripting language designed for AI agents to write code safely, quickly, and at low cost. It acts as an orchestration layer, allowing AI to call predefined wrapped functions while preventing unauthorized actions through static compilation and runtime restrictions.
Autolang is a lightweight compiler for safely executing short AI-generated scripts.
It prevents common AI errors like infinite loops and null pointer access via static analysis and opcode limits.
Anthropic introduced dynamic workflows in Claude Code, but the author argues that a task-based architecture surpasses session-based approaches for team engineering. This post explains why task trees scale from small fixes to large migrations and why orchestration should be substrate, not a mode.
Anthropic's dynamic workflows signal a shift from single prompts to orchestration in coding agents
The author advocates for task and task tree architecture over sessions for durable team work
This article introduces LangChain's Interpreter Skills, an extension to agent skills that includes a TypeScript module for deterministic execution. Agents can import and run the module inside an interpreter, enabling reliable and evaluable workflows such as GitHub issue triage.
Interpreter skills extend traditional skills with a TypeScript module executable in an interpreter.
Deterministic parts are coded, while the model decides when to invoke them, improving reliability and evaluation.
This article describes a macroeconomic research agent built with Deep Agents, LangSmith, and the You.com Finance Research API. It analyzes GDP data across all 27 EU member states, detects anomalies, and produces a cited briefing in approximately 45 minutes. The report details the anomalous growth in Ireland and contraction in Germany, emphasizing the importance of traceability and auditability.
The AI agent analyzes GDP data for all 27 EU countries in about 45 minutes at an API cost of roughly $2.20.
Ireland's 12.3% GDP growth is driven by pharma export front-loading, while Germany faces structural contraction from automotive and construction sectors.
VFEAgent is an end-to-end multi-agent system that automates finite element analysis (FEA) modeling and simulation directly from input images and problem descriptions. It combines a multimodal vision-language multi-agent pipeline with a verification-first code synthesis framework, using ReAct-driven reasoning to extract structured FEA specifications and incorporating self-debugging and fallback mechanisms for executability and physical validity. Experiments show high success rates in generating complete, physically valid simulations, outperforming LLM-based baselines in reliability and correctness, and promising to free engineers from tedious manual analysis.
VFEAgent automates FEA modeling and simulation from images and problem descriptions.
Employs a multimodal vision-language multi-agent pipeline with ReAct-driven reasoning.
This post combines learnings from LangChain’s work on evaluating deep agents and Anthropic’s guide to demystifying evals for AI agents into a practical guide. You will learn how to apply five evaluation patterns for deep agents, build offline evaluations using pytest and LangSmith, and configure online monitoring for production. The walkthrough uses a text-to-SQL deep agent with Amazon Bedrock for the full development to production lifecycle.
Agent evaluations face challenges: non-determinism, error propagation, and creative solutions.
Introduces three grader types: code-based, model-based (LLM-as-judge), and human graders, with recommendations for combining them.
The article explores the shift from tightly coupled local developer workflows to asynchronous background agents in AI coding, highlighting the December 2025 model inflection that made spec-to-PR workflows practical, and delving into the architecture, security, testing, memory, and multi-agent orchestration behind Devin and OpenInspect.
Background agents are becoming mainstream; Devin's merged PR share grew from 16% to 80% on Cognition repos.
The December 2025 model upgrades (Opus 4.5/GPT 5.2) enabled agents to autonomously go from specification to a complete pull request.
As of mid-2026, seven major AI agent frameworks (DSPy, Claude Agent SDK, OpenAI Agents SDK, CrewAI, AutoGen, LangGraph, Google ADK) vary in design philosophy, architecture, production readiness, etc. LangGraph leads in production deployments, Claude Agent SDK offers deepest single-provider capabilities, OpenAI Agents SDK provides cleanest multi-agent handoffs, and CrewAI excels in developer velocity. The market is projected to grow from $7.84B in 2025 to $52.62B by 2030.
LangGraph has the most mature durable execution model, deployed by ~400 enterprises.
Claude Agent SDK offers the most powerful single-provider capabilities but is locked to Anthropic models.
Recapping two days of Interrupt 2026 — LangSmith Engine, Sandboxes GA, LangChain Labs, and 23 talks from teams at LinkedIn, Rippling, Cisco, and more. Now on demand.
LangSmith Engine automates failure analysis from production traces.
LangSmith Sandboxes reaches General Availability for secure agent execution.
LangChain's April newsletter announces product updates for LangSmith including 30+ evaluator templates, cost alerting, and Fleet RBAC/ABAC. Deep Agents now supports one-command deployment. Interrupt 2026 conference agenda is released. Customer stories feature Credit Genie and Cisco achieving significant efficiency gains.
LangSmith introduces 30+ evaluator templates, cost alerts, and RBAC/ABAC for tools.
Deep Agents launch `deepagents deploy` for single-command production deployment.
Lyft used LangGraph and LangSmith to build a self-serve AI agent platform for customer support, cutting agent development from months to weeks. The platform empowers non-technical domain experts to build agents via prompts and configuration, with a router-based multi-agent architecture and robust evaluation pipeline.
Lyft moved agent development closer to domain experts by letting ops teams, VoC leads, and product managers define agents through prompts and configuration.
A router-based multi-agent architecture with LangGraph routes rider and driver requests across specialized subagents with safety checks and state management.
AI models have plateaued on raw intelligence, and the next gains come from what you build around them. The AI agent harness provides tools, memory, and human-in-the-loop capabilities to transform LLMs into useful digital assistants. Companies like Google, LangChain, OpenAI, and Anthropic offer different solutions.
AI intelligence gains are plateauing; agent harnesses are the new frontier.
Agent harnesses add tools, memory, and human oversight to LLMs.
Learn how to build a multi-agent campaign review system that demonstrates parallel reasoning, context persistence, and traceable execution paths using an integrated architecture combining NVIDIA NIM for GPU-accelerated inference, Amazon Bedrock AgentCore for managed runtime, and Strands Agents for serverless orchestration.
Combines NVIDIA NIM, Amazon Bedrock AgentCore, and Strands Agents for high-performance multi-agent AI.
Enables parallel reasoning, context persistence, and traceable execution.