A deep dive inside Modal's new ultra-low-latency serverless server product, explaining the architecture decisions behind building a custom proxy (fprs) using Pingora, Envoy at the edge, and Spanner for configuration, all optimized for LLM inference workloads.
Modal introduces Serverless Servers for ultra-low-latency HTTP/WebSocket/gRPC traffic.
Unlike Web Functions, Servers sacrifice queuing and retries for lower latency.
Modal and Decagon collaborated to cut inference latency by 100ms using speculative decoding, outperforming proprietary providers. The article details the low-latency playbook including optimization of communication, host overhead, prefill, and decode latencies, with a focus on custom speculative decoding models (DFlash) for big wins.
Modal Auto Endpoints achieve low latency via speculative decoding, leveraging Blackwell GPUs, SGLang engine, and Modal Servers.
Speculative decoding reduces decode latency by parallelizing token generation, with efficiency depending on acceptance length.
Modal launches Auto Endpoints, a self-serve on-ramp to production-grade LLM inference, allowing users to deploy frontier open models with a single command and gain full visibility and control over inference code, metrics, and infrastructure. Built on Modal's AI infrastructure platform, it features high-performance autoscaling, custom container runtime, global GPU availability, and Modal Servers for ultra-low-latency routing (5ms overhead). Pre-tuned recipes from top-tier team experience and DFlash speculative decoding are included. Future roadmap includes full automation of inference engineering.
Auto Endpoints enable one-command deployment of open models (e.g., GLM 5.2) with complete ownership of the inference stack.
Engine-level observability with server and inference metrics exposed.
Modal is all-in on speculative decoding, arguing it's the single most important inference optimization, delivering 2-3x speedups. They released state-of-the-art DFlash speculators for Qwen models, achieving 5-20% extra speedups, and explain the theory, simulation, and math behind the acceleration.
Speculative decoding is the only engine optimization that matters for high-interactivity inference, delivering integral speedups.
Modal released new DFlash speculators for Qwen models, improving speed by 5-20% over strong baselines.
This article explores the practical application of reinforcement learning in post-training large language models, highlighting that the current bottleneck is infrastructure rather than algorithms. Modal shares its experience running RL post-training at scale and introduces its open-source library to help teams address key challenges like multi-node training, environment management, and GPU utilization.
The bottleneck for RL post-training LLMs is infrastructure, including training engines, inference sandboxes, and environment isolation.
Multi-node training makes weight synchronization costly; RDMA and delta compression significantly reduce latency.
Modal introduces Role-Based Access Control for all Team and Enterprise users, built on Environments to provide granular permissions for humans and AI agents.
RBAC is now available for all Team and Enterprise plan users, centered around Environments as secure boundaries.
Restricted Environments allow precise control over who can deploy and manage resources.
Modal raised $355M at a $4.65B valuation, led by General Catalyst and Redpoint. The company has grown fivefold since September, exceeding $300M in annualized revenue. Modal provides a cloud platform for AI, focusing on elastic inference, agent runtimes, and sandboxes. The funding will support expansion in low-latency inference, reinforcement learning, and agent compute.
Modal raised $355M at a $4.65B valuation, with General Catalyst and Redpoint leading the round.
The company has grown fivefold since September and surpassed $300M in ARR.
Applied Compute trains custom AI agents for enterprises using Reinforcement Learning, focusing on post-training to differentiate from commoditized frontier models. Their 'Specific Intelligence' approach leverages Modal's infrastructure for fast, flexible, and reliable RL training loops serving clients like DoorDash, Cognition, and Mercor.
Applied Compute uses RL to create custom AI agents for enterprises, emphasizing post-training as the competitive differentiator.
Their 'Specific Intelligence' approach trains agents on proprietary data, improving with each use.
Anthropic and Modal announce the integration of Claude Managed Agents with Modal Sandboxes, allowing developers to run tool calls in self-hosted, customizable sandboxes with fast startup, cost efficiency, and scalability. The collaboration enables secure, isolated execution environments for AI agents, with early adopters like Mason AI, DoorDash, and Blend sharing positive experiences.
Claude Managed Agents now integrates with Modal Sandboxes for custom, scalable agent execution.
Modal offers fast cold-starts, custom images, persistence options, and cost-efficient burst pricing.
Modal's deep engineering reduces GPU inference server boot times from kiloseconds to tens of seconds, enabling truly serverless computing for variable inference workloads.
Four key optimizations: cloud buffers, custom filesystem, CPU checkpoint/restore, CUDA checkpoint/restore
Boot time reduced from ~2000s to ~50s (40x faster)
By profiling SGLang's scheduler, Modal engineers discovered a bottleneck from repeated CUDA IPC pool handle opens. Replacing them with a simple Python dictionary cache improved throughput by 16.2% and reduced latency by over 10% on Qwen2.5-VL-3B. The optimization is merged in SGLang v0.5.10.
SGLang's scheduler spent significant CPU time repeatedly opening CUDA IPC pool handles for multimodal inputs.
AE Studio used Modal to compare Evolution Strategies and GRPO for training LLMs on math theorem proving. By leveraging Modal's parallel GPUs, sandboxed verification, and volume storage, they reduced setup time by 60% and cost by up to 75%. Early results show ES matching or outperforming GRPO in several scenarios, especially with limited data.
AE Studio implemented Evolution Strategies (ES) alongside GRPO for theorem proving on Modal.
Modal's .map() for parallel GPU inference, Sandboxes for isolated verification, and Volumes for model storage streamlined the infrastructure.
Modal becomes an official sandbox provider for the OpenAI Agents SDK. This article demonstrates how to build a custom coding agent harness from scratch, integrating Modal sandboxes for secure, parallel, and scalable automation, using the Parameter Golf challenge as an example.
Modal is an official sandbox provider for the OpenAI Agents SDK, offering isolated, scalable environments.
The article walks through building an agent harness step by step, including basic agent, sandboxing, memory, subagents, async parallelism, and snapshots.
Modal integrates with Autoresearch to provide elastic GPU scaling, allowing AI agents to dynamically provision compute resources. In a Parameter Golf challenge, an agent ran 113 experiments across 238 GPU-hours, achieving 5x speedup over a single workstation while using a fraction of a dedicated cluster's resources.
Modal enables agents to seamlessly scale from single GPUs to dozens of H100s, adapting to workload demands.
The Parameter Golf agent completed core training runs 5x faster than a single workstation, with efficient resource utilization.
Modal announces that Butter, an AI sandbox technology company, is joining Modal. Founder Erik Dunteman and researcher Raymond Tana will join the Modal Sandbox team. Butter's expertise includes agent harness engineering and the development of bVisor, a lightweight ephemeral sandbox built with Zig.
Butter team joins Modal to advance agent engineering and sandbox products.
Founder Erik Dunteman has a long history with Modal, including co-founding Banana.
Physical Intelligence uses Modal to achieve low-latency remote real-time inference for robots, with a specialized QUIC-based transport adding only 10-15 ms of network overhead, enabling experimentation with larger models.
Physical Intelligence develops a Vision-Language-Action (VLA) model for general-purpose robotics. At Physical Intelligence, every robot arm movement flows through continuous inference. They use Modal to run remote inference, but TCP tunnels introduce jitter for control loops.
Pi and Modal built a custom QUIC-based portal over UDP with NAT traversal, reducing network overhead to ~10-15ms.
Modal announces product updates including NVIDIA RTX Pro 6000 Blackwell GPU support, Command K palette in dashboard, Sandbox Filesystem API Beta, SDK improvements, and customer stories.
RTX Pro 6000 Blackwell is now available with 96GB VRAM for inference and fine-tuning.
Command K palette provides navigation shortcuts and object ID jump in dashboard.
Runway partners with Modal to power real-time inference for Runway Characters, a real-time video agent API that generates expressive digital personas from a single image. Modal's serverless platform enables low-latency, multi-GPU inference globally, allowing Runway to move from proof of concept to production in under 30 days.
Runway partners with Modal for real-time inference of Runway Characters.
Runway Characters is a real-time video agent API built on GWM-1.
Doppel, an AI-native cybersecurity platform, migrated its ML workflows to Modal to accelerate experimentation and simplify inference deployment. Training experiments became parallel, reducing feedback loops. Inference builds dropped from 30 minutes to under a minute, with automatic scaling for traffic spikes.
Doppel used Modal to parallelize ML training experiments, drastically shortening iteration cycles.
Modal's image caching and persistent volumes cut model deployment build times by up to 10x.