AI News HubLIVE
Public articles 20Collected articles 23Trust 82Refresh 120 min
Health HealthySource type OfficialFull-text rights Official full textLast ingested 2026-06-25ID modal-blogStatus Enabled

Official AI infrastructure blog; confirm reuse terms before full body display.

Latest public articles

Routing for serverless servers with Pingora, Envoy, and Spanner

A deep dive inside Modal's new ultra-low-latency serverless server product, explaining the architecture decisions behind building a custom proxy (fprs) using Pingora, Envoy at the edge, and Spanner for configuration, all optimized for LLM inference workloads.

  • Modal introduces Serverless Servers for ultra-low-latency HTTP/WebSocket/gRPC traffic.
  • Unlike Web Functions, Servers sacrifice queuing and retries for lower latency.
In-site article

Achieve state-of-the-art inference latencies with speculative decoding

Modal and Decagon collaborated to cut inference latency by 100ms using speculative decoding, outperforming proprietary providers. The article details the low-latency playbook including optimization of communication, host overhead, prefill, and decode latencies, with a focus on custom speculative decoding models (DFlash) for big wins.

  • Modal Auto Endpoints achieve low latency via speculative decoding, leveraging Blackwell GPUs, SGLang engine, and Modal Servers.
  • Speculative decoding reduces decode latency by parallelizing token generation, with efficiency depending on acceptance length.
In-site article

Introducing Modal Auto Endpoints: Optimized inference you actually own

Modal launches Auto Endpoints, a self-serve on-ramp to production-grade LLM inference, allowing users to deploy frontier open models with a single command and gain full visibility and control over inference code, metrics, and infrastructure. Built on Modal's AI infrastructure platform, it features high-performance autoscaling, custom container runtime, global GPU availability, and Modal Servers for ultra-low-latency routing (5ms overhead). Pre-tuned recipes from top-tier team experience and DFlash speculative decoding are included. Future roadmap includes full automation of inference engineering.

  • Auto Endpoints enable one-command deployment of open models (e.g., GLM 5.2) with complete ownership of the inference stack.
  • Engine-level observability with server and inference metrics exposed.
In-site article

Speculation Is All You Need

Modal is all-in on speculative decoding, arguing it's the single most important inference optimization, delivering 2-3x speedups. They released state-of-the-art DFlash speculators for Qwen models, achieving 5-20% extra speedups, and explain the theory, simulation, and math behind the acceleration.

  • Speculative decoding is the only engine optimization that matters for high-interactivity inference, delivering integral speedups.
  • Modal released new DFlash speculators for Qwen models, improving speed by 5-20% over strong baselines.
In-site article

Reinforcement Learning is an Infrastructure Problem

This article explores the practical application of reinforcement learning in post-training large language models, highlighting that the current bottleneck is infrastructure rather than algorithms. Modal shares its experience running RL post-training at scale and introduces its open-source library to help teams address key challenges like multi-node training, environment management, and GPU utilization.

  • The bottleneck for RL post-training LLMs is infrastructure, including training engines, inference sandboxes, and environment isolation.
  • Multi-node training makes weight synchronization costly; RDMA and delta compression significantly reduce latency.
In-site article

Role-Based Access Control for Humans and Agents

Modal introduces Role-Based Access Control for all Team and Enterprise users, built on Environments to provide granular permissions for humans and AI agents.

  • RBAC is now available for all Team and Enterprise plan users, centered around Environments as secure boundaries.
  • Restricted Environments allow precise control over who can deploy and manage resources.
In-site article

Modal's Series C: Raising $355M at a $4.65B valuation

Modal raised $355M at a $4.65B valuation, led by General Catalyst and Redpoint. The company has grown fivefold since September, exceeding $300M in annualized revenue. Modal provides a cloud platform for AI, focusing on elastic inference, agent runtimes, and sandboxes. The funding will support expansion in low-latency inference, reinforcement learning, and agent compute.

  • Modal raised $355M at a $4.65B valuation, with General Catalyst and Redpoint leading the round.
  • The company has grown fivefold since September and surpassed $300M in ARR.
In-site article

Scaling Reinforcement Learning at Applied Compute

Applied Compute trains custom AI agents for enterprises using Reinforcement Learning, focusing on post-training to differentiate from commoditized frontier models. Their 'Specific Intelligence' approach leverages Modal's infrastructure for fast, flexible, and reliable RL training loops serving clients like DoorDash, Cognition, and Mercor.

  • Applied Compute uses RL to create custom AI agents for enterprises, emphasizing post-training as the competitive differentiator.
  • Their 'Specific Intelligence' approach trains agents on proprietary data, improving with each use.
In-site article

Introducing Claude Managed Agents with Modal Sandboxes

Anthropic and Modal announce the integration of Claude Managed Agents with Modal Sandboxes, allowing developers to run tool calls in self-hosted, customizable sandboxes with fast startup, cost efficiency, and scalability. The collaboration enables secure, isolated execution environments for AI agents, with early adopters like Mason AI, DoorDash, and Blend sharing positive experiences.

  • Claude Managed Agents now integrates with Modal Sandboxes for custom, scalable agent execution.
  • Modal offers fast cold-starts, custom images, persistence options, and cost-efficient burst pricing.
In-site article

How to achieve truly serverless GPUs

Modal's deep engineering reduces GPU inference server boot times from kiloseconds to tens of seconds, enabling truly serverless computing for variable inference workloads.

  • Four key optimizations: cloud buffers, custom filesystem, CPU checkpoint/restore, CUDA checkpoint/restore
  • Boot time reduced from ~2000s to ~50s (40x faster)
In-site article

Boosting multimodal inference performance by >10% with a single Python dictionary

By profiling SGLang's scheduler, Modal engineers discovered a bottleneck from repeated CUDA IPC pool handle opens. Replacing them with a simple Python dictionary cache improved throughput by 16.2% and reduced latency by over 10% on Qwen2.5-VL-3B. The optimization is merged in SGLang v0.5.10.

  • SGLang's scheduler spent significant CPU time repeatedly opening CUDA IPC pool handles for multimodal inputs.
  • A Python dictionary cache eliminated redundant _new_shared_cuda calls, reducing scheduler overhead.
In-site article

Building an RL Theorem-Proving Workflow on Modal

AE Studio used Modal to compare Evolution Strategies and GRPO for training LLMs on math theorem proving. By leveraging Modal's parallel GPUs, sandboxed verification, and volume storage, they reduced setup time by 60% and cost by up to 75%. Early results show ES matching or outperforming GRPO in several scenarios, especially with limited data.

  • AE Studio implemented Evolution Strategies (ES) alongside GRPO for theorem proving on Modal.
  • Modal's .map() for parallel GPU inference, Sandboxes for isolated verification, and Volumes for model storage streamlined the infrastructure.
In-site article

Building with Modal and the OpenAI Agents SDK

Modal becomes an official sandbox provider for the OpenAI Agents SDK. This article demonstrates how to build a custom coding agent harness from scratch, integrating Modal sandboxes for secure, parallel, and scalable automation, using the Parameter Golf challenge as an example.

  • Modal is an official sandbox provider for the OpenAI Agents SDK, offering isolated, scalable environments.
  • The article walks through building an agent harness step by step, including basic agent, sandboxing, memory, subagents, async parallelism, and snapshots.
In-site article

Autoscaling Autoresearch: Give your agents elastic GPUs on Modal

Modal integrates with Autoresearch to provide elastic GPU scaling, allowing AI agents to dynamically provision compute resources. In a Parameter Golf challenge, an agent ran 113 experiments across 238 GPU-hours, achieving 5x speedup over a single workstation while using a fraction of a dedicated cluster's resources.

  • Modal enables agents to seamlessly scale from single GPUs to dozens of H100s, adapting to workload demands.
  • The Parameter Golf agent completed core training runs 5x faster than a single workstation, with efficient resource utilization.
In-site article

Butter is joining Modal

Modal announces that Butter, an AI sandbox technology company, is joining Modal. Founder Erik Dunteman and researcher Raymond Tana will join the Modal Sandbox team. Butter's expertise includes agent harness engineering and the development of bVisor, a lightweight ephemeral sandbox built with Zig.

  • Butter team joins Modal to advance agent engineering and sandbox products.
  • Founder Erik Dunteman has a long history with Modal, including co-founding Banana.
In-site article

Real-time inference for robots at Physical Intelligence

Physical Intelligence uses Modal to achieve low-latency remote real-time inference for robots, with a specialized QUIC-based transport adding only 10-15 ms of network overhead, enabling experimentation with larger models.

  • Physical Intelligence develops a Vision-Language-Action (VLA) model for general-purpose robotics. At Physical Intelligence, every robot arm movement flows through continuous inference. They use Modal to run remote inference, but TCP tunnels introduce jitter for control loops.
  • Pi and Modal built a custom QUIC-based portal over UDP with NAT traversal, reducing network overhead to ~10-15ms.
In-site article

Product Updates: RTX Pro 6000 Blackwell, Command K, Sandbox FS API and more

Modal announces product updates including NVIDIA RTX Pro 6000 Blackwell GPU support, Command K palette in dashboard, Sandbox Filesystem API Beta, SDK improvements, and customer stories.

  • RTX Pro 6000 Blackwell is now available with 96GB VRAM for inference and fine-tuning.
  • Command K palette provides navigation shortcuts and object ID jump in dashboard.
In-site article

Runway Chooses Modal to Power Real-Time Inference for Runway Characters

Runway partners with Modal to power real-time inference for Runway Characters, a real-time video agent API that generates expressive digital personas from a single image. Modal's serverless platform enables low-latency, multi-GPU inference globally, allowing Runway to move from proof of concept to production in under 30 days.

  • Runway partners with Modal for real-time inference of Runway Characters.
  • Runway Characters is a real-time video agent API built on GWM-1.
In-site article

How Doppel eliminated ML infrastructure tax with Modal

Doppel, an AI-native cybersecurity platform, migrated its ML workflows to Modal to accelerate experimentation and simplify inference deployment. Training experiments became parallel, reducing feedback loops. Inference builds dropped from 30 minutes to under a minute, with automatic scaling for traffic spikes.

  • Doppel used Modal to parallelize ML training experiments, drastically shortening iteration cycles.
  • Modal's image caching and persistent volumes cut model deployment build times by up to 10x.
In-site article

Product Updates: Directory Snapshots, GLM-5, Billing updates and more

A roundup of everything we shipped in February: Directory Snapshots for Sandboxes, a free GLM-5 endpoint, new billing API, and more.

  • Directory Snapshots allow snapshotting specific directories, separating dependencies from application code.
  • Free GLM-5 endpoint available until end of April, great for coding agents.
In-site article

All sources