2026-06-18站内改写5 min readUpdated: 2026-06-18

Six Numbers from Running 1,500 AI Agents Simultaneously

This post presents six key metrics from running 1,500 fully KVM-isolated virtual machines simultaneously on a single AWS c6i.metal instance for AI agent tasks, including VM warm launch time, memory density, DNS cache performance, network latency, TLS session reuse, and throughput, demonstrating the cost and benefits of a lightweight isolation architecture.

SourceHacker News AIAuthor: amitlimaye

May 12, 2026

This is post 4 of a series.

The first three posts in this series built the foundation: Post 1 showed how we rewrite every syscall in a binary at load time so the guest can never bypass the runtime. Post 2 explained why we run with no guest kernel — just a 36KB shim that handles the ~30 syscalls an agent actually needs. Post 3 covered the pool daemon: how VMs are pre-warmed, how connections are pre-established, how the runtime sits in every agent’s path.

Post 4 is later than I’d like. The gap wasn’t writer’s block — it was the work itself. Getting 1,500 VMs to actually run stably, getting the pool daemon to hold TLS sessions under real load, chasing down the snapshot portability bugs that only show up when you move from a dev box to a different CPU family on AWS. This post exists because the system is now stable enough to measure reliably.

This post is what you can do with all of that — what it actually looks like running at scale.

Every number here comes from running 1,500 fully KVM-isolated virtual machines simultaneously on a single AWS c6i.metal. Not containers. Not shared-kernel processes. Each agent in its own VM, with its own hardware isolation boundary.

The question I wanted to answer: what does that actually cost? Memory, latency, throughput — does the isolation model make any of this unmeasurable, or do you pay for it everywhere?

Here’s what the operations dashboard showed.

Time to launch a warm VM — 0.42s

The pool has to be ready before anything else matters.

Warming 1,500 KVM VMs the naive way — fork, exec, boot, import — would take about 200ms per agent run sequentially. Five minutes before the first job runs.

What we do instead: run Python through its full import sequence once, freeze the VM at that point, and save the result as a snapshot. 99MB of non-zero pages — interpreter state, loaded modules, pre-resolved imports. At pool startup, all 1,500 VMs restore from that snapshot in parallel. Physical pages are shared across VMs until an agent writes to one. Each agent pays no import cost.

One thing worth being clear about: this snapshot only includes the Python standard library. User-level imports — anthropic, openai, requests, whatever your agent uses — still happen at agent startup. We could snapshot later in the boot sequence, after those imports, and squeeze the dispatch time down further. The tradeoff is that you’d end up with a different snapshot per agent type rather than one universal Python image shared across the entire fleet. We chose the universal image — 0.42s is fast enough, and one image is operationally simpler.

The bar chart shows how the 1,500 VMs distributed across 100ms time buckets. Most landed in the first two. Total wall time: 0.42 seconds.

We have no guest kernel. The shim is ~36KB. Nothing else.

Running density — 10.8 agents/GB

This is the number I get asked about most.

1,500 KVM-isolated VMs — each making active TLS calls to an LLM API — fit in 139GB of physical RAM. That’s 10.8 agents per GB, not idle, not at rest, but under real load. The figure includes everything an agent accumulates at runtime: Python heap, open TLS sessions, in-flight request buffers, stack frames, syscall state.

The ps RSS reads higher — ~290GB summed across all VMM processes. That number is misleading: the 99MB Python snapshot is shared across all 1,500 workers and RSS accounting counts it once per process, not once per physical page. The 139GB on the dashboard is physical RAM, measured directly from the host.

The 10.8 agents/GB is measured under real load. A warm idle VM costs 3MB (above). Your number will fall somewhere between those two depending on what your agents do and how much heap they accumulate at runtime.

DNS cache — 100% hit rate, 260.7K hits

A native Linux process gets OS-level DNS caching. Move that process into a VM with its own kernel and you typically lose it — the guest runs its own resolver, makes its own upstream queries, pays its own latency.

Our guests have no kernel. DNS queries exit through the pool daemon, which sits in the path before any UDP packet leaves the host. The pool maintains a shared cache across all 1,500 agents — so each guest gets exactly what a native process gets, and more: a single upstream query serves the entire fleet.

If 50 agents cold-miss the same name simultaneously, one query goes out and all 50 get the answer. The upstream resolver sees one query per name per TTL window, regardless of fleet size.

If you’ve been following the series, this will sound familiar: for anything the pool can’t parse or handle — exotic record types, DNSSEC, anything non-standard — the query falls through to the native Linux stack. The same principle as the syscall design. Fast path for the common case, the real thing as the backstop.

100% hit rate, 260,700 cache hits. In this run all 1,500 agents were hitting the same two endpoints — api.anthropic.com and a local Postman mock — so the cache saturates quickly. In a more diverse fleet the hit rate will be lower, but the coalescing benefit scales with agent count regardless. Isolation didn’t cost the agents their DNS cache — they got a better one.

Time to first network call — p50 44ms, p99 105ms

Agent code is deterministic. The same task hits the network at the same point every time. So the time from dispatch to first network call is almost entirely infrastructure — time the agent spent waiting to be ready, not working.

Dispatch to first syscall:

p50: 44ms p95: 86ms p99: 105ms n: 1,500

Connect to first send (warm TLS):

p50: 70ms p99: 1.6s n: 1,500

The dispatch number is pure infrastructure overhead — VM resume to first syscall. The connect number is how long until the first byte hits the wire over a pre-established TLS session. The p99 tail on connect reflects agents that hit Anthropic rate limiting before their first send.

At one job the 44ms is noise. At 1,500 agents running continuously, it’s the difference between the infrastructure being measurable or not.

TLS cache — 338 sessions held

Every HTTPS call normally starts with a TCP handshake (10–50ms) and a TLS handshake (50–150ms). At 1,500 agents all hitting the same endpoint, that’s a lot of latency you pay on every cold connection.

The pool daemon sits in the network path between every agent and every upstream destination — which means it can own the TLS layer too.

At pool startup, the daemon establishes persistent TLS sessions to configured upstream endpoints. When an agent makes an HTTPS request, it writes plaintext into a ring buffer. The daemon handles TLS on its behalf, into an already-established session. No handshake. No round trip.

338 sessions held across the fleet. 1,500 agents sharing 338 upstream connections. The API sees 338 connections. The agents see zero handshake latency.

One thing worth being direct about: the pool daemon reads the plaintext of every agent request. This is intentional — the same seam is what makes policy enforcement, credential injection, and response inspection possible. Whether that’s what you want depends on your threat model. It’s a design choice, not a side effect.

Network throughput — 33.2 MB/s in, 37.0 MB/s out

138,972 LLM calls completed at the time of this snapshot — cumulative since pool start, with 1,500 agents each making roughly one call per second.

33.2 MB/s in, 37.0 MB/s out. These numbers move with the fleet — this is a point-in-time snapshot from a live run, not a ceiling. At 1,500 agents the runtime isn’t anywhere near the network limit of a c6i.metal. The constraint here is the LLM API — in our case, Anthropic rate limiting. Agents spend most of their time waiting, not moving bytes.

What the numbers add up to

Put them together and you get the lifecycle of a job on a running KVM fleet:

Pool warms up: 1,500 VMs in 0.42s, fitting in 10.8 agents/GB

Job arrives: 44ms to first network call

Agent resolves endpoint: <1µs DNS cache hit, 0ms TCP (pre-established)

Agent makes HTTPS call: 0ms TLS — 338 sessions held, agent writes plaintext

Response flows back: 33.2 MB/s sustained across the fleet

Each layer works the same way: the runtime is already in the path, so it caches the DNS response, holds the TLS session, shares the snapshot pages. Nothing required a separate system. The same mechanism that enforces isolation is the one doing the optimization.

KVM isolation doesn’t have to mean heavy. These numbers are the evidence.

At 1,500 agents the savings are measurable. At 10,000 they become structural — the difference between needing more machines and not.

The next post covers what becomes possible when the runtime sees every syscall, every DNS query, every TLS session, every model call, across every agent on the host simultaneously.

If you’re running agents in production and any of this is a live problem — reach out on LinkedIn.