2026-05-17站内改写6 min readUpdated: 2026-06-12

A cheap fix that saves the AI $400M dollars a year and brings 4B people online

Codec is a novel protocol that keeps token IDs as the wire format end-to-end, eliminating repeated detokenization and re-tokenization in the AI inference stack. It reduces data by 16-1700x, saving an estimated $400M/year in global AI costs and making AI accessible to ~5 billion people on slow or expensive connections.

SourceHacker News AIAuthor: Zombwaffle

v0.4.1 shipping · source-available · what's new →

The control plane for AI inference.

AI inference is burning megawatts of GPU power and datacenter buildout is racing to keep up — meanwhile your inference stack is paying again at every hop on top of the GPU bill. Models think in tokens, but the rest of the stack speaks text. Every gateway, router, tool dispatcher, and middleware in the path does the same ritual: detokenize the model's IDs to text, encode as UTF-8, wrap in JSON, ship it, parse it, decode UTF-8, re-tokenize back to IDs — burning CPU, memory, and latency on lossy conversions the AI never asked for, and risking KV-cache corruption when the re-tokenize doesn't round-trip cleanly. Codec is a drop-in upgrade that keeps token IDs as the wire format end-to-end: gateways forward IDs verbatim, tool dispatchers match on raw IDs, cross-model handoffs translate vocabularies in-process. Same model, same prompts, same answers; typically 16× less data on the wire on real agent traffic, up to ~1,700× when the content compresses well — how big the win is depends on what your AI generates, full receipts below. On mobile: snappier app, lighter cloud bill. At fleet scale: megawatt-hours of network energy and middleware CPU not burned on bytes nobody reads. Plug-in libraries for TypeScript, Python, Rust, Java, .NET, and C work with the AI servers you already use (sglang, vllm, llama.cpp). Your code doesn't change. We can't make the model smaller — the waste, we can. And by shrinking the wire 1,000+×, Codec opens AI access to the ~5 billion people on slow, expensive, or metered connections that JSON-SSE prices out of the door.

What it gives you → Protocol map github / wdunn001/Codec

~$400M+/yr total wire + GPU savings worldwide ~$320M cloud egress (heavy-agent baseline — tool-use + A2A is default at Claude/ChatGPT/Gemini) + ~$50M GPU on blocked prompts + ~$35M Starlink; sub-agent-heavy flows push to $500–700M/yr

up to 10× faster on mobile 2 K-token reply over 10 Mbps 4G

~400 cars/yr off the road today, ~4,000 by 2030 counts bidirectional + ~8 round-trips per visible reply (the heavy-agent topology every major provider runs today) + ~10% client-side blocked doomed prompts

~5B people AI accessible where it wasn't 2.6B offline + 2–3B on slow / expensive mobile (ITU 2024)

Token IDs straight on the wire. Tool-call dispatch, observability, cross-vocab handoff — all the things you'd want to do at the inference layer reduce to integer compares on the stream. Detokenize becomes a byproduct, not a per-token cost.

control-plane primitives

Three operations. All on raw token IDs.

Codec gives the inference layer the same primitives a service mesh gives a microservice fleet: route, dispatch, translate. Run them on raw uint32 tokens, never on text. The compression you see in the receipts below is what falls out for free when you stop reserializing every hop.

Models think in tokens. Every middleware in your stack — gateway, router, log sink — speaks text, so it detokenizes, JSON-wraps, ships, parses, re-tokenizes — once per hop, burning CPU and risking KV-cache drift. Codec keeps token IDs as the wire format end-to-end; UTF-8 happens once, at the edge that actually displays text. Same compression options on top (gzip / brotli / dict-zstd). Same framing on every engine; six client languages decode byte-identically.

16–1700×less wire (workload-dep.)

3engines, one wire

The MCP path normally tokenizes the tool result at the gateway, every call. A Codec-aware MCP server (codec-time-leaf) attaches token IDs to its result via _meta['ai.codec/leaf-tokenization']; the codec-metamcp gateway forwards them verbatim — [Codec][leaf] fires, the gateway becomes a transparent ID pipe, and the consumer skips its BPE re-tokenize. tools/list across a 40-tool namespace: 21.4 KB → 5.9 KB (3.6×). ToolWatcher detects tool boundaries on the raw ID stream at 26.7× the speed of detokenize+regex (lab EPYC, 481 Mtok/s).

12.4×leaf consumer CPU

26.7×ToolWatcher vs detok+regex

A Llama-3 agent's stream feeds a Qwen-2 agent through one in-process detokenize / retokenize step. UTF-8 never crosses the wire. At 2 K tokens the Codec path ships 15.1× fewer wire bytes (10.4 KB → 709 B) at bridge CPU within noise of the JSON-SSE+retokenize path. Both paths emit byte-identical Qwen-2 output; the bench asserts strict equality before reporting numbers.

15.1×smaller wire @ 2K

≡byte-identical output

The same wire format extends to diffusion models: VAE latents stream in length-prefixed binary frames instead of decoded pixels. The client runs vae_decode locally; pixels never touch the wire. Measured on the lab against codec-diffusers running SD-1.5: a 512×512 latent at int8 packs to 16.4 KB (~5–10× smaller than JPEG, ~90× smaller than raw fp16 pixels). The int4 pipeline halves it again. Pipeline math validates byte-for-byte against spec/PIPELINES.md.

3.9×int4 vs raw latent

~90×vs raw fp16 pixels

receipts

What falls out when the inference layer stays token-native.

Compression isn't the headline — the primitives are. But once every hop runs on raw uint32 token IDs, the wire reduction and the tool-call latency floor are measurable byproducts. Numbers below are from the cross-stack benchmark matrix: same prompt, same model, three real inference engines, six real client languages. Every cell is measured. Full SCHEMA-v1 result JSONs in packages/bench/results/.

MCP gateway

tools/list across 40 tools

wirevs JSON

JSON-RPC21.4 KB1.0×

msgpack + gzip5.9 KB3.6×

[Codec][leaf] log fires end-to-end on codec-time-leaf tool calls — gateway becomes a transparent ID pipe.

Latents (v0.3)

512×512 SD-1.5 latent on the wire

pipelinewirevs raw

raw fp1632.4 KB1.0×

int816.4 KB2.0×

int48.4 KB3.9×

For comparison: same image as a JPEG ~80–150 KB; raw fp16 pixels 1.5 MB. int4 = ~10× smaller than JPEG, ~180× vs raw pixels.

JSON-SSE

Codec (identity)

Codec + dict-zstd

sglang

JSON-SSE 485 KB

Codec msgpack + dict-zstd 291 B

1,707× @ 44.7 ms

vllm

JSON-SSE 518 KB

Codec msgpack + gzip 3.9 KB

137× @ 59.0 ms

llama.cpp

JSON-SSE 529 KB

Codec msgpack + dict-zstd 140 B

3,868× @ 40.8 ms

Bridge response time — the latency the next agent waits on

CPU cost to turn the inbound stream into Qwen-2 IDs ready for agent B. Lower is better — this is the wall-clock the handoff blocks on, before agent B's first new token.

Wire bytes — the bandwidth the bridge has to ingest

What the bridge has to receive before any translation can run. Network-bound: the slower the link, the more this dominates response time on top of the bridge CPU above.

Surface JSON wire Codec wire Reduction JSON total Codec total Speedup

mock get_weather 13,419 B794 B16.9× 1,662 ms189 ms8.8×

SearXNG (live web) 42,302 B2,348 B18.0× 2,078 ms1,257 ms1.65×

MetaMCP gateway (Time MCP) 18,072 B1,061 B17.0× 210 ms216 ms~neutral

Path (get_current_time, ~30 char result) wire (bytes) consumer tokenize total

plain MCP (consumer re-tokenizes text) 105 0.052 ms 0.5 ms

mcp-leaf (consumer reads ids from _meta) 316 0.004 ms 0.4 ms

delta +211 bytesleaf 3× larger on wire 12.4× faster —

The leaf _meta envelope is a fixed ~210-byte cost per text block; the consumer-CPU savings scale linearly with text length. The wire crossover where leaf ≤ plain sits at ~300+ characters per text block — timestamps pay a wire tax for the CPU win, while paginated docs / search results / large MCP outputs win on both axes. 20/20 integrity: every leaf sample's ids equal tokenizer.encode(text) under the declared map_id.

Codec is a wire + dispatch primitive, not an inference accelerator. The model still runs at the same TPS on the same GPU. The cost story is on the network (egress, mobile data, radio energy), the client CPU (BPE tokenize + JSON parse removed from the hot path), and the server CPU floor (response-side serialize + UTF-8 encode removed, raising the concurrent-request ceiling per GPU). It is NOT on GPU compute.

A 2K-token chat reply ships 485 KB JSON-SSE vs 291 B Codec on sglang — but per visible user reply, real bytes-out are ~4× that because every major AI platform now defaults to tool-use + agent-to-agent: initial response + final response + 2–3 tool requests/results + sub-agent handoffs that span regions. The "single chat reply" era is over — Claude Code, ChatGPT-with-tools, Gemini-Agentic are all multi-hop by design.

ScopeReplies/dayChat-only floorHeavy-agent baseline (~4×)

Anthropic Claude ~900M $14M/yr ~$56M/yr

OpenAI ChatGPT + Copilot ~2.5B $40M/yr ~$160M/yr

Google Gemini ~600M $9M/yr ~$36M/yr

Others (Grok, Perplexity, …)~300M$5M/yr~$20M/yr

Worldwide AI traffic (heavy baseline)~5B$80M/yr~$320M/yr

Chat-only floor: 485 KB JSON-SSE per reply × replies/day × $0.09/GB AWS S3. Heavy-agent baseline multiplies by ~4× for the topology that's now default at every major provider — multi-tool dispatch, A2A handoffs, sub-agent invocations, RAG context retrieval all crossing egress. Extreme deep-research / sub-agent-heavy flows push to 6-8× ($480-640M/yr); only legacy chat-only deployments hit the floor. GCP + Azure egress are in the same ballpark ($0.08-$0.12/GB).

Carrier add'l-data rate ~$10/GB. Radio-link energy ~50 nJ/bit (conservative cross-tech estimate from published 4G/5G/Wi-Fi measurements). Per 2K-token chat reply:

JSON-SSECodec

Data cost $0.0049$0.000003

Radio energy 194 mJ 0.12 mJ

Bits over the air3.88 Mb2.3 Kb

Per-response cost is tiny on a phone; the unit you can intuit comes from multiplying by the user base. ~20M Claude users × ~50 mobile chat replies/day each ≈ 1B replies/day across a mobile fleet: ~194 MJ vs ~0.12 MJ on radio links — ~54 kWh/day saved at the airlink alone, about ~1.8 average US households' daily electricity (EIA ~30 kWh/household-day), plus the per-user battery + data-cap relief. The full non-GPU energy delta (radio + network + client CPU) is bigger — see the power+latency card below.

Two measured points from the bench section above, both worth real CPU on the consumer side:

ToolWatcher — 2.08 ns/token (single 32-bit compare) vs 55.42 ns/token for detokenize+regex match. 26.7× less CPU on every tool-detection pass.

mcp-leaf — 0.004 ms meta-read vs 0.052 ms BPE tokenize per tool result. 12.4× less CPU per tool call where the result includes _meta['ai.codec/leaf-tokenization'].

Per call it's microseconds. At fleet scale (100M consumers × 100 tool-bearing turns/day) the agent-mesh saves on the order of ~1,000 CPU-hours/day across the consumer fleet. On a single laptop running an agent loop locally: less fan noise, longer battery.

Codec-aware clients with web-safety enabled refuse doomed prompts locally — policy violations, safety-policy mismatches, malformed payloads — before any wire round-trip. Those requests never reach the GPU. At ~10% client-side block rate on ~5B daily requests = ~500M GPU requests/day avoided.

AssumptionPer callAt 500M blocks/day

GPU-seconds avoided ~1 s avg ~138K GPU-hr/day

@ $1/GPU-hr blended $0.000278 ~$50M/yr saved

@ $2/GPU-hr (premium) $0.000556 ~$100M/yr saved

This is the only place Codec actually reduces GPU dollars — not because the model runs faster, but because the request never runs at all. On the ~90% of requests that DO reach the GPU, compute is unchanged. ~$50–100M/yr is the defensible range at 10% block rate; more aggressive client-side dedup / safety / format-validation pushes it higher.

The model still runs at the same TPS. Codec doesn't accelerate token generation, doesn't ch

[truncated for AI cost control]