2026-05-30 22:07 UTCIn-site rewrite6 min readUpdated: 2026-06-30 13:03 UTC

Show HN: Thaw – Git branch for a running LLM (fork agents, skip prefill)

Thaw is an open-source tool that enables forking a running LLM session into multiple branches, skipping the costly prefill phase, enabling parallel exploration for AI agents. It achieves sub-second fork times (0.88s median) vs ~340s cold boot, and works with vLLM/SGLang.

SourceHacker News AIAuthor: nilsmatteson

Notifications You must be signed in to change notification settings

Fork 0

Star 4

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

78 Commits

.github/workflows

benchmarks

crates

demos

docs

notebooks

python

scripts

site

tests

.gitattributes

.gitignore

Cargo.lock

Cargo.toml

LICENSE

README.md

bench_slot_warm.py

bench_slot_warm_correctness.py

logo.png

pyproject.toml

setup.sh

vercel.json

Repository files navigation

The fork primitive for AI agents.

When your agent forks N ways to explore a problem, thaw skips the cold prefill and runs them in parallel from one shared memory. Snapshot a running session — weights, KV cache, scheduler state, prefix-hash table — and hydrate N divergent children at the fork point. git branch for live AI agents.

pip install thaw-vllm

The receipt — ForkPool, 2026-04-20

Pre-warmed subprocess pool holds the engine once; each fork_completions() call snapshots KV only.

Llama-3.1-8B on H100 80 GB PCIe, 5 rounds × 4 branches × 64 tokens:

Stage Time

init_pool (one-time — workers boot with real weights) 22.3s

First fork round 1.16s

Median fork round 0.88s

Per-round cost: ~340s cold-boot → sub-second (≈400× amortized). All rounds 4/4 non-empty and divergent. Bit-identical at the fork boundary. The first sub-second fork amortization proof on real hardware.

Reproducer: demos/fork_pool_rl.py · Receipt JSON: site/receipts/2026-04-20_h100_fork_pool_rl.json

What you can build with it

Agent branching — fork a conversation into N parallel hypotheses mid-reasoning, run them concurrently, pick the winner.

RL rollouts — collapse num_rollouts × prefill_time to num_rollouts × memcpy_time. Real dollars on $100k+/month training budgets. HuggingFace's 2026 async-RL survey: "no current async library supports [KV pivot resampling] out of the box." This ships it.

Parallel coding agents — turn "8 agents exploring 8 solutions" from an expensive re-prefill tax into a fast primitive.

Session migration — move a live inference session between GPUs, pods, or data centers without losing state.

Who this is for

RL post-training teams. PPO, DPO, tree-GRPO, and best-of-N loops that fork rollouts from a shared trunk pay for prefill on every branch. The receipt above takes a round from ~340s cold-boot to 0.88s warm-pool. A step with 16 rollouts: ~90 minutes → ~15 seconds. Multiply by steps × epochs. HuggingFace's 2026 async-RL survey documented the gap: "no current async library supports [KV pivot resampling] out of the box."

Coding-agent teams. Parallel-exploration products — Cursor-style N approaches, SWE-bench agents, test-driven coding loops — pay a prefill tax on every branch. ForkPool turns "explore 8 approaches" from 8× full prefill into an 8-branch fork against one warm KV state. More hypotheses per user request at the same GPU spend.

Platform + framework teams. thaw.fork(llm) returns a portable, serializable handle you can ship across processes and pods. Session migration, multi-model hot-swap, session replay — without rewriting your inference layer. Drop-in for LangGraph nodes, Modal functions, Ray workers.

Not for you yet. Single-prompt serving — one request, one response, no shared trunk, no repeated forking — vLLM / SGLang alone are fine. thaw earns its keep when you fork ≥2 children from shared state or hot-swap between sessions.

Works with vLLM and SGLang. Open source (Apache-2.0).

▶ 75-second demo — Hot-swap LLMs in 0.29s · How it works (4m) · Fork a running agent (2m 20s)

Inside a single fork

ForkPool amortizes setup cost across repeated forks. Each primitive behind it is receipted individually.

Sleep / wake round-trip (vLLM native LLM.sleep(level=2) + LLM.wake_up() composed with thaw's snapshot — bit-identical greedy output both sides):

Config Sleep Wake Snapshot CuMemAllocator freed Receipt

Llama-3.1-8B, 1× H100 SXM, TP=1 3.4s 11.1s 16 GB, 195 regions 45.38 GiB sleep_mode_8b_tp1.json

Llama-3.1-70B, 2× H100 SXM, TP=2 16.1s 53.6s 141 GB, 966 regions 72.67 GiB/rank (145 GiB total) sleep_mode_70b_tp2.json

Slot-warm hot-swap (thaw serve with a persisted pinned mmap, H100 SXM Llama-3-8B): one-time cudaHostRegister pin ~6s, then 0.29s / 55 GB/s per reload (86% of PCIe Gen5 line rate). Reproducer: bench_slot_warm.py, correctness: bench_slot_warm_correctness.py. Extrapolates to ~2.5s hot-swap for a 70B at 140 GB.

Every other "fast model loading" tool restores weights only. thaw restores the full state of a live inference session — weights + KV blocks + prefix-hash table + scheduler state — and that's what makes fork work.

Numbers are per-pod; freeze-side throughput is NVMe-bound (not code-bound). Re-measure on your own pod before citing as a ceiling. Methodology: docs/BENCHMARKS.md.

A pre-staged RAM path (mmap + cudaHostRegister) exists behind THAW_ZEROCOPY_MMAP=1. cudaHostRegister is O(pages) — pinning a 16 GB mmap costs ~7s, so the path is only a win when amortized across many restores (what thaw serve does by persisting the pin on each slot).

All paths produce bit-identical inference output. KV cache restore preserves prefix cache across cold starts — new requests skip prefill entirely.

How it works

Fork is a composition of four primitives: freeze weights, freeze KV cache, freeze scheduler state, restore all three into a fresh process. None of that was possible at GPU speeds before thaw.

flowchart TB A["Running vLLM engine weights (16 GB) + KV blocks + prefix-hash table + scheduler state"] A -- "thaw.freeze_model + thaw.freeze_kv_cache" --> B["Durable artifact (.thaw + .thawkv on disk or S3)"] B -- "pipelined CUDA DMA (double-buffered, O_DIRECT)" --> C1["Child engine 1 same weights + KV"] B -- "pipelined CUDA DMA" --> C2["Child engine 2 same weights + KV"] B -- "pipelined CUDA DMA" --> C3["Child engine N same weights + KV"] C1 --> D1[diverges here →] C2 --> D2[diverges here →] C3 --> D3[diverges here →]

classDef src fill:#1e293b,stroke:#64748b,color:#f1f5f9 classDef art fill:#0f172a,stroke:#38bdf8,color:#e0f2fe classDef child fill:#134e4a,stroke:#2dd4bf,color:#ccfbf1 classDef diverge fill:none,stroke:none,color:#94a3b8 class A src class B art class C1,C2,C3 child class D1,D2,D3 diverge

Freeze captures the full engine state into two binary files: .thaw (weights) and .thawkv (KV blocks + prefix-hash table + scheduler metadata).

Restore initializes a fresh vLLM engine with dummy weights (fast — no disk I/O), overwrites them from the snapshot via double-buffered pipelined DMA through pinned host memory, then rebuilds the prefix-cache block table from the .thawkv sidecar. Two CUDA streams overlap PCIe transfers with disk reads. New requests matching the restored prefix skip prefill entirely.

Three restore modes:

Disk: reads snapshot from NVMe with O_DIRECT, bypassing the kernel page cache. Throughput is NVMe-bound; re-measure per pod before citing a ceiling.

Pre-staged RAM: snapshot already in memory (tmpfs, shared memory, or mmapped with page cache warm). The full zero-copy path (mmap + cudaHostRegister) is implemented behind THAW_ZEROCOPY_MMAP=1, but the one-time registration cost makes it a win only when amortized across many restores.

Slot-warm hot-swap (thaw serve): when a pool slot warms up, thaw serve pins the snapshot mmap once (~6s cudaHostRegister for 16 GB) and persists the pinned handle on the slot. Every subsequent model swap into that slot reuses the pinned buffer and runs as pure PCIe DMA — 0.29s at 55 GB/s for an 8B model on H100 SXM.

KV cache snapshots are the hard part. vLLM's prefix-cache hash table maps token-hash → block-id, and the scheduler assumes those block assignments are live. thaw serializes the block contents, the hash table, and the scheduler's view of which blocks are cached. On restore, the block data is DMA'd back to GPU and the hash table is rebuilt — so a request whose prefix was cached in the parent immediately hits cache in the child. Nobody else does this.

Sleep-mode integration (vLLM RFC #34303)

thaw_vllm.sleep_mode composes thaw's freeze/restore around vLLM's native LLM.sleep(level=2) + LLM.wake_up() — not a parallel path: sleep() freezes then lets vLLM's CuMemAllocator free the GPU memory; wake_up() re-allocates the tensor storage then thaw populates it. Requires enable_sleep_mode=True at LLM construction (strict-mode gate).

sequenceDiagram autonumber participant U as user code participant TS as thaw_vllm.sleep_mode participant T as thaw (freeze/restore_model_tp) participant V as vLLM (LLM.sleep / wake_up) participant CMA as CuMemAllocator (GPU memory)

rect rgba(59,130,246,0.12) note over U,CMA: sleep(llm, path, level=2) U->>TS: sleep(llm, path) TS->>T: freeze_model_tp(llm, path) T-->>TS: snapshot on disk (.thaw) TS->>V: llm.sleep(level=2) V->>CMA: release tagged allocations CMA-->>V: GPU memory freed (receipt: 72.67 GiB / rank on 70B TP=2) V-->>TS: ok TS-->>U: stats (freed=True) end

rect rgba(16,185,129,0.12) note over U,CMA: wake_up(llm, path) U->>TS: wake_up(llm, path) TS->>V: llm.wake_up() V->>CMA: re-allocate tensor storage CMA-->>V: GPU tensors re-created (empty) V-->>TS: ok (~0.33s on 70B) TS->>T: restore_model_tp(llm, path) T-->>TS: snapshot populated into GPU tensors TS-->>U: stats (bit-identical greedy output) end

Receipts (2× H100 SXM, bit-identical greedy output on both ends): sleep_mode_8b_tp1.json, sleep_mode_70b_tp2.json. Source: python/thaw_vllm/sleep_mode.py. Tests: tests/test_sleep_mode.py (8 passing, CPU-only).

from vllm import LLM import thaw_vllm.sleep_mode as sm

llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct", enable_sleep_mode=True, # required by the strict-mode gate enforce_eager=True, dtype="float16")

llm.generate(["hello"]) # warm the engine

sm.sleep(llm, "/snap/llama8b.thaw") # freeze then llm.sleep(level=2)

GPU memory is actually freed here — not just tagged

sm.wake_up(llm, "/snap/llama8b.thaw") # llm.wake_up() then restore llm.generate(["hello"]) # bit-identical tokens

Architecture

thaw/ crates/ thaw-core/ Rust. File format, region tables, I/O. No CUDA dep. thaw-cuda-sys/ Rust. FFI bindings to CUDA runtime (cudaMallocHost, cudaMemcpyAsync, streams). Built via build.rs. thaw-runtime/ Rust. Orchestration: freeze/restore pipelines, double- buffered DMA, O_DIRECT, thread-local WC-buffer cache, unified zero-copy/staging restore. MockCuda for Mac. thaw-py/ Rust. PyO3 bindings exposing pipelined freeze/restore to Python. Builds a native .so via maturin. thaw-cli/ Rust. thaw-bench-freeze binary + internal tooling. python/ thaw_common/ Engine-agnostic freeze/restore primitives (shared). thaw_vllm/ vLLM integration + engine pool + OpenAI server. snapshot.py vLLM TP freeze/restore via collective_rpc. kv_snapshot.py KV cache freeze/restore (pipelined path, .meta sidecar). loader.py vLLM ModelLoader: load_format="thaw". pool.py Engine pool: pre-warmed slots, model hot-swap. server.py OpenAI-compatible API server. cli.py CLI: thaw freeze, thaw serve, thaw info. thaw_sglang/ SGLang integration (class-passthrough loader). vllm_demo.py End-to-end benchmark: normal vs thaw cold start. kv_cache_demo.py KV cache snapshot/restore demo with correctness test. demos/ agent_fork.py Agent fork demo: clone session, fork parallel completions.

Testing on Mac, shipping on GPU. The CudaBackend trait abstracts all GPU operations. MockCuda (a HashMap-backed fake) lets 48 runtime tests run on any machine. The cuda feature flag activates real GPU paths only when needed.

Quick start

pip install thaw-vllm[all]

This i

[truncated for AI cost control]