Show HN: Thaw – Git branch for a running LLM (fork agents, skip prefill)
Thaw is an open-source tool that enables forking a running LLM session into multiple branches, skipping the costly prefill phase, enabling parallel exploration for AI agents. It achieves sub-second fork times (0.88s median) vs ~340s cold boot, and works with vLLM/SGLang.
Article intelligence
Key points
- Thaw provides a fork primitive for AI agents, allowing them to branch from a running session without re-prefill.
- Demonstrated performance: sub-second fork times on H100 GPU, ~400x amortization over cold boot.
- Use cases include agent branching, RL rollouts, parallel coding agents, and session migration.
- Open source (Apache-2.0), integrates with vLLM and SGLang.
Why it matters
This matters because thaw provides a fork primitive for AI agents, allowing them to branch from a running session without re-prefill.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
Notifications You must be signed in to change notification settings
Fork 0
Star 4
BranchesTags
Open more actions menu
Folders and files
NameName
Last commit message
Last commit date
Latest commit
History
78 Commits
78 Commits
.github/workflows
.github/workflows
benchmarks
benchmarks
crates
crates
demos
demos
docs
docs
notebooks
notebooks
python
python
scripts
scripts
site
site
tests
tests
.gitattributes
.gitattributes
.gitignore
.gitignore
Cargo.lock
Cargo.lock
Cargo.toml
Cargo.toml
LICENSE
LICENSE
README.md
README.md
bench_slot_warm.py
bench_slot_warm.py
bench_slot_warm_correctness.py
bench_slot_warm_correctness.py
logo.png
logo.png
pyproject.toml
pyproject.toml
setup.sh
setup.sh
vercel.json
vercel.json
Repository files navigation
The fork primitive for AI agents.
When your agent forks N ways to explore a problem, thaw skips the cold prefill and runs them in parallel from one shared memory. Snapshot a running session — weights, KV cache, scheduler state, prefix-hash table — and hydrate N divergent children at the fork point. git branch for live AI agents.
pip install thaw-vllm
The receipt — ForkPool, 2026-04-20
Pre-warmed subprocess pool holds the engine once; each fork_completions() call snapshots KV only.
Llama-3.1-8B on H100 80 GB PCIe, 5 rounds × 4 branches × 64 tokens:
Stage Time
init_pool (one-time — workers boot with real weights) 22.3s
First fork round 1.16s
Median fork round 0.88s
Per-round cost: ~340s cold-boot → sub-second (≈400× amortized). All rounds 4/4 non-empty and divergent. Bit-identical at the fork boundary. The first sub-second fork amortization proof on real hardware.
Reproducer: demos/fork_pool_rl.py · Receipt JSON: site/receipts/2026-04-20_h100_fork_pool_rl.json
What you can build with it
Agent branching — fork a conversation into N parallel hypotheses mid-reasoning, run them concurrently, pick the winner.
RL rollouts — collapse num_rollouts × prefill_time to num_rollouts × memcpy_time. Real dollars on $100k+/month training budgets. HuggingFace's 2026 async-RL survey: "no current async library supports [KV pivot resampling] out of the box." This ships it.
Parallel coding agents — turn "8 agents exploring 8 solutions" from an expensive re-prefill tax into a fast primitive.
Session migration — move a live inference session between GPUs, pods, or data centers without losing state.
Who this is for
RL post-training teams. PPO, DPO, tree-GRPO, and best-of-N loops that fork rollouts from a shared trunk pay for prefill on every branch. The receipt above takes a round from ~340s cold-boot to 0.88s warm-pool. A step with 16 rollouts: ~90 minutes → ~15 seconds. Multiply by steps × epochs. HuggingFace's 2026 async-RL survey documented the gap: "no current async library supports [KV pivot resampling] out of the box."
Coding-agent teams. Parallel-exploration products — Cursor-style N approaches, SWE-bench agents, test-driven coding loops — pay a prefill tax on every branch. ForkPool turns "explore 8 approaches" from 8× full prefill into an 8-branch fork against one warm KV state. More hypotheses per user request at the same GPU spend.
Platform + framework teams. thaw.fork(llm) returns a portable, serializable handle you can ship across processes and pods. Session migration, multi-model hot-swap, session replay — without rewriting your inference layer. Drop-in for LangGraph nodes, Modal functions, Ray workers.
Not for you yet. Single-prompt serving — one request, one response, no shared trunk, no repeated forking — vLLM / SGLang alone are fine. thaw earns its keep when you fork ≥2 children from shared state or hot-swap between sessions.
Works with vLLM and SGLang. Open source (Apache-2.0).
▶ 75-second demo — Hot-swap LLMs in 0.29s · How it works (4m) · Fork a running agent (2m 20s)
Inside a single fork
ForkPool amortizes setup cost across repeated forks. Each primitive behind it is receipted individually.
Sleep / wake round-trip (vLLM native LLM.sleep(level=2) + LLM.wake_up() composed with thaw's snapshot — bit-identical greedy output both sides):
Config Sleep Wake Snapshot CuMemAllocator freed Receipt
Llama-3.1-8B, 1× H100 SXM, TP=1 3.4s 11.1s 16 GB, 195 regions 45.38 GiB sleep_mode_8b_tp1.json
Llama-3.1-70B, 2× H100 SXM, TP=2 16.1s 53.6s 141 GB, 966 regions 72.67 GiB/rank (145 GiB total) sleep_mode_70b_tp2.json
Slot-warm hot-swap (thaw serve with a persisted pinned mmap, H100 SXM Llama-3-8B): one-time cudaHostRegister pin ~6s, then 0.29s / 55 GB/s per reload (86% of PCIe Gen5 line rate). Reproducer: bench_slot_warm.py, correctness: bench_slot_warm_correctness.py. Extrapolates to ~2.5s hot-swap for a 70B at 140 GB.
Every other "fast model loading" tool restores weights only. thaw restores the full state of a live inference session — weights + KV blocks + prefix-hash table + scheduler state — and that's what makes fork work.
Numbers are per-pod; freeze-side throughput is NVMe-bound (not code-bound). Re-measure on your own pod before citing as a ceiling. Methodology: docs/BENCHMARKS.md.
A pre-staged RAM path (mmap + cudaHostRegister) exists behind THAW_ZEROCOPY_MMAP=1. cudaHostRegister is O(pages) — pinning a 16 GB mmap costs ~7s, so the path is only a win when amortized across many restores (what thaw serve does by persisting the pin on each slot).
All paths produce bit-identical inference output. KV cache restore preserves prefix cache across cold starts — new requests skip prefill entirely.
How it works
Fork is a composition of four primitives: freeze weights, freeze KV cache, freeze scheduler state, restore all three into a fresh process. None of that was possible at GPU speeds before thaw.
flowchart TB A["Running vLLM engine weights (16 GB) + KV blocks + prefix-hash table + scheduler state"] A -- "thaw.freeze_model + thaw.freeze_kv_cache" --> B["Durable artifact (.thaw + .thawkv on disk or S3)"] B -- "pipelined CUDA DMA (double-buffered, O_DIRECT)" --> C1["Child engine 1 same weights + KV"] B -- "pipelined CUDA DMA" --> C2["Child engine 2 same weights + KV"] B -- "pipelined CUDA DMA" --> C3["Child engine N same weights + KV"] C1 --> D1[diverges here →] C2 --> D2[diverges here →] C3 --> D3[diverges here →]
classDef src fill:#1e293b,stroke:#64748b,color:#f1f5f9 classDef art fill:#0f172a,stroke:#38bdf8,color:#e0f2fe classDef child fill:#134e4a,stroke:#2dd4bf,color:#ccfbf1 classDef diverge fill:none,stroke:none,color:#94a3b8 class A src class B art class C1,C2,C3 child class D1,D2,D3 diverge
Loading
Freeze captures the full engine state into two binary files: .thaw (weights) and .thawkv (KV blocks + prefix-hash table + scheduler metadata).
Restore initializes a fresh vLLM engine with dummy weights (fast — no disk I/O), overwrites them from the snapshot via double-buffered pipelined DMA through pinned host memory, then rebuilds the prefix-cache block table from the .thawkv sidecar. Two CUDA streams overlap PCIe transfers with disk reads. New requests matching the restored prefix skip prefill entirely.
Three restore modes:
Disk: reads snapshot from NVMe with O_DIRECT, bypassing the kernel page cache. Throughput is NVMe-bound; re-measure per pod before citing a ceiling.
Pre-staged RAM: snapshot already in memory (tmpfs, shared memory, or mmapped with page cache warm). The full zero-copy path (mmap + cudaHostRegister) is implemented behind THAW_ZEROCOPY_MMAP=1, but the one-time registration cost makes it a win only when amortized across many restores.
Slot-warm hot-swap (thaw serve): when a pool slot warms up, thaw serve pins the snapshot mmap once (~6s cudaHostRegister for 16 GB) and persists the pinned handle on the slot. Every subsequent model swap into that slot reuses the pinned buffer and runs as pure PCIe DMA — 0.29s at 55 GB/s for an 8B model on H100 SXM.
KV cache snapshots are the hard part. vLLM's prefix-cache hash table maps token-hash → block-id, and the scheduler assumes those block assignments are live. thaw serializes the block contents, the hash table, and the scheduler's view of which blocks are cached. On restore, the block data is DMA'd back to GPU and the hash table is rebuilt — so a request whose prefix was cached in the parent immediately hits cache in the child. Nobody else does this.
Sleep-mode integration (vLLM RFC #34303)
thaw_vllm.sleep_mode composes thaw's freeze/restore around vLLM's native LLM.sleep(level=2) + LLM.wake_up() — not a parallel path: sleep() freezes then lets vLLM's CuMemAllocator free the GPU memory; wake_up() re-allocates the tensor storage then thaw populates it. Requires enable_sleep_mode=True at LLM construction (strict-mode gate).
sequenceDiagram autonumber participant U as user code participant TS as thaw_vllm.sleep_mode participant T as thaw (freeze/restore_model_tp) participant V as vLLM (LLM.sleep / wake_up) participant CMA as CuMemAllocator (GPU memory)
rect rgba(59,130,246,0.12) note over U,CMA: sleep(llm, path, level=2) U->>TS: sleep(llm, path) TS->>T: freeze_model_tp(llm, path) T-->>TS: snapshot on disk (.thaw) TS->>V: llm.sleep(level=2) V->>CMA: release tagged allocations CMA-->>V: GPU memory freed (receipt: 72.67 GiB / rank on 70B TP=2) V-->>TS: ok TS-->>U: stats (freed=True) end
rect rgba(16,185,129,0.12) note over U,CMA: wake_up(llm, path) U->>TS: wake_up(llm, path) TS->>V: llm.wake_up() V->>CMA: re-allocate tensor storage CMA-->>V: GPU tensors re-created (empty) V-->>TS: ok (~0.33s on 70B) TS->>T: restore_model_tp(llm, path) T-->>TS: snapshot populated into GPU tensors TS-->>U: stats (bit-identical greedy output) end
Loading
Receipts (2× H100 SXM, bit-identical greedy output on both ends): sleep_mode_8b_tp1.json, sleep_mode_70b_tp2.json. Source: python/thaw_vllm/sleep_mode.py. Tests: tests/test_sleep_mode.py (8 passing, CPU-only).
from vllm import LLM import thaw_vllm.sleep_mode as sm
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct", enable_sleep_mode=True, # required by the strict-mode gate enforce_eager=True, dtype="float16")
llm.generate(["hello"]) # warm the engine
sm.sleep(llm, "/snap/llama8b.thaw") # freeze then llm.sleep(level=2)
GPU memory is actually freed here — not just tagged
sm.wake_up(llm, "/snap/llama8b.thaw") # llm.wake_up() then restore llm.generate(["hello"]) # bit-identical tokens
Architecture
thaw/ crates/ thaw-core/ Rust. File format, region tables, I/O. No CUDA dep. thaw-cuda-sys/ Rust. FFI bindings to CUDA runtime (cudaMallocHost, cudaMemcpyAsync, streams). Built via build.rs. thaw-runtime/ Rust. Orchestration: freeze/restore pipelines, double- buffered DMA, O_DIRECT, thread-local WC-buffer cache, unified zero-copy/staging restore. MockCuda for Mac. thaw-py/ Rust. PyO3 bindings exposing pipelined freeze/restore to Python. Builds a native .so via maturin. thaw-cli/ Rust. thaw-bench-freeze binary + internal tooling. python/ thaw_common/ Engine-agnostic freeze/restore primitives (shared). thaw_vllm/ vLLM integration + engine pool + OpenAI server. snapshot.py vLLM TP freeze/restore via collective_rpc. kv_snapshot.py KV cache freeze/restore (pipelined path, .meta sidecar). loader.py vLLM ModelLoader: load_format="thaw". pool.py Engine pool: pre-warmed slots, model hot-swap. server.py OpenAI-compatible API server. cli.py CLI: thaw freeze, thaw serve, thaw info. thaw_sglang/ SGLang integration (class-passthrough loader). vllm_demo.py End-to-end benchmark: normal vs thaw cold start. kv_cache_demo.py KV cache snapshot/restore demo with correctness test. demos/ agent_fork.py Agent fork demo: clone session, fork parallel completions.
Testing on Mac, shipping on GPU. The CudaBackend trait abstracts all GPU operations. MockCuda (a HashMap-backed fake) lets 48 runtime tests run on any machine. The cuda feature flag activates real GPU paths only when needed.
Quick start
pip install thaw-vllm[all]
This i
[truncated for AI cost control]