2026-07-01 16:19 UTCIn-site rewrite6 min readUpdated: 2026-07-01 16:32 UTC

Reduce GVisor Cold Starts with GPU Snapshotting

This article describes how Cerebrium uses GPU memory checkpointing to reduce cold start time of GPU workloads in gVisor sandboxes from 50 seconds to as low as 2.25 seconds. It explains the concept: perform expensive startup work once, freeze the result, and restore on demand. The implementation involves modifying the gVisor containerd shim to decide at container creation whether to boot normally or restore a checkpoint, and addresses various edge cases related to timing, network state, multiprocessing, file system, and storage performance.

SourceHacker News AIAuthor: jono_irwin

Article intelligence

InvestorsAdvanced

Key points

GPU workload initialization (Python imports, PyTorch loading, kernel compilation) is deterministic and can be cached via checkpointing.
Cerebrium extended the gVisor runtime to restore from snapshots at container creation when a compatible checkpoint exists.
Restoring a 9 GiB checkpoint reduced startup from 50s to 2.25s (S3) or 9s (local NVMe) on a g5.12xlarge instance.
Real-world challenges include handling network connections, multiprocessing file descriptors, local runtime files, CUDA work consistency, and checkpoint compatibility.

Why it matters

This matters because GPU workload initialization (Python imports, PyTorch loading, kernel compilation) is deterministic and can be cached via checkpointing.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

That initialization path includes importing Python modules, loading PyTorch, assembling model weights, copying them onto the GPU, and running the framework’s warmup path - torch.compile, CUDA graph capture, KV cache initialization, and whatever else the serving stack needs before it can take traffic.

Every one of these stages is deterministic.

Importing PyTorch produces the same loaded modules every time. Building the model and copying weights onto the GPU produces the same bytes in GPU memory every time. torch.compile and CUDA graph capture produce the same kernels every time.

Yet on every scale-up, we pay to recompute a result that is known.

That is what checkpointing changes.

The idea is simple: do the expensive startup work once, freeze the result, and restore it on demand.

Concretely, taking a checkpoint means:

Pause execution: pause all application processes, threads, and, crucially, GPU work.

Dump memory: serialize the in-memory state from both CPU and GPU to files.

Upload: push those files to fast, durable storage.

Restoring runs the same process in reverse. We pull the checkpoint files down, rehydrate CPU and GPU memory, repair the pieces of state that cannot survive a move, and unpause the workload.

The restored application process is the same warmed-up runtime we froze earlier: PyTorch has already been imported, model weights are already resident on the GPU, kernels are already compiled, and the application is ready to serve traffic.

The mental model is straightforward. Making it work reliably for real GPU workloads is not.

High-level architecture

At a high level, checkpointing needs to sit in the one place where it can control the full lifecycle of a container: between the container runtime and the sandbox running the workload.

Cerebrium runs user workloads inside gVisor sandboxes for isolation. To support checkpointing, we extended that runtime path so that when a container starts, we can make a decision before the normal boot sequence completes:

Should this container start from scratch, or should it be restored from a checkpoint?

If no checkpoint exists, the container follows the normal path. The image starts, the application boots, models load, GPU memory is populated, and the workload becomes ready. Once the container is fully warmed, the user can trigger a checkpoint. At that point, we pause the workload, capture its CPU and GPU state, write the checkpoint to disk, and upload it to fast storage.

If a checkpoint does exist, we skip the normal startup path. Instead of launching the container and waiting for Python imports, model loading, GPU transfers, torch.compile, and CUDA graph capture, we restore the saved state directly into the sandbox. The process resumes as if it had just finished warming up.

That sounds simple, but it requires the runtime to answer a few questions at exactly the right time:

Which workload is being started?

Does a compatible checkpoint exist for this image, GPU type, machine type, and runtime version?

Where is the checkpoint stored?

Is the checkpoint already cached locally on the host?

Should we restore, or fall back to a clean boot?

To make this work, we added two components to the node runtime.

The first is a small checkpoint service that runs on every host. It handles the operational side of checkpointing: downloading checkpoints, uploading new ones, caching them locally, evicting old or corrupted checkpoints, and reporting restore status.

The second is a modified gVisor containerd shim. This is the piece that sits in the container startup path. It intercepts container creation, checks whether a checkpoint can be restored, and either continues with the normal boot flow or replaces that flow with a restore.

In other words, the checkpoint service moves and manages the snapshot files. The shim decides whether a new container should boot normally or wake up from a snapshot.

The hardest part was not the API between those two components. It was timing.

Containerd starts a sandbox through a fixed sequence:

Sandbox Create → Sandbox Start → Container Create → Container Start

The natural place to decide whether to restore is when the sandbox starts. But at that point, we do not yet have enough information about the container image to know whether a checkpoint exists. The image information only becomes available later, during container creation.

So we had to reorder the startup sequence slightly.

When containerd asks us to start the sandbox, we defer the real start. We keep containerd satisfied with the expected status responses, but delay the actual sandbox startup until container creation, once we know which image is being launched and whether a matching checkpoint exists.

At that point, we choose one of two paths:

Normal boot: start the sandbox, launch the container, let the application initialize, and optionally checkpoint it once warm.

Checkpoint restore: download or locate the checkpoint, restore CPU and GPU memory into the sandbox, repair runtime state that cannot survive a move, and resume the process.

The work is mostly the same work the runtime would already do. The key change is that we moved the restore decision from sandbox start to container creation, where the image information is finally available and we can determine whether a matching checkpoint exists.

That small reordering is what lets checkpointing feel transparent from the user’s perspective. They start a workload the same way, but once a checkpoint exists, future scale-ups restore the warmed process instead of rebuilding it from scratch.

As we tested and developed the feature, we ran into several edge cases that were not obvious from the available documentation. Where possible, we are working to move those fixes upstream so that the next team adopting this technology does not have to rediscover the same issues.

Some of the issues we uncovered included:

A race condition in the TCP network stack that stopped the network from working when the container received many packets during the checkpointing process.

A race condition that crashes gVisor when running within containerd if a checkpoint takes longer than a few seconds.

Supporting Container Device Interface injection for NVidia GPUs.

Checkpoint distribution: why the storage layer matters more than you'd think

A checkpoint of a warmed-up GPU container is large - one of our test workloads is around 9 GiB however restoring Deepseek V4 FP8 with vLLM would be 640GB. Restoring is only worth it if we can move that much data faster than the container would have cold-started on its own. That makes the storage and network path the single most important design decision in the whole system.

The math is unforgiving:

For our 9GB container size, on a g5.12xlarge, a full vLLM cold start took around 50 seconds. Restoring from a 9 GiB checkpoint reduced startup to 2.25 seconds from S3 and 9 seconds from local NVMe.

We use S3 as the default restore path because it is fast enough and portable across the clouds and regions Cerebrium supports. Local NVMe is fast when the checkpoint is already cached on the node, while object storage remains the durable source of truth.

These results are specific to g5.12xlarge. On nodes with higher network bandwidth or faster local storage, restore times improve further.

The hard part: real workloads are messy

Checkpointing is easiest when the workload’s state is self-contained to memory. Real GPU workloads are rarely that clean.

A snapshot can preserve the warmed-up runtime, but it cannot blindly preserve every external dependency around it. After restore, the application may still hold references to a filesystem path, socket, IP address, device handle, or driver state that was valid before the move but invalid after it. That is where most of the finicky behavior comes from.

Network state is the first obvious example. Open TCP connections are tied to the original runtime environment. After restore, those connections have been terminated, and the container may also have a different external IP. This breaks frameworks that use the container’s external IP for internal heartbeats, worker coordination, or control-plane communication. In vLLM, for example, this meant the process could restore successfully but still fail internally because parts of the runtime were trying to communicate through an address that was no longer valid. The fix was to pin internal framework communication to loopback using VLLM_HOST_IP=127.0.0.1, so that worker coordination no longer depended on the external IP assigned to the container.

Multiprocessing creates another class of problems. Many Python serving frameworks use worker processes, and if those workers are created with fork, they can inherit NVIDIA driver file descriptors from the parent process. That matters because the checkpoint system needs a clean understanding of which processes actually own GPU state. Leaked driver file descriptors can make the runtime believe the GPU is still in use by processes that should not block checkpointing, or cause restore behavior that is difficult to reason about. For vLLM, the fix was to use spawn instead of fork for GPU workers with VLLM_WORKER_MULTIPROC_METHOD=spawn, so child processes start cleanly instead of inheriting GPU driver state from the parent.

Local runtime files are another subtle edge. Frameworks often create Unix sockets, temporary files, lock files, and coordination state on local disk. If that local filesystem is not restored with the checkpoint, the process can wake up expecting files that no longer exist. This is one of the more annoying failure modes because the process may look healthy from the outside while workers silently fail to communicate internally. In vLLM, we solved this by moving restore-critical RPC state to a small preserved path using VLLM_RPC_BASE_PATH=/run/cuda-ckpt.

The timing of the checkpoint also matters. A checkpoint needs a consistent view of CPU and GPU memory. If CUDA work is still running while the snapshot is taken, the checkpoint may be inconsistent or unsafe to restore. In practice, this means checkpointing has to happen after the workload has finished warming up and reached a known idle state. For some frameworks, that requires an explicit readiness step: load the model, run the warmup pass, wait for compilation or CUDA graph capture to finish, and only then trigger the checkpoint.

Another optimization is deciding what should not be checkpointed. vLLM sleep mode is useful here because it can drop transient state like the KV cache before the checkpoint is taken. The KV cache can be large, and preserving it makes the checkpoint bigger, slower to upload, and slower to restore. For many workloads, the cache is not worth carrying across restores because it is request-specific and can be rebuilt naturally once traffic resumes. In those cases, putting vLLM into sleep mode before checkpointing dramatically reduces the snapshot size and improves restore performance. We expose this as a choice rather than forcing one behavior: users can decide whether they want to preserve that state across restores or discard it to make checkpointing faster.

The last constraint is compatibility. A GPU memory checkpoint is not a portable artifact in the same way a container image is. It is tied to the environment it was created in: GPU type, CPU architecture, machine type, driver/runtime compatibility, and gVisor version. A checkpoint created on one hardware and runtime shape cannot safely be restored onto an arbitrary other one. Because of that, we key checkpoints by compatibility, not just by application. The restore path only uses a checkpoint when the target environment matches the original checkpoint environment.

The bigger pattern is that GPU memory checkpointing is not just “dump memory and reload it.” It is about separating state that can be frozen from state that must be recreated, reconnected, or moved into a checkp

[truncated for AI cost control]