2026-06-25 15:00 UTCIn-site rewrite6 min readUpdated: 2026-06-25 15:12 UTC

VRAM Ghost Busting: Who You Gonna close()?

At H Company, GPU clusters showed VRAM held by ghost processes after job crashes. Investigation traced the issue to FUSE mounts used for S3 storage: the userspace daemon (rclone mount) was killed by OOM, but the kernel failed to abort the FUSE connection because a file descriptor was still held by the broker process (fusermount-server). This left training threads stuck in D-state holding CUDA contexts. A hotfix forced abort of FUSE connections, but a root cause fix requires addressing the fd leak.

SourceHacker News AIAuthor: zhwu

VRAM Ghost Busting: Who You Gonna close()? - H Company

VRAM Ghost Busting

VRAM Ghost Busting

June 25th, 2026

VRAM Ghost Busting: Who You Gonna close()?

Léonard Benedetti, Charles Park, Tony Wu (H Company); Kevin Mingtarja (SkyPilot)

Context

At H Company, our training workloads, in particular online RL and large supervised fine-tuning jobs, run on GPU clusters managed by SkyPilot across two backends, as described in our previous blog post: a Kubernetes one (on AWS SageMaker HyperPod, exposing EKS) and a Slurm one. A typical job is a multi-node distributed run on 8xH100 nodes; checkpoints and dataset shards live in S3 and have to be streamed in at high throughput.

SkyPilot’s MOUNT_CACHED is what makes that simple: it exposes an S3 bucket inside the pod as if it were a local directory, with a write-back cache on local disk and rclone mount driving asynchronous sync with the bucket. Training code sees a plain filesystem; under the hood, it is a FUSE mount, with the kernel relaying every filesystem syscall to a userspace daemon.

The bug below lived inside that path. It only ever surfaced on the Kubernetes backend, where SkyPilot uses an additional privileged helper to broker the FUSE mount for unprivileged user pods, but as discussed at the end of the post, the underlying pattern is generic and applies anywhere a similar fd-broker exists.

Initial observation: held VRAM, no owning process

The first sighting was an 8xH100 node that a researcher had freshly booked through SkyPilot. The job OOMed while loading model weights, before training proper had even started. We ran nvidia-smi after the job had crashed and got this:

On a freshly booked node, every GPU should report 0 MiB. Instead GPU 0 had eighty gigabytes pinned, and the seven others were not clean either: each still held between a few hundred MiB and a gigabyte of VRAM it should not have. None of it had a process to attach to or a PID to kill, as if VRAM were being used by a phantom process.

The pre-training Ray bootstrap log from the same node confirmed the picture:

It was not a one-off: over the following weeks the same pattern kept reappearing, nodes that the SkyPilot scheduler considered idle, nvidia-smi reporting tens of gigabytes of allocated VRAM, and no owning process. We s tarted referring to them as “ghost processes” 👻: no PID, no container, no obvious owner.

Diagnosis: tracing the stuck workers back to FUSE

The standard process-level tools (lsof, nvidia-smi, dmesg) all came up empty: from their point of view, nothing was running on the node. We had to drop one layer down, to containerd, and then to the kernel, before anything useful turned up.

Every command below was therefore run on the bare GPU node itself, over AWS SSM Session Manager onto the SageMaker HyperPod instance, not from inside a pod or from a remote workstation. The containerd CLI talks to the host's containerd socket, and the kernel-level surfaces we ended up poking at (/proc/$tid/wchan, /sys/fs/fuse/connections/) only exist in the host’s procfs/sysfs.

Step 1 — Look for containers that should not be there anymore

Even though the Kubernetes API reported no pods on the node, we listed live containerd tasks directly:

Several containers were still in RUNNING even though their pods had been deleted minutes or hours earlier. Pulling the metadata back to the originating pod confirmed they were all SkyPilot job containers:

$ ctr --namespace k8s.io containers info

$ ctr --namespace k8s.io containers info

$ ctr --namespace k8s.io containers info

So whatever was holding the GPU was sitting inside a container that the kubelet had been unable to reap.

Step 2 — Find the threads the kernel cannot wake up

If a container is alive but SIGKILL cannot finish the job, the prime suspect is a process in D-state. D-state corresponds to the Linux TASK_UNINTERRUPTIBLE state, in which a thread is blocked inside a kernel call and ignores signals (including SIGKILL) until that kernel call returns (see this LWN article on TASK_KILLABLE for background). We listed every D-state thread on the node:

This returned 30+ threads, with names like rl-training::Train, pt_nccl_heartbt, cuda-EvtHandlr, pt_elastic, and cuda00001800007, which appeared to have been started in Python:

$ ps -o pid,stat,comm -p

$ ps -o pid,stat,comm -p

$ ps -o pid,stat,comm -p

Those names pointed at CUDA/NCCL stuck inside the NVIDIA driver, exactly the kind of hang you would expect from a half-collapsed distributed training job. Red herring: they were just the thread names the training process had given its workers, not evidence the kernel was stuck inside CUDA. We spent some time pulling on that thread before catching it.

Step 3 — Find the culprit: FUSE

The thing the kernel actually knows about a sleeping task is its wait channel: the symbol of the kernel function it is blocked inside, which is exposed as wchan on Linux (see the proc_pid_wchan(5) man page). For instance:

$ ps -o pid,stat,wchan:24,comm -p

$ ps -o pid,stat,wchan:24,comm -p

$ ps -o pid,stat,wchan:24,comm -p

Hence, we dumped the wait channel for each thread from /proc/$tid/wchan and counted them:

Every single D-state thread was sleeping inside the same kernel function: request_wait_answer. That symbol is defined in fs/fuse/dev.c, i.e. the FUSE device driver. It is the function that parks a thread until the userspace FUSE daemon answers a request (more details about FUSE below). None of this was CUDA, NCCL, or similar: every stuck worker was blocked on a FUSE reply that never arrived.

Concretely the training process was blocked inside a read() on a file that lived behind the FUSE mount, and the answer was never coming.

It also helped that, by the time we got here, we had two distinct sets of affected nodes: one running our rl-training workload and one not, both launched through SkyPilot. This helped further narrow down the scope, as the intersection was SkyPilot and FUSE setup, not the training code.

Step 4 — Confirm from the FUSE side

Every active FUSE mount on a Linux host shows up as a numbered directory under /sys/fs/fuse/connections/, with a waiting file giving (according to documentation):

The number of requests which are waiting to be transferred touserspace or being processed by the filesystem daemon. If there isno filesystem activity and ‘waiting’ is non-zero, then thefilesystem is hung or deadlocked.

We listed connections with pending waiters:

$ for c in /sys/fs/fuse/connections/*/; do id=$(basename "$c") waiting=$(cat "$c/waiting" 2>/dev/null) [ "$waiting" -gt 0 ]

$ for c in /sys/fs/fuse/connections/*/; do id=$(basename "$c") waiting=$(cat "$c/waiting" 2>/dev/null) [ "$waiting" -gt 0 ]

$ for c in /sys/fs/fuse/connections/*/; do id=$(basename "$c") waiting=$(cat "$c/waiting" 2>/dev/null) [ "$waiting" -gt 0 ]

The number of pending waiters across connections matched the D-state thread count exactly. The kernel was holding live FUSE connections, with no userspace daemon on the other end answering them.

But this is not how FUSE is supposed to behave: when the userspace daemon dies, the kernel is meant to abort all pending requests so that blocked readers wake up with -ECONNABORTED. The function that does this is fuse_abort_conn, and it was clearly not called for these connections.

Step 5 — Verify by forcing the connection closed

The FUSE control filesystem exposes an abort file per connection that forces the kernel to tear it down regardless of who still holds a file descriptor (“fd”) onto it. On a cordoned, otherwise-idle test node, we aborted the stuck connections:

$ for c in /sys/fs/fuse/connections/*/; do w=$(cat "$c/waiting" 2>/dev/null) [ "$w" -gt 0 ]

$ for c in /sys/fs/fuse/connections/*/; do w=$(cat "$c/waiting" 2>/dev/null) [ "$w" -gt 0 ]

$ for c in /sys/fs/fuse/connections/*/; do w=$(cat "$c/waiting" 2>/dev/null) [ "$w" -gt 0 ]

The blocked workers exited within seconds, and nvidia-smi immediately reported the GPUs as free. That closed the loop on the mechanism: stuck FUSE connections were keeping training processes parked in request_wait_answer, which kept them holding their CUDA contexts, which is why the VRAM never came back.

What was not yet clear was: (a) why the userspace FUSE daemon died, and (b) why fuse_abort_conn had failed to fire on its own once the FUSE daemon died (which is what should normally short-circuit that chain by waking the parked read() with -ECONNABORTED).

Detection checklist

To recap, here are the five signals from steps 1–5 above, condensed into a single checklist you can run on another fleet:

Held VRAM on an idle GPU node with no owning process: nvidia-smi

Containers stuck in RUNNING after their pods are deleted: ctr --namespace k8s.io tasks list

D-state threads sleeping in request_wait_answer: cat /proc/$tid/wchan for each tid from ps -eLo state,tid | awk '$1 ~ /D/'

FUSE connections with pending waiters and no userspace daemon: cat /sys/fs/fuse/connections/*/waiting

/dev/fuse fds accumulating in fusermount-server across mounts: kubectl exec -n skypilot-system -- sh -c 'ls -l /proc/1/fd | grep -c fuse'

A first Quick-and-Dirty™ hotfix: aborting stuck FUSE connections at fleet scale

While the investigation continued, we needed the fleet to stop poisoning new jobs. The Step 5 abort recipe worked node-by-node; we wrapped it in a small script (kill_zombies_k8s, at this point, the ghosts had turned into zombies in our minds 🎃) that targeted idle GPU nodes on our SageMaker HyperPod cluster, probed each one for held VRAM with an nvidia-smi pod, and ran fusermount -u --abort on the ghost processes through AWS SSM.

The abort cleared most ghost processes on the spot. A few stubborn nodes kept their VRAM even after the kernel-level abort and required a reboot. That asymmetry was a clue we did not fully appreciate at the time; we’ll come back to it once the root cause is in view.

The hotfix gave researchers their cluster back. It did not tell us why the leak existed in the first place.

MOUNT_CACHED and the FUSE plumbing underneath

To explain why the kernel was holding live FUSE connections with no userspace daemon, we need to sketch the storage path that produced them and to explain how FUSE works.

SkyPilot’s MOUNT_CACHED exposes, using rclone mount, an S3 bucket to a job as a local directory with a local VFS cache layered on top: reads are served from cache when possible, and writes hit local disk first and are asynchronously uploaded to the bucket. That is what makes it usable for our checkpoints: training writes at local-disk speed instead of waiting on S3 round-trips.

Under the hood it is a FUSE mount. The kernel docs describes it this way:

FUSE is a userspace filesystem framework. It consists of a kernel module (fuse.ko), a userspace library (libfuse.*) and a mount utility (fusermount).

A FUSE mount has two halves wired together by a single open fd onto /dev/fuse. The kernel side receives every filesystem syscall against the mount point (e.g. read("/mnt/myfs/foo")) and serializes it as a request (FUSE_READ) onto that fd; the userspace daemon reads requests from the fd onto /dev/fuse and writes back answers. In MOUNT_CACHED’s case that daemon is rclone mount, which serves reads from the local cache and reconciles with S3 in the background.

Why the userspace FUSE daemon died

This was the easiest question to answer: the rclone mount daemon was killed for OOM under heavy parallel reads. With vfs_read_chunk_size: 32M and vfs_read_chunk_streams: 16 it shares the main container’s memory budget with the training process, so a memory spike on the training side could evict it. That answers (a) but not (b): even after rclone was gone, the kernel kept parking read()s on the mount point instead of aborting them with -ECONNABORTED.

If you are picking up MOUNT_CACHED for production training, SkyPilot’s cache-tuning guide is the n

[truncated for AI cost control]

VRAM Ghost Busting: Who You Gonna close()? | AI News Hub