AI News HubLIVE
站内改写

MinIO adds petabyte-scale MemKV cache for Nvidia GPU inference

MinIO introduces MemKV, a petabyte-scale caching system for Nvidia GPUs, built on its AIStor object storage. It leverages Nvidia's STX architecture to provide microsecond-latency shared context across GPU clusters, boosting utilization from 50% to over 90% in a 128-GPU deployment and yielding $2 million annual compute savings.

Article intelligence

EngineersAdvanced

Key points

  • MemKV sits atop GPU HBM, CPU DRAM, and local SSD caches, using BlueField-4 DPUs for seamless data movement.
  • It improves time-to-first-token and GPU utilization significantly in large-scale inference workloads.
  • MinIO positions MemKV as a G3.5 tier, purpose-built for inference, contrasting with legacy storage approaches.

Why it matters

This matters because memKV sits atop GPU HBM, CPU DRAM, and local SSD caches, using BlueField-4 DPUs for seamless data movement.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

MinIO adds petabyte-scale MemKV cache for Nvidia GPU inference

Chris Mellor

Chris Mellor

Blocks & Files editor

Published tue 12 May 2026 // 13:00 UTC

MinIO has built a petabyte-scale MemKV caching system for Nvidia GPUs, effectively sitting atop its AIStor object storage.

A cluster of GPUs running AI inferencing workloads needs closely coupled high-bandwidth memory (HBM) to hold the context, the vectorized tokens and intermediate key-value, or KV, pairs used during repeated inference calculations. In a KV cache scheme, when a GPU's HBM fills up, the data is cached downstream in the memory hierarchy to the server's CPU DRAM and then to NVMe SSDs with controller software running in Nvidia's BlueField-4 (BF4) DPUs. When these fill up, a backing object storage system such as MinIO's AIStor comes into play. Nvidia's STX architecture defines how this hierarchy of caches layered on top of external storage operates. MemKV complies with it to deliver persistent, shared context across GPU clusters at a scale MinIO claims existing memory and storage tiers cannot satisfy.

AB Periasamy

AB Periasamy, co-founder and co-CEO, said: "The industry has been papering over context loss for years because, at small scale, you may be able to absorb the recompute tax and move on. At the GPU density hyperscalers and neoclouds are building toward, that is no longer true. A GPU recomputing context it has already generated is burning power without return, and at a thousand GPUs that is not inefficiency, it is structural drag. Yield economics at this scale demand something purpose-built for the inference data path. MemKV was designed for exactly this."

BANDF AD

MinIO says that, for the first time, an entire GPU cluster can access a common pool of context at microsecond latencies that keep pace with inference, rather than waiting on millisecond-latency external storage. When the HBM, CPU DRAM, and local SSD caches fill up, expensive GPUs have to recompute context.

MinIO claims MemKV delivered a large improvement in time-to-first-token at production concurrency. It also increased GPU utilization from 50 percent to over 90 percent, in a 128-GPU deployment with a 128K-token context length, which MinIO says resulted in $2 million in annual compute savings.

We're told MemKV is purpose-built for the inference data path, being designed for Nvidia's STX architecture, and supporting its Dynamo and NIXL caching software. It delivers petabytes of shared context memory at SSD economics, replacing the cost and capacity constraints of GPU HBM and DRAM with a tier that scales independently of the compute cluster. MemKV features:

BANDF AD

Native support for BlueField-4 STX: Runs directly within STX infrastructure as a single ARM64-native binary, embedded in the storage tier rather than deployed on separate x86 storage servers connected over the network.

End-to-end RDMA transport: KV cache moves data from GPU memory to NVMe over RDMA, bypassing file system or object storage protocols entirely.

GPU-native block sizes: Operates in 2-16 MB blocks optimized for throughput-oriented GPU access patterns, not the 4 KB blocks designed for legacy storage workloads.

Wire-speed fabric performance: Built for Nvidia Spectrum-X Ethernet networking and PCIe Gen6, driving throughput to near-wire-speed across the physical fabric.

MemKV moves data directly from NVMe SSDs to the AI data path via end-to-end RDMA transport, with no HTTP overhead, no file system translation, and no storage servers between the GPU and its context.

B&F view of Nvidia's CMX scheme

MinIO wants us to realize that "every storage vendor" announcing "context memory support" is doing one of two things: extending a local NVMe offering (G3) that can't be shared across the cluster, or adapting a general-purpose, shared storage platform (G4) into the inference data path. Neither was designed for this job, but MinIO's MemKV is built from scratch to occupy G3.5.

It emphasizes that, when "legacy storage vendors claim G3.5 support, data still flows through the same protocol nodes, metadata services, and file system translation layers they've always had. Those layers exist to provide enterprise durability, ACID consistency, and erasure coding. That's exactly right for training data and model weights, but not for KV cache, which is ephemeral, recomputable, and needs to move in 2-16 MB inference-optimized blocks – not the 4 KB blocks every legacy storage system was architected around."

BANDF AD

GPU-powered, hardware RAID supplier GRAID also has an STX-supporting KV cache offering, as does WEKA. A large group of storage suppliers has also signed up to support Nvidia's STX architecture, including Cloudian, Dell, DDN, Everpure, Hammerspace, Hitachi Vantara, HPE, Lightbits/ScaleFlux, NetApp, Nutanix, Peak:AIO, Pliops, and VAST Data.