2026-05-19 01:23 UTCIn-site rewrite3 min readUpdated: 2026-06-30 13:03 UTC

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference

SuperInfer is a high-performance LLM inference system designed for emerging superchips (e.g., NVIDIA GH200). It introduces RotaSched, a proactive SLO-aware rotary scheduler, and DuplexKV, a full-duplex memory engine, achieving up to 74.7% higher TTFT SLO attainment while maintaining comparable TBT and throughput.

SourceHacker News AIAuthor: matt_d

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

1 2 3

-->

Paper

arXiv

Code

Blog

-->

News

2026-01-26: SuperInfer has been accepted at MLSys 2026 ! 🎉

UCP boosts large-scale training efficiency:

🚀 Flexible change of parallelism (PP, SP, TP, ZeRO-DP) or GPU count mid-stream

🚀 Improve resilience by scaling down to healthy nodes

🚀 Increase throughput by scaling up to elastic nodes

-->

Abstract

Expert-specialized Mixture-of-Experts (MoEs) represent a significant advancement in large language models, employing fine-grained experts with large top-k routing to enhance expert specialization. However, training these emerging MoE architectures poses significant challenges for existing off-the-shelf MoE training solutions, especially on heterogeneous HPC platforms. These challenges include inefficient cross-platform kernels, shifted memory bottlenecks from model parameters to activations, and expensive all-to-all communication on hierarchical networks.

To address these issues, we present X-MoE, a comprehensive training system designed specifically for expert-specialized MoEs on HPC platforms. X-MoE introduces three key innovations: (1) a padding-free sparse MoE training pipeline with cross-platform kernels that eliminates zero-padding overhead, (2) a hierarchical redundancy-bypassing dispatch algorithm that reduces communication redundancy on hierarchical networks, and (3) a hybrid parallelism strategy with sequence-sharded MoE blocks that addresses the shifted memory bottleneck. Our evaluation on the Frontier supercomputer demonstrates that X-MoE enables training of models up to 545B parameters on 1024 AMD GPUs—10× larger than existing solutions—while achieving up to 1.42× higher training throughput.

-->

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs.

To address these issues, we present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces (1) RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, (2) DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C.

Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.

Background

The Memory Wall: During autoregressive generation, each request maintains a growing KV cache that quickly exhausts GPU memory under high loads, leading to SLO violations. Mixture-of-Experts (MoE) models have emerged as a powerful approach to scale neural networks efficiently by activating only a subset of parameters per token. Traditional MoE architectures typically employ coarse-grained experts with relatively large hidden dimensions and small routing values (e.g., top-1 or top-2). -->

The Interconnect Bottleneck: Existing KV offloading systems are crippled by slow PCIe bandwidth (~32-64 GB/s), causing severe head-of-line (HOL) blocking and SLO violations. Mixture-of-Experts (MoE) models have emerged as a powerful approach to scale neural networks efficiently by activating only a subset of parameters per token. Traditional MoE architectures typically employ coarse-grained experts with relatively large hidden dimensions and small routing values (e.g., top-1 or top-2). -->

expert-specialized MoEs represent a paradigm shift toward more fine-grained expertise. These architectures feature:

-->

Fine-grained experts with smaller hidden dimensions that encourage specialization

Large top-k routing (e.g., top-8) that activates multiple specialized experts per token

Enhanced expert specialization where each expert learns to handle specific types of linguistic patterns or knowledge domains

-->

Increasing swap bandwidth beyond the PCIe Gen5x16 uni-directional limit significantly reduces both TTFT and TBT.

PCIe's Low swap bandwidth creates two major obstacles: request backlogging and HOL blocking to reducing tail latencies.

Superchip Opportunity: Emerging tightly-coupled CPU-GPU Superchips provides highspeed CPU-GPU interconnects to break the PCIe bottleneck. As an example, NVIDIA GH200 integrates a Hopper GPU and a Grace CPU via NVLink-C2C with 900 GB/s interconnection bandwidth.

Software Bottlenecks: Existing serving stacks fall short on two fronts:

SLO-unaware

React to memory pressure, not latency urgency. Static Waiting-First / Swapped-First policies bias one SLO (TTFT or TBT) at the expense of the other.

Under-utilized C2C

Exploit < 5% of NVLink-C2C bandwidth — PagedAttention fragments KV cache into tiny pieces.

SuperInfer Design

RotaSched: Proactive Rotation

Rotates requests between running (HBM) and a novel transient rotary (DRAM) state by latency urgency — OS-style time-slicing for LLM serving.

Virtual Lag Time (VLT): Per-request metric of deviation from TTFT/TBT SLOs, defined for running / rotary / waiting requests — the scheduling currency that drives RotaSched.

Largest-VLT-First (LVF): Prioritizes requests with the largest positive VLT — those most vulnerable to SLO violation — while preempting long-running requests with smaller VLT to the rotary state to free HBM.

DuplexKV: Full-Duplex Engine

Eager Block Rotation: KV cache fills its blocks incrementally; fully-written blocks are eagerly copied to DRAM in the background and marked "synced". On preemption, synced HBM blocks are simply discarded → swap-in/out data race broken, full-duplex transfers enabled.

Block-first Layout + Batched DMA: Prioritizes requests with the largest positive VLT — those most vulnerable to SLO violation — while preempting long-running requests with smaller VLT to the rotary state to free HBM.

Evaluation Results

The system was evaluated on a GH200 Superchip (144GB HBM, 480GB DRAM) using models like LLaMA-3-8B, Qwen2.5-32B, and Mixtral-8x7B against baselines such as vLLM, TensorRT-LLM, LightLLM, LTR, and NEO.

Versatility: SuperInfer demonstrates significant performance improvements on both Dense and MoE models.

SLO Attainment : Up to 74.7% higher TTFT SLO attainment with comparable TBT SLO, vs. vLLM at high request rates.

Integration: SuperInfer is built on top of vLLM, a wildly used open-source inference engine.

BibTeX

@misc{yu2026superinfer, title={SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips}, author={Jiahuan Yu and Mingtao Hu and Zichao Lin and Minjia Zhang}, year={2026}, eprint={2601.20309}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2601.20309}, }