SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference
SuperInfer is a high-performance LLM inference system designed for emerging superchips (e.g., NVIDIA GH200). It introduces RotaSched, a proactive SLO-aware rotary scheduler, and DuplexKV, a full-duplex memory engine, achieving up to 74.7% higher TTFT SLO attainment while maintaining comparable TBT and throughput.
Article intelligence
Key points
- Proposes RotaSched, the first proactive SLO-aware rotary scheduler that rotates requests between HBM and DRAM based on latency urgency.
- DuplexKV engine enables full-duplex KV cache transfer over NVLink-C2C, overcoming PCIe bandwidth limitations.
- Evaluated on GH200, showing up to 74.7% improvement in TTFT SLO attainment over state-of-the-art systems like vLLM.
Why it matters
This matters because proposes RotaSched, the first proactive SLO-aware rotary scheduler that rotates requests between HBM and DRAM based on latency urgency.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips
1 2 3
-->
Paper
arXiv
Code
Blog
-->
News
2026-01-26: SuperInfer has been accepted at MLSys 2026 ! 🎉
UCP boosts large-scale training efficiency:
🚀 Flexible change of parallelism (PP, SP, TP, ZeRO-DP) or GPU count mid-stream
🚀 Improve resilience by scaling down to healthy nodes
🚀 Increase throughput by scaling up to elastic nodes
-->
Abstract
Expert-specialized Mixture-of-Experts (MoEs) represent a significant advancement in large language models, employing fine-grained experts with large top-k routing to enhance expert specialization. However, training these emerging MoE architectures poses significant challenges for existing off-the-shelf MoE training solutions, especially on heterogeneous HPC platforms. These challenges include inefficient cross-platform kernels, shifted memory bottlenecks from model parameters to activations, and expensive all-to-all communication on hierarchical networks.
To address these issues, we present X-MoE, a comprehensive training system designed specifically for expert-specialized MoEs on HPC platforms. X-MoE introduces three key innovations: (1) a padding-free sparse MoE training pipeline with cross-platform kernels that eliminates zero-padding overhead, (2) a hierarchical redundancy-bypassing dispatch algorithm that reduces communication redundancy on hierarchical networks, and (3) a hybrid parallelism strategy with sequence-sharded MoE blocks that addresses the shifted memory bottleneck. Our evaluation on the Frontier supercomputer demonstrates that X-MoE enables training of models up to 545B parameters on 1024 AMD GPUs—10× larger than existing solutions—while achieving up to 1.42× higher training throughput.
-->
Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs.
To address these issues, we present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces (1) RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, (2) DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C.
Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.
Background
The Memory Wall: During autoregressive generation, each request maintains a growing KV cache that quickly exhausts GPU memory under high loads, leading to SLO violations. Mixture-of-Experts (MoE) models have emerged as a powerful approach to scale neural networks efficiently by activating only a subset of parameters per token. Traditional MoE architectures typically employ coarse-grained experts with relatively large hidden dimensions and small routing values (e.g., top-1 or top-2). -->
The Interconnect Bottleneck: Existing KV offloading systems are crippled by slow PCIe bandwidth (~32-64 GB/s), causing severe head-of-line (HOL) blocking and SLO violations. Mixture-of-Experts (MoE) models have emerged as a powerful approach to scale neural networks efficiently by activating only a subset of parameters per token. Traditional MoE architectures typically employ coarse-grained experts with relatively large hidden dimensions and small routing values (e.g., top-1 or top-2). -->
expert-specialized MoEs represent a paradigm shift toward more fine-grained expertise. These architectures feature:
-->
Fine-grained experts with smaller hidden dimensions that encourage specialization
Large top-k routing (e.g., top-8) that activates multiple specialized experts per token
Enhanced expert specialization where each expert learns to handle specific types of linguistic patterns or knowledge domains
-->
Increasing swap bandwidth beyond the PCIe Gen5x16 uni-directional limit significantly reduces both TTFT and TBT.
PCIe's Low swap bandwidth creates two major obstacles: request backlogging and HOL blocking to reducing tail latencies.
Superchip Opportunity: Emerging tightly-coupled CPU-GPU Superchips provides highspeed CPU-GPU interconnects to break the PCIe bottleneck. As an example, NVIDIA GH200 integrates a Hopper GPU and a Grace CPU via NVLink-C2C with 900 GB/s interconnection bandwidth.
Software Bottlenecks: Existing serving stacks fall short on two fronts:
SLO-unaware
React to memory pressure, not latency urgency. Static Waiting-First / Swapped-First policies bias one SLO (TTFT or TBT) at the expense of the other.
Under-utilized C2C
Exploit < 5% of NVLink-C2C bandwidth — PagedAttention fragments KV cache into tiny pieces.
SuperInfer Design
RotaSched: Proactive Rotation
Rotates requests between running (HBM) and a novel transient rotary (DRAM) state by latency urgency — OS-style time-slicing for LLM serving.
Virtual Lag Time (VLT): Per-request metric of deviation from TTFT/TBT SLOs, defined for running / rotary / waiting requests — the scheduling currency that drives RotaSched.
Largest-VLT-First (LVF): Prioritizes requests with the largest positive VLT — those most vulnerable to SLO violation — while preempting long-running requests with smaller VLT to the rotary state to free HBM.
DuplexKV: Full-Duplex Engine
Eager Block Rotation: KV cache fills its blocks incrementally; fully-written blocks are eagerly copied to DRAM in the background and marked "synced". On preemption, synced HBM blocks are simply discarded → swap-in/out data race broken, full-duplex transfers enabled.
Block-first Layout + Batched DMA: Prioritizes requests with the largest positive VLT — those most vulnerable to SLO violation — while preempting long-running requests with smaller VLT to the rotary state to free HBM.
Evaluation Results
The system was evaluated on a GH200 Superchip (144GB HBM, 480GB DRAM) using models like LLaMA-3-8B, Qwen2.5-32B, and Mixtral-8x7B against baselines such as vLLM, TensorRT-LLM, LightLLM, LTR, and NEO.
Versatility: SuperInfer demonstrates significant performance improvements on both Dense and MoE models.
SLO Attainment : Up to 74.7% higher TTFT SLO attainment with comparable TBT SLO, vs. vLLM at high request rates.
Integration: SuperInfer is built on top of vLLM, a wildly used open-source inference engine.
BibTeX
@misc{yu2026superinfer, title={SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips}, author={Jiahuan Yu and Mingtao Hu and Zichao Lin and Minjia Zhang}, year={2026}, eprint={2601.20309}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2601.20309}, }