2026-06-09原文2 min readUpdated: 2026-06-09

Enabling KV Caching of Shared Prefix for Diffusion Language Models

Diffusion Language Models (DLMs) suffer from bidirectional attention causing failure of existing KV caching methods, leading to near-zero accuracy. The proposed bidirectional prefix caching (bicache) dynamically identifies safe layer depths to reuse shared prefix KVs, improving throughput by 36.3%-98.3% with only 0-1.8% accuracy degradation.

SourcearXiv Machine LearningAuthor: Younghun Go, Jaehoon Han, Changyong Shin, Chuk Yoo, Gyeongsik Yang

[2606.07571] Enabling KV Caching of Shared Prefix for Diffusion Language Models

[Submitted on 26 May 2026]

Title:Enabling KV Caching of Shared Prefix for Diffusion Language Models

View a PDF of the paper titled Enabling KV Caching of Shared Prefix for Diffusion Language Models, by Younghun Go and 4 other authors

View PDF HTML (experimental)

Abstract:Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical challenges in emerging diffusion language models (DLMs). In DLMs, bidirectional attention means that updating any token dynamically alters the entire context and its corresponding KVs. Thus, existing caching techniques developed for LLMs, which assume that KVs remain invariant once computed, corrupt the shared prefix KVs. Our experiments show that applying these techniques to DLMs causes model accuracy to collapse to near zero.

To unlock high-throughput DLM serving, we propose bidirectional prefix caching, bicache, the first KV caching technique for shared prefixes in DLMs. bicache is designed based on key observations from our comprehensive analysis: shared prefix KVs remain stable and reusable in shallow layers, while the depth of shallow layers depends on the fraction of shared prefix tokens in each request. Thus, bicache dynamically identifies a safe layer depth for reusing shared prefix KVs and eliminates redundant computation. Evaluations demonstrate that bicache significantly improves serving throughput by 36.3%-98.3% compared to existing techniques without accuracy collapse (only 0-1.8% difference).

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2606.07571 [cs.LG]

(or arXiv:2606.07571v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2606.07571

arXiv-issued DOI via DataCite

Submission history

From: Jaehoon Han [view email] [v1] Tue, 26 May 2026 05:27:57 UTC (428 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Enabling KV Caching of Shared Prefix for Diffusion Language Models, by Younghun Go and 4 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.LG

new | recent | 2026-06

Change to browse by:

cs cs.AI

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)