2026-06-02 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

Long-context decoding in LLMs is constrained by memory bandwidth for fetching KV cache. Existing methods prune keys before decoding, ignoring joint key-value dependence. ART is a lightweight run-time mechanism that tracks accumulated attention outputs and terminates KV block accesses when contributions become negligible. It is orthogonal to existing methods and achieves 20% higher throughput on LongBench with comparable accuracy.

SourcearXiv Computational LinguisticsAuthor: Chen Qiu, Guozhong Li, Panos Kalnis

[2606.00024] ART: Attention Run-time Termination for Efficient Large Language Model Decoding

[Submitted on 15 Apr 2026]

Title:ART: Attention Run-time Termination for Efficient Large Language Model Decoding

View a PDF of the paper titled ART: Attention Run-time Termination for Efficient Large Language Model Decoding, by Chen Qiu and Guozhong Li and Panos Kalnis

View PDF HTML (experimental)

Abstract:Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth required to fetch the extensive Key-Value (KV) cache. Most existing KV management methods rely on key-only pruning before decoding, despite the evidence that attention outputs depend jointly on keys and values, as incorporating values in their methods incurs prohibitive additional overhead. In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that tracks accumulated attention outputs during kernel execution and terminates subsequent KV block accesses once further contributions become negligible. This design makes ART orthogonal to existing key-based KV cache management methods, enabling seamless integration with them. Experiments on LongBench benchmarks show that ART achieves 20% higher generation throughput in large batch size than state-of-the-art baseline while maintaining comparable accuracy.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2606.00024 [cs.CL]

(or arXiv:2606.00024v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.00024

arXiv-issued DOI via DataCite

Submission history

From: Chen Qiu [view email] [v1] Wed, 15 Apr 2026 06:55:14 UTC (732 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled ART: Attention Run-time Termination for Efficient Large Language Model Decoding, by Chen Qiu and Guozhong Li and Panos Kalnis

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CL

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)