2026-05-18 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Researchers propose Group-Query Latent Attention (GQLA), a modification of DeepSeek's Multi-head Latent Attention that provides two hardware-adaptive decoding paths without retraining. This approach enables efficient inference on both H100 and H20 GPUs, and includes TransGQLA for converting pretrained GQA models.

SourcearXiv Machine LearningAuthor: Fanxu Meng

[2605.15250] GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

[Submitted on 14 May 2026]

Title:GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

View a PDF of the paper titled GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding, by Fanxu Meng

View PDF HTML (experimental)

Abstract:Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.

Comments: this https URL

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2605.15250 [cs.LG]

(or arXiv:2605.15250v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2605.15250

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Meng Fanxu [view email] [v1] Thu, 14 May 2026 15:50:01 UTC (665 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding, by Fanxu Meng

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.LG

new | recent | 2026-05

Change to browse by:

cs cs.AI

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)