AI News HubLIVE
原文

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Researchers propose Group-Query Latent Attention (GQLA), a modification of DeepSeek's Multi-head Latent Attention that provides two hardware-adaptive decoding paths without retraining. This approach enables efficient inference on both H100 and H20 GPUs, and includes TransGQLA for converting pretrained GQA models.

Article intelligence

EngineersAdvanced

Key points

  • GQLA extends DeepSeek's MLA with dual decoding paths (MQA-absorb and GQA) to match different hardware rooflines.
  • A single set of GQLA weights can be used on H100 (MQA path) or H20 (GQA path with multi-token prediction).
  • TransGQLA converts pretrained GQA checkpoints to GQLA, compressing KV cache to 28.125% on LLaMA-3-8B.
  • Supports up to 8-way zero-redundancy tensor parallelism on the GQA path.

Why it matters

This matters because GQLA extends DeepSeek's MLA with dual decoding paths (MQA-absorb and GQA) to match different hardware rooflines.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

[2605.15250] GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

[Submitted on 14 May 2026]

Title:GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

View a PDF of the paper titled GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding, by Fanxu Meng

View PDF HTML (experimental)

Abstract:Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.

Comments: this https URL

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2605.15250 [cs.LG]

(or arXiv:2605.15250v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2605.15250

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Meng Fanxu [view email] [v1] Thu, 14 May 2026 15:50:01 UTC (665 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding, by Fanxu Meng

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.LG

new | recent | 2026-05

Change to browse by:

cs cs.AI

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Loading...

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

IArxiv recommender toggle

IArxiv Recommender (What is IArxiv?)

Author

Venue

Institution

Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)