Hugging Face Daily Papers · May 18, 2026 · 4 min read

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Efficient on both compute-constrained (H100) and bandwidth-constrained (H20) hardware</p>\n","updatedAt":"2026-05-18T10:00:38.015Z","author":{"_id":"643f55d4ec817b766686438a","avatarUrl":"/avatars/0feb460432c92ab9ada0d417a7a38f6a.svg","fullname":"mengfanxu","name":"fxmeng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":19,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8332592844963074},"editors":["fxmeng"],"editorAvatarUrls":["/avatars/0feb460432c92ab9ada0d417a7a38f6a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15250","authors":[{"_id":"6a0ae2573049bece374a85f4","name":"Fanxu Meng","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding","submittedOnDailyBy":{"_id":"643f55d4ec817b766686438a","avatarUrl":"/avatars/0feb460432c92ab9ada0d417a7a38f6a.svg","isPro":false,"fullname":"mengfanxu","user":"fxmeng","type":"user","name":"fxmeng"},"summary":"Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.","upvotes":2,"discussionId":"6a0ae2573049bece374a85f5","ai_summary":"Group-Query Latent Attention (GQLA) enables efficient inference across different hardware by exposing multiple decoding paths from a single set of trained weights, supporting both high-performance and commodity GPUs without retraining.","ai_keywords":["Multi-head Latent Attention","Group-Query Latent Attention","attention mechanism","MQA","GQA","tensor parallelism","Multi-Token Prediction","H100","H20","KV cache","TransMLA","TransGQLA","low-rank latent compression"],"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"643f55d4ec817b766686438a","avatarUrl":"/avatars/0feb460432c92ab9ada0d417a7a38f6a.svg","isPro":false,"fullname":"mengfanxu","user":"fxmeng","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15250.md"}">

Papers

arxiv:2605.15250

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Published on May 14

· Submitted by

mengfanxu on May 18

Peking University

Upvote

Authors:

Abstract

Group-Query Latent Attention (GQLA) enables efficient inference across different hardware by exposing multiple decoding paths from a single set of trained weights, supporting both high-performance and commodity GPUs without retraining.

AI-generated summary

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.

View arXiv page View PDF Add to collection

Community

fxmeng

Paper submitter about 16 hours ago

Efficient on both compute-constrained (H100) and bandwidth-constrained (H20) hardware

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.15250

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15250 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15250 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15250 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers