Efficient on both compute-constrained (H100) and bandwidth-constrained (H20) hardware</p>\n","updatedAt":"2026-05-18T10:00:38.015Z","author":{"_id":"643f55d4ec817b766686438a","avatarUrl":"/avatars/0feb460432c92ab9ada0d417a7a38f6a.svg","fullname":"mengfanxu","name":"fxmeng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":19,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8332592844963074},"editors":["fxmeng"],"editorAvatarUrls":["/avatars/0feb460432c92ab9ada0d417a7a38f6a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15250","authors":[{"_id":"6a0ae2573049bece374a85f4","name":"Fanxu Meng","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding","submittedOnDailyBy":{"_id":"643f55d4ec817b766686438a","avatarUrl":"/avatars/0feb460432c92ab9ada0d417a7a38f6a.svg","isPro":false,"fullname":"mengfanxu","user":"fxmeng","type":"user","name":"fxmeng"},"summary":"Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.","upvotes":2,"discussionId":"6a0ae2573049bece374a85f5","ai_summary":"Group-Query Latent Attention (GQLA) enables efficient inference across different hardware by exposing multiple decoding paths from a single set of trained weights, supporting both high-performance and commodity GPUs without retraining.","ai_keywords":["Multi-head Latent Attention","Group-Query Latent Attention","attention mechanism","MQA","GQA","tensor parallelism","Multi-Token Prediction","H100","H20","KV cache","TransMLA","TransGQLA","low-rank latent compression"],"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"643f55d4ec817b766686438a","avatarUrl":"/avatars/0feb460432c92ab9ada0d417a7a38f6a.svg","isPro":false,"fullname":"mengfanxu","user":"fxmeng","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15250.md"}">
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
Abstract
Group-Query Latent Attention (GQLA) enables efficient inference across different hardware by exposing multiple decoding paths from a single set of trained weights, supporting both high-performance and commodity GPUs without retraining.
AI-generated summary
Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.
Community
Efficient on both compute-constrained (H100) and bandwidth-constrained (H20) hardware
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.15250 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.15250 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.15250 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.