Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| submitted by /u/seraschka [link] [comments] |
More from r/LocalLLaMA
-
The power of structured workflows and small local models
May 17
-
llama: avoid copying logits during prompt decode in MTP by am17an · Pull Request #23198 · ggml-org/llama.cpp
May 17
-
Developers who use local AI - Q4_0 vs Q8_0 KV quant?
May 17
-
MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it
May 17
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.