r/LocalLLaMA · · 2 min read

UPDATE: Qwen-27B-IQ4_KS and Qwen-27B-IQ_KS_KT for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Continuing 16GB VRAM Optimizations: New Qwen3.6-27B GGUF Quants (Experimental Trellis/iq4_kt & MTP)

Hi everyone,

I'm continuing my optimization efforts for 16GB VRAM and Nvidia GPUs from this post:

https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27biq4_ks_for_ik_llamacpp_especially_for/

As a result, I've just uploaded two new quantizations for ik_llama.cpp.

  1. To the Qwen3.6-27B-i1-IQ4_KS-GGUF repository, I added a new quant: Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf. Theoretically, it features a more logical layout (I'm still learning as I go). It keeps the exact same size as the previous Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KSS.gguf model, but I tweaked it to boost logic at the expense of the model's general knowledge. This should help with coding tasks.

    PPL Test Results:

    ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf -f /mnt/Samsung4TB/models/pg19.txt -c 65536 --chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 256 [1]6.6926,[2]7.0049,[3]7.2043,[4]7.3382,[5]7.4861,[6]7.3838,[7]7.4411,[8]7.4459,[9]7.4857,[10]7.5303,[11]7.5779,[12]7.4131, Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4131 +/- 0.02774

  2. The second model, Qwen3.6-27B-i1-IQ4_KS_KT-GGUF, is a total experiment. I was wondering where we could successfully leverage the highly efficient Trellis algorithm quantization (iq4_kt). Normally, this type of quantization completely wrecks the model's logic, so I only applied it to tensors with near-Gaussian distributions. The results turned out pretty interesting.

    PPL Test Results:

    ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS_KT-attn_qkv-IQ4_KS.gguf -f /mnt/Samsung4TB/models/pg19.txt -c 65536 --chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 256 [1]6.6915,[2]7.0030,[3]7.1945,[4]7.3323,[5]7.4815,[6]7.3783,[7]7.4367,[8]7.4409,[9]7.4804,[10]7.5251,[11]7.5728,[12]7.4091, Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4091 +/- 0.02777

As you can see from the results, both models show very similar PPL (perplexity). Unfortunately, I don't have the means to run KLD tests right now, so if anyone has the setup for it, I'd be super grateful if you could test them out.

To keep up with recent trends, I also threw MTP (Multi-Token Prediction) into the mix, though there isn't much headroom left for context. I made two versions: i1_MTP denotes an iq4_ks quantization, while pure MTP is q8_0.

submitted by /u/Pablo_the_brave
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA