r/LocalLLaMA · · 2 min read

Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hi everyone,

I'm presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp. I'm talking about the KS and KSS quants developed by ikawrakow. After many trials, I managed to create a 14.1GB model which, in my testing, delivers results highly comparable to my previous 14.7GB IQ4_XS quantization.

Model Link: cHunter789/Qwen3.6-27B-i1-IQ4_KS-GGUF

ik_llama.cpp Project: ikawrakow/ik_llama.cpp

Unfortunately, the ik_llama.cpp project required to run this model is NVIDIA CUDA and CPU only. There is currently no way to run this on AMD or Apple Silicon (Metal) :/

Using this model with ik_llama.cpp and a Q4_0 Hadamard KV cache allows for a 105k context window.

Benchmark Results & Real-World Impressions

The model was heavily tested in daily production workflows for several days. It runs much faster (1.5x-1.75x) and more reliably than the previous iteration—completely eliminating the issue of "blank outputs", while the search-replace functionality works flawlessly.

  • Qwen Benchmark: Successfully passed the performance evaluations on qwen3-6-27b-benchmark.vercel.app.
  • Needle In A Haystack: Successfully evaluated with satisfying results across the full 100k context window.
  • Comparison: In direct testing, this model performs slightly better than my previous variant: Qwen3.6-27B-i1-IQ4_XS-GGUF.

Perplexity (PPL) Testing

Perplexity evaluations were conducted focusing exclusively on the KV Cache quantization setup (q4_0), as this is the primary target use case:

```bash wget https://www.gutenberg.org/files/2600/2600-0.txt -O pg19.txt

./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KSS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 512 ```

Test Log Output: ```text perplexity: calculating perplexity over 12 chunks, n_ctx=65536, batch_size=512, n_seq=1 perplexity: 71.10 seconds per pass - ETA 14.22 minutes [1]6.6897,[2]7.0032,[3]7.1989,[4]7.3327,[5]7.4816,[6]7.3770,[7]7.4325,[8]7.4378,[9]7.4754,[10]7.5192,[11]7.5669,[12]7.4040,

Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4040 +/- 0.02773 ```

Note: I currently do not have the capability to run KLD (Kullback–Leibler divergence) tests.

Example Server Configuration

For reference, here is the server configuration I used during my tests:

bash llama-server \ -m "$MODEL_PATH" \ -a Qwen3.6-27B \ --ctx-size 105000 \ --chat-template-file chat_template.jinja \ --n-gpu-layers 99 \ --cache-type-k q4_0 \ --cache-type-v q4_0 \ --batch-size 512 \ --ubatch-size 256 \ --flash-attn on \ --no-mmap \ --host 0.0.0.0 \ --port 8081 \ --reasoning on \ --reasoning-format deepseek \ -t 8 \ --parallel 1 \ -khad \ -vhad \ --chat-template-kwargs '{"preserve_thinking": true}' \ --defrag-thold 0.3 \ --jinja \ --cont-batching \ --temp 0.15 \ --top-k 1 \ --min-p 0.1 \ --repeat-last-n 512 \ --repeat-penalty 1.05

```

submitted by /u/Pablo_the_brave
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA