r/LocalLLaMA · · 4 min read

Krasis update: Qwen3.6-35B-A3B (Q4) at reading speed, 1x 8GB 3070 Mobile laptop (32GB RAM)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Context

Krasis is an LLM runtime for running models that don't fit into VRAM. Krasis streams the model through VRAM from system RAM efficiently and handles prefill and decode as separate architectures and optimised usecases.

Latest results (v1.0 release)

  • 1x Laptop RTX 3070 Mobile 8GB, (35B param, Q4) Qwen3.6-35B-A3B (HQQ4, k4v4) : 222 pp, 12.48 tg
  • 1x RTX 5080 16GB, (35B param, Q4) Qwen3.6-35B-A3B (HQQ4, k4v4) : 3,743 pp, 60 tg
  • 1x RTX A4500 20GB, (35B param, Q4) Qwen3.6-35B-A3B (HQQ6, k6v6) : 2,235 pp, 51 tg
  • 1x RTX A4500 20GB, (80B param, Q4) Qwen3-Coder-Next, (HQQ6, k4v4) : 1,569 pp, 34.7 tg
  • 1x RTX 5090 32GB, (35B param, Q4) Qwen3.6-35B-A3B (HQQ4, k4v4) : 10,030 pp, 124.9 tg
  • 1x RTX 5090 32GB, (80B param, Q4) Qwen3-Coder-Next, (HQQ8, k4v4) : 6,111 pp, 88.6 tg
  • 1x RTX 5090 32GB, (122B param, Q4) Qwen3.5-122B-A10B : (HQQ6, k4v4) : 4,880 pp, 25.2 tg

(Benchmark note: Krasis runs a number of prompt lengths when gathering benchmark numbers for both prefill and decode. These figures represent the best throughput obtained during the benchmark, not the average across all prompt lengths. Prefill throughput broadly scales up with larger inputs, and decode tends to reduce with larger outputs, as is generally the case in runtimes.)

Latest Updates

It's been a couple of months now since the initial release of Krasis.

What I thought would be relatively quick changes have taken far longer than I expected but Krasis is now at a point where I feel it is a solid base upon which to build support for more models.

Here are the biggest changes:

  • All Rust Execution: Krasis no longer runs Python at all in the hot path. I found that the Python GIL was frequently causing difficulties and slowdowns where they didn't really need to exist. Python is still there for the initial pre-processing but when the model runs now, it's 100% rust and it runs faster.
  • Speed: Krasis runs models faster now. The biggest gains are with prefill but decode is also quicker.
  • Ampere support: RTX 3000 series cards are now fully supported. I've been running an A4500 20GB and getting good speeds on substantial models that don't fit on the GPU like Qwen3.6-35B-A3B and even Qwen3-Coder-Next (80B parameters).
  • Memory improvements: Krasis doesn't require 2x the quantized model in system RAM any more, 1x plus some overhead is required.
  • New 4-bit and 6-bit KV cache: Krasis now has a 4-bit and 6-bit KV cache implementation, both of which are thoroughly tested for accuracy vs BF16 and get good results. Polar4 which was based on TurboQuant has been dropped because it just wasn't accurate enough (interestingly the TurboQuant accuracy claims related to preserving scores on tasks whereas in Krasis I'm measuring accuracy based on exact match length of output on a variety of prompts quantised vs BF16/reference, top-k containment, perplexity and distribution drift). The new KV cache doesn't require FP8 instructions so is fully compatible with Ampere cards.
  • Sensitivity Aware HQQ Attention at 4, 6 or 8 bits: Krasis no longer uses AWQ attention. AWQ required running the model in BF16 to generate a template which people could download. Often users may not have the VRAM required to do this themselves so I wanted a better alternative. Krasis now runs HQQ attention in 4, 6 or 8 bits and can mix precision to achieve higher accuracy. HQQ assets are built by mathematically assessing the model and don't require a previously built template. During the assessment Krasis can also estimate which areas of the model are most sensitive to quantisation and offer 90% HQQ4 + 10% HQQ6 or 90% HQQ6 +10% HQQ8 keeping the memory usage low while moving more sensitive areas to a higher precision resulting in better accuracy vs BF16 execution. HQQ is also fully compatible with Ampere cards.
  • Stability improvements: Krasis now handles changes in VRAM elsewhere in the system by dynamically evicting from the cache. Krasis maximises usage of VRAM to optimise performance of the model run but previously if you ran Krasis on Windows via WSL and then opened Opencode you might see it fail due to Windows allocating 500MB+ VRAM to Opencode (transiently or otherwise). Krasis now handles this and backs off, maintaining the safety buffer.
  • Qwen3.6-35B-A3B support: Krasis now supports the latest Qwen 3.6 model.

Trying it out

Krasis is a copy/paste setup, you can run it on Linux or in Windows using WSL and once its installed you can update to the latest release or prerelease now using "krasis update" or "krasis prerelease".

GitHub Repo - https://github.com/brontoguana/krasis

Coming soon

Now Krasis has a solid and accurate base with the KV cache and attention in a good place, I plan to focus on more models like Google's Gemma and MiniMax, and look at implementing vision support for the models.

Very interested to hear if anyone has any opinions on the future direction it should take or how they might use it.

submitted by /u/mrstoatey
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA