ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
A few days ago I posted about my experiments with MTP on a 6GB VRAM laptop. That didn't work so well; CPU offload hurts MTP performance badly. But now I've tried out the new ByteShape quants for Qwen3.6-35B-A3B that are claimed to be both smaller and faster than others while still having excellent quality. I decided to compare my previous best Unsloth UD-IQ4_XS setup head-to-head with the ByteShape "CPU-5" quant in terms of performance.
TL;DR: ByteShape quant is 30% faster on TG but slightly slower on PP than the similarly sized Unsloth quant when partially offloaded to CPU on a 6GB VRAM laptop.
Hardware
- Asus ROG Zephyrus G14 laptop, 2021 model
- AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
- NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
- 24GB RAM (DDR4 3200 MT/s), 1TB SSD
Software
- Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
- llama.cpp version: 9203 (87589042c) built from current master branch with GNU 13.3.0 for Linux x86_64
- CUDA 12.0 installed from Ubuntu repositories
Test setup
I fixed the following for all the experiments:
- context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
- mmap off, mlock on, ubatch size 2048 (gives much better PP speed than the default 512)
- no mmproj (no image input support needed for now)
- for more details, see configuration below
The quants tested:
- Unsloth UD-IQ4_XS (17.7 GB)
- ByteShape CPU-5 aka Q4_K_S-4.22bpw (18.3 GB)
Configuration
My models-preset.ini contents:
version = 1 [Qwen3.6-35B-A3B] # Unsloth variant m = /proj/llms/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf # ByteShape variant # m = /proj/llms/Qwen3.6-35B-A3B-Q4_K_S-4.22bpw.gguf fit = true fit-target = 64 c = 65536 chat-template-kwargs = {"preserve_thinking": true} temp = 0.6 top-p = 0.95 min-p = 0.0 top-k = 20 repeat-penalty = 1.0 presence-penalty = 0.0 ctx-checkpoints = 64 flash-attn = on b = 2048 ub = 2048 jinja = true ctk = q8_0 ctv = q8_0 threads = 6 parallel = 1 cache-ram = 4096 mmap = false mlock = true Benchmark results
I used a test prompt of approx. 10k tokens, followed by 1.5-2k tokens of generation. Tried both twice, got pretty much exactly the same numbers.
| Unsloth | ByteShape | Δ | |
|---|---|---|---|
| PP tok/s | 585 | 564 | -4% |
| TG tok/s | 25.4 | 33.1 | +30% |
The ByteShape quant, despite being a bit larger than Unsloth, is over 30% faster on generation than the Unsloth quant! PP speed is slightly lower for ByteShape though.
Observations
- Part of the difference may be explained by imatrix (IQ) vs regular (Q) quants. Unsloth UD-IQ4_XS is imatrix, and I understand that these are slower to compute on CPU. A better comparison would be against the ByteShape GPU-5 quant, which is also imatrix in my understanding. But I wanted an upgrade over UD-IQ4_XS and definitely got it!
- I noticed that my TG performance seems to degrade over time by ~10% or more without changing the setup. I suspect suspending and then awakening the laptop repeatedly somehow hurts, but I haven't figured out the reason; it's not just memory pressure building up AFAICT. Rebooting the machine brings me the best performance, so I did that before benchmarking.
- I haven't made any detailed quality measurements between the models. The ByteShape model seems very similar; possibly the thinking output is generally somewhat shorter than with Unsloth, but that could be a measurement error. I hope that someone does an independent comparison between ByteShape and other quants in terms of output quality, because their claims seem to be a bit too good to be true!
Notes
This post assembled from 100% biodegradeable bytes. No AIs were harmed in the process.
[link] [comments]
More from r/LocalLLaMA
-
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
May 22
-
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
May 22
-
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
May 22
-
Experts first llama.cpp
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.