r/LocalLLaMA · May 22, 2026 · 3 min read

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

A few days ago I posted about my experiments with MTP on a 6GB VRAM laptop. That didn't work so well; CPU offload hurts MTP performance badly. But now I've tried out the new ByteShape quants for Qwen3.6-35B-A3B that are claimed to be both smaller and faster than others while still having excellent quality. I decided to compare my previous best Unsloth UD-IQ4_XS setup head-to-head with the ByteShape "CPU-5" quant in terms of performance.

TL;DR: ByteShape quant is 30% faster on TG but slightly slower on PP than the similarly sized Unsloth quant when partially offloaded to CPU on a 6GB VRAM laptop.

Hardware

Asus ROG Zephyrus G14 laptop, 2021 model
AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
llama.cpp version: 9203 (87589042c) built from current master branch with GNU 13.3.0 for Linux x86_64
CUDA 12.0 installed from Ubuntu repositories

Test setup

I fixed the following for all the experiments:

context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
mmap off, mlock on, ubatch size 2048 (gives much better PP speed than the default 512)
no mmproj (no image input support needed for now)
for more details, see configuration below

The quants tested:

Unsloth UD-IQ4_XS (17.7 GB)
ByteShape CPU-5 aka Q4_K_S-4.22bpw (18.3 GB)

Configuration

My models-preset.ini contents:

version = 1 [Qwen3.6-35B-A3B] # Unsloth variant m = /proj/llms/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf # ByteShape variant # m = /proj/llms/Qwen3.6-35B-A3B-Q4_K_S-4.22bpw.gguf fit = true fit-target = 64 c = 65536 chat-template-kwargs = {"preserve_thinking": true} temp = 0.6 top-p = 0.95 min-p = 0.0 top-k = 20 repeat-penalty = 1.0 presence-penalty = 0.0 ctx-checkpoints = 64 flash-attn = on b = 2048 ub = 2048 jinja = true ctk = q8_0 ctv = q8_0 threads = 6 parallel = 1 cache-ram = 4096 mmap = false mlock = true

Benchmark results

I used a test prompt of approx. 10k tokens, followed by 1.5-2k tokens of generation. Tried both twice, got pretty much exactly the same numbers.

	Unsloth	ByteShape	Δ
PP tok/s	585	564	-4%
TG tok/s	25.4	33.1	+30%

The ByteShape quant, despite being a bit larger than Unsloth, is over 30% faster on generation than the Unsloth quant! PP speed is slightly lower for ByteShape though.

Observations

Part of the difference may be explained by imatrix (IQ) vs regular (Q) quants. Unsloth UD-IQ4_XS is imatrix, and I understand that these are slower to compute on CPU. A better comparison would be against the ByteShape GPU-5 quant, which is also imatrix in my understanding. But I wanted an upgrade over UD-IQ4_XS and definitely got it!
I noticed that my TG performance seems to degrade over time by ~10% or more without changing the setup. I suspect suspending and then awakening the laptop repeatedly somehow hurts, but I haven't figured out the reason; it's not just memory pressure building up AFAICT. Rebooting the machine brings me the best performance, so I did that before benchmarking.
I haven't made any detailed quality measurements between the models. The ByteShape model seems very similar; possibly the thinking output is generally somewhat shorter than with Unsloth, but that could be a measurement error. I hope that someone does an independent comparison between ByteShape and other quants in terms of output quality, because their claims seem to be a bit too good to be true!

Notes

This post assembled from 100% biodegradeable bytes. No AIs were harmed in the process.

submitted by /u/OsmanthusBloom
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.