r/LocalLLaMA · · 3 min read

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

A few days ago I posted about my experiments with MTP on a 6GB VRAM laptop. That didn't work so well; CPU offload hurts MTP performance badly. But now I've tried out the new ByteShape quants for Qwen3.6-35B-A3B that are claimed to be both smaller and faster than others while still having excellent quality. I decided to compare my previous best Unsloth UD-IQ4_XS setup head-to-head with the ByteShape "CPU-5" quant in terms of performance.

TL;DR: ByteShape quant is 30% faster on TG but slightly slower on PP than the similarly sized Unsloth quant when partially offloaded to CPU on a 6GB VRAM laptop.

Hardware

  • Asus ROG Zephyrus G14 laptop, 2021 model
  • AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
  • NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
  • 24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

  • Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
  • llama.cpp version: 9203 (87589042c) built from current master branch with GNU 13.3.0 for Linux x86_64
  • CUDA 12.0 installed from Ubuntu repositories

Test setup

I fixed the following for all the experiments:

  • context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
  • mmap off, mlock on, ubatch size 2048 (gives much better PP speed than the default 512)
  • no mmproj (no image input support needed for now)
  • for more details, see configuration below

The quants tested:

Configuration

My models-preset.ini contents:

version = 1 [Qwen3.6-35B-A3B] # Unsloth variant m = /proj/llms/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf # ByteShape variant # m = /proj/llms/Qwen3.6-35B-A3B-Q4_K_S-4.22bpw.gguf fit = true fit-target = 64 c = 65536 chat-template-kwargs = {"preserve_thinking": true} temp = 0.6 top-p = 0.95 min-p = 0.0 top-k = 20 repeat-penalty = 1.0 presence-penalty = 0.0 ctx-checkpoints = 64 flash-attn = on b = 2048 ub = 2048 jinja = true ctk = q8_0 ctv = q8_0 threads = 6 parallel = 1 cache-ram = 4096 mmap = false mlock = true 

Benchmark results

I used a test prompt of approx. 10k tokens, followed by 1.5-2k tokens of generation. Tried both twice, got pretty much exactly the same numbers.

Unsloth ByteShape Δ
PP tok/s 585 564 -4%
TG tok/s 25.4 33.1 +30%

The ByteShape quant, despite being a bit larger than Unsloth, is over 30% faster on generation than the Unsloth quant! PP speed is slightly lower for ByteShape though.

Observations

  • Part of the difference may be explained by imatrix (IQ) vs regular (Q) quants. Unsloth UD-IQ4_XS is imatrix, and I understand that these are slower to compute on CPU. A better comparison would be against the ByteShape GPU-5 quant, which is also imatrix in my understanding. But I wanted an upgrade over UD-IQ4_XS and definitely got it!
  • I noticed that my TG performance seems to degrade over time by ~10% or more without changing the setup. I suspect suspending and then awakening the laptop repeatedly somehow hurts, but I haven't figured out the reason; it's not just memory pressure building up AFAICT. Rebooting the machine brings me the best performance, so I did that before benchmarking.
  • I haven't made any detailed quality measurements between the models. The ByteShape model seems very similar; possibly the thinking output is generally somewhat shorter than with Unsloth, but that could be a measurement error. I hope that someone does an independent comparison between ByteShape and other quants in terms of output quality, because their claims seem to be a bit too good to be true!

Notes

This post assembled from 100% biodegradeable bytes. No AIs were harmed in the process.

submitted by /u/OsmanthusBloom
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA