r/LocalLLaMA · · 7 min read

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling, and u/complexminded pointed out the tool-eval-bench utility by SeraphimSerapis in a comment. This got me interested in benchmarking a few questions that I've wondered about that I don't recall seeing good answers to:

  1. Are the ByteShape quants of Qwen3.6-35B-A3B as good as they claim in their blog post? Their benchmark shows that their ~4bpw quants retain >99% of the benchmark scores of unquantized models, matching or exceeding other quants such as Unsloth, AesSedai and bartowski, while being faster and usually smaller.
  2. How does KV cache quantization affect real world performance? Is q8_0 free lunch? How much worse is q4_0?
  3. Does the picture change if we look at long context settings instead of short prompts?

TL;DR: No clear winner in ByteShape vs. Unsloth; q8_0 is free lunch, but q4_0 is worse; long context significantly degrades tool calling performance across all scenarios.

Materials

I had temporary access to a mostly idle cluster of V100 GPUs with 32GB VRAM each, so I set out to do some experiments using llama.cpp and tool-eval-bench. First, I chose the following Qwen3.6-35B-A3B quants to compare, including both IQ and Q type quants:

  1. ByteShape IQ3_S-3.48bpw a.k.a. GPU-3 (15.1 GB), the one ByteShape recommends for 16GB VRAM (it just barely fits)
  2. ByteShape IQ4_XS-4.15bpw a.k.a. GPU-5 (18.0 GB), the one ByteShape recommends for 24GB VRAM
  3. ByteShape Q4_K_S-4.22bpw a.k.a. CPU-5 (18.3 GB), the one I use on my 6GB VRAM laptop, partially on CPU
  4. Unsloth UD-IQ3_XXS (13.2 GB), very compact IQ quant, fits into 16GB VRAM, punches above its weight in some benchmarks
  5. Unsloth UD-Q3_K_XL (16.8 GB), a Q quant similar in size to ByteShape CPU-5
  6. Unsloth UD-IQ4_XS (17.7 GB), an IQ quant similar in size to ByteShape GPU-5
  7. Unsloth UD-Q4_K_M (22.1 GB), the default quant size for many
  8. Unsloth UD-Q6_K (29.3 GB), the largest I could fit into 32GB VRAM

I decided not to test quants from others because I'm mostly interested in ByteShape vs. the rest and Unsloth seems to be a common choice trusted by many.

To measure effect of KV cache quantization, I decided on three configurations to test: default f16, q8_0/q8_0 and q4_0/q4_0. To limit the number of runs, I decided not to test asymmetric KV cache quants this time.

To measure performance on long vs. short context, I used the --context-pressure parameter of tool-eval-bench (later abbreviated cp), setting it to either 0.0 or 0.5. 0.0 means short context (approximately 5k tokens system prompt containing tool call definitions) while 0.5 means that the prompt will include an additional 122k tokens of text that could confuse the model. This simulates how the model behaves when the context window is already 50% filled with conversation and tool call history.

I repeated each benchmark run three times using different random seeds. This gave a total of (8 GGUFs) x (3 KV quants) x (2 context lengths) x (3 repetitions) = 144 runs. The short context runs took only about 15 minutes, but the long context runs took around 4 hours each. Total time spent was thus around 300 GPU-hours, including some experimental and failed runs.

Software setup

To run the models, I used llama.cpp version 9529 (96fbe0039) built with CUDA support. For the tool use benchmarks, I used tool-eval-bench 2.0.4.

llama.cpp parameters: -m $GGUF --temperature 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -ngl 99 --ubatch-size 2048 --fit-target 256 -ctk $KV\_QUANT -ctv $KV\_QUANT --port $PORT

tool-eval-bench parameters: --base-url $BASE\_URL --hardmode --weight-by-difficulty --backend llamacpp --context-size 262144 --context-pressure $CONTEXT\_PRESSURE --seed $SEED

I did not spend much time optimizing or even measuring the PP/TG speeds, as I was only interested in the quality of output, not raw performance. I did not enable MTP or other speculative decoding for the same reason. The bottleneck in the very slow long context runs was mainly PP speed, so I did increase --ubatch-size to 2048, which seemed to help a bit.

Scoring metric

The metric I looked at is what tool-eval-bench reports as "total points". With --hardmode enabled, this version of tool-eval-bench performs 84 separate tests. Each test gives 2 points for a succesful tool use, 1 point for a partially correct tool use, 0 for failure. The theoretical maximum is in this case 84 * 2 = 168 points. tool-eval-bench also returns an overall score, but this is just a rounded percentage of total points and the rounding loses some precision, so I opted for the raw total points instead. I couldn't figure out what the --weight-by-difficulty option is doing; it didn't seem to have any effect on scores.

Results by GGUF

Here is an overview of the models, their sizes, overall scores as well as scores broken down by KV cache quant and separately by short vs. long context. See also the scatterplot diagram.

model_name model_size avg_overall avg_kv_f16 avg_kv_q8_0 avg_kv_q4_0 avg_cp_0.0 avg_cp_0.5
Unsloth UD-IQ3_XXS 13.2 143.6 142.2 143.2 145.5 150.7 136.6
ByteShape GPU-3 15.1 144.5 147.0 144.5 142.0 149.7 139.3
Unsloth UD-Q3_K_XL 16.8 143.8 145.0 143.7 142.8 147.3 140.3
Unsloth UD-IQ4_XS 17.7 144.8 143.0 146.8 144.5 149.7 139.9
ByteShape GPU-5 18.0 146.8 147.8 147.3 145.3 149.0 144.7
ByteShape CPU-5 18.3 142.2 143.0 141.5 142.0 145.4 138.9
Unsloth UD-Q4_K_M 22.1 144.4 143.0 143.7 146.5 148.3 140.4
Unsloth UD-Q6_K 29.3 145.2 147.7 146.7 141.2 150.7 139.7

The overall best model is ByteShape GPU-5, which beats much larger models including Unsloth UD-Q4_K_M and UD-Q6_K when looking at average scores. It stands out especially for the good performance on long context tasks. ByteShape CPU-5 is the worst performer. Model size appears to only weakly correlate with benchmark scores; this could also indicate a noisy benchmark metric.

Results by KV cache quant

Here is a breakdown of the benchmark scores grouped by the KV cache quant used. First the overall score, then conditional scores by short vs. long context. See also the bar graph diagram.

kv_quant avg_overall avg_cp_0.0 avg_cp_0.5
f16 144.8 149.2 140.5
q8_0 144.7 149.2 140.1
q4_0 143.7 148.1 139.3

The f16 and q8_0 KV cache quants are practically tied; their benchmark scores are so close that they are likely within the margin of error. However, f16 may have a slight advantage in the long context (cp=0.5) case. The q4_0 quant is behind the others by approximately 1 point.

Findings

  • It is not clear whether ByteShape or Unsloth quants are better. ByteShape had both the best (GPU-5) and worst (CPU-5) performing quants.
  • f16 and q8_0 KV cache quants are practically tied, so q8_0 could be seen as free lunch. Using q4_0 has a surprisingly small effect, but it is there.
  • Long context hurts performance very much, with an average gap of almost 10 points between cp=0.0 and cp=0.5 cases. The ByteShape GPU-5 quant was more resilient than others in the case of long context pressure.

Caveats

This benchmark relies entirely on the tool-eval-bench tasks and how the results are graded. It may or may not be representative of real tool use performance. To me it seems that the author or tool-eval-bench has done a great job in coming up with realistic looking tool call tasks, including some really hard ones enabled using --hardmode. For the long context runs, I relied on the --context-pressure setting in tool-eval-bench, which (in my limited understanding) populates the context with realistic looking conversation and tool call history that could confuse the model.

There was substantial variation and noise in the benchmark scores, including some surprising results where the smallest quants (both in GGUF files and KV cache) occasionally beat the largest ones and similar anomalies. Each individual measurement should be taken with a grain of salt; however, I think that the aggregate scores are still at least somewhat meaningful. I did my best to collect good benchmark numbers, but this benchmark is inherently very noisy and I only have limited resources for repeating benchmark runs.

Note: No AI was used for writing this post, it's all organic, though I did use some AI assistance (the same Qwen3.6-35B-A3B!) in writing the benchmark scripts as well as for analyzing and plotting the results.

submitted by /u/OsmanthusBloom
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA