Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling, and u/complexminded pointed out the tool-eval-bench utility by SeraphimSerapis in a comment. This got me interested in benchmarking a few questions that I've wondered about that I don't recall seeing good answers to:
TL;DR: No clear winner in ByteShape vs. Unsloth; q8_0 is free lunch, but q4_0 is worse; long context significantly degrades tool calling performance across all scenarios. MaterialsI had temporary access to a mostly idle cluster of V100 GPUs with 32GB VRAM each, so I set out to do some experiments using llama.cpp and tool-eval-bench. First, I chose the following Qwen3.6-35B-A3B quants to compare, including both IQ and Q type quants:
I decided not to test quants from others because I'm mostly interested in ByteShape vs. the rest and Unsloth seems to be a common choice trusted by many. To measure effect of KV cache quantization, I decided on three configurations to test: default f16, q8_0/q8_0 and q4_0/q4_0. To limit the number of runs, I decided not to test asymmetric KV cache quants this time. To measure performance on long vs. short context, I used the I repeated each benchmark run three times using different random seeds. This gave a total of (8 GGUFs) x (3 KV quants) x (2 context lengths) x (3 repetitions) = 144 runs. The short context runs took only about 15 minutes, but the long context runs took around 4 hours each. Total time spent was thus around 300 GPU-hours, including some experimental and failed runs. Software setupTo run the models, I used llama.cpp version 9529 (96fbe0039) built with CUDA support. For the tool use benchmarks, I used tool-eval-bench 2.0.4. llama.cpp parameters: tool-eval-bench parameters: I did not spend much time optimizing or even measuring the PP/TG speeds, as I was only interested in the quality of output, not raw performance. I did not enable MTP or other speculative decoding for the same reason. The bottleneck in the very slow long context runs was mainly PP speed, so I did increase Scoring metricThe metric I looked at is what tool-eval-bench reports as "total points". With Results by GGUFHere is an overview of the models, their sizes, overall scores as well as scores broken down by KV cache quant and separately by short vs. long context. See also the scatterplot diagram.
The overall best model is ByteShape GPU-5, which beats much larger models including Unsloth UD-Q4_K_M and UD-Q6_K when looking at average scores. It stands out especially for the good performance on long context tasks. ByteShape CPU-5 is the worst performer. Model size appears to only weakly correlate with benchmark scores; this could also indicate a noisy benchmark metric. Results by KV cache quantHere is a breakdown of the benchmark scores grouped by the KV cache quant used. First the overall score, then conditional scores by short vs. long context. See also the bar graph diagram.
The f16 and q8_0 KV cache quants are practically tied; their benchmark scores are so close that they are likely within the margin of error. However, f16 may have a slight advantage in the long context (cp=0.5) case. The q4_0 quant is behind the others by approximately 1 point. Findings
CaveatsThis benchmark relies entirely on the tool-eval-bench tasks and how the results are graded. It may or may not be representative of real tool use performance. To me it seems that the author or tool-eval-bench has done a great job in coming up with realistic looking tool call tasks, including some really hard ones enabled using There was substantial variation and noise in the benchmark scores, including some surprising results where the smallest quants (both in GGUF files and KV cache) occasionally beat the largest ones and similar anomalies. Each individual measurement should be taken with a grain of salt; however, I think that the aggregate scores are still at least somewhat meaningful. I did my best to collect good benchmark numbers, but this benchmark is inherently very noisy and I only have limited resources for repeating benchmark runs. Note: No AI was used for writing this post, it's all organic, though I did use some AI assistance (the same Qwen3.6-35B-A3B!) in writing the benchmark scripts as well as for analyzing and plotting the results. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.