BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
BeeLlama v0.3.0 and v0.3.1 are here! Big architectural update to align the fork with upstream llama.cpp and integrate all its additions like MTP and Gemma 4 12B support, while also updating DFlash to handle complex configurations like multi-slot and multi-GPU.
Now also recommended by club-3090! Thanks to noonghunna for inviting Bee to the club and for their help with testing v0.3.0 on a multi-GPU setup.
Not quite a pegasus, but close enough.
GitHub | Qwen 3.6 27B Quick Start | Gemma 4 31B Quick Start
- Updated to a much newer llama.cpp base: MTP, Gemma 4 12B, VRAM optimizations, unified llama app, backend improvements across CUDA, Metal, Vulkan, and more.
- Prebuilt binaries and Docker images are now provided for all major platforms.
- DFlash now works across multiple concurrent slots with shared drafter batching.
- Adaptive draft depth got smarter: it seeds baselines, probes depths, backs off on failure, and resets per request.
- Multi-GPU DFlash now works (and quite decently) after many fixes and improvements.
- Faster speculative verification that fails safely on bad state.
- Better tool-call and reasoning output handling: earlier streaming, stale KV state clearing, isolated deltas.
- New cache and quantization options:
q6_0KV cache,TQ3_1SandTQ4_1Smodels. - ...and many more improvements!
Benchmarks
These were run back on BeeLlama v0.2.0, but both engines had no major performance updates since then, other than MTP being 5-10% faster. club-3090 did benchmarks of their own using v0.3.0, including multi-GPU setup, and ended up recommending Bee as default.
- Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
- Config: same as in quick start docs, but with reasoning off for non-chat prompts
- Baseline and MTP server in comparison: llama.cpp b9275 CUDA 13.1 Windows prebuilt
- The full text of the benchmark prompts is in README.md on GitHub
Qwen 3.6 27B
Target model: Qwen 3.6 27B Q5_K_S or Qwen 3.6 27B MTP Q5_K_S. DFlash model: Q4_K_M.
| Prompt | Server | Output | Median | Best | Speedup | Acceptance |
|---|---|---|---|---|---|---|
| Task store module | Baseline | ~1K tok | 37.2 tok/s | 37.2 tok/s | 1.00x | N/A |
| Task store module | DFlash | ~1K tok | 163.9 tok/s | 181.9 tok/s | 4.40x | 67.7% / 89.2% |
| Task store module | MTP | ~1K tok | 69.3 tok/s | 69.6 tok/s | 1.86x | 92.0% / 73.3% |
| KV report module | Baseline | ~1K tok | 34.6 tok/s | 36.5 tok/s | 1.00x | N/A |
| KV report module | DFlash | ~1K tok | 157.7 tok/s | 162.5 tok/s | 4.56x | 58.8% / 88.9% |
| KV report module | MTP | ~1K tok | 67.3 tok/s | 68.1 tok/s | 1.94x | 89.3% / 73.0% |
| Doubly-linked list | Baseline | ~4K tok | 36.8 tok/s | 36.9 tok/s | 1.00x | N/A |
| Doubly-linked list | DFlash | ~4K tok | 130.8 tok/s | 154.1 tok/s | 3.56x | 50.4% / 86.8% |
| Doubly-linked list | MTP | ~4K tok | 66.3 tok/s | 68.0 tok/s | 1.80x | 87.8% / 72.5% |
| Prompt processing | Baseline | ~20K tok | 1229.5 tok/s | 1229.5 tok/s | 1.00x | N/A |
| Prompt processing | DFlash | ~20K tok | 1214.4 tok/s | 1221.7 tok/s | 0.99x | N/A |
| Prompt processing | MTP | ~20K tok | 1162.6 tok/s | 1164.7 tok/s | 0.95x | N/A |
| Multi-turn coding | Baseline | ~28K tok | 33.3 tok/s | 33.3 tok/s | 1.00x | N/A |
| Multi-turn coding | DFlash | ~30K tok | 64.6 tok/s | 65.4 tok/s | 1.94x | 24.9% / 72.9% |
| Multi-turn coding | MTP | ~34K tok | 56.5 tok/s | 56.5 tok/s | 1.70x | 71.9% / 68.3% |
Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens
Gemma 4 31B
Target model: Gemma 4 31B Q4_K_S. DFlash model: Q5_K_M.
| Prompt | Server | Output | Median | Best | Speedup | Acceptance |
|---|---|---|---|---|---|---|
| Task store module | Baseline | ~1K tok | 36.1 tok/s | 36.1 tok/s | 1.00x | N/A |
| Task store module | DFlash | ~1K tok | 177.8 tok/s | 182.0 tok/s | 4.93x | 65.7% / 90.0% |
| KV report module | Baseline | ~1K tok | 35.9 tok/s | 36.0 tok/s | 1.00x | N/A |
| KV report module | DFlash | ~1K tok | 154.3 tok/s | 162.8 tok/s | 4.29x | 55.7% / 88.6% |
| Doubly-linked list | Baseline | ~1.9K tok | 36.0 tok/s | 36.0 tok/s | 1.00x | N/A |
| Doubly-linked list | DFlash | ~1.9K tok | 116.6 tok/s | 127.3 tok/s | 3.24x | 44.5% / 84.9% |
| Prompt processing | Baseline | ~24K tok | 1021.3 tok/s | 1021.3 tok/s | 1.00x | N/A |
| Prompt processing | DFlash | ~24K tok | 954.5 tok/s | 954.9 tok/s | 0.93x | N/A |
| Multi-turn coding | Baseline | ~12K tok | 34.8 tok/s | 34.8 tok/s | 1.00x | N/A |
| Multi-turn coding | DFlash | ~12K tok | 60.6 tok/s | 64.1 tok/s | 1.74x | 24.4% / 72.3% |
Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens
[link] [comments]
More from r/LocalLLaMA
-
Higgs Audio v3 TTS 4B. Built for voice chat. Support 100 languages and inline control.
Jun 4
-
cyankiwi AWQ 4-bit — 26.05 update, NVFP4 + FP8 Dynamic quantization and benchmarks across Qwen3.6 4-bit quants
Jun 4
-
You guys were right - Qwen 3.6 35B IS good...and KV Cache DOES matter.
Jun 4
-
Run (your largest) local models from your iPhone
Jun 4
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.