r/LocalLLaMA · · 3 min read

BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

BeeLlama v0.3.0 and v0.3.1 are here! Big architectural update to align the fork with upstream llama.cpp and integrate all its additions like MTP and Gemma 4 12B support, while also updating DFlash to handle complex configurations like multi-slot and multi-GPU.

Now also recommended by club-3090! Thanks to noonghunna for inviting Bee to the club and for their help with testing v0.3.0 on a multi-GPU setup.

Not quite a pegasus, but close enough.

GitHub | Qwen 3.6 27B Quick Start | Gemma 4 31B Quick Start

  • Updated to a much newer llama.cpp base: MTP, Gemma 4 12B, VRAM optimizations, unified llama app, backend improvements across CUDA, Metal, Vulkan, and more.
  • Prebuilt binaries and Docker images are now provided for all major platforms.
  • DFlash now works across multiple concurrent slots with shared drafter batching.
  • Adaptive draft depth got smarter: it seeds baselines, probes depths, backs off on failure, and resets per request.
  • Multi-GPU DFlash now works (and quite decently) after many fixes and improvements.
  • Faster speculative verification that fails safely on bad state.
  • Better tool-call and reasoning output handling: earlier streaming, stale KV state clearing, isolated deltas.
  • New cache and quantization options: q6_0 KV cache, TQ3_1S and TQ4_1S models.
  • ...and many more improvements!

Benchmarks

These were run back on BeeLlama v0.2.0, but both engines had no major performance updates since then, other than MTP being 5-10% faster. club-3090 did benchmarks of their own using v0.3.0, including multi-GPU setup, and ended up recommending Bee as default.

  • Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
  • Config: same as in quick start docs, but with reasoning off for non-chat prompts
  • Baseline and MTP server in comparison: llama.cpp b9275 CUDA 13.1 Windows prebuilt
  • The full text of the benchmark prompts is in README.md on GitHub

Qwen 3.6 27B

Target model: Qwen 3.6 27B Q5_K_S or Qwen 3.6 27B MTP Q5_K_S. DFlash model: Q4_K_M.

Prompt Server Output Median Best Speedup Acceptance
Task store module Baseline ~1K tok 37.2 tok/s 37.2 tok/s 1.00x N/A
Task store module DFlash ~1K tok 163.9 tok/s 181.9 tok/s 4.40x 67.7% / 89.2%
Task store module MTP ~1K tok 69.3 tok/s 69.6 tok/s 1.86x 92.0% / 73.3%
KV report module Baseline ~1K tok 34.6 tok/s 36.5 tok/s 1.00x N/A
KV report module DFlash ~1K tok 157.7 tok/s 162.5 tok/s 4.56x 58.8% / 88.9%
KV report module MTP ~1K tok 67.3 tok/s 68.1 tok/s 1.94x 89.3% / 73.0%
Doubly-linked list Baseline ~4K tok 36.8 tok/s 36.9 tok/s 1.00x N/A
Doubly-linked list DFlash ~4K tok 130.8 tok/s 154.1 tok/s 3.56x 50.4% / 86.8%
Doubly-linked list MTP ~4K tok 66.3 tok/s 68.0 tok/s 1.80x 87.8% / 72.5%
Prompt processing Baseline ~20K tok 1229.5 tok/s 1229.5 tok/s 1.00x N/A
Prompt processing DFlash ~20K tok 1214.4 tok/s 1221.7 tok/s 0.99x N/A
Prompt processing MTP ~20K tok 1162.6 tok/s 1164.7 tok/s 0.95x N/A
Multi-turn coding Baseline ~28K tok 33.3 tok/s 33.3 tok/s 1.00x N/A
Multi-turn coding DFlash ~30K tok 64.6 tok/s 65.4 tok/s 1.94x 24.9% / 72.9%
Multi-turn coding MTP ~34K tok 56.5 tok/s 56.5 tok/s 1.70x 71.9% / 68.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

Gemma 4 31B

Target model: Gemma 4 31B Q4_K_S. DFlash model: Q5_K_M.

Prompt Server Output Median Best Speedup Acceptance
Task store module Baseline ~1K tok 36.1 tok/s 36.1 tok/s 1.00x N/A
Task store module DFlash ~1K tok 177.8 tok/s 182.0 tok/s 4.93x 65.7% / 90.0%
KV report module Baseline ~1K tok 35.9 tok/s 36.0 tok/s 1.00x N/A
KV report module DFlash ~1K tok 154.3 tok/s 162.8 tok/s 4.29x 55.7% / 88.6%
Doubly-linked list Baseline ~1.9K tok 36.0 tok/s 36.0 tok/s 1.00x N/A
Doubly-linked list DFlash ~1.9K tok 116.6 tok/s 127.3 tok/s 3.24x 44.5% / 84.9%
Prompt processing Baseline ~24K tok 1021.3 tok/s 1021.3 tok/s 1.00x N/A
Prompt processing DFlash ~24K tok 954.5 tok/s 954.9 tok/s 0.93x N/A
Multi-turn coding Baseline ~12K tok 34.8 tok/s 34.8 tok/s 1.00x N/A
Multi-turn coding DFlash ~12K tok 60.6 tok/s 64.1 tok/s 1.74x 24.4% / 72.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

submitted by /u/Anbeeld
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA