r/LocalLLaMA · June 4, 2026 · 3 min read

BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

BeeLlama v0.3.0 and v0.3.1 are here! Big architectural update to align the fork with upstream llama.cpp and integrate all its additions like MTP and Gemma 4 12B support, while also updating DFlash to handle complex configurations like multi-slot and multi-GPU.

Now also recommended by club-3090! Thanks to noonghunna for inviting Bee to the club and for their help with testing v0.3.0 on a multi-GPU setup.

Not quite a pegasus, but close enough.

GitHub | Qwen 3.6 27B Quick Start | Gemma 4 31B Quick Start

Updated to a much newer llama.cpp base: MTP, Gemma 4 12B, VRAM optimizations, unified llama app, backend improvements across CUDA, Metal, Vulkan, and more.
Prebuilt binaries and Docker images are now provided for all major platforms.
DFlash now works across multiple concurrent slots with shared drafter batching.
Adaptive draft depth got smarter: it seeds baselines, probes depths, backs off on failure, and resets per request.
Multi-GPU DFlash now works (and quite decently) after many fixes and improvements.
Faster speculative verification that fails safely on bad state.
Better tool-call and reasoning output handling: earlier streaming, stale KV state clearing, isolated deltas.
New cache and quantization options: q6_0 KV cache, TQ3_1S and TQ4_1S models.
...and many more improvements!

Benchmarks

These were run back on BeeLlama v0.2.0, but both engines had no major performance updates since then, other than MTP being 5-10% faster. club-3090 did benchmarks of their own using v0.3.0, including multi-GPU setup, and ended up recommending Bee as default.

Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
Config: same as in quick start docs, but with reasoning off for non-chat prompts
Baseline and MTP server in comparison: llama.cpp b9275 CUDA 13.1 Windows prebuilt
The full text of the benchmark prompts is in README.md on GitHub

Qwen 3.6 27B

Target model: Qwen 3.6 27B Q5_K_S or Qwen 3.6 27B MTP Q5_K_S. DFlash model: Q4_K_M.

Prompt	Server	Output	Median	Best	Speedup	Acceptance
Task store module	Baseline	~1K tok	37.2 tok/s	37.2 tok/s	1.00x	N/A
Task store module	DFlash	~1K tok	163.9 tok/s	181.9 tok/s	4.40x	67.7% / 89.2%
Task store module	MTP	~1K tok	69.3 tok/s	69.6 tok/s	1.86x	92.0% / 73.3%
KV report module	Baseline	~1K tok	34.6 tok/s	36.5 tok/s	1.00x	N/A
KV report module	DFlash	~1K tok	157.7 tok/s	162.5 tok/s	4.56x	58.8% / 88.9%
KV report module	MTP	~1K tok	67.3 tok/s	68.1 tok/s	1.94x	89.3% / 73.0%
Doubly-linked list	Baseline	~4K tok	36.8 tok/s	36.9 tok/s	1.00x	N/A
Doubly-linked list	DFlash	~4K tok	130.8 tok/s	154.1 tok/s	3.56x	50.4% / 86.8%
Doubly-linked list	MTP	~4K tok	66.3 tok/s	68.0 tok/s	1.80x	87.8% / 72.5%
Prompt processing	Baseline	~20K tok	1229.5 tok/s	1229.5 tok/s	1.00x	N/A
Prompt processing	DFlash	~20K tok	1214.4 tok/s	1221.7 tok/s	0.99x	N/A
Prompt processing	MTP	~20K tok	1162.6 tok/s	1164.7 tok/s	0.95x	N/A
Multi-turn coding	Baseline	~28K tok	33.3 tok/s	33.3 tok/s	1.00x	N/A
Multi-turn coding	DFlash	~30K tok	64.6 tok/s	65.4 tok/s	1.94x	24.9% / 72.9%
Multi-turn coding	MTP	~34K tok	56.5 tok/s	56.5 tok/s	1.70x	71.9% / 68.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

Gemma 4 31B

Target model: Gemma 4 31B Q4_K_S. DFlash model: Q5_K_M.

Prompt	Server	Output	Median	Best	Speedup	Acceptance
Task store module	Baseline	~1K tok	36.1 tok/s	36.1 tok/s	1.00x	N/A
Task store module	DFlash	~1K tok	177.8 tok/s	182.0 tok/s	4.93x	65.7% / 90.0%
KV report module	Baseline	~1K tok	35.9 tok/s	36.0 tok/s	1.00x	N/A
KV report module	DFlash	~1K tok	154.3 tok/s	162.8 tok/s	4.29x	55.7% / 88.6%
Doubly-linked list	Baseline	~1.9K tok	36.0 tok/s	36.0 tok/s	1.00x	N/A
Doubly-linked list	DFlash	~1.9K tok	116.6 tok/s	127.3 tok/s	3.24x	44.5% / 84.9%
Prompt processing	Baseline	~24K tok	1021.3 tok/s	1021.3 tok/s	1.00x	N/A
Prompt processing	DFlash	~24K tok	954.5 tok/s	954.9 tok/s	0.93x	N/A
Multi-turn coding	Baseline	~12K tok	34.8 tok/s	34.8 tok/s	1.00x	N/A
Multi-turn coding	DFlash	~12K tok	60.6 tok/s	64.1 tok/s	1.74x	24.4% / 72.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

submitted by /u/Anbeeld
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA