r/LocalLLaMA · May 22, 2026 · 3 min read

BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

BeeLlama v0.2.0 is here!

Not quite a pegasus, but close enough.

GitHub | Qwen 3.6 27B Quick Start | Gemma 4 31B Quick Start

Full Gemma 4 31B support with efficient DFlash implementation and vision.
Major Qwen 3.6 27B performance update from lower DFlash overhead, cleaner prefill handling, drafter K/V projection caching, and safer CUDA execution.
DFlash GGUFs with upstream architecture are now supported.
Fixes to adaptive profit behavior around baseline probing.
Reduced verifier path is stricter now, with safer fallback to full logits when grammar, sampler state, or reasoning requires it.
Reasoning and tool-call boundaries were tightened.
Stricter draft/target validation and better draft-model discovery.
...and many more improvements!

Benchmarks

Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
Config: same as in quick start docs, but with reasoning off for non-chat prompts
Baseline and MTP server in comparison: llama.cpp b9275 CUDA 13.1 Windows prebuilt
The full text of the benchmark prompts is in README.md on GitHub

Qwen 3.6 27B

Prompt	Server	Output	Median	Best	Speedup	Acceptance
Task store module	Baseline	~1K tok	37.2 tok/s	37.2 tok/s	1.00x	N/A
Task store module	DFlash	~1K tok	163.9 tok/s	181.9 tok/s	4.40x	67.7% / 89.2%
Task store module	MTP	~1K tok	69.3 tok/s	69.6 tok/s	1.86x	92.0% / 73.3%
KV report module	Baseline	~1K tok	34.6 tok/s	36.5 tok/s	1.00x	N/A
KV report module	DFlash	~1K tok	157.7 tok/s	162.5 tok/s	4.56x	58.8% / 88.9%
KV report module	MTP	~1K tok	67.3 tok/s	68.1 tok/s	1.94x	89.3% / 73.0%
Doubly-linked list	Baseline	~4K tok	36.8 tok/s	36.9 tok/s	1.00x	N/A
Doubly-linked list	DFlash	~4K tok	130.8 tok/s	154.1 tok/s	3.56x	50.4% / 86.8%
Doubly-linked list	MTP	~4K tok	66.3 tok/s	68.0 tok/s	1.80x	87.8% / 72.5%
Prompt processing	Baseline	~20K tok	1229.5 tok/s	1229.5 tok/s	1.00x	N/A
Prompt processing	DFlash	~20K tok	1214.4 tok/s	1221.7 tok/s	0.99x	N/A
Prompt processing	MTP	~20K tok	1162.6 tok/s	1164.7 tok/s	0.95x	N/A
Multi-turn coding	Baseline	~28K tok	33.3 tok/s	33.3 tok/s	1.00x	N/A
Multi-turn coding	DFlash	~30K tok	64.6 tok/s	65.4 tok/s	1.94x	24.9% / 72.9%
Multi-turn coding	MTP	~34K tok	56.5 tok/s	56.5 tok/s	1.70x	71.9% / 68.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

Gemma 4 31B

Target model: Gemma 4 31B Q4_K_S. DFlash model: Q5_K_M.

Prompt	Server	Output	Median	Best	Speedup	Acceptance
Task store module	Baseline	~1K tok	36.1 tok/s	36.1 tok/s	1.00x	N/A
Task store module	DFlash	~1K tok	177.8 tok/s	182.0 tok/s	4.93x	65.7% / 90.0%
KV report module	Baseline	~1K tok	35.9 tok/s	36.0 tok/s	1.00x	N/A
KV report module	DFlash	~1K tok	154.3 tok/s	162.8 tok/s	4.29x	55.7% / 88.6%
Doubly-linked list	Baseline	~1.9K tok	36.0 tok/s	36.0 tok/s	1.00x	N/A
Doubly-linked list	DFlash	~1.9K tok	116.6 tok/s	127.3 tok/s	3.24x	44.5% / 84.9%
Prompt processing	Baseline	~24K tok	1021.3 tok/s	1021.3 tok/s	1.00x	N/A
Prompt processing	DFlash	~24K tok	954.5 tok/s	954.9 tok/s	0.93x	N/A
Multi-turn coding	Baseline	~12K tok	34.8 tok/s	34.8 tok/s	1.00x	N/A
Multi-turn coding	DFlash	~12K tok	60.6 tok/s	64.1 tok/s	1.74x	24.4% / 72.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

Discussion (0)

No comments yet. Sign in and be the first to say something.