r/LocalLLaMA · June 2, 2026 · 3 min read

Using Gemma 4 E4B with the LiteRT engine - ~2.4x speedup over Q4 GGUF in text generation, image processing roughly the same

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I know there is a PR in llama.cpp to support MTP for the 26b and 31b versions of Gemma 4, but as far as I can tell there is nothing yet for the E2B and E4B models.

Using Hermes Agent, I had it set up Gemma 4 E4B in Google's Lite RT format, and then write a Python wrapper around it to create an OpenAI compatible endpoint, and ran some speed tests, comparing the LiteRT model with the Unsloth/AtomicChat Q4M quant of E4B.

The tests were conducted by giving each model identical prompts and measuring the output speed. I also had each model caption 111 images in a folder (using the same script for both models).

Results:

Text Generation Speed

Prompt	LiteRT-LM 4B (MTP)	llama.cpp GGUF 4B	Speedup
Transfer learning	160.6 tok/s	66.3 tok/s	2.4×
Transformer architecture	148.2 tok/s	65.9 tok/s	2.2×
ML paradigms	162.7 tok/s	66.8 tok/s	2.4×
Average	157.2 tok/s	66.3 tok/s	2.4×

Image Captioning (111 images, full resolution)

Metric	LiteRT-LM 4B	llama.cpp GGUF 4B	Speedup
Per image	0.65s	0.72s	1.1×
Total	~72s	~80s	1.1×

Summary

For text generation, LiteRT-LM is 2.4× faster thanks to MTP (multi-token prediction). The MTP drafter predicts multiple tokens ahead and verifies them, effectively giving ~1.5-2× throughput on top of the already efficient LiteRT runtime.
For image captioning, the speed difference is only 11% because the bottleneck is the vision encoder, not the text decoder. MTP only helps with text generation, not image encoding.

Both models were tested 'warm' (aka loaded into memory prior to eliminate warm-up time).

This was done on a 4060ti 16gb, with only one model loaded into memory at a time.

Memory footprint between the two was basically the same.

Audio transcription also works, but it is CPU only.

I now have this model configured as my go-to in Hermes Agent for a number of roles (summarization, vision, title generation, etc). It's faster locally than using Gemma4 26b via API.

Notes: The Python wrapper doesn't have the full features of an OpenAI compatible endpoint yet - can't currently select parameters like temperature, etc. It runs at whatever the default LiteRT engine does. Also, responses do not stream, but come in one chunk. Dunno if that's how the LiteRT model works or if it's the way the Python wrapper handles it.

Disclaimer - the Python wrapper was completely vibe coded, using the stealth Owl-Alpha model on Openrouter, inside Hermes Agent. It also ran the tests and made the result chart and the summary below the chart. I

Conclusion: tokens go brr with LiteRT. I don't know if wrapping it in an OpenAI compatible endpoint is the best way to use it, but it makes it easy for me to drop it in my existing apps (including Hermes) as a typical OpenAI endpoint.

I've uploaded the Python server wrapper to Github here:

https://github.com/Madvulcan/litert-lm-server-wrapper

Further disclaimer: everything in that repo was AI created, including the readme, etc. Note the known limitations:

Deterministic output (no temperature sensitivity) This is a known limitation of the current LiteRT-LM engine for Gemma 4. The model produces identical responses regardless of temperature/top_p/seed settings. This is likely a .litertlm conversion issue (greedy decoding only).

Known Limitations

Single-session engine — Only one active conversation per engine instance. New sessions close previous ones.

Deterministic output — Temperature/top_p/seed are accepted but not honored by the C++ engine.

No batching — Each request is processed sequentially.

Linux only — Tested on Ubuntu 24.04 LTS. Windows/macOS not tested.

Maybe someone can build on it or use it for inspiration.

submitted by /u/AnticitizenPrime
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA