Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hi r/LocalLLaMA - I've been paying close attention to the edge AI ecosystem because it's an area where i see huge potential and where I truly believe AI will become more useful for day to day tasks. Around the gemma 4 release I was already experimenting with local AI but the memory usage i was getting even for smaller variants of Gemma 3 were unacceptable. I have a flagship from Samsung and I could feel the UX degrading, OS was killing the app every now and then too (let's not talk about the phone getting hot)
Gemma 3 through llama.cpp (with a react native bridge) had a footprint of around 4-5GBs on every inference and having the model IDLE was asking for about 1 GB or so of memory until I released it and memory went back to normal.
I was banging my head against the wall to find a solution and then Gemma 4, I saw it through the AI Edge Gallery and I noticed two things:
- The speed difference between CPU and GPU is enormous
- How quick the model replied and loaded, my phone was working quite well and memory jumps were barely noticeable.
This is when I learnt about LiteRTLM and how optimized it is for Edge AI.
I could get it working, not without its quirks of course, I had to write some native modules for both Android and iOS (through Objective-C since they don't offer a Swift API yet!)
I've not written anything to use the NPU but GPU and CPU inference works quite well. Memory footprint is around 1.5GB to 2GB. Oldest phone I tried this to where it runs well is an iPhone 13 Pro Max.
The only thing I don't like too much is the fact you have to release the model to recover memory since it needs its allocation even when IDLE. The startup cost is not too much after it has picked its preferred backend to run on but it could be even faster for people.
i have a strength tracking mobile app and this is how i use it right now:
- routine generation
- performance check for suggestions on exercises mid workout
- follow ups and suggestions after finishing workouts
Each inference calls takes 2-4 seconds on GPU, add one or two more on CPU
what i plan to do next:
- image recognition for exercises (gemma has proven to be a challenging model for this feature but perhaps with some good prompting you can get something going)
- on the spot workout generation
So far I've had great experience with the model and framework and hope they keep releasing updates, and smaller sized models, too! :)
| Setup | Device | Backend | Model | Memory | Latency (full inference) |
|---|---|---|---|---|---|
| llama.cpp RN bridge | Samsung S25 Ultra | CPU (couldn't make GPU work for some reason) | Gemma 3 1B IT | 4–5 GB peak | ~7-10 s |
| LiteRT-LM | Samsung S25 Ultra | GPU/CPU | Gemma 4 E2B IT | 1.5–2 GB | 2–4 sec (1+2 more for CPU) |
| LiteRT-LM | iPhone 13 Pro Max | CPU (haven't tested GPU due to Metal constraints) | Gemma 4 E2B IT | 1.5–2 GB | 3–6 sec |
[link] [comments]
More from r/LocalLLaMA
-
Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution
May 15
-
I built a self-hosted open-source MCP server that gives any local LLM real financial data — SEC filings, 13F, insider & congressional trades, short data, FRED
May 15
-
Qwen 3.6 27B: IQ3XXS KV Q8 vs Q4XL KV Q4 (262K context)
May 15
-
Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.
May 15
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.