r/LocalLLaMA · · 3 min read

Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hi r/LocalLLaMA - I've been paying close attention to the edge AI ecosystem because it's an area where i see huge potential and where I truly believe AI will become more useful for day to day tasks. Around the gemma 4 release I was already experimenting with local AI but the memory usage i was getting even for smaller variants of Gemma 3 were unacceptable. I have a flagship from Samsung and I could feel the UX degrading, OS was killing the app every now and then too (let's not talk about the phone getting hot)

Gemma 3 through llama.cpp (with a react native bridge) had a footprint of around 4-5GBs on every inference and having the model IDLE was asking for about 1 GB or so of memory until I released it and memory went back to normal.

I was banging my head against the wall to find a solution and then Gemma 4, I saw it through the AI Edge Gallery and I noticed two things:

  1. The speed difference between CPU and GPU is enormous
  2. How quick the model replied and loaded, my phone was working quite well and memory jumps were barely noticeable.

This is when I learnt about LiteRTLM and how optimized it is for Edge AI.

I could get it working, not without its quirks of course, I had to write some native modules for both Android and iOS (through Objective-C since they don't offer a Swift API yet!)

I've not written anything to use the NPU but GPU and CPU inference works quite well. Memory footprint is around 1.5GB to 2GB. Oldest phone I tried this to where it runs well is an iPhone 13 Pro Max.

The only thing I don't like too much is the fact you have to release the model to recover memory since it needs its allocation even when IDLE. The startup cost is not too much after it has picked its preferred backend to run on but it could be even faster for people.

i have a strength tracking mobile app and this is how i use it right now:

  • routine generation
  • performance check for suggestions on exercises mid workout
  • follow ups and suggestions after finishing workouts

Each inference calls takes 2-4 seconds on GPU, add one or two more on CPU

what i plan to do next:

  • image recognition for exercises (gemma has proven to be a challenging model for this feature but perhaps with some good prompting you can get something going)
  • on the spot workout generation

So far I've had great experience with the model and framework and hope they keep releasing updates, and smaller sized models, too! :)

Setup Device Backend Model Memory Latency (full inference)
llama.cpp RN bridge Samsung S25 Ultra CPU (couldn't make GPU work for some reason) Gemma 3 1B IT 4–5 GB peak ~7-10 s
LiteRT-LM Samsung S25 Ultra GPU/CPU Gemma 4 E2B IT 1.5–2 GB 2–4 sec (1+2 more for CPU)
LiteRT-LM iPhone 13 Pro Max CPU (haven't tested GPU due to Metal constraints) Gemma 4 E2B IT 1.5–2 GB 3–6 sec
submitted by /u/Aguxez
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA