r/LocalLLaMA · · 2 min read

Llama.cpp VS LiteRT on a custom Xiaomi 12 Pro 24/7 Server (V2 Redesign)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Llama.cpp VS LiteRT on a custom Xiaomi 12 Pro 24/7 Server (V2 Redesign)

https://preview.redd.it/sm4ysgdw1w2h1.png?width=1376&format=png&auto=webp&s=3705932403919814fbf2008a1cba189d17e0591e

Thanks everyone for the advice on my previous post (24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/Gemma4). You really inspired me, and I completely redesigned the cooling and power supply for this setup.

What's new:

  • Cooling: Installed a copper heatsink with a fan on the back. On the front, I removed the screen and mounted the device directly onto an aluminum plate with 2 fans using a thermal pad. The cooling now turns on at 40°C and shuts off at 35°C.
  • Power Supply: Built a custom, fully safe PSU. I took apart the battery and wired the PSU directly to the battery's BMS via a capacitor. Added 2 fuses (input/output), a crowbar circuit at 4.3V to protect the phone, and a backup fan for the PSU itself (though after a week of testing, I barely needed it since it doesn't get that hot).
  • Housing: 3D-printed a custom case, built a stand out of aluminum extrusions, and routed an external power button.

Here is how it looks now:

https://preview.redd.it/z17nqy6w2w2h1.jpg?width=3072&format=pjpg&auto=webp&s=09c02d18e53d2771383ae85f35796150ed8b91d8

https://reddit.com/link/1tlgxms/video/ul2iivua3w2h1/player

https://reddit.com/link/1tlgxms/video/xiuyt9wk3w2h1/player

Benchmarks (gemma-4-E4B):
(Prompt: “Write 2000 words IT essay”)

  1. Llama.cpp

https://reddit.com/link/1tlgxms/video/v0t8t5n54w2h1/player

  • Speed: Prompt: 30.6 t/s | Generation: 5.7 t/s
  • The CPU load is pretty "gentle," and the PSU shows a lower amp draw.

https://preview.redd.it/l0wnc1xo4w2h1.jpg?width=2937&format=pjpg&auto=webp&s=d426d9edb9e3801e0a9a487aa4cc729aa7da4dcd

  1. LiteRT (by Google)

https://reddit.com/link/1tlgxms/video/1cbz7rk85w2h1/player

https://preview.redd.it/dh7lc91d5w2h1.png?width=1804&format=png&auto=webp&s=5aacb2bdbcd135e79cfe20afda44009a3896ce83

  • Slightly faster generation, but it maxes out the CPUs, and the amp draw is noticeably higher.

https://preview.redd.it/avfhuxlg5w2h1.jpg?width=2693&format=pjpg&auto=webp&s=3f5e143df4f192225e84e10738c7673f6394b948

GPU Struggles

I tried running LiteRT on the GPU, but unfortunately, Google AI Edge hasn't released an APK for my Snapdragon 8 Gen 1. Swapping library files from the Qualcomm site didn't work either. I also tried running a Vulkan build of llama.cpp but ran into issues. I'll post updated benchmarks once I manage to get it working.

Conclusion

If anyone asks if it was worth it: If you have a powerful spare phone lying around and want a great DIY project, definitely yes. But if you just need an LLM server and don't want the hassle, you're better off just buying a Mini PC.

Thanks again to this sub for the inspiration—I wouldn't have committed to such a massive rebuild without your feedback!

submitted by /u/Aromatic_Ad_7557
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA