r/LocalLLaMA · · 2 min read

Galaxy Z Fold6 as a local inference node — llama.cpp/Vulkan, homelab telemetry, SHA-256 model verification

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Built a small Android app called Pocket Node that runs llama.cpp inference

on-device. Here's what it actually does and what it doesn't.

**What it does**

* Loads a GGUF model (SmolLM3 Q4_0, ~1.1B params) directly on the Fold6

* Uses the Vulkan/OpenCL backend via llama.cpp — not CPU-only

* Streams tokens to a native Jetpack Compose UI

* Handles Stop during prefill, not just decode: tapping Stop during the

prefill phase sets the native abort flag, cancels the JNI call, resets

the UI, and lets you send a follow-up prompt normally

* SHA-256 verifies the model file against a local registry on first load;

if the hash doesn't match, inference is blocked and the UI shows a

recovery path (Rescan / Re-import / Choose another)

* Reports model state and health to a homelab monitoring stack so I can

see at a glance whether the phone is up and inference is ready

**The stack**

* App: Kotlin + Jetpack Compose, llama.cpp via JNI, Vulkan/OpenCL backend

* Model: SmolLM3 Q4_0 (1.1B) — SHA-256 verified on load

* Homelab side: Python monitoring service polls the phone's health endpoint

and includes it in a daily digest alongside the other nodes

* The phone exposes an OpenAI-compatible API on Tailscale — direct calls

work; it's not registered in the LiteLLM routing layer yet, so automatic

routing doesn't apply. That's the next config step.

* Debug build, Android 16

**What it doesn't do**

* Not a replacement for a desktop GPU or a Mac Studio. SmolLM3 at Q4_0

on a phone handles short tasks but context is limited and longer prompts

are slow.

* No persistent memory or RAG. Each conversation is independent.

* Battery and thermal: short runs are fine. Sustained generation heats the

device. Don't leave it in a benchmark loop.

* Not tested on other Android hardware. Vulkan driver quality varies by

device. I can't say it works on your phone.

* Not a public server. The API is Tailscale-gated, LAN only.

**Why bother**

For short tasks — quick classification, a local chat response that doesn't

need to leave the device — it works. The goal isn't to match a frontier

model on a phone. It's zero cloud cost for the tasks that don't need cloud.

The verification step mattered more than I expected. Knowing the model file

matches a known-good SHA-256 before running it is the kind of thing you

want when you're running a model you downloaded months ago.

**Screenshots in gallery:** chat UI with inference status, diagnostics, stop-in-progress state, P20 health digest.

Happy to answer questions about the llama.cpp JNI layer, the stop/prefill

handling, or the homelab monitoring side.

---

*Clarification pre-emptively: "Vulkan/OpenCL" means the backend llama.cpp

selects on this device. I'm not doing anything custom on the GPU side beyond

what llama.cpp exposes.*

submitted by /u/GsxrGuy80s
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA