Galaxy Z Fold6 as a local inference node — llama.cpp/Vulkan, homelab telemetry, SHA-256 model verification
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Built a small Android app called Pocket Node that runs llama.cpp inference
on-device. Here's what it actually does and what it doesn't.
**What it does**
* Loads a GGUF model (SmolLM3 Q4_0, ~1.1B params) directly on the Fold6
* Uses the Vulkan/OpenCL backend via llama.cpp — not CPU-only
* Streams tokens to a native Jetpack Compose UI
* Handles Stop during prefill, not just decode: tapping Stop during the
prefill phase sets the native abort flag, cancels the JNI call, resets
the UI, and lets you send a follow-up prompt normally
* SHA-256 verifies the model file against a local registry on first load;
if the hash doesn't match, inference is blocked and the UI shows a
recovery path (Rescan / Re-import / Choose another)
* Reports model state and health to a homelab monitoring stack so I can
see at a glance whether the phone is up and inference is ready
**The stack**
* App: Kotlin + Jetpack Compose, llama.cpp via JNI, Vulkan/OpenCL backend
* Model: SmolLM3 Q4_0 (1.1B) — SHA-256 verified on load
* Homelab side: Python monitoring service polls the phone's health endpoint
and includes it in a daily digest alongside the other nodes
* The phone exposes an OpenAI-compatible API on Tailscale — direct calls
work; it's not registered in the LiteLLM routing layer yet, so automatic
routing doesn't apply. That's the next config step.
* Debug build, Android 16
**What it doesn't do**
* Not a replacement for a desktop GPU or a Mac Studio. SmolLM3 at Q4_0
on a phone handles short tasks but context is limited and longer prompts
are slow.
* No persistent memory or RAG. Each conversation is independent.
* Battery and thermal: short runs are fine. Sustained generation heats the
device. Don't leave it in a benchmark loop.
* Not tested on other Android hardware. Vulkan driver quality varies by
device. I can't say it works on your phone.
* Not a public server. The API is Tailscale-gated, LAN only.
**Why bother**
For short tasks — quick classification, a local chat response that doesn't
need to leave the device — it works. The goal isn't to match a frontier
model on a phone. It's zero cloud cost for the tasks that don't need cloud.
The verification step mattered more than I expected. Knowing the model file
matches a known-good SHA-256 before running it is the kind of thing you
want when you're running a model you downloaded months ago.
**Screenshots in gallery:** chat UI with inference status, diagnostics, stop-in-progress state, P20 health digest.
Happy to answer questions about the llama.cpp JNI layer, the stop/prefill
handling, or the homelab monitoring side.
---
*Clarification pre-emptively: "Vulkan/OpenCL" means the backend llama.cpp
selects on this device. I'm not doing anything custom on the GPU side beyond
what llama.cpp exposes.*
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.