r/LocalLLaMA · May 25, 2026 · 3 min read

Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead

Hey everyone, just wanted to share a project I've been hacking on for the last few weeks. I managed to build a from-scratch C++ inference engine to run MiniCPM-V 4.6 entirely on the Orange Pi AIPro (the budget board with the Ascend 310B NPU, costs around $149 for 20 TOPS INT8 / 10 TFLOPS FP16).

If you want to check out the custom ops, build scripts, or the Gradio web UI, the repository is open source on GitHub at github.com/lvyufeng/minicpm-v-4.6-orangepi

https://preview.redd.it/upfsqb0jm73h1.png?width=1655&format=png&auto=webp&s=1e80185171fa6db651d81e20d717b3a05791614c

If you've ever tried deploying local LLMs or VLMs on this specific hardware, you probably know that dealing with the standard framework stack can be a massive pain, especially if you want to get any decent performance on the edge. To get around this, I skipped the heavy frameworks and went low-level. Both the text generation and the SigLIP vision tower run natively on the NPU inside a single C++ subprocess. There is absolutely zero torch_npu dependency on the hot path. Python is only used on the cold path for CPU-side tokenization and image preprocessing.

The initial stock aclnnMm baseline was pretty rough during the token decoding phase because it heavily underutilized the NPU's cube unit when M=1 (vector-matrix multiply). It was giving me around 2.88 tokens/s (taking about 350ms per step).

After rewriting the critical paths with custom AscendC kernels, it's now hitting 5.90 tokens/s in FP16 (dropping the per-step latency down to 170ms). Here is the actual breakdown of how the 2x speedup happened:

Stage	Tokens/s	Per-step (ms)	Saved
Stock `aclnnMm` baseline	2.88	350 ms	—
+ Custom Cube Matmul ($M=1$)	4.37	229 ms	121 ms
+ `lm_head` 16-chunk Cube Path	4.99	200 ms	29 ms
+ Vectorized Causal-Conv1d Step Kernel	5.90	170 ms	30 ms

First, I wrote a custom cube matmul kernel for M=1 using MatmulImpl to bypass the slow generic vector path. This single change boosted the speed from 2.88 tps to 4.37 tokens/s, saving around 121ms per step.

Second, the lm_head was way too wide for normal cube tiling because the vocabulary size is huge (around 248k). Running the stock matmul directly was a bottleneck. So I made the engine chunk the weights into 16 cube-friendly slices at load time, running sequential matmuls followed by a host reduce. This shaved off another 29ms, bringing it up to 4.99 tokens/s.

Third, I replaced a highly scalar causal-conv1d baseline with a vectorized step kernel using Unified Buffer DMAs, which saved another 30ms per step, bringing it to the final 5.90 tokens/s.

Right now, the decoding step is completely bottlenecked by the board's 44 GB/s memory bandwidth reading the FP16 weights. The absolute theoretical floor for reading the 1.4GB weights per step is around 32ms, and my current cube path sits at 170ms. The next logical step is implementing fused INT4/INT8 dequantization kernels on the cube path to push it past 12+ tokens/s.

Let me know if you have any questions about AscendC kernel tuning, the C++ SigLIP implementation, or edge VLM deployment in general!

submitted by /u/Known_Ice9380
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA