r/LocalLLaMA · · 2 min read

I finally put my NPU (Intel Arrow Lake) to use doing ASR for my smart home

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I finally put my NPU (Intel Arrow Lake) to use doing ASR for my smart home

I wrote about what I found in a deep dive elsewhere (which I will no mention because Reddit doesn't like cross linking) but I wanted to share it here since this is where I learn the most about AI stuff and I've seen before questions about NPUs, that are often dismissed as marketing gimmicks (and for the most part they are if we're taking LLMs, but not for other ML workloads).

If you care for the traps I found along the way making onnx-asr working on openvino compiled to the NPU, you can read the article, I'm here to post the findings.

Table comparing the total time, total energy used (watts during inference and total Joules per transcription).

Audio length CPU (INT8) NPU (FP32) Speedup Energy
10s 978ms / 44.6J / 45.6w 204ms / 4.2J / 20.5w 4.8× faster 10.7× less energy
20s 1708ms / 79.8J / 46.7w 615 ms / 7.8 J / 12.7 W 2.8× faster 10.2× less energy
60s 5011ms / 237.7J / 47.4w 818 ms / 11.0 J / 13.4 W 6.1× faster 21.6× less energy

The energy was sampled at 10hz using intel-rapl which gives the total package power, to which I substracted the idle power I measured before the run, so when you see that the power was 12.7w, it means it was 12.7w above idle.

I think this is a remarcably result considering intel NPUs are, at least on paper, rather weak with 13TOPS, compared with the >40TOPS of the AMD ones, but still more than fast enough for this task.

Some real world number end-to-end number from home assistant:

CPU

NPU

Running this on the NPU frees the CPU to do CPU stuff, and also saves some valuable 2-3gb of valuable vram on my 7900XTX to do LLM stuff.

Incidentally, this setup happens to beat in real world usage my 12GB RTX 3060 eGPU that I was using before. On a 3-4s voice command, the NPU takes ~120-160ms, while the 3060 i used before took ~150-300ms. I am not claiming that the NPU is more powerful than the nvidia card, but I suspect that the advantage comes from the NPU being able to wake up instantly from dormancy, while the nvidia card took long enough to ramp up that for short workloads like smart home voice commands, the head start of the NPU was enough to win. Quite likely transcribing long format audio the nvidia card would win again.

I finally found a nice use for the NPU, and I want to move the STT audio generation to the NPU next.

https://github.com/cibernox/wyoming-parakeet-on-intel-npu

submitted by /u/cibernox
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA