How we catch silent NPU fallback on Snapdragon in CI [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Posting because I've now seen this exact bug at multiple teams shipping ML to Snapdragon, and the pattern is worth writing up.
ONNX Runtime's QNN execution provider (the one that targets Qualcomm's Hexagon NPU on Snapdragon SoCs) will silently route unsupported ops to the CPU. Your accuracy is fine, your eval latency on the dev board looks fine, but production latency mysteriously triples because the input distribution stresses fallback paths differently — and the runtime never raises anything louder than a startup-log line nobody reads.
The default median-of-N latency gate doesn't catch this, because fallback creates a bimodal distribution and the median lands on the fast cluster. Three things end up being necessary:
**Run on real hardware** — emulators implement the ISA in software so every op is "supported" (for the wrong reason), and cloud x86 doesn't load the QNN EP at all
**Gate on coefficient of variation alongside median** — healthy on-NPU CV is 2–5%, intermittent fallback pushes it >15%
**Parse the ORT profiling JSON and assert NPU FLOP percentage** — the routing info is in there but you have to opt into `profiling_level=detailed` and post-process it; the default warning-level log just says "23 nodes assigned to QNN, 7 to CPU"
The third one is the diagnostic that actually identifies which op fell back, so you can either swap it for a supported equivalent, pin the QNN SDK, or escalate to firmware.
Wrote up the full pattern with the actual Python (CV gating function + ORT profile parser): https://edgegate.frozo.ai/blog/how-we-catch-silent-npu-fallback-on-snapdragon-in-ci
Curious if anyone here has hit similar silent-fallback patterns with TensorRT on Jetson or CoreML on iOS — I'd expect the symptom (bimodal latency, silent provider routing) but haven't gone digging. Same with ExecuTorch.
[link] [comments]
More from r/MachineLearning
-
arXiv implements 1-year ban for papers containing incontrovertible evidence of unchecked LLM-generated errors, such as hallucinated references or results. [N]
May 15
-
Follow the Mean: Reference-Guided Flow Matching [R]
May 14
-
[N] LangChain Interrupt 2026 announcements [N]
May 14
-
Would a 2000-2021 ML paper even get accepted today? [D]
May 14
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.