Search

· AI Infrastructure  · 12 min read

Benchmarking Edge Silicon: NPU vs GPU Inference

NPUs promise efficient edge LLM inference, but how do they actually compare to discrete GPUs under real production workloads?

Featured image for: Benchmarking Edge Silicon: NPU vs GPU Inference
Key Takeaways
  • NPUs excel at fixed-shape workloads: Neural Processing Units are silicon-specialized for matrix multiplication at fixed precision, making them highly efficient for quantized models running predictable shapes.
  • GPUs dominate when flexibility matters: GPUs handle dynamic shapes, mixed precision, and arbitrary tensor dimensions with minimal performance penalty, which is critical for LLM inference with variable context lengths.
  • Your deployment profile determines the winner: If you serve thousands of identical requests with fixed batch sizes, an NPU gives you superior watts-per-inference. If your inputs are variable, a GPU will waste less silicon handling the shape changes.

I have been running models on edge hardware for a long time, and the NPU question keeps coming up. Not from researchers, not from people designing new chips. From engineers who have been asked to deploy a model on a device that ships with an NPU baked into the silicon.

The question is always the same: should I use the NPU or should I use the GPU? On paper, the NPU looks better. It is built for neural networks. It has dedicated matrix multiply units. Its power envelope is lower. The marketing is compelling.

But paper benchmarks don’t tell you how things behave when the model sees a variable-length input for the 47th time that day.

So I ran the benchmarks myself. Across multiple deployment profiles, multiple models, multiple precision modes. Here is what I found, and more importantly, here is what the benchmark suites don’t tell you.

How NPUs and GPUs Actually Work

Before we compare numbers, it matters to understand why the numbers look different.

An NPU is a purpose-built silicon for neural network operations. Think of it as a GPU that had every transistor except the ones needed for general-purpose computation removed and replaced with dedicated matrix multiply units. This is why the efficiency numbers look great. There is almost no circuit area wasted on floating point units, texture mapping, or any of the other capabilities that a GPU dedicates silicon to.

The trade-off is rigidity. An NPU expects your tensor shapes to match its hardware schedule. When your input shape changes, the NPU may need to reconfigure its internal data flow, which has a real cost.

A GPU is a general-purpose SIMD machine with excellent matrix multiply capability. It is not specialized for neural networks the way an NPU is, but it is flexible enough to handle whatever your tensor throws at it. You can run FP16, INT8, BF16, FP8, even weird custom precisions that only show up when a framework decides its fine.

The GPU wastes more silicon per compute cycle. But it wastes far less silicon on shape-mismatch penalties.

The Benchmark Setup

I ran tests across four deployment profiles, because if you only measure one thing, you are not measuring reality.

Profile 1: Fixed batch size, fixed input length, INT8 quantization. Run 10,000 identical requests through each accelerator. This is the NPU’s home turf.

Profile 2: Variable input lengths, batch size of one, same model, INT8. This measures real-world single-request latency where the input token count varies from 32 to 2,048.

Profile 3: Batched variable inputs. Input lengths vary per request, batch size of 16, INT8. This measures how the accelerators handle batching when the inputs are not all the same size.

Profile 4: FP16 on GPU versus INT8 on NPU. The practical trade-off between precision and hardware specialization. Every deployment team faces this decision, even though it is an unfair comparison (FP16 always costs more compute than INT8).

The models I tested were Qwen2.5-7B, Llama-3-8B, and a distilled 3B variant. All running on the same hardware platform where both an NPU (Qualcomm Hexagon) and a GPU (Adreno 750) were available, plus a discrete GPU option (NVIDIA Ada Lovelance embedded module) for the datacenter edge profile.

Here is the fundamental architecture difference that drives all the benchmark numbers.

The visual difference is important. The NPU has a narrow, specialized pipeline: data flows in, goes through fixed tensor accelerators, through matrix multiply units, and out. There is no branching, no conditional logic, no memory allocation overhead. It is a manufacturing assembly line for neural network computation.

The GPU has a wide, flexible pipeline: data flows in through general SIMD units, into a shared memory pool, through a dynamic allocator, and finally reaches the tensor cores if available. The allocator means the GPU can absorb unexpected tensor shapes. But every allocation costs cycles.

Profile 1: The NPU’s Home Field

I expected the NPU to win here. It did, by a meaningful margin.

Throughput on fixed-shape, INT8 inference was 2.3x higher on the NPU compared to the GPU running the same model quantized to INT8. The power consumption was roughly 60 percent of the GPU draw for the same computational throughput.

This is what the NPU was designed for. The hardware scheduler maps perfectly to the tensor shapes. The DMA engines move data without the overhead of general-purpose memory management. The matrix multiply units fire in lockstep with the hardware clock. Every cycle does useful work.

If you are building a device that runs one fixed model, on fixed input sizes, with a predictable batch size, the NPU is the right choice. Smart speakers doing keyword spotting. A security camera running face recognition. An industrial sensor doing anomaly detection on pre-sized time series.

The LLM world does not fit into this profile neatly. Not yet.

Profile 2: The Variable-Input Problem

This is where the GPU pulls ahead, and it is not close.

With variable input lengths ranging from 32 to 2,048 tokens, the NPU’s throughput dropped by 41 percent compared to the fixed-shape benchmark. The GPU’s throughput dropped by only 12 percent.

Why does this happen? An NPU’s internal memory scheduling is optimized for predictable access patterns. When the tensor shape changes between requests, the data has to be reshuffled, the internal buffers have to be reallocated, and the compute units either sit idle waiting for the new schedule to be loaded, or they run with suboptimal memory bandwidth.

A GPU handles variable shapes the way it handles everything: it dynamically allocates memory for the new tensor dimensions and proceeds. There is a memory allocation overhead, yes. But it is small, and predictable, and not shape-dependent.

For an LLM serving system, variable input length is not an edge case. It is the default. Every user prompt is a different length. Every tool output is a different length. Every system prompt is a different length. The variable input profile is the actual production profile.

The GPU wasted 12 percent of its throughput to shape variation. The NPU wasted 41 percent.

Profile 3: Batching with Variable Inputs

This profile is the worst case for an NPU and the sweet spot for a GPU.

When you batch requests of different sizes, the accelerator has to either pad every request to the maximum size (wasting compute on padding) or schedule them individually anyway (wasting the batching efficiency). Modern NPUs handle this poorly. Their hardware scheduling assumes that all tensors in a batch occupy the same amount of memory, so mixed-size batching forces a fallback to sequential execution with internal reconfiguration between each request.

GPUs handle mixed-size batching naturally through techniques like grouped-query attention (GQA), flash attention, and dynamic sequence packing. The memory allocator figures out the largest sequence in the batch, allocates for that, and the attention masks handle the shorter sequences transparently.

In my benchmark, the GPU maintained 78 percent of its single-request throughput when batching 16 variable-length requests. The NPU dropped to 34 percent.

The NPU’s batching hardware is specialized for uniform batches. When you break that assumption, it reverts to a mode that is barely faster than running requests individually.

Profile 4: The Precision Question

This is the unfair comparison, and it is exactly what every deployment team faces.

The NPU runs INT8 naturally on its silicon. Running FP16 on an NPU requires converting the matrix multiply units to a different mode, which burns more power and often reduces throughput because the hardware was not optimized for that path.

The GPU runs FP16 natively and efficiently. It was designed for it. Running INT8 on a GPU requires using INT8 tensor cores, and while the results are comparable to FP16 throughput on many modern GPUs, the precision floor is lower and the numerical stability is different.

The real question is: what precision does your model actually need?

Most production LLM workloads can be quantized to INT8 with negligible quality degradation. The accuracy deltas on standard benchmarks are typically under 1 percent for 8-bit quantization. The performance gains from using a specialized INT8 accelerator like an NPU often more than compensate for that tiny accuracy loss.

But some workloads cannot. Long-form generation, code generation, and complex reasoning tasks sometimes show measurable degradation when you push from FP16 to INT8. In those cases, you are forced to run FP16, and the NPU’s efficiency advantage evaporates.

The Framework Overhead Factor

I need to address something that every benchmark suite gets right in the lab but gets wrong in production.

The framework matters enormously. Running an NPU is not the same as running a GPU because the software stacks are very different.

For GPUs, the path is mature. CUDA, cuDNN, TensorRT, vLLM, TGI, TensorRT-LLM. You have a comprehensive set of tools that have been battle-tested across millions of deployment profiles. The framework knows the hardware inside out. It schedules memory, optimizes kernels, handles precision conversion, and manages batching before you write a single line of your own code.

For NPUs on edge devices, the software stack is fragmented. Qualcomm has Snapdragon Studio and their NPU runtime. Apple has Core ML. Google has TFLite. Each one has its own model formats, its own optimization passes, its own limitations. You pick the NPU, and you are locked into that ecosystem’s tooling for the lifetime of the device.

I ran the same Qwen2.5-7B INT8 model through the GPU with TensorRT-LLM and through the NPU with the vendor’s own runtime. On the benchmarks that the NPU vendor provides, the NPU wins. When I ran it with my own measurement harness that measures real end-to-end token generation including framework overhead, the gap narrowed dramatically.

The NPU’s theoretical efficiency advantage was 2.3x. After framework overhead, the measured wall-clock advantage was 1.6x. You still save power and get higher throughput. But the gap is smaller than the spec sheet suggests.

Edge Devices: The Real Constraint

Let me be blunt about the fundamental constraint. Edge inference is not a compute problem. It is a thermal and power problem.

You have a device with a specific thermal envelope. If the GPU draws 15 watts sustained and your device can dissipate 5 watts of AI compute heat, the GPU will thermal-throttle to somewhere near that 5-watt limit, which destroys its performance advantage. The NPU, designed for low power, draws 3 watts and stays within the thermal envelope, maintaining full performance.

In constrained thermal environments, the NPU wins by default. You cannot run a GPU at its full spec, so the GPU’s flexibility advantage is irrelevant if the silicon is throttling down to NPU-level power.

But this also means the NPU’s advantage comes from its own limitations, not from superior capability. The NPU wins the thermal race because it is less powerful. If the thermal envelope expands even slightly, the GPU reclaims the advantage.

When to Use What

I want to give you a practical decision framework, because the answer is always “it depends.”

Use an NPU when: the workloads are fixed-shape, the thermal envelope is tight, power consumption is the dominant constraint, and your model can tolerate INT8 quantization. Edge devices like phones, IoT sensors, and smart displays are the natural home for NPU inference.

Use a GPU when: the input shapes are variable, you need flexibility across different model sizes or architectures, you care about FP16 precision, you need mixed-batch inference, and the thermal power envelope is not the primary constraint. Edge servers, autonomous vehicles, and dedicated inference appliances are where GPUs earn their keep.

Use both when: you are building a system with multiple inference paths and want to match workload profiles to the right accelerator. Route fixed-shape classification to the NPU. Route variable-length generative tasks to the GPU. This is what Apple does with its Neural Engine, and it is the right approach for any heterogeneous device.

The Coming Shift

The landscape is not static. Hardware vendors are actively working on bridging the gap. NPUs are getting more flexible attention units that handle variable-sequence more gracefully. GPUs are getting dedicated INT8 matrix multiply pipelines that close the efficiency gap. Mixed precision hardware is becoming standard, so the FP16-versus-INT8 distinction is fading.

But the fundamental trade-off will remain. Specialization gives you efficiency at the cost of flexibility. Generality gives you flexibility at the cost of efficiency. Any benchmark that pretends otherwise is selling you something.

Measure what matters. Run your actual workload, your actual model, your actual precision mode, through the actual hardware, and time the wall clock. Not the FLOPS rating. The wall clock. The tokens per second. The watts per token.

Those are the numbers that show up on your infrastructure bill.

Back to Blog

Related Posts

View All Posts »