In-Depth Analysis of Intel Meteor Lake NPU: Architecture, Performance, and Limitations

Abstract

This article provides an in-depth analysis of the neural processing unit (NPU) integrated into Intel’s Meteor Lake mobile processor. Codenamed “NPU 3720,” this NPU is based on Movidius technology acquired by Intel and is designed to handle machine learning workloads with higher energy efficiency. The article discusses its high-level architecture, neural compute engine design, memory subsystem, link bandwidth, and real-world application performance (such as Stable Diffusion). The analysis indicates that the NPU has energy efficiency advantages in specific scenarios, but it has limitations in performance, software ecosystem support, and the variety of data types, especially when compared to integrated GPUs and discrete GPUs.

1. Overview
2. High-Level Architecture
3. Neural Compute Engine Architecture
4. Memory Subsystem
5. Link Bandwidth
6. Stable Diffusion v1.5 Performance Testing
7. Conclusion and Outlook
8. References

1. Overview

AI is a hot topic currently, and to avoid falling behind, Intel has integrated a neural processing unit (NPU) into its Meteor Lake mobile processor. This NPU is referred to internally at Intel as “NPU 3720,” although this name does not appear in public market materials. This analysis is based on the Intel Core Ultra 7 155H processor installed in the ASUS Zenbook 14. The NPU 3720 features a large number of execution units and operates at a relatively low clock frequency of 1.16 GHz, aiming to process machine learning workloads with better energy efficiency than other hardware.

In-Depth Analysis of Intel Meteor Lake NPU: Architecture, Performance, and Limitations — Intel Tech Tour Slides

This article is relatively short because a deep exploration of the NPU requires extensive testing, writing, validation, and research work. For a non-profit project conducted in spare time, the workload for a deep analysis of the NPU is too large.

2. High-Level Architecture

Intel’s NPU design is derived from Movidius’s solution, which initially took the form of a programmable digital signal processor (DSP). After acquiring Movidius in 2016, Intel expanded the existing platform to include a large-scale multiply-accumulate (MAC) array. AI workloads involve a significant amount of matrix multiplication, and MAC units are well-suited for such computations. The MAC array of the NPU 3720 is distributed across two Neural Compute Engine (NCE) tiles, capable of executing 4096 INT8 multiply-accumulate operations per cycle, achieving a peak performance of 9.5 TOPS at the lower frequency of 1.16 GHz.

Similar to a GPU, the NPU connects to the system as a PCIe device, with commands issued by the host. Intel did not design a dedicated command processor but instead retained Movidius’s 32-bit LEON microcontroller. The LEON uses the SPARC instruction set and runs a real-time operating system. One core, named “LeonRT,” is responsible for NPU initialization and host command parsing; the other core, “LeonNN,” dispatches tasks to the NCE compute resources, acting as a low-level hardware scheduler. Both LEON cores have independent caches. The NPU also integrates Movidius’s SHAVE (Streaming Hybrid Architecture Vector Engine) DSP, which operates in parallel with the MAC array to handle machine learning steps that cannot be mapped to the MAC array.

The core of accelerator design is to tightly couple hardware to specific tasks, a principle that is also reflected in the NPU’s memory hierarchy. Each NCE is equipped with 2 MB of software-managed SRAM; since it is not a cache, the NPU does not need to store tags or state arrays to track data locations. Access requests can read data directly from the SRAM storage, eliminating the need for tag matching and virtual address translation steps. However, the lack of a caching mechanism places a greater burden on the compiler and software, which must explicitly move data into SRAM.

At the system level, the NPU is integrated alongside the integrated GPU and CPU cores through Meteor Lake’s Scalable Fabric and shares the LPDDR5 memory subsystem with other modules. The operating system driver accesses the NPU as a PCIe device, with an access mechanism consistent with other peripherals like the integrated GPU.

3. Neural Compute Engine Architecture

The MAC array of a single NCE tile is further divided into 512 MAC Processing Engines (MPE), with each MPE capable of executing 4 INT8 multiply-accumulate operations per cycle; the throughput for FP16 MAC is half that of INT8.

Details on the SHAVE DSP core were not obtained, but tests conducted on the OpenCLOn12 platform indicate that its FP32 throughput is slightly above 50 GFLOPS. Since the DPU does not support FP32, this result should come from the SHAVE core. The SHAVE architecture, announced by Movidius in 2011, features a single 128-bit vector data path, with a theoretical value of 37 GFLOPS at 1.16 GHz; Intel may have subsequently iterated on SHAVE to enhance vector throughput. Nevertheless, the SHAVE core does not handle primary computational tasks at the DPU level but is used to support extended data types like FP32, transcendental functions, and data type conversions. However, compared to CPUs or GPUs, its data type support remains limited, with complete lack of FP64 support, which can be a significant obstacle in certain scenarios.

LevelZeroCompilerInDriver: Compilation of the network failed. Error code: 2013265924. Unable to find FP64 for the time stepIntel OpenVINO, error when attempting to compile the UNET model for Stable Diffusion on the NPU

Running the MatMul test using the Intel NPU acceleration library, I measured an FP16 throughput of 1.35 TFLOPS, while the theoretical peak is 4.7 TFLOPS, indicating a significant gap. Small matrix sizes suffer from efficiency losses due to task startup overhead; the CPU vector unit has nearly zero startup latency, and data is often already resident in cache, giving it an advantage in small matrix scenarios. As the matrix size increases, NPU utilization improves, but it is still limited by memory bandwidth; at this point, Meteor Lake’s integrated GPU can compete strongly with the NPU even at FP32 precision.

At a matrix size of 7168×7168, performance drops sharply. VTune shows that the NPU’s memory bandwidth is very low, but no other metrics are available. It is possible that the NPU is optimized for smaller models, and when memory usage becomes too large, TLB misses occur. Alternatively, when data overflows SRAM, the NPU may be limited by DMA latency.

GPUs have a large number of fused multiply-accumulate units, as graphics rendering also involves matrix multiplication. With a larger power budget, fast caches, and high memory bandwidth, even older discrete GPUs like NVIDIA’s GTX 1080 outperform the NPU significantly.

Newer GPUs, such as AMD’s RX 6900XT, operate at a completely different performance level. GPU performance scales well with shrinking process nodes and increased power budgets, so even without tensor cores, the RX 6900 XT can deliver massive matrix multiplication throughput.

4. Memory Subsystem

The NPU can access 128 KB of storage space with relatively low latency. The latency is 16.02 ns, slightly lower than the RDNA 2 scalar cache (17.8 ns), and falls between the L1 cache latencies of NVIDIA’s Ampere and Ada Lovelace architectures. It is unclear how the NPU completed this latency test, but it is speculated that its code runs on the SHAVE DSP.

Although the latency test suggests the presence of a cache, its behavior is entirely different from conventional caches. Changing the stride of pointer chasing did not affect the results until the stride exceeded 128 KB. If described in cache terms, it is equivalent to having only a single 128 KB cache line. Furthermore, it is both direct-mapped and fully associative.

It is more likely that we are not facing a traditional cache but rather a 128 KB high-speed storage configured next to the load/store unit. When an access miss occurs, the software is responsible for one-time moving 128 KB of data to fill it. The 237 ns latency measured at 512 KB is likely due to the NPU loading a 128 KB block from the NCE’s scratchpad at over 500 GB/s; in contrast, DRAM latency is close to 1 µs, making it appear extremely slow. Intel’s data manual indicates that requests undergo address translation via the MMU before leaving the device.

OpenCL’s local memory corresponds better to the NPU’s scratchpad memory, as both are non-coherent, directly-addressable storage spaces. The NPU again demonstrates excellent latency characteristics; while the CPU’s latency is certainly lower, the NPU is now on par with the latest generation of GPUs.

5. Link Bandwidth

Similar to the integrated GPU, the NPU should be able to quickly access the CPU’s memory space. Data copying between the NPU memory and CPU memory does not need to go through the PCIe bus. However, when using OpenCL’s clEnqueueWriteBuffer and clEnqueueReadBuffer commands to transfer data, the NPU’s performance lags far behind that of the integrated GPU.

Meteor Lake’s integrated GPU can exceed 19 GB/s when moving data from CPU memory space to GPU memory space. The NPU, however, cannot break through 10 GB/s. This is either due to its connection to the Meteor Lake Scalable Fabric being narrow or its DMA engine capabilities being weak.

6. Stable Diffusion v1.5 Performance Testing

Stable Diffusion generates images from text prompts: starting from random noise, it iteratively denoises to the final image using UNET. The number of UNET steps is significant, and even on high-end hardware, it takes considerable time, making it a typical target for hardware acceleration.

Getting Stable Diffusion to run on the NPU is not easy. I first attempted to modify the OpenVINO Stable Diffusion pipeline to adapt it for NPU compilation:

• UNET compilation failed due to FP64 operations
• The VAE decoder reported an error due to lack of support for a certain operator:

Cannot create ScaledDotProductAttention layer __module.vae.decoder.mid_block.attentions.0/aten::scaled_dot_product_attention/ScaledDotProductAttention id:111 from unsupported opset: opset13Intel OpenVINO, error when attempting to compile the VAE decoder for the NPU

Subsequently, I attempted to modify the ONNX version of the Stable Diffusion pipeline to call the NPU via DirectML. However, the NPU is a special case, and the process remains cumbersome. Typically, applications need to first create a DirectX12 device based on a DXGI (DirectX Graphics Infrastructure) device, then use DirectML; however, the NPU lacks graphics rendering or display output capabilities, so it is not a DXGI device. Posts on StackOverflow regarding setting the ONNX device_id did not resolve the issue.Thus, I turned to study the ONNX source code^[1] and found that the device filter needed to be configured instead of device_id. However, the target code segment had been conditionally compiled (#ifdef<code>) out. After spending several hours trying to recompile ONNX from source with the relevant macros enabled, I ultimately gave up.

Eventually, I found the OpenVINO GIMP plugin released by Intel, which has implemented Stable Diffusion functionality. It loads a precompiled opaque blob rather than compiling the public model on the fly. It can be inferred that Intel internally generated this blob to bypass the NPU’s limitations regarding unsupported operators and FP64; at the same time, the model was converted to INT8 to run at full speed on the NPU (the original Stable Diffusion defaults to FP32). Since users cannot generate blobs themselves, they cannot use newer models like Stable Diffusion 2.0 and can only run version 1.5 to generate 512×512 images on the NPU.

The UNET inference of Stable Diffusion can be split into two steps: UNET+ and UNET-, with the latter used for negative prompts. The GIMP plugin provided by Intel defaults to executing UNET+ on the NPU and UNET- on the integrated GPU. I modified the plugin code to test both the “full NPU” and “full integrated GPU” paths. For comparison, I also ran the unmodified Stable Diffusion 1.5 model using Olive+DirectML on Meteor Lake’s integrated GPU and a discrete GPU to assess overall performance differences.

The measured results align with Intel’s official slides: Meteor Lake’s integrated GPU easily surpasses the NPU. My “pure GPU” score was 1.38 iterations/s, which closely matches the official figure of 20 iterations/14.5 s (or 1.379 iterations/s). Even when loading the full FP32 model, the integrated GPU is still 62% faster than the NPU; if the same INT8 model is run on the integrated GPU, the lead increases to 261%. As for discrete GPUs, taking the AMD RX 6900 XT as an example, even at FP32 precision, its speed is 6.7 times that of the NPU.

7. Conclusion and Outlook

Since the dawn of computing, engineers have continuously designed various accelerators. When hardware is closely matched to specific tasks, performance or power savings can be achieved. The NPU in Meteor Lake is designed for machine learning workloads: its core supports INT8 and FP16, with only symbolic support for FP32 and a complete abandonment of FP64; its memory subsystem is also simplified accordingly, reducing overhead through predictable data movement patterns in machine learning. Overall, this NPU reflects the value of Intel’s past acquisitions and demonstrates the flexibility of the Meteor Lake Scalable Fabric—allowing for more freedom in the positioning and number of integrated accelerators compared to the older Sandy Bridge architecture.

However, the NPU still has significant shortcomings. Accelerators inherently lack the flexibility of more general designs: while no one expects the NPU to handle general computing, the limited data type support of the NPU 3720 directly hinders the operation of certain ML models. Custom accelerators need to build a dedicated software ecosystem around them—deploying Stable Diffusion on the NPU is both cumbersome and limited. By 2024, ordinary consumers may be accustomed to multiple programming languages and willing to browse open-source code, but they expect to directly call ready-made models rather than obtaining opaque blobs out of thin air. In other scenarios, the NPU may perform better; I will further validate this when I have more time and resources.

For runnable models, the NPU only brings power savings and does not necessarily lead to performance improvements. Meteor Lake reports power consumption for the CPU cores, integrated GPU, and System Agent separately; the NPU belongs to the System Agent, with a maximum power consumption of just over 7 W, while the integrated GPU can reach 20 W. However, the integrated GPU achieves performance and flexibility that the NPU cannot match. If machine learning workloads do not need to be executed in mobile scenarios, discrete GPUs can provide orders of magnitude performance leaps. The NPU may have value in specific contexts, but the marketing label of “AI PC” is clearly disconnected from reality.

Consider the AI PC as a super personal computer equipped with the right hardware and software to handle AI…ZDNET article “What is an AI PC?”^[2]

As long as there is sufficient video memory, any PC equipped with a mid-range discrete graphics card can handle AI tasks. GPUs already have a mature AI software stack and can support a wider range of models without additional hassle than the NPU. I can directly run Stable Diffusion 2.0 to generate 768×768 high-resolution images with GPU acceleration, still faster than the NPU generating 512×512 images with a mysterious INT8 blob. In my view, any enthusiast-level gaming desktop is already an “AI PC”; as for the ASUS Zenbook 14, I would gladly call it an “AI PC” regardless of whether it has an NPU.

The NPU in Meteor Lake is an intriguing accelerator, but its applicable scenarios and range of benefits are quite limited. If I relied on AI tasks daily, I would run ready-made models directly on the integrated GPU—higher performance and less deployment time. The NPU only makes sense when extreme battery life extension is required, and I hardly work in a state without power; nowadays, even economy class seats come with power outlets. I hope Intel continues to iterate on both hardware and software to broaden the NPU’s application boundaries. Over the past fifteen years, GPU computing has evolved to a point of considerable ease of use; witnessing workloads being offloaded to low-power modules remains exciting, and I look forward to the NPU experiencing a similar evolution.

8. References

• Intel Core Ultra Processors (PS Series) Datasheet
• https://techcrunch.com/2016/09/05/intel-buys-computer-vision-startup-movidius-as-it-looks-to-build-up-its-realsense-platform/

This article is translated and reconstructed from Intel Meteor Lake’s NPU^[3]

Citation Links

[1] ONNX Source Code: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/dml/dml_provider_factory.cc[2] “What is an AI PC?”: https://www.zdnet.com/article/what-is-an-ai-pc-exactly-and-should-you-buy-one-in-2025/[3] Intel Meteor Lake’s NPU: https://chipsandcheese.com/p/intel-meteor-lakes-npu

In-Depth Analysis of Intel Meteor Lake NPU: Architecture, Performance, and Limitations

Abstract

Table of Contents

1. Overview

2. High-Level Architecture

3. Neural Compute Engine Architecture

4. Memory Subsystem

5. Link Bandwidth

6. Stable Diffusion v1.5 Performance Testing

7. Conclusion and Outlook

8. References

Citation Links

Leave a Comment Cancel reply

Abstract

Table of Contents

1. Overview

2. High-Level Architecture

3. Neural Compute Engine Architecture

4. Memory Subsystem

5. Link Bandwidth

6. Stable Diffusion v1.5 Performance Testing

7. Conclusion and Outlook

8. References

Citation Links

Related posts

Leave a Comment Cancel reply