Why NPU Has Become One of the Hottest Chips in the AI Wave

Recently, Qualcomm launched three flagship chips: the next-generation flagship mobile SoC chip Snapdragon 8 Gen 2 (Snapdragon 8 Elite) mobile platform, Snapdragon X2 Elite Extreme, and Snapdragon X2 Elite PC processor, all utilizing 3nm process technology.

The third-generation Orion CPU is specifically designed for AI applications, capable of seamless collaboration with Qualcomm’s Hexagon NPU to efficiently schedule machine learning tasks and drive AI agents.Additionally, the hardware matrix acceleration technology allows AI models to be directly deployed on the CPU, further enhancing flexibility and processing speed.According to Hot Hardware, the fifth-generation Snapdragon 8 Gen 2 demonstrated an “absolute leading advantage” on Geekbench 6, “easily defeating the Apple A18 Pro” and being comparable to the Apple M4 processor in the iPad Pro.Combined with CPU and GPU optimizations, the fifth-generation Snapdragon 8 Gen 2 can intelligently manage various workloads, further enhancing NPU performance to support generative AI, image processing, real-time translation, AI-accelerated photography, and video applications.For example, this NPU can support real-time motion capture in NetEase games, synchronizing the actions of real people and virtual characters almost perfectly.In everyday applications, we often hear about CPUs and GPUs, but what exactly is an NPU, and how does it differ from a GPU? Today, we will provide a general understanding.

What is an NPU

A Neural Processing Unit (NPU) is a specialized processor optimized for neural networks and artificial intelligence workloads.

NPUs have two main technical characteristics:The first is to simulate the operation of human neural networks, excelling in parallel processing and appropriately allocating “task flows” within the chip to reduce idle computational resources.

The second is through “near-memory computing” (placing the processor close to DRAM to reduce data transfer latency and power consumption) or “in-memory computing” (moving simple logical operations to memory arrays),achieving integration of storage and computation, reducing energy consumption during computation, speeding up access times, and thus enhancing the execution speed and efficiency of AI computations.

Compared to CPUs and GPUs, the low-power, high-performance NPU is particularly suitable for handling AI inference tasks, including image recognition, natural language processing, object detection, and other applications.

In recent years, major PC and mobile chip manufacturers have incorporated NPUs into their chip designs to enhance AI computing capabilities.

NPU vs. GPU

Although GPUs have advantages in parallel computing capabilities, they do not operate independently and require CPU collaboration. The construction of neural network models and data flow still occurs on the CPU. Additionally, GPUs have issues with high power consumption and large size. The higher the performance, the larger the GPU and the higher the power consumption, making it impractical for smaller devices and mobile applications. Therefore, the compact, low-power, high-performance dedicated chip NPU has emerged.

The working principle of an NPU is to simulate human neurons and synapses at the circuit level, directly processing large-scale neurons and synapses through a deep learning instruction set, allowing a single instruction to complete the processing of a group of neurons. Compared to CPUs and GPUs, NPUs integrate storage and computation through synaptic weights, thereby improving operational efficiency.

CPUs and GPUs require thousands of instructions to complete neuron processing, while NPUs can accomplish this with just one or a few instructions, giving them a significant advantage in deep learning processing efficiency. Experimental results show that, under the same power consumption, the performance of NPUs is 118 times that of GPUs.

Comparison Item	GPU	NPU
Architectural Differences	General parallel computing architecture that requires CPU collaboration for task processing. The construction of neural network models and data flow still occurs on the CPU.	Simulates human neurons and synapses at the circuit level, capable of directly processing large numbers of neurons and synapses using a deep learning instruction set.
Computational Efficiency Comparison – Instruction Efficiency	CPUs and GPUs require thousands of instructions to complete neuron processing.	NPUs can complete neuron processing with just one or a few instructions.
Computational Efficiency Comparison – Energy Efficiency Ratio	Under the same power consumption, general-purpose GPUs have significantly lower AI computing performance than dedicated NPUs.	Under the same power consumption, dedicated NPUs can achieve AI computing performance 10-50 times that of equivalent GPUs.
Computational Efficiency Comparison – Memory Access	Due to the lack of a special storage-computation integration mechanism, memory access overhead is high.	NPUs achieve storage-computation integration through synaptic weights, greatly reducing memory access overhead.
Application Scenario Differences	Advantageous Scenarios: General computing, graphics rendering, large-scale training.	Advantageous Scenarios: Edge device inference, low-power scenarios, real-time AI applications.

Characteristics of Different Processing Units

CPU – 70% of transistors are used to build cache and control units. Fewer computational units, suitable for logical control operations.

GPU – Most transistors are used to build computational units, with low computational complexity, suitable for large-scale parallel computing. Mainly used for big data, backend servers, and image processing.

NPU – Simulates neurons at the circuit level, achieving storage-computation integration through synaptic weights. A single instruction completes the processing of a group of neurons, improving operational efficiency. Mainly used in communication, big data, and image processing.

FPGA – Programmable logic, high computational efficiency, closer to low-level IO. Achieves logical editability through redundant transistors and connections. Essentially instructionless, does not require shared memory, and has higher computational efficiency than CPUs and GPUs. Mainly used in smartphones, portable devices, and automobiles.

Type	Characteristics	Applications
CPU	About 70% of transistors are used to build cache and control units. Fewer computational units, suitable for logical control operations. Highly versatile but low AI computing efficiency.	General computing, device control
GPU	Transistors are mainly used to build computational units. Low computational complexity, suitable for large-scale parallel computing.	Big data, backend servers, image processing
NPU	Simulates neurons at the circuit level, achieving storage-computation integration through synaptic weights. A single instruction can complete the processing of a group of neurons, improving operational efficiency.	Communication, big data, image processing, edge AI
FPGA	Programmable logic, high computational efficiency, closer to low-level IO. Achieves logical editability through redundant transistors and connections. Essentially instructionless, does not require shared memory, and has higher computational efficiency than CPUs and GPUs.	Smartphones, portable devices, automobiles

Practical Applications of NPU

• AI scene recognition during photography and image processing through NPU
• NPU determines light sources and dark detail synthesis for super night scenes
• Voice assistant operations realized through NPU
• NPU collaborates with GPU Turbo to predict the next frame for pre-rendering, enhancing game smoothness
• NPU predicts touch operations to improve responsiveness and sensitivity
• NPU assesses differences in front-end and back-end network speed requirements, optimizing network connections through Link Turbo technology
• NPU intelligently adjusts resolution based on game rendering load
• Reducing AI computation load through NPU to save power in gaming
• NPU enables dynamic scheduling of CPU and GPU
• NPU assists in big data advertising push
• AI intelligent association function of input methods realized through NPU

Definitions of Various Processing Units

• APU: Accelerated Processing Unit, AMD product used for accelerating image processing
• BPU: Brain Processing Unit, Horizon’s embedded processor architecture
• CPU: Central Processing Unit, mainstream product at the core of PCs
• DPU: Data Stream Processing Unit, AI architecture proposed by Wave Computing
• FPU: Floating Point Processing Unit, floating-point module in general processors
• GPU: Graphics Processing Unit, multi-threaded SIMD architecture designed for graphics processing
• HPU: Holographic Processing Unit, Microsoft’s holographic computing chip and device
• IPU: Intelligent Processing Unit, AI processor product from Graphcore (DeepMind investment)
• MPU/MCU: Microprocessor/Microcontroller Unit, typically used in low-computation applications within RISC computer architecture products
• NPU: Neural Network Processing Unit, a new type of processor based on neural network algorithms and acceleration
• TPU: Tensor Processing Unit, Google’s processor specifically designed to accelerate artificial intelligence algorithms
• VPU: Vision Processing Unit, a chip launched by Movidius, acquired by Intel, focused on accelerating image processing and artificial intelligence

Disclaimer: Some content is sourced from the internet for educational and communication purposes. The copyright of the article belongs to the original author. If there are any issues, please contact for removal.

What is an NPU

NPU vs. GPU

Characteristics of Different Processing Units

Practical Applications of NPU

Definitions of Various Processing Units

Related posts

Leave a Comment Cancel reply