Getting Started with NPU: Essential Modules for Learning

With the current popularity of AI chips, many IC engineers are eager to learn about NPU (Neural Processing Unit) but are unsure where to start.

This article will discuss various dedicated modules for AI chips, providing some direction for your learning.

Dedicated Modules for AI Chips

1. NPU Core Module (Neural Processing Unit)

Definition: A hardware unit that performs large-scale matrix/vector computations.

Features: Capable of parallel processing for convolution, fully connected, pooling, and other operations.

Uniqueness: Not found in standard SoCs, specifically designed for AI algorithms.

Submodules:

  • Systolic Array / MAC Array: An efficient parallel matrix multiplication engine.
  • Quantization Unit: Supports precision conversion such as INT8, BF16.
  • Activation Unit: Hardcoded function calculators for functions like ReLU, GELU.

2. Dedicated Instruction Controller (AI Scheduler / Controller)

Function: Schedules and controls the AI instruction flow (similar to a scheduler in microarchitecture).

Features: Can parse the “graph instruction stream” or “layer instructions” issued by the AI compiler.

Uniqueness: Optimized scheduling for AI models, not a general-purpose CPU scheduler.

3. Weight & Feature Buffer (On-chip High Bandwidth SRAM)

Function: Caches input feature maps, weights, and intermediate results.

Features: On-chip multi-bank, high concurrency, low latency.

Uniqueness: Designed to support high throughput for systolic/MAC units, distinct from general caches.

4. Tensor DMA / AI Dedicated Memory Access Controller

Function: Responsible for transporting tensor-format data, supporting block-split, channel-split, etc.

Features: Can optimize memory access order based on Tensor layout (e.g., NHWC).

Uniqueness: Compared to standard DMA, it has tensor awareness and supports AI layouts.

5. AI Graph/Layer Scheduling Execution Unit (Layer Dispatcher)

Function: Receives network layers segmented by the upper-level compiler and schedules submodule execution.

Features: Typically used in conjunction with software (Runtime).

Uniqueness: Focused on scheduling AI task graphs, not general operating system task scheduling.

6. Post-processing & Decode Unit (e.g., NMS, Softmax Accelerator)

Function: Accelerates post-processing tasks in object detection such as Non-Maximum Suppression (NMS) for YOLO/SSD.

Features: Fixed function, saves CPU time.

Uniqueness: Customized for CV models, typically executed in software on CPU/GPU.

7. AI Compiler Support Module (e.g., Layer-IR Decoder)

Function: Parses layer instructions from models transmitted by the upper-level compiler and maps them to internal chip operation sequences.

Features: Similar to a translator from “IR to hardware microinstructions”.

Uniqueness: Serves neural network execution, differing from traditional CPU decoding logic.

Recommended Learning Modules and Path for Beginners

Just as a beginner in IC design or verification would not be assigned to learn using CPU or flash modules, but rather timer and I2C modules, the same principle applies here. The timer module is simpler than flash and more suitable for beginners, ensuring it does not discourage them. Similarly, if we were to select a few modules from the above ten that are most suitable for beginners in AI chip learning, I would recommend the following three.

1. Systolic Array (Matrix Multiplication Array)

Why recommend it?

Because it is the computational core of AI chips, responsible for executing the most critical operation in neural networks: matrix multiplication (MAC).

Once you understand it, you grasp the essence of “acceleration” in AI chips.

Learning this module will help you understand how networks like CNN and Transformer operate.

Its structure is clear and systematic, making it ideal for writing UVM testbenches, assertions, and analyzing coverage.

From this module, you can learn:

  • • Data flow design (input stationary / weight stationary)
  • • Pipeline, buffering, and parallel scheduling concepts
  • • How to integrate quantization and buffering

2. Quantization Unit

Why recommend it?

Quantization is the bridge connecting the training (floating-point) and inference (fixed-point) worlds, a necessary technology for AI deployment.

Mastering it will help you understand efficient AI chip precision modes like INT8 / BF16.

Verification difficulty is moderate, with clear logic (mainly scaling, truncation, clamp), making it very suitable for beginners to practice writing test cases.

From this module, you can learn:

  • • Converting decimals to integers: scale + zero point + saturation
  • • How to avoid overflow and precision loss
  • • How much bandwidth does INT8 occupy? How to drive it using UVM?

3. Tensor-aware DMA

Why recommend it?

It connects the computing units to off-chip DDR, serving as the “vascular system” of the entire AI chip.

Tensor DMA is more complex than standard DMA (considering formats like NHWC/NCHW), making it a good entry point for learning memory optimization and address generator design.

Beginners can practice simulation and debugging skills by writing simple data transport scenarios.

From this module, you can learn:

  • • Tensor memory layout (NCHW, NHWC)
  • • DMA burst and stride modes
  • • How to verify that DMA correctly transports tensor data (key UVM content)

AI chips are not an unreachable black box; they are composed of logical modules just like familiar digital chips. However, they target more complex scenarios, process more data, and compute more efficiently.

For every IC engineer looking to transition to AI/NPU, there is no need to rush to master the entire AI workflow or to understand the mathematical essence behind deep models right away. Starting with individual “dedicated modules” to understand their structure, data flow, and verification approaches is the first step.

Systolic Array, Quantization Unit, and Tensor DMA are foundational building blocks in AI chips and are the most practical to implement.

I hope this article helps clarify your direction, build your confidence, and open the door to the world of AI chips.

Leave a Comment