The NPU (Neural Processing Unit), as a dedicated AI acceleration chip, requires its hardware interface design to meet the demands of high bandwidth, low latency, and flexible scalability. Below are the commonly used interface signals and corresponding protocols for NPUs at the hardware level, explained with actual chip examples:
1. Memory Interface
Function: Connects to external memory (such as DDR, HBM) to transfer large volumes of data like weights and activation values.Signals and Protocols:
-
DDR4/5 Interface:
-
Differential Clock (CK_t/CK_c)
-
Address Lines (A0-A17)
-
Data Lines (DQ0-DQ63)
-
Control Lines (RAS/CAS/WE)
-
Signal Lines:
-
Protocol: JEDEC DDR standard, supports burst transfers (Burst Length 8/16).
-
Example: Huawei Ascend 910 achieves 512GB/s bandwidth through 8-channel DDR4-3200.
-
HBM2e/3 Interface:
-
1024-bit wide data bus (per stack)
-
TSV (Through-Silicon Via) vertical interconnect signals
-
Signal Lines:
-
Protocol: JEDEC HBM standard, achieves 2.5D packaging through microbumps.
-
Example: NVIDIA A100 GPU integrates HBM2e, achieving bandwidth of 1.6TB/s.
2. High-Speed Interconnect Interface
Function: Interconnects with CPU/GPU/other accelerators to achieve heterogeneous computing.Signals and Protocols:
-
PCIe Gen4/5:
-
Differential Data Pairs (Tx+/Tx-, Rx+/Rx-)
-
Reference Clock (REFCLK)
-
Signal Lines:
-
Protocol: PCI Express protocol, supports x16 link (Gen5 single-channel bandwidth 32 GT/s).
-
Example: Habana Goya AI accelerator card communicates with the host via PCIe Gen4 x16.
-
CXL (Compute Express Link):
-
Signal Lines: Multiplexes PCIe physical layer, adds cache coherence protocol signals.
-
Protocol: CXL 2.0/3.0, supports memory sharing between devices (e.g., CXL.mem).
-
Example: Intel Sapphire Rapids CPU connects to AI accelerator card via CXL 2.0.
-
NVLink/NVSwitch:
-
Signal Lines: High-speed differential pairs (4-12 channels per direction).
-
Protocol: NVIDIA proprietary protocol, supports direct memory access between GPU/NPU (GPU Direct RDMA).
-
Example: NVIDIA DGX H100 system achieves 900GB/s interconnect bandwidth via NVLink 4.0.
3. On-Chip Bus Interface
Function: Connects NPU with other IP cores (such as CPU, DSP) within the SoC.Signals and Protocols:
-
AXI4/AXI-Stream:
-
Address/Data Channels (AW/AR/W/B/R)
-
Streaming Interface (TDATA/TVALID/TREADY)
-
Signal Lines:
-
Protocol: ARM AMBA 4.0 standard, supports out-of-order transfers and multiple master devices.
-
Example: Tesla FSD chip connects NPU with CPU via AXI bus.
-
CHI (Coherent Hub Interface):
-
Signal Lines: Supports cache-coherent request/response channels.
-
Protocol: ARM AMBA 5 CHI, suitable for multi-core coherent interconnect.
-
Example: AWS Graviton3 connects NPU with Neoverse V1 cores via CHI.
4. Control and Debug Interface
Function: Configures NPU operating modes, monitors status, and debugging.Signals and Protocols:
-
APB (Advanced Peripheral Bus):
-
Signal Lines: PADDR (address), PWDATA (write data), PRDATA (read data).
-
Protocol: ARM AMBA 3.0, used for low-speed register access.
-
Example: Google TPU configures control registers via APB.
-
JTAG/SWD:
-
Signal Lines: TDI (data input), TDO (data output), TCK (clock), TMS (mode select).
-
Protocol: IEEE 1149.1 standard, supports boundary scan and debugging.
-
Example: Xilinx Versal AI Edge chip programs NPU firmware via JTAG.
-
I2C/SPI:
-
Signal Lines: SCL/SDA (I2C), CS/SCK/MOSI/MISO (SPI).
-
Protocol: Transmits configuration parameters or sensor data (e.g., temperature monitoring).
-
Example: Horizon Sunrise X3 connects PMIC to configure NPU voltage via I2C.
5. Data Stream Interface
Function: Directly connects to sensors or preprocessing modules to reduce data transfer overhead.Signals and Protocols:
-
MIPI CSI-2:
-
Signal Lines: Differential data pairs (D0+/D0- … Dn+/Dn-), synchronous clock (CLK+/CLK-).
-
Protocol: Supports multi-channel RAW image transmission (e.g., 12-bit 4K@60fps).
-
Example: Ambarella CV5 NPU receives raw data from the camera via CSI-2.
-
Ethernet/RoCE:
-
Signal Lines: SGMII (1Gbps), USXGMII (10Gbps) PHY interface.
-
Protocol: TCP/IP or RoCEv2 (RDMA over Converged Ethernet).
-
Example: Intel Habana Gaudi2 achieves distributed training via 100GbE RoCE.
6. Power and Clock Interface
Function: Power supply and clock synchronization.Signals and Protocols:
-
Power Management:
-
Signal Lines: VDD (core power), VDDQ (IO power), PG (power good indicator).
-
Protocol: Follows PMBus 1.3, dynamically adjusts voltage/frequency (DVFS).
-
Example: Apple M2 Ultra NPU supports dynamic voltage adjustment from 0.8V to 1.2V.
-
Clock Network:
-
Signal Lines: Differential clock input (REFCLK), PLL control signals (FB/CP).
-
Protocol: Ensures timing convergence through Clock Tree Synthesis (CTS).
-
Example: Qualcomm Hexagon NPU integrates low-jitter PLL, clock accuracy ±50ppm.
Typical NPU Interface Architecture Example
TakingHuawei Ascend 910 as an example:
-
Memory Interface: 8-channel HBM2e, 4096-bit wide, bandwidth 1.5TB/s.
-
Interconnect Interface: PCIe Gen4 x16 + CXL 2.0, supports cache coherence with Kunpeng CPU.
-
On-Chip Bus: AXI4-Stream connects AI Core with DDR controller.
-
Control Interface: APB configures registers, JTAG used for chip testing.
-
Data Stream Interface: Integrates RoCEv2 engine, supports direct connection to 100GbE network.
Design Considerations and Trends
-
Bandwidth and Latency Balance: Prefer using HBM3 (6.4TB/s) over DDR5, but cost must be weighed.
-
Protocol Compatibility: CXL gradually replaces PCIe, becoming the standard for heterogeneous computing interconnect.
-
Energy Efficiency Optimization: Employ near-memory computing (e.g., HBM-PIM) to reduce data transfer power consumption.
-
Emerging Interfaces:
-
UCIe (Universal Chiplet Interconnect Express): Supports 3D stacked NPU Chiplet integration.
-
OpenHBI (High Bandwidth Interconnect): Open-source high-bandwidth interface protocol.
By selecting interfaces and protocols wisely, NPUs can achieve up to 1000 TOPS of computing power (e.g., Tesla Dojo D1 chip), while maintaining an energy efficiency ratio of >10 TOPS/W.