Implementation of Edge-side PD Disaggregation in NPU+CPU Heterogeneous Computing

In the scenario of large model inference on the edge, balancing low latency and high performance is always a core requirement. The collaborative PD disaggregation architecture of NPU and CPU innovatively addresses the TTFT bottleneck of edge-side inference by deploying the Prefill phase on the NPU, executing the Decode phase on the CPU, and optimizing the sharing method of the KV Cache.

1. What is PD Disaggregation (Prefill-Decode Disaggregation)?

PD disaggregation refers to the Prefill-Decode disaggregated inference architecture, a technical solution that optimizes the two core stages of large model inference.

In traditional LLM inference frameworks, the Prefill and Decode stages are executed by the same hardware, requiring the inference engine to dynamically switch between the two stages to complete tasks. The PD disaggregation architecture separates these two stages to run independently on different hardware instances: the Prefill stage focuses on processing input prompts and generating the KV Cache (key-value cache); the Decode stage then outputs results token by token based on the generated KV Cache. The two stages are connected through an efficient data transmission mechanism, allowing for independent optimization of their respective performances, ultimately enhancing overall inference efficiency.

2. Why Implement PD Disaggregation?

The core value of PD disaggregation lies in matching the characteristics of different inference stages with the core advantages of hardware, breaking the resource contention bottleneck of traditional architectures, especially adapting to the hardware configuration characteristics of edge devices.

1. Core Characteristics of Prefill and Decode Stages

Prefill Stage: A compute-intensive task, the core work is to encode the input prompt sequence to generate the KV Cache required for subsequent inference. The computational load of this stage increases linearly with the length of the input tokens, requiring high parallel computing capabilities, and its performance directly determines TTFT (time to first token).

Decode Stage: A memory-intensive task, it uses a token-by-token autoregressive generation mode, requiring frequent access to the KV Cache generated in the Prefill stage for each step of generation. The computational load of this stage is relatively small, but it has high demands on memory access speed and data continuity, with performance primarily affecting TPOT (time per output token).

2. Hardware Adaptation Logic: NPU for Prefill, CPU for Decode

The core advantage of NPU for Prefill: The NPU is a dedicated processor designed for neural network computations, equipped with numerous specialized computation units (such as MAC arrays and tensor processing units) and an optimized memory subsystem, capable of efficiently handling core neural network operations like matrix multiplication and activation functions. The parallel computing characteristics of the Prefill stage align well with the hardware architecture of the NPU, fully leveraging its performance advantages in compute-intensive tasks to quickly complete KV Cache generation, thereby reducing TTFT from the source.

The core advantage of CPU for Decode: As a general-purpose processor, the CPU possesses strong logical control capabilities and a flexible memory access scheduling mechanism. The token-by-token generation process in the Decode stage is logically complex, requiring frequent cache access and instruction scheduling. The CPU’s advantages in serial processing and mature memory management ensure continuity and stability in the Decode stage. Additionally, CPUs are commonly found in edge devices, allowing for efficient utilization of hardware resources without incurring extra hardware costs for Decode tasks.

In traditional architectures, sharing hardware between the two stages can lead to resource contention: prioritizing optimization for Prefill can degrade Decode performance, while focusing on Decode can increase Prefill wait times. PD disaggregation resolves this contradiction through hardware separation, achieving synchronized optimization of TTFT and TPOT.

3. How to Achieve PD Disaggregation: Efficient Sharing of KV Cache

To implement NPU+CPU collaborative edge-side PD disaggregation, the core challenge is to solve the efficient sharing of the KV Cache between the two hardware components, requiring a comprehensive solution from the perspectives of transmission mechanisms, memory layout, and scheduling strategies.

1. Establishing a Low-Latency KV Cache Transmission Channel

After the Prefill stage completes KV Cache computation on the NPU, it must transmit it to the CPU via a high-speed data transmission channel. In edge-side scenarios, on-chip buses (such as AMBA AXI) or shared memory mechanisms can be employed to reduce data copy overhead: after computation, the NPU directly writes the KV Cache into a shared memory area, allowing the CPU to access it directly through memory mapping, avoiding cross-hardware data transfer delays and ensuring smooth connection between the two stages.

2. Optimizing the Memory Layout of KV Cache

Considering the limited memory resources of edge devices, a compact memory layout is adopted to store the KV Cache, reducing memory fragmentation and redundant overhead. Additionally, the data arrangement is adjusted based on the CPU’s cache characteristics (such as cache line size and multi-level cache architecture) to achieve a higher cache hit rate when the CPU accesses the KV Cache during the Decode stage, thereby reducing memory access latency.

3. Designing an Intelligent Scheduling Mechanism

A lightweight scheduler is implemented to achieve coordinated interaction between the two stages: while the NPU executes Prefill computations, the CPU prepares for the Decode stage initialization in advance; once the KV Cache is transmitted to shared memory, the scheduler immediately triggers the CPU to start the Decode process; simultaneously, it monitors the execution status of both stages in real-time, dynamically adjusting resource allocation to avoid idle losses from one side waiting for the other, ensuring maximum utilization of hardware resources.

4. Advanced Optimization: KV Cache Quantization

The storage overhead of the KV Cache is a key constraint factor for the edge-side PD disaggregation architecture. Quantization optimization can further reduce memory usage and access latency without significantly compromising inference accuracy, thereby enhancing overall performance.

1. Core Value of KV Cache Quantization

Memory resources on edge devices are typically limited, while the size of the KV Cache for large models grows rapidly with input length and model scale, potentially becoming a performance bottleneck. By applying quantization techniques to reduce the data precision of the KV Cache from FP16/FP32 to INT8 or even INT4, storage overhead can be reduced by over 50%, while also decreasing memory bandwidth usage and improving CPU access speed, indirectly optimizing TPOT in the Decode stage. Furthermore, the smaller transmission volume of the quantized KV Cache can further shorten the connection time between the two stages.

2. Key Points for Implementing Edge-side KV Cache Quantization

Adopt a mixed-precision quantization strategy: Retain higher precision for parts of the KV Cache that are sensitive to precision (such as key feature dimensions) while applying low-precision quantization to non-sensitive parts, achieving a balance between precision loss and performance improvement.
Hardware-friendly quantization schemes: Design quantization computation processes that align with the CPU’s instruction set characteristics (such as ARM’s NEON instructions and x86’s AVX instructions) to ensure efficient execution of operations when processing low-precision KV Cache, avoiding additional computational overhead from quantization.
Quantization calibration and error compensation: Determine reasonable quantization ranges through offline calibration to reduce quantization errors; apply error compensation algorithms to critical inference paths to ensure that the inference performance of the quantized model meets the requirements of edge-side applications.