Implementation of Edge-side PD Disaggregation in NPU+CPU Heterogeneous Computing
In the scenario of large model inference on the edge, balancing low latency and high performance is always a core requirement. The collaborative PD disaggregation architecture of NPU and CPU innovatively addresses the TTFT bottleneck of edge-side inference by deploying the Prefill phase on the NPU, executing the Decode phase on the CPU, and optimizing … Read more