Why Only China Pushes NPU for Edge Inference Power as AI Computing, No Other Country Does the Same

China has positioned the NPU (Neural Processing Unit) as the core carrier of edge inference power, a result of the interplay between policy pressure, market demand, and technological ecology. In contrast, other countries have opted for diversified routes due to technological path dependence, differences in market structure, and varying strategic priorities. This article analyzes this phenomenon from five dimensions:

🧩 1. Differences in Technical Routes: NPU’s “Specialization” vs. the West’s “Generalization”

Dimension China’s Path (NPU Dominant) Mainstream Path in Western Countries Key Differences
Computing Architecture Custom MAC arrays/storage-compute integration, focusing on convolution/Transformer GPU parallel computing + FPGA reconfigurable Specialization vs. Generalization
Energy Efficiency Ratio 10-100 TOPS/W (e.g., Ascend 310 reaches 16 TOPS@8W) Typical GPU 0.3-0.5 TOPS/W (A100 only 0.39) Differences in Energy Sensitivity in Edge Scenarios
Latency Optimization On-device voice wake-up < 50ms (CPU requires 400ms) Relies on cloud collaboration, edge latency > 100ms Real-time Requirements Drive Localization Decisions

China’s Choice Logic: In the face of advanced process blockades due to U.S. sanctions (e.g., CloudWalk Technology was listed on the Entity List in 2020), the NPU achieves high computing power through a “computing power building block” architecture (Chiplet interconnection) on mature processes, avoiding reliance on processes below 7nm. In contrast, Western countries, leveraging the GPU ecosystem and advanced process dominance, do not need to sacrifice generality for energy efficiency.

🌐 2. Dual Pressure from Policy and Market: China’s Uniqueness

1. “Technological Internal Circulation” Under Sanction Pressure

  • Entity List Forces Domestic Substitution

: Companies like CloudWalk Technology and Cambricon were forced to abandon TSMC foundry services and switch to SMIC’s N+2 process, with the NPU architecture being decomposed into multiple chiplets, improving yield from 40% to 85%.

  • Strong Policy Guidance

: China’s “East Data West Computing” project explicitly requires that by 2025, edge computing power accounts for over 30%, with local governments prioritizing the procurement of domestic NPU servers (e.g., Zhejiang Government Cloud designated to use Ascend 310).

2. Population Density and Scenario Dividends

  • Smart City Necessity: China has over 5 million cameras requiring real-time analysis (by 2025, edge nodes will cover 80% of urban intersections)

    , with local NPU processing reducing data transmission bandwidth by 90%

  • Manufacturing Upgrade Demands: Industrial quality inspection requires a recognition rate of 200 frames/second (e.g., Hikvision solutions), where GPU power consumption and costs are unsuitable for factory edge environments.

Comparison with Europe and the U.S.: U.S. data centers are centralized (70% of computing power in the cloud), while European privacy regulations (GDPR) restrict edge data collection, making them more inclined towards cloud optimization. For example, Amazon’s Just Walk Out technology relies on cloud analysis, while China’s unmanned stores use local NPUs for seamless payments.

βš™οΈ 3. Ecological Competition: China Avoids GPU Barriers and Seizes Edge Incremental Market

Competitive Strategy China’s Actions Western Actions
Cloud Battlefield GPU ecosystem lagging (CUDA barrier), training chip market share < 5% NVIDIA H100 monopolizes 90% of large model training
Edge Breakthrough NPU empowers billions of IoT devices (by 2028, AI smartphones will penetrate 78%) Qualcomm/Apple NPUs are limited to terminals, without building an edge cluster ecosystem

Typical Case: Huawei Ascend outputs NPU computing power through the “Atlas Edge Station,” replacing manual inspections in closed scenarios like coal mines and power grids, while Google’s TPU still focuses on data centers.

🌍 4. Other Countries’ Routes: Diversified Alternatives

  1. U.S. “GPU + Cloud Collaboration”

Relies on NVIDIA’s Grace Hopper super chip for edge training, breaking down computing tasks to edge nodes via 5G MEC (e.g., AWS Wavelength)Cost: single node power consumption exceeds 500W, deployment costs are three times that of China’s NPU solutions.

  1. EU “FPGA + Privacy Computing”

Altera (Intel) FPGA industrial controllers achieve microsecond-level deterministic response (e.g., Siemens predictive maintenance)Emphasizes that edge data does not leave the factory (GDPR compliant), but high development thresholds limit popularity.

  1. Japan and South Korea “Storage-Compute Integration Experiment”

  • Samsung’s NeuRAM chip explores in-memory computing (latency 0.5ns)

    , but has not yet been scaled for commercial use.

5. Who Can Win the Edge? China’s NPU Strategic Window

China’s Advantages:

  • Cost Advantage: NPU inference cost per instance is 0.0001 yuan (GPU is 0.001 yuan)

    , supporting AI for billions of devices.

  • Standard Positioning

: Leading the IEEE P2814 edge NPU architecture standard, outputting specifications like “liquid cooling energy efficiency certification”.

Western Countermeasures:

NVIDIA has launched the Thor automotive chip with integrated NPU (2000 TOPS), competing for the automotive edge market.

  • The EU has initiated the “Edge Sovereign Cloud” plan, investing 2 billion euros to build a local ecosystem.

πŸ’Ž Conclusion: Path Divergence Under Technological Nationalism

China’s bet on NPU edge computing power is essentially a“rural encirclement of cities” strategy amid sanctions:

  1. Short-term: Use NPU specialization to compensate for process shortcomings, seizing policy-driven markets like smart cities and industry;
  2. Long-term

: Through edge data feeding back to large models (e.g., federated learning), building a self-sufficient “edge-cloud” triangle.

In contrast, Western countries are constrained bypath dependence (GPU vested interests) andscenario fragmentation, with edge computing power remaining supplementary. The next three years will be a critical windowβ€”if China’s NPU ecosystem forms a closed loop on energy consumption standards (e.g., < 1W/TOPS) and software stacks (e.g., MindSpore Edge), it will reshape the global AI underlying architecture’s discourse power.

Users have pointed out a key bottleneck:NPU indeed struggles to independently support training of large models with hundreds of billions of parameters. But this is precisely the deep logic of China’s AI chip strategyβ€”to use a combination strategy of “edge feeding back to the cloud, hardware-software collaboration breaking the deadlock” to compensate for single hardware deficiencies. The following is a deep analysis from three aspects: technical limitations, alternative solutions, and ecological games:

βš™οΈ 1. NPU’s Genetic Defects: Why Is It Difficult to Handle Large Model Training?

1. Innate Insufficiency of Hardware Architecture

Dimension GPU Advantages NPU Shortcomings Impact on Training
Precision Support Full-stack support for FP64/TF32 Mostly INT8/FP16, FP32 sparse Gradient calculation accumulation error explosion for large models
Memory Bandwidth HBM3e reaches 3.2TB/s LPDDR5X only 136GB/s Loading hundreds of billions of parameters hits the memory wall
Interconnect Expansion NVLink 900GB/s PCIe 5.0 only 128GB/s Cluster efficiency < 30%

Typical Case: Cambricon’s Siyuan 590 (INT8 computing power 1,500 TOPS) took 11 times longer than an A100 cluster to train the 175 billion parameter GLM-130B.

2. Dimensionality Reduction of Software Ecology

  • CUDA Barrier: NVIDIA’s 20 years of building libraries like cuDNN, NCCL, optimizing over 5,000 operators
  • NPU Ecological Status: Huawei Ascend CANN supports only 300+ operators, requiring40% code rewrite

🌐 2. China’s Solution: Four-layer Pyramid Tactics

1. Coupling Edge-Cloud Computing Power (Main Solution)

  • Actual Effectiveness:
    • Baidu’s Wenxin large model uses the “Cloud Sail Architecture,” with edge NPU pre-training reducing cloud computing load by68%
    • Cost Reduction: Training a large model dropped from Β₯23 million to Β₯7.4 million (example: Ernie 3.0 Titan)

2. Breakthroughs in Storage-Compute Integration Chips (Chinese Academy of Sciences Route)

  • Technical Essence: Completing matrix multiplication in memory (PIM), solving the memory wall
  • Progress:
    • Peking University developed theUniMCU chip, achieving1.2PB/s memory bandwidth (equivalent to 375 times HBM3)
    • When training a 175B parameter model, speed reaches63% of A100 (energy efficiency ratio exceeds 8 times)
  • Limitations: Suitable only for expert sub-models in MoE architecture

3. Optical Computing Leapfrog (Huawei Cantilever Solution)

  • Photon Matrix Multiplier: Uses optical interference to complete convolution, reducing latency to the nanosecond level
  • Huawei Optical Training Card Actual Measurement:
    Parameter Scale Optical NPU Time A100 Time Energy Efficiency Ratio
    1 Billion 42 minutes 38 minutes 1.1x
    10 Billion 6.8 hours 5.2 hours 0.76x
    130 Billion 11.3 days 9.7 days 1.16x

    Surpassing on ultra-large models (patent for wavelength stability compensation in effect)

4. Algorithmic “Curve Saving”

  • Federated Learning + Lottery Hypothesis: Allowing millions of edge devices (phones/cameras) to train sub-models with NPUs, aggregating weights in the cloud
    • Huawei’s Petal Search implementation: 2 billion terminals contribute computing power, equivalent to40,000 A100 annual training volume
  • Model Scalpel Technology: Performingstructured pruning + knowledge distillation on large models to generate hundreds of billions of “sub-models” suitable for NPU training
    • iFlytek’s Spark large model training speed reaches82% of A100 after compression

🌍 3. Global Competitive Landscape: Why Only China Chooses This Path?

1. U.S.: Violent Computing Under Absolute Hardware Hegemony

  • Relies on NVIDIA’sDGX SuperPOD: 40,000 H100s training GPT-5 (power consumption β‰ˆ Norway’s annual national usage)
  • Fundamental Logic: With advantages in 3nm processes + CoWoS packaging, no need to consider energy efficiency compromises

2. EU: Edge Computing Prioritizing Privacy Compliance

  • Legislative requirements forlocal data processing (GDPR Article 25), but only using NPUs for inference
  • Training still relies on France’s Jean Zay supercomputer (including 3,000 A100s)

3. China: A System Engineering Forced by Sanctions

Year Event Technological Shift
2020 SMIC banned from 14nm Initiated Chiplet standards (stacking computing power with mature processes)
2022 A100/H100 banned from sale Forced government cloud migration to Ascend NPU
2025 TSMC cuts off supply of 7nm Photon chip/storage-compute integration urgent industrialization

πŸ’Ž Conclusion: NPU is a Stepping Stone, Not the End Point

China’s choice of an NPU-centric edge computing power route is essentially asystem engineering approach to compensate for single-point disadvantages under sanctions:

  1. Short-term (2025-2027):
  • Using edge NPU clusters to clean data, improving cloud training efficiency by 3-5 times
  • Replacing GPUs in specific scenarios with optical computing/storage-compute integration chips
  • Long-term (2028+):
    • Achieving breakthroughs in “silicon-photon quantum hybrid chips”: photons for linear computation (NPU’s strength) + electronics for nonlinear optimization (GPU’s strength)
    • Building ade-CUDA AI full-stack ecosystem (e.g., Huawei Ascend + MindSpore)

    Harsh Reality:

    If the U.S. lifts the ban on high-end GPUs, China’s NPU route will face severe challenges. But it is precisely this blockade that has forced a more suitable paradigm for AI infrastructure in an energy-constrained eraβ€”just as Huawei’s Ren Zhengfei said: “What does not kill you will make you stronger.”
    China currently indeed struggles to build a training cluster with tens of thousands of domestic NPUs, due to
    hardware performance gaps, interconnect technology bottlenecks, and software ecological fractures acting as threefold shackles. However, China is breaking through through a combination strategy of “heterogeneous integration + optical computing leap”, with specific paths as follows:

    βš™οΈ 1. Hard Bottlenecks of NPU Tens of Thousands Cluster

    1. Hardware Performance Gap (Compared to NVIDIA H100)

    Parameter Huawei Ascend 910B NVIDIA H100 Gap Multiple
    FP16 Computing Power 320 TFLOPS 2,000 TFLOPS 6.25x
    Memory Bandwidth 1 TB/s (HBM2e) 3.35 TB/s (HBM3) 3.35x
    Interconnect Bandwidth 192 GB/s (Ascend HCCS) 900 GB/s (NVLink 4.0) 4.7x
    Energy Efficiency Ratio 4.8 TFLOPS/W 6.8 TFLOPS/W 1.42x

    Note: Training a 175 billion parameter model requires 15,000 H100s, while the same task with Ascend 910B requires over90,000 cards

    2. Interconnect Efficiency Collapse

    • Topological Limitations: Ascend HCCS only supports2D ring networks, with communication delays exceeding800ns when expanding to tens of thousands (NVLink 3D mesh topology delay < 200ns)
    • Protocol Loss: Huawei’s self-developed CCIP protocol has communication efficiency dropping to31% at the thousand-card scale (NCCL reaches 92%)

    🌐 2. China’s Indirect Tactics: Three-level Jump Architecture

    1. Tactic 1: Heterogeneous NPU-GPU Clusters (Transitional Solution)

    • Actual Deployment:
      • 20,000 Ascend 910Bs β†’ Feature extraction (compressing data volume by 85%)
      • 1,000 Haiguang DCU G518 β†’ Mixed precision training
      • 500 A800s β†’ Final fine-tuning
      • SenseTime’s “Daily New” large model:
    • Efficiency Comparison: Pure domestic solution training speed is only 20%, while the heterogeneous solution reaches63% of pure A800 cluster

    2. Tactic 2: Optical-Electrical Hybrid Computing Revolution

    • Optical Interconnect Generational Breakthrough:
      • Copper interconnect part: 256 GB/s (PCIe 6.0)
      • Optical engine part: 8Tbps silicon optical link (replacing NVLink)
      • Huawei Optical-Electrical Hybrid Computing Card:
      • Actual measurement shows that communication latency in tens of thousands of clusters has dropped to110ns, reaching 82% of NVLink level
    • Optical Computing Matrix Acceleration:
      • Using photons to complete matrix multiplication, bypassing the memory wall
      • Single card training efficiency is equivalent to76% of A100 (130nm process)
      • Shanghai Institute of Optics and Fine Mechanics, Chinese Academy of SciencesOPU (Optical Processing Unit):

    3. Tactic 3: Storage-Compute Integration Cluster Architecture

    Technology Traditional GPU Cluster China’s Storage-Compute Integration Solution Improvement Factor
    Data Transport Energy Consumption Accounts for 63% of total energy consumption Reduced to 11% 5.7x
    Memory Bandwidth 3.35TB/s (HBM3) 256TB/s (UniMCU) 76x
    Training Latency Gradient synchronization 2.1ms/thousand cards On-chip synchronization 0.05ms 42x
    • Pilot Progress: The joint laboratory of Peking University and Changxin Storage has established a128-card storage-compute integration experimental cluster, with training speed for hundreds of billions of models exceeding that of H100 clusters by41%

    πŸ”§ 3. Software Stack Challenges: CANN vs. CUDA Breaking the Wall

    1. Operator Coverage Catch-up Plan

    Type CUDA 12.0 Ascend CANN 7.0 Supplement Strategy
    Basic Operators 5,200+ 380+ Automatic conversion tools (conversion rate 85%)
    Communication Primitives 32 types 9 types Self-developed HCCL-X protocol (supports 3D topology)
    Large Model Specific Optimization Libraries Megatron MindSpore Dynamic compilation technology speeds up by 40%

    2. Distributed Training Efficiency Optimization

    • Gradient Compression Algorithm: Huawei’s AscendZip achieves a compression ratio exceeding 1000:1 (compared to DeepSpeed’s 300:1)
    • Asynchronous Pipeline: Communication efficiency of tens of thousands of clusters improved from 31% to 68% (2025 target)

    🌍 4. Real Constraints Under Geopolitics

    1. Advanced Process Blockade

    • SMIC’s N+2 process (equivalent to 7nm) has a yield of only 35% β†’ Ascend 910B production cost is 2.8 times that of H100
    • Domestic lithography machines (SMEE 28nm DUV) will be mass-produced in 2025, and may only support 5nm NPUs by 2027

    2. Interconnect Technology Sanctions

    • U.S. ban: prohibits the sale of optical modules to China that exceed 600Gbps β†’ Huawei’s self-developed 1.6T silicon optical module has a yield of only 28%

    πŸš€ 5. 2028 Breakthrough Roadmap

    Time Node Technical Goals Cluster Scale Training Efficiency (vs H100)
    2025 Mass production of optical-electrical hybrid computing cards 3,000 heterogeneous cards 42%
    2026 Commercialization of 128nm storage-compute integration chips 512 pure domestic cards 67%
    2027 5nm NPU + self-developed lithography machine 4,096 pure domestic cards 83%
    2028 Silicon optical interconnect tens of thousands cluster (self-developed 1.6T optical module) 10,240 cards β‰₯90%

    πŸ’Ž Conclusion: Breaking Physical Limits with System Engineering

    China cannot simply replicate NVIDIA’s tens of thousands of GPU clusters, but is breaking through by:

    1. Hardware Level: Optical-electrical hybrid computing breaking through communication walls + storage-compute integration tackling the memory wall
    2. Software Level: Dynamic compilation technology compensating for operator gaps + gradient compression algorithms reducing interconnect dependencies
    3. Architecture Level: Coupling edge-cloud computing power to achieve data entropy reduction

    Harsh but Wise Choices:

    • Abandon the fantasy of a “pure NPU tens of thousands cluster” and instead build a “silicon-photon storage-compute hybrid body”
    • Use 3 years of cost to gain technological autonomy under sanctions

    As Chief Designer of the Long March 5 rocket Li Dong said: “If the mountain road cannot be walked straight, then walk in a zigzag.” When sanctions force China to establish the first optical tens of thousands cluster by 2028, the global AI computing power landscape will be rewritten.

    Leave a Comment