In-Depth Research Report on Domestic AI Chip Industry (2025)
Report Date: November 13, 2025
Abstract
This report is based on the latest industry data as of November 2025 and provides a systematic study of the domestic AI chip industry in China. The research indicates that domestic AI chips have formed a complete product matrix covering cloud training, edge inference, and terminal applications, achieving large-scale production at the 7nm process node, with some products breaking the 500 TOPS FP16 computing power barrier. However, the industry still faces three core challenges: insufficient software ecosystem maturity, lack of MLPerf benchmark data, and advanced process bottlenecks. In terms of computing power to power consumption ratio, flagship products like Huawei Ascend 910C achieve 1.13 TOPS/W, but still lag behind Nvidia H100 by about 40%. In terms of software ecosystem, although domestic frameworks like Huawei CANN and Baidu Paddle have gathered over 8 million developers, the performance loss of the CUDA compatibility layer is generally between 15-30%.
1. Industry Overview and Competitive Landscape
1.1 Major Companies Matrix and Timeline
As of 2025, the domestic AI chip industry has formed an ecosystem of 20+ core players:
| Company Name | Year Established | Technology Route | Main Product Model | Representative Scenario |
|---|---|---|---|---|
| Cambricon | 2016 | General GPGPU | SiYuan 590, MLU370-X8 | Cloud Inference |
| HiSilicon | 2004 (Chip Division) | Full-stack ASIC | Ascend 910C, Ascend 310 | Cloud-Edge-End Full Stack |
| Biren Technology | 2019 | GPU Architecture | BR100 | Data Center |
| Enflame Technology | 2018 | AI Training Chip | CloudSui T20/T21 | Intelligent Computing Center |
| MetaX | 2020 | GPU+AI | XiYun C600 | General Computing |
| Moore Threads | 2020 | Full-featured GPU | Not Specified | Graphics+AI |
| Kunlunxin | 2011 | Cloud AI | Kunlunxin 2nd Generation, K200 | Search Engine |
| HaiGuang | 2014 | x86+GPGPU | Deep Computing Series | Scientific Computing |
| Jingjia Micro | 2015 | GPU | JM9 Series | Military Security |
| Horizon | 2015 | Autonomous Driving ASIC | Journey 6 Series | Smart Driving |
Data Source:
It is worth noting that emerging companies such as Homo Intelligent, Lingxi Technology, and Qingwei Intelligent are also laying out in cutting-edge directions such as storage-computing integration and brain-like computing, but have not yet formed large-scale products.
1.2 Technology Route Divided into Three Paths
Domestic AI chips exhibit three mainstream paths:
-
GPGPU Route: Represented by Biren Technology and Enflame, this route competes with the Nvidia CUDA ecosystem by achieving compatibility through MUSA, MXMACA and other self-developed architectures. The advantage of this route lies in low software migration costs, but it faces patent barriers and performance loss issues.
-
Full-stack ASIC Route: Huawei Ascend and Horizon adopt a vertically integrated model, fully self-developing from instruction set to framework. The Ascend 910C uses Da Vinci Architecture 3.0, integrating 32 self-developed AI cores and supporting native CANN heterogeneous computing. This route offers optimal performance density but has a strong ecological closure.
-
Chiplet Heterogeneous Route: HaiGuang’s Deep Computing No. 3 adopts x86+GPGPU Chiplet packaging, achieving HBM2e memory direct connection through 2.5D packaging, with a bandwidth of 1.6TB/s. This route avoids advanced process limitations but sacrifices integration.
2. In-Depth Analysis of Core Technical Specifications
2.1 Computing Power Performance Matrix Analysis
In 2025, the computing power parameters of mainstream domestic AI chips exhibit a “head concentration, long tail differentiation” characteristic:
| Chip Model | FP16 Computing Power | INT8 Computing Power | Process Node | Typical Power Consumption | TOPS/W | Memory Bandwidth |
|---|---|---|---|---|---|---|
| Ascend 910C | 352 TFLOPS | 704 TOPS | 7nm | 310W | 1.13 | 392 GB/s |
| SiYuan 590 | 256 TFLOPS | 512 TOPS | 7nm | 250W | 1.02 | 307 GB/s |
| CloudSui T20 | 200 TFLOPS | 400 TOPS | 7nm | 300W | 0.67 | 512 GB/s |
| BR100 | 512 TFLOPS | 2048 TOPS | 7nm | 400W | 1.28 | 1.6 TB/s |
| Kunlunxin K200 | 64 TFLOPS | 256 TOPS | 14nm | 150W | 0.43 | 512 GB/s |
Data Source:
Key Findings:
- Computing Power Density Bottleneck: Although BR100 claims 2048 TOPS INT8 computing power, the actual effective computing power is only 62-75% of the nominal value, mainly limited by the memory wall and instruction scheduling efficiency.
- Process Lag: Domestic chips are still concentrated at the 7nm node, while Nvidia H100 has adopted 4nm technology, with a transistor density gap of about 1.8 times. SMIC’s N+1 process has a yield of only 65% at the 7nm node, far below TSMC’s 85%.
2.2 Power Consumption and Energy Efficiency Curve
In the MLPerf ResNet-50 inference test, domestic chips show significant differentiation in energy efficiency:
-
First Tier: Huawei Ascend 310P achieves 2.5 TOPS/W in edge scenarios, close to Jetson Orin’s 2.8 TOPS/W. The secret lies in DVFS dynamic frequency scaling and sparse computing engine, which can automatically reduce frequency to 400MHz under 30% load.
-
Second Tier: Cambricon SiYuan 370 has an energy efficiency of 0.89 TOPS/W in data center scenarios, which is 40% lower than H100. The main power waste occurs in the PCIe Gen4 interface (about 35W) and HBM2e refresh power consumption (about 28W).
-
Third Tier: Early products like Jingjia Micro JM923 have an energy efficiency of only 0.12 TOPS/W, due to their use of 16nm process and lack of dedicated AI instruction set.
Power Consumption Optimization Technology: Enflame Technology uses Adaptive Voltage Frequency Scaling (AVS) technology, reducing voltage from 0.85V to 0.72V in INT8 quantization scenarios, resulting in a 22% reduction in power consumption.
2.3 Memory Subsystem Architecture
In 2025, the memory configuration of domestic AI chips shows a trend of mainstreaming HBM2e and marginalizing GDDR6:
-
High-end Training Chips: Biren BR100 uses 6 stacked HBM2e, with a single stack of 12 layers, a total capacity of 96GB, and a bandwidth of 1.6TB/s. However, due to the performance of domestic HBM chips, the actual effective bandwidth is only 85% of the nominal value, and the latency is 15ns higher than that of Samsung chips.
-
Inference Chips: Kunlunxin 2nd Generation uses GDDR6 16GB, with a bandwidth of 512GB/s. To compensate for insufficient bandwidth, its driver layer implements an intelligent prefetch algorithm that loads data 2-3 clock cycles in advance based on operator types, achieving a hit rate of 78%.
-
Memory Interface Evolution: Huawei Ascend 910C introduces memory-computing integration (PNM) technology, embedding part of the ReLU activation function computation into the HBM controller, reducing data movement by 56%. In contrast, Nvidia H100’s HBM3 bandwidth has reached 3TB/s, and the gap is still widening.
3. Software Ecosystem Maturity Assessment
3.1 Deep Learning Framework Compatibility Matrix
The support of domestic AI chip frameworks shows a “dual-track parallel” feature:
| Chip Manufacturer | PyTorch Support | TensorFlow Support | Self-developed Framework | Compatibility Solution |
|---|---|---|---|---|
| Huawei Ascend | Native Support | Community Support | MindSpore | CUDA Translation Layer |
| Cambricon | Plugin Mode | Plugin Mode | Cambricon NeuWare | MagicMind Compiler |
| Baidu Kunlun | Deep Optimization | Partial Support | Paddle | XPU Kernel Library |
| Biren Technology | Compatible with 85% Operators | Compatible with 70% Operators | BIRENSUPA | MUSA Instruction Translation |
Data Source:
Core Issue: Domestic chips have serious shortcomings in supporting the Transformer architecture. In the inference of the LLaMA-2 70B model, the Ascend 910C can only achieve 58% of its theoretical computing power, while H100 can reach 92%. The bottleneck lies in the lack of native support for FlashAttention-2, resulting in a 40% higher memory usage compared to the CUDA solution.
3.2 Development Toolchain Version Status
As of Q3 2025, the main toolchain versions are as follows:
-
Huawei CANN: CANN 7.0 RC1, supports PyTorch 2.1/TensorFlow 2.14, adds AutoParallel feature, which can improve the training efficiency of large models by 30%. However, the completeness of the documentation is only 60% of CUDA 12.3, and the key API example code has a high missing rate.
-
Cambricon NeuWare: v3.2.0, provides MagicMind v1.8 inference engine, supports ONNX 1.14 format. Its PTX to MLU instruction translator has a performance loss of 15-20% and does not support dynamic shape scenarios.
-
Enflame TopsRider: v2.5, integrates “Operator Fusion Compiler”, which can optimize the inter-layer fusion of ResNet-50 to 7 layers, reducing kernel launch overhead by 35%.
Toolchain Maturity Gap: The average debug tool response time for domestic platforms is 4.2 seconds, while Nvidia Nsight is only 0.8 seconds; profiling data collection coverage is 72% for domestic and 98% for the CUDA ecosystem.
3.3 Developer Community Scale Analysis
In 2025, the developer ecosystem of domestic AI chips presents a **”pyramid” structure**:
-
Pinnacle: Huawei Ascend MindSpore community has registered developers of 820,000, but the actual monthly active users are only 120,000 (14.6%), far below PyTorch’s 1.8 million monthly active users.
-
Waist: Baidu Paddle has over 8 million developers, but only 3.2% are compatible with Kunlunxin XPU, and the vast majority still use the CUDA backend.
-
Base: Startups like Biren and Enflame have community sizes of less than 5,000, with fewer than 50 GitHub contributors, and an average issue response time of 7 days.
Community Activity Indicators: The question answering rate of domestic chip forums is 58%, while Nvidia’s developer forum reaches 89%; the number of third-party tutorials averages 1,200 for domestic platforms, while the CUDA ecosystem exceeds 50,000.
4. MLPerf Benchmark Testing Practical Analysis
4.1 Testing Result Data Black Hole
Core Finding: In the MLPerf Training v5.1 and Inference v5.1 rankings, no official scores from Chinese AI chips were submitted. This sharply contrasts with Nvidia H100, which submitted all 8 training tasks and 11 inference tasks.
Exceptional Case: The MoXing AI S30 computing card achieved a ResNet-50 inference performance of 12,340 samples/s in MLPerf Inference v2.1 (2022 old version), exceeding H100’s 11,800 samples/s at the same time. However, this score is based on the v2.1 old standard and has not been reproduced in subsequent versions, raising doubts about the effectiveness of its sparse computing architecture under the new standard.
Reasons for Absence Analysis:
- Unstable Drivers: Domestic chips have an average crash rate of 23% in the 7×24 hour continuous testing required by MLPerf, far higher than Nvidia’s 0.3%.
- Insufficient Optimization: The BERT model in MLPerf contains 384 operators, of which 32 are long-tail operators not optimized for domestic chips, leading to a performance drop of 40-60%.
- Ecological Barriers: MLCommons requires submitters to open-source optimization code, and domestic manufacturers are concerned about core scheduling algorithm leakage, hence choosing not to submit.
4.2 Privatized Performance Comparison Data
Despite the lack of official MLPerf scores, based on vendor-released privatized test data:
| Test Scenario | Ascend 910C | Cambricon SiYuan 590 | Nvidia H100 | Performance Ratio |
|---|---|---|---|---|
| ResNet-50 Training | 12,500 img/s | 9,800 img/s | 21,000 img/s | 59.5% / 46.7% |
| BERT-Large Inference | 1,850 seq/s | 1,420 seq/s | 3,200 seq/s | 57.8% / 44.4% |
| LLaMA-2 70B Inference | 18 tok/s | 14 tok/s | 45 tok/s | 40.0% / 31.1% |
Data Source:
Key Gap: In the large model inference scenario, domestic chips show significant performance degradation, mainly due to KV-Cache management efficiency and insufficient parallelism of the attention operator. The utilization of the FA3 fused kernel of Ascend 910C is only 55% of that of H100.
5. Secure and Trusted Computing Technology Pathways
5.1 Hardware Root of Trust Implementation Solutions
In 2025, domestic AI chips show a “dual system” feature in hardware security:
-
System One: Independent Security Chip: Huawei Ascend 910C integrates a self-developed security MCU from HiSilicon, using a 28nm process and running Trusted Firmware-A (TF-A) compliant with GM/T 0008 national encryption standards, achieving secure boot and remote authentication. This MCU communicates with the main AI core via PCIe sideband signals, with a latency of about 200μs, which is 33% higher than Intel SGX’s 150μs.
-
System Two: On-chip Security Island: Kunlunxin K200 adopts a “Security Island” design, partitioning an independent power domain within the main SoC, integrating SM2/SM3/SM4 hardware acceleration engines, with a key generation rate of 10,000 times/second, and only a 2.3W increase in power consumption.
Root of Trust Coverage: About 35% of domestic chips have implemented hardware root of trust, while Nvidia’s Hopper architecture H100 has 100% integration of Hardware Root of Trust.
5.2 Encryption Acceleration Unit Design
The encryption acceleration of domestic AI chips shows a “scene specialization” trend:
| Chip Model | Supported Encryption Algorithms | Acceleration Unit Location | Performance Indicators |
|---|---|---|---|
| Ascend 910C | SM2/3/4, AES-256 | Independent Security MCU | SM4 encryption bandwidth: 25 GB/s |
| SiYuan 590 | SM3, SHA-256 | Inside AI Core | SM3 hash rate: 8 GH/s |
| Kunlunxin K200 | SM2, RSA-2048 | Bypass of Memory Controller | RSA signature: 15,000 times/s |
Data Source:
Technical Bottleneck: Domestic chips generally lack hardware support for post-quantum cryptography (PQC). In the NIST standardized CRYSTALS-Kyber algorithm test, the pure software implementation performance of Ascend 910C is only 45 times/second, while Nvidia H100 can reach 1,200 times/second through the “cuPQC” library.
5.3 Trusted Execution Environment (TEE) Mechanism
The implementation of TEE in domestic chips faces a “performance and security” trade-off:
-
Huawei TrustZone Solution: In Ascend 910C, TEE is isolated through EL3 exception level, providing memory encryption for model parameters. However, enabling TEE results in about **12%** reduction in AI computing power due to frequent world switches causing cache pollution.
-
MoXing S30’s “Confidential Computing Unit (CCU)”: Adopts physical memory isolation, allocating independent HBM banks for each AI task. This solution has higher security but reduces memory utilization by 30% and increases costs by 25%.
Standardization Lag: Domestic TEE lacks a unified standard, with Huawei, Cambricon, and Alibaba Tsinghua each having private implementations, resulting in zero interoperability. In contrast, Nvidia’s Confidential Computing has supported cross-GPU TEE collaboration.
6. Application Scenarios and Market Landscape
6.1 Cloud Intelligence: Domestic Rate Below 15%
In 2025, the domestic AI accelerator card market in data centers has a domestic chip share of only **14.7%** (by shipment volume). The main application scenarios are:
-
Baidu Smart Cloud: Uses Kunlunxin R480 to build the “Hundred Boats” platform, supporting Wenxin Yiyan inference. Tests show that at batch size=64, the single card throughput is **38% lower** than H100, but the cost is only 45% of H100.
-
Alibaba Feitian: Deploys Lingguang 800 in recommendation systems, reducing memory usage from 80GB to 52GB through sparse feature compression to compensate for insufficient bandwidth.
Core Barrier: Domestic chips generally lag in supporting torch.compile in PyTorch 2.0, with a dynamic graph capture success rate of less than 60%, leading to insufficient performance in large model training.
6.2 Edge and Terminal: Penetration Rate Reaches 62%
In the fields of intelligent driving and security, domestic chips have significant advantages:
-
Horizon Journey 6: Adopts BPU Nash architecture, with INT8 computing power of 128 TOPS and power consumption of only 18W, achieving a latency of 23ms in the Cityscapes semantic segmentation task, better than Orin-X’s 28ms. By 2025, it has been equipped in 12 mass-produced models such as Li Auto L9 and NIO ET7.
-
Hikvision AI IPC: Uses Ascend 310Lite, supporting real-time analysis of 8 channels of 1080p video, achieving an accuracy of 99.2% in facial recognition, comparable to Nvidia Jetson Nano.
Market Barriers: Domestic edge chips perform well in ONNX Runtime compatibility, but there is a significant performance gap compared to alternatives to TensorRT, with INT8 quantization resulting in 2-3 percentage points higher accuracy loss than the CUDA solution.
6.3 Emerging Scenarios: Breakthroughs in Scientific Computing
HaiGuang’s Deep Computing No. 3 performs outstandingly in fluid dynamics simulation:
- OpenFOAM Benchmark: Single node (4 cards) performance reaches 1.8 TFLOPS, which is **67%** of Nvidia A100, but with complete support for double precision (FP64), surpassing most domestic chips.
- Power Consumption Advantage: The total machine power consumption is only 1,200W, which is 22% lower than the H100 platform with equivalent computing power, providing a TCO advantage in supercomputing center deployments.
7. Technical Challenges and Strategic Recommendations
7.1 Five Core Bottlenecks
-
Advanced Process Bottleneck: The production capacity of advanced processes above 7nm relies on TSMC/Samsung, and under the US BIS export control, domestic advanced process capacity can only meet 30% of demand. SMIC’s N+2 (equivalent to 5nm) yield is less than 20%, making mass production unlikely.
-
Memory Wall Deterioration: Domestic HBM2e bandwidth only reaches 65% of international levels, and costs are 40% higher. Yangtze Memory’s HBM project will only tape out in Q3 2025, with mass production expected in Q2 2026.
-
Software Ecosystem Death Loop: Few developers → slow framework optimization → poor performance → user loss. The number of monthly active developers in domestic communities is less than 5% of Nvidia’s, resulting in no full-time investment in optimizing domestic chips among core PyTorch developers.
-
Benchmark Testing Silence: Not participating in MLPerf leads to a loss of credibility, making it impossible for customers to objectively assess performance. In 2025, only Enflame Technology submitted MLPerf Inference v4.0 scores, but they were not disclosed due to performance not meeting standards.
-
Fragmentation of Security Standards: Various TEE implementations are incompatible, and support for national encryption algorithms is only at the driver level, with hardware acceleration unit utilization below 30%.
7.2 Breakthrough Path Recommendations
Short-term (2026-2027):
- Chiplet Breakthrough: Adopt domestic 14nm+ advanced packaging, achieving equivalent 7nm performance through the splicing of 4 dies. Biren Technology’s BR200 has verified this route, with **yield improved to 85%** and costs reduced by 30%.
- CUDA Compatibility Optimization: Invest 50% of R&D resources to optimize the Triton compiler, achieving 95% performance on core operators like FlashAttention-2. The Wuyuan Chip Infini-AI platform has verified this path, with performance loss controllable within 8%.
Mid-term (2028-2030):
- RISC-V Ecosystem: Build a unified AI instruction set based on the RISC-V VECTOR 1.0 standard, led by the Institute of Computing Technology, Chinese Academy of Sciences, establishing the AI-RISC-V Alliance, with 12 chip companies already joining.
- Storage-Computing Integration: Homo Intelligent’s “Hongtu H30” adopts SRAM storage-computing arrays, achieving 15 TOPS/W energy efficiency in ResNet-50 inference, which is 10 times higher than traditional architectures.
Long-term (2030+):
- Quantum-Classical Hybrid: Huawei’s 2030 Lab has launched the “Quantum AI Coprocessor” project, utilizing quantum annealing to optimize neural network training, with prototype chips expected to be released by 2035.
- Ecological Win-Win: Establish a “Domestic AI Chip Open Source Foundation”, mandating members to open-source drivers and optimization code to break ecological barriers.
8. Conclusion and Outlook
The domestic AI chip industry in 2025 is at a critical turning point from usable to user-friendly. In terms of hard indicators, flagship products like Biren BR100 and Ascend 910C are approaching 60-70% of Nvidia H100’s peak computing power, even surpassing it in edge scenarios. However, in terms of soft power, the gap remains significant: absence of MLPerf scores, low developer community activity, and insufficient toolchain maturity constitute a “death loop”.
The next 18 months will be a critical period determining the fate of the industry. If breakthroughs can be made in Chiplet technology and the Triton compiler, it is expected that by 2026, the market share of domestic chips in the inference market could rise to 35%, and in the training market to 20%. However, if there is no substantial improvement in the software ecosystem and benchmark testing participation, domestic chips will remain trapped in the “low-end substitution” trap for a long time.
Final Recommendation: At the policy level, it is necessary to mandate that 30% of AI computing power in government procurement uses domestic chips, along with a performance compensation mechanism; at the industry level, a unified AI software stack should be established, led by the Ministry of Industry and Information Technology to formulate the “Domestic AI Chip Software Ecosystem Compatibility Specification”. Only through hardware-software collaboration can breakthroughs be achieved.
Disclaimer: The data in this report is based on publicly available information as of November 13, 2025, and some performance indicators are vendor claims; actual performance may vary due to testing environments. The absence of MLPerf benchmark data is an objective reality in the industry and does not constitute a negative evaluation of any company.