What Level is the Ascend 910 NPU and How Does it Perform in the DeepSeek Integrated Machine?

The Ascend DeepSeek integrated machine is an AI solution based on self-developed Ascend AI chips (such as Ascend 910B and 910C) deeply integrated with the DeepSeek large model, aimed at providing a high-performance, low-cost, domestically produced AI computing power platform. This article provides a detailed analysis from various dimensions including the technology, products, architecture, specifications, customization, and industrial ecology of the integrated machine.

More Ascend technology references are derived from “Domestic AI Chips: Ascend AI Processors”, “Domestic AI Chips: Ascend AI Computing Models”, “Domestic AI Chips: Ascend AI Core Units”.

All materials in this article have been uploaded to the “Intelligent Computing Chip Knowledge” community. For example, “Collection of 60+ AI Agent Technical Reports”, “Tsinghua University: Collection of 13 DeepSeek Reports”, “Zhejiang University: Collection of 20 DeepSeek Technologies”, “Collection of 290+ DeepSeek Technical Reports”, “Collection of 100+ AI Chip Technical Training”, “800+ Significant ChatGPT Professional Reports”, “Collection of 12+ Manus Technical Reports”, join the community to access carefully selected technical reports.

Huawei’s Ascend 910B is a high-performance processor chip designed for AI training and inference tasks, demonstrating outstanding performance.

Ascend 910B Manufacturing Process and Architecture Design

In terms of manufacturing process, the 910B adopts cutting-edge 7-nanometer process technology, which provides significant advantages in high performance and low power consumption.At the architectural design level, the 910B is based on Huawei’s self-developed Da Vinci architecture, which cleverly integrates a large number of heterogeneous processing cores on chip and advanced high-speed interconnect technology. This innovative design ensures smooth and efficient communication and collaborative computing capabilities between different processing cores, allowing the 910B to demonstrate superior efficiency and performance when handling various complex AI tasks.

Ascend 910B Computing Power Performance

Peak Performance: Huawei’s Ascend 910B exhibits an astonishing peak performance of up to 376 TFLOPS at FP16 precision (however, actual values may vary due to different reports and testing environments, but overall it remains at a high level). This figure far exceeds many similar products, for example, the peak performance of the NVIDIA A100 is 312 TFLOPS, showing a significant advantage for the 910B. With such powerful computing power, the 910B can demonstrate outstanding performance in tasks that have extremely high computational demands, providing a clear advantage.

Multi-Precision Support: Huawei’s Ascend 910B has excellent multi-precision support capabilities, as it not only supports FP16 precision but also is compatible with various precision formats such as FP32, INT8, and INT4. This multi-precision support feature allows the 910B to flexibly handle tasks with different precision requirements, effectively improving processing efficiency.

Ascend 910B Memory and Bandwidth

Memory Capacity: Huawei’s Ascend 910B is equipped with 64GB of HBM2E memory. Although its memory capacity may seem slightly inferior compared to some competing products, it still ranks among the leading configurations in the industry. Such a large memory capacity provides ample storage space for the 910B when processing large-scale datasets, effectively ensuring high efficiency in data processing.

Bandwidth Performance: The 910B is equipped with a high-speed PCIe 5.0 interface and has a high internal bandwidth. This feature greatly accelerates data transfer speeds, allowing data to flow quickly between various components, thereby significantly enhancing the overall system performance.

Application Scenarios:

Huawei’s Ascend 910B has a very wide range of applications, deeply penetrating various industries such as autonomous driving, AI integrated machines, energy, finance, public utilities, transportation, telecommunications, manufacturing, and education. In these industries, the 910B, with a series of targeted solutions such as Intelligent Hub, Ascend Smart Patrol, Ascend Smart Travel, and Ascend Manufacturing, provides strong support for the intelligent transformation and upgrading of various industries, promoting the industry towards a more efficient and intelligent direction.

Ecological System: In the industry, NVIDIA’s CUDA ecosystem has long held a leading position. However, the self-developed CANN programming library used by Huawei’s Ascend 910B is also in a vigorous development stage. As Huawei continues to increase its investment in the field of artificial intelligence and actively builds a complete ecological system, the maturity of the CANN programming library will continue to improve, and the scale of its developer community is expected to expand further, providing developers with richer resources and broader development space.

Ascend DeepSeek Integrated Machine

The core competitiveness of the Ascend DeepSeek integrated machine comes from the deep synergy between hardware and software.

Ascend 910B/910C Chip Technology:

Process and Computing Power: The 910B adopts a 7nm process, with FP16 computing power of 280 TFLOPS and INT8 computing power of 140 TOPS. The 910C is further optimized to SMIC N+2 process, with FP16 improved to about 320 TFLOPS, approaching 60%-70% of the performance of NVIDIA’s H100.

Energy Efficiency Optimization: Through dynamic voltage frequency scaling (DVFS) and handwritten CUNN kernels, power consumption is reduced to about 250W (910C), significantly more energy-efficient than H100 (700W).

Heterogeneous Computing Support: Integrating AI Core (based on Da Vinci architecture), AI CPU, and DVPP modules, supporting multi-task parallelism.

DeepSeek Model Optimization:

MoE Architecture: DeepSeek adopts a Mixture-of-Experts (MoE) architecture, activating only a small number of parameters (about 4%) per token, doubling inference efficiency.

DualPipe Algorithm: By overlapping computation and communication, cross-node communication overhead is reduced to nearly zero, training a 671B parameter model using only 2048 H800 GPUs, taking 2 months.

Software Stack Adaptation: MindSpore and CANN are deeply optimized, supporting seamless transition from CUDA to CUNN, reducing developer migration costs by 80%.

The Ascend 910C introduces handwritten CUNN kernels (similar to CUDA’s PTX instructions), optimizing matrix multiplication for Transformer models, reducing inference latency from 10ms to 6ms.

DeepSeek enhances the accuracy of complex tasks (such as mathematical reasoning) through a Multi-Head Latent Attention (MLA) mechanism, achieving an inference throughput of 500 tokens per second.

System Architecture of the Ascend DeepSeek Integrated Machine

The Ascend DeepSeek integrated machine adopts a modular, distributed design:

Hardware Layer:

Core: Ascend 910B/910C + Kunpeng 920 CPU.

Storage: NVMe SSD (single machine capacity up to 16TB).

Network: RoCE v2 (200Gbps bandwidth), supporting ultra-large-scale clusters. The RoCE network adopts a non-uniform Bruck algorithm, improving cluster communication efficiency by 50%, with network cost accounting for less than 20%.

Software Layer:

The MindSpore framework provides model training and fine-tuning tools.

The CANN software stack optimizes operator scheduling, improving inference efficiency by 30%. CANN supports ACL interfaces, allowing developers to customize high-performance operators to meet specific industry needs.

Distributed Computing:

Supports multi-card parallelism (8/16/32 cards), achieving efficient communication through the HCCL library.

Product Forms of the Ascend DeepSeek Integrated Machine

The Ascend DeepSeek integrated machine is divided into two main product lines:

Training and Inference Integrated Machine (FusionCube A3000 DS version):

Supports training and inference of DeepSeek V3 (671B parameters) and the entire R1 model series.

FusionCube supports modular expansion, scalable from a single machine with 8 cards to a cluster of 1024 cards, with training efficiency increasing linearly with scale.

Targeted at customers needing customized models, such as financial risk control and medical research.

Inference Integrated Machine (Atlas Series):

Built-in DeepSeek-R1 models of different scales (32B, 70B, 671B).

The Atlas 300I Pro inference card has a single card power consumption of only 150W, supporting real-time analysis of 80 channels of 1080p video.

Focused on efficient inference, suitable for edge and cloud deployment.

Specifications, Performance, and Configuration of the Ascend DeepSeek Integrated Machine

Specifications:

Single Card: 24GB LPDDR4X memory, bandwidth 204.8 GB/s.

Single card FP16 computing power comparison: 910C (320 TFLOPS) vs H100 (1410 TFLOPS), but the energy efficiency ratio reaches 1.8:1.

Cluster: 8 cards (entry-level), 32 cards (high-end).

Cluster scalability: with a 32-card configuration, computing power reaches 8960 TOPS (INT8), with power consumption of only 8kW.

Performance:

Inference: 671B model at 500 tokens per second, latency 6ms.

Training: 148 trillion tokens pre-training, efficiency close to 90% of H100.

Configuration:

Supports domestic CPUs such as Kunpeng and Haiguang, with strong compatibility.

Customization of the Ascend DeepSeek Integrated Machine

The customization capability of the Ascend DeepSeek integrated machine is a major highlight, whether it is the flexible adjustment of hardware configurations or the model optimization at the software level, it can accurately adapt to the needs of different industries and enterprises. This high degree of flexibility not only lowers the usage threshold but also significantly improves deployment efficiency and cost-effectiveness. The following is an in-depth analysis from three aspects: hardware, software, and case studies.

Hardware Customization: Flexible Configuration to Meet Diverse Needs

The hardware design of the Ascend DeepSeek integrated machine adopts a modular concept, allowing users to freely adjust the number of cards, storage capacity, and network bandwidth according to computing power needs and budget. This “building block” style customization enables it to serve both small enterprises and support ultra-large-scale intelligent computing centers.

Software Customization: Model Distillation and Industry Fine-Tuning

The Ascend DeepSeek integrated machine provides deep customization at the software level, including lightweight model distillation and industry-specific fine-tuning versions. This capability allows enterprises to quickly build dedicated AI tools based on existing frameworks without starting from scratch to train large models.

Customization Case: China Telecom’s “Xirang Intelligent Computing Integrated Machine”

The “Xirang Intelligent Computing Integrated Machine” customized by China Telecom based on the Ascend DeepSeek integrated machine is a typical success case. This product is optimized for 5G edge computing scenarios, integrating Ascend computing power and DeepSeek models, supporting low-latency inference and real-time data processing.

Source: Comprehensive online compilation

Download Links:「Significant Collection」1. “Collection of 70+ Semiconductor Industry Research Frameworks”2. “Collection of 56+ Intelligent Network Cards and DPU”3. “Collection of 14 Semiconductor ‘AI’s iPhone Moment’ Series”4. “Collection of 21 Reports on Entering the ‘Chip’ Era”5. “Collection of 800+ Significant ChatGPT Professional Reports”6. “Collection of 92 GPU Technologies and White Papers”7. “Collection of 11+ Reports on the Moment of AI’s Fission”

8. “Collection of 3+ Technical Series Basic Knowledge Explanation (Community Version)”

“Collection of 290+ DeepSeek Technical Reports”

“Collection of 42 In-Depth Reports & Maps in the Semiconductor Industry”

Asia-Pacific Chip Valley Technology Research Institute: 2024 AI High-Performance Chip Technology Development and Industry Trends

Complete Knowledge of SSD Flash Memory Technology (Community Version)Complete Knowledge of Server Basics (Community Version)Complete Knowledge of Storage Systems (Community Version)2025 New Technology Outlook Series Collection

All materials from this account have been uploaded to the knowledge community, for more content please log in toIntelligent Computing Chip Knowledge (Knowledge Community) to download all materials.

What Level is the Ascend 910 NPU and How Does it Perform in the DeepSeek Integrated Machine?

Disclaimer: This account focuses on sharing related technologies, and the content views do not represent the position of this account. All traceable content is marked with sources, and if there are copyright issues with published articles, please leave a message to contact for deletion, thank you.

Friendly Reminder:

Please search for “AI_Architect” or “Scan Code” to follow the public account for real-time access to deep technology sharing, click “Read the Original” to get moreoriginaltechnical content.

Ascend 910B Manufacturing Process and Architecture Design

Ascend 910B Computing Power Performance

Ascend 910B Memory and Bandwidth

Bandwidth Performance: The 910B is equipped with a high-speed PCIe 5.0 interface and has a high internal bandwidth. This feature greatly accelerates data transfer speeds, allowing data to flow quickly between various components, thereby significantly enhancing the overall system performance.

Application Scenarios:

Ascend DeepSeek Integrated Machine

Ascend 910B/910C Chip Technology:

Process and Computing Power: The 910B adopts a 7nm process, with FP16 computing power of 280 TFLOPS and INT8 computing power of 140 TOPS. The 910C is further optimized to SMIC N+2 process, with FP16 improved to about 320 TFLOPS, approaching 60%-70% of the performance of NVIDIA’s H100.

Energy Efficiency Optimization: Through dynamic voltage frequency scaling (DVFS) and handwritten CUNN kernels, power consumption is reduced to about 250W (910C), significantly more energy-efficient than H100 (700W).

Heterogeneous Computing Support: Integrating AI Core (based on Da Vinci architecture), AI CPU, and DVPP modules, supporting multi-task parallelism.

DeepSeek Model Optimization:

MoE Architecture: DeepSeek adopts a Mixture-of-Experts (MoE) architecture, activating only a small number of parameters (about 4%) per token, doubling inference efficiency.

DualPipe Algorithm: By overlapping computation and communication, cross-node communication overhead is reduced to nearly zero, training a 671B parameter model using only 2048 H800 GPUs, taking 2 months.

Software Stack Adaptation: MindSpore and CANN are deeply optimized, supporting seamless transition from CUDA to CUNN, reducing developer migration costs by 80%.

System Architecture of the Ascend DeepSeek Integrated Machine

Hardware Layer:

Software Layer:

Distributed Computing:

Product Forms of the Ascend DeepSeek Integrated Machine

Training and Inference Integrated Machine (FusionCube A3000 DS version):

Inference Integrated Machine (Atlas Series):

Specifications, Performance, and Configuration of the Ascend DeepSeek Integrated Machine

Specifications:

Performance:

Configuration:

Customization of the Ascend DeepSeek Integrated Machine

Hardware Customization: Flexible Configuration to Meet Diverse Needs

Software Customization: Model Distillation and Industry Fine-Tuning

Customization Case: China Telecom’s “Xirang Intelligent Computing Integrated Machine”

Related posts

Leave a Comment Cancel reply