Huawei Ascend NPU Achieves Near-Trillion Parameter Large Model, Showcasing Domestic Computing Power Strength

Huawei has made significant breakthroughs in the training of AI large models, with its Ascend NPU successfully running a near-trillion parameter large model, marking a leap for domestic computing platforms into the world-leading ranks in AI large model training.

Previously, training trillion-parameter large models faced numerous challenges, such as difficulties in load balancing, high communication overhead, and low training efficiency. The Huawei Pangu team (including the Noah’s Ark Lab, Huawei Cloud, etc.) tackled these issues based on the Ascend domestic computing platform. The team completed long-term stable training of a 718 billion (718B) parameter MoE model on a cluster of over 6000 Ascend NPUs, achieving significant performance improvements through several breakthrough system optimization techniques, greatly enhancing training efficiency and supporting the development of industry-leading models.

Training ultra-large parameter MoE models faces “four major challenges.” First is the architecture parameter optimization problem, which requires exploring the optimal configuration among numerous parameter combinations and designing a large-scale MoE architecture compatible with the Ascend NPU to achieve efficient utilization of computing resources. Second is the challenge of dynamic load balancing, where the routing mechanism must intelligently allocate tasks to avoid uneven distribution of expert resources, as this could reduce training efficiency due to the “barrel effect” and even lead to abnormal model convergence. There is also the bottleneck of distributed communication; at nearly trillion-parameter scale, the flow of tokens among different computing nodes generates significant communication overhead, and the “communication wall” problem restricts training efficiency. Finally, the complexity of hardware adaptation requires deep collaboration between the MoE algorithm and dedicated AI accelerators like the Ascend NPU, necessitating full-stack optimization that integrates algorithm design, software frameworks, and hardware characteristics to fully unleash the computational potential of the hardware.

To address these issues, Huawei approached solutions from the perspectives of model architecture, MoE training analysis, and system optimization. In terms of model architecture, they optimized the selection of MoE structures and the affinity structure for Ascend. The team conducted preliminary experiments to determine the paradigm of fine-grained experts plus shared experts, considering multiple factors during model selection. In terms of computation and memory affinity, they increased the hidden size in the model, reduced the activation parameter count, and improved the model’s computational load and computing power utilization as well as inference throughput. In terms of multi-dimensional parallel affinity, they adopted an exponential number of experts set to 2, achieving a TP8 x EP4 super-hybrid parallel method, using TP-extend-EP technology to avoid operator efficiency decline and employing grouped AllToAll communication technology to reduce EP communication overhead. In terms of DaVinci architecture affinity, they aligned tensors to 256, matching the 16×16 matrix computing units to unleash the computing power of the Ascend NPU. In terms of pipeline scheduling affinity, they utilized PP, VPP, and empty layer technologies to achieve load balancing between PP and VPP, reducing idle computing resources.

In terms of model structure simulation, the team significantly adjusted the range of model parameter selections based on hardware adaptation characteristics and developed a dedicated modeling simulation tool. This tool decomposes the model structure, runtime strategies, and hardware systems into small parameters, simulating computations, data transfers, and read operations at the operator, block, and layer levels, calculating overall model performance with an accuracy rate of over 85%. The team used this tool to test all parameter combinations that met hardware adaptation requirements, evaluating data processing speeds and identifying model structures with better performance.

In MoE training analysis, to address the issue of uneven load distribution, the research community proposed various auxiliary loss functions. The Huawei team developed a new EP group load balancing loss algorithm that does not overly demand absolute balance in local task allocation, avoiding “overcorrection,” while consuming less data transmission resources, thus saving communication costs with moderate constraints. To tackle the “barrel effect” caused by uneven expert loads, the team compared different solutions and adopted a dropless scheme when training the Pangu Ultra MoE, comprehensively optimizing four key directions: improving parallel computing strategies, optimizing data transmission efficiency, enhancing memory usage, and achieving more uniform task allocation. On a large computing cluster composed of over 6000 Ascend NPUs, the model’s computing power utilization (MFU) reached 30.0%, an increase of 58.7%. The final parallel computing scheme adopted 16-way pipeline parallelism, 8-way tensor parallelism, 4-way expert parallelism, 2-way virtual pipeline parallelism, and 48-way data parallelism. In expert parallelism, they employed the TP-extend-EP strategy, with a total of 32 expert groups divided into 256 experts. The virtual pipeline parallel strategy showed significant effects, reducing idle computing resource rates from 18.98% to 10.49%, and controlling load overflow caused by uneven task allocation to within 5%.

To resolve communication bottlenecks in parallel scaling, the team designed the Hierarchical EP Communication and Adaptive Pipe Overlap Mechanism. The hierarchical EP communication uses cross-machine Allgather communication to synchronize all tokens within the machine, sorts tokens within the machine, and reallocates tokens using machine-internal AlltoAll communication, reducing inter-machine communication volume. The adaptive front-backward masking strategy utilizes the independent characteristics of inter-machine and intra-machine communication link bandwidth to achieve mutual masking, alleviating host-bound issues by effectively arranging operators and separating expert backward dw calculations from dx calculations for finer-grained masking.

In terms of memory optimization, the team adopted new computing methods to recalculate fine-grained modules, avoiding additional computational overhead, and efficiently utilized NPU memory through Tensor Swapping technology. They are also researching new memory-saving methods and preparing to combine various optimization strategies to find the most suitable combination based on device configurations.

To enhance training efficiency, the team designed a dynamic device-level load balancing mechanism. The planner predicts future task volumes by observing expert workloads and uses a greedy algorithm to plan expert reallocation, while the executor periodically transfers expert parameters and optimizer states, increasing the model’s MFU by 10%.

Additionally, the team developed host-side optimizations, computation offloading, data sharing, and fusion operators tailored for Ascend devices. In operator dispatch optimization, they reduced operators requiring frequent synchronization, employing fine-grained CPU binding technology. In computation offloading and data sharing, they assigned computations unsuitable for NPU to the CPU during data loading, combining data sharing technology to improve computation and data transfer speeds. In fusion operators, they integrated FlashAttention and RMSNorm fusion operators from the Pangu dense model, as well as GMMAdd, Permute, and Unpermute fusion operators into the MoE model.

Experimental results show that during the training dataset construction process, the team implemented strict data quality control, emphasizing the diversity, complexity, and comprehensiveness of the corpus. They introduced special markers for long-chain thinking samples and adopted instruction fine-tuning strategies in the later training stages, covering a wide range of fields, with a ratio of reasoning to non-reasoning samples set at 3:1. The Pangu Ultra MoE dialogue version demonstrated exceptional competitiveness across multiple domains, performing comparably to DeepSeek-R1 on most benchmarks and excelling in high-difficulty tests such as general understanding tasks, mathematical reasoning, and code generation.

The team also conducted an expert specialization analysis of the Pangu Ultra MoE, discovering that tokens from the same network layer are preferentially routed to different experts for different tasks, indicating significant task variability in expert specialization. This confirms that the Pangu Ultra MoE has formed notable expert differentiation, enhancing the model’s expressive capability. The output of the MoE layer in the Pangu Ultra MoE consists of a weighted sum contributed by both shared experts and routing experts, with routing experts maintaining a contribution strength comparable to shared experts across all network layers. This balanced collaborative effect effectively enhances the overall representational capability of the model. The team analyzed the phenomenon of expert co-activation and found that, with few exceptions, there was no significant co-activation among experts across the three layers, reflecting a relatively low redundancy among experts in the Pangu Ultra MoE.

Related posts

Leave a Comment Cancel reply