How Huawei Tamed a Trillion-Parameter Sparse Model? Key Technical Breakthroughs in MOE Training on Ascend NPU

How Huawei Tamed a Trillion-Parameter Sparse Model? Key Technical Breakthroughs in MOE Training on Ascend NPU

How Huawei Tamed a Trillion-Parameter Sparse Model? Key Technical Breakthroughs in MOE Training on Ascend NPU

In the competition of large models, sparse large models represented by Mixture of Experts (MoE) are gradually becoming the new favorites in the AI field due to their outstanding efficiency. Recently, Huawei released a technical report titled “Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs,” which reveals in detail how to train a sparse large language model with nearly a trillion parameters on Ascend NPU, showcasing Huawei’s strong capabilities in the AI infrastructure field. This article will take you deep into this technical report to explore how Huawei achieves efficient breakthroughs in sparse large model training through sophisticated model design and system optimization.

Background: The Rise and Challenges of MoE Models

How Do Sparse Models Change the AI Competitive Landscape?

In the past two years, sparse large language models represented by MoE (Mixture of Experts) have gradually become mainstream. From Google’s Switch Transformers and Microsoft’s Mixtral to DeepSeek R1, sparse large models have achieved efficient learning on datasets of hundreds of trillions of tokens while significantly reducing computational costs through their unique “on-demand activation” mechanism, maintaining or even improving performance.

The core advantage of sparse models lies in: activating only a small portion of model parameters for each input token. Imagine if a traditional large model is a team where all experts work simultaneously, then the MoE model is an intelligent team that divides labor on demand—only the most suitable experts are consulted for each problem, saving resources while ensuring expertise. This design makes MoE models more efficient during inference than dense models, as only a small portion of parameters are activated, greatly reducing computational costs.

However, training such a large model is no easy task. Huawei’s report points out that training MoE models faces three major challenges:

  1. 1. Enormous Parameter Scale: Nearly a trillion parameters impose extremely high requirements on hardware
  2. 2. Dynamic Sparse Structure: Different inputs activate different parameters, leading to unstable computational loads
  3. 3. Low Utilization of Computational Resources: Traditional training methods struggle to fully utilize NPU computing power

In response, the Huawei team proposed a complete solution, from model architecture design to system optimization, achieving efficient training of the Pangu Ultra MoE, a super-large sparse model with 718 billion parameters on Ascend NPU.

Model Design: Finding the Optimal Architecture for Ascend NPU

Model Architecture Search: A Hardware-Centric Model Design Philosophy

Traditional model design often follows the approach of “designing the model first, then adapting to the hardware,” but the Huawei team adopted the opposite strategy—a hardware-centric model design. This method not only considers the theoretical performance of the model but also focuses on the actual performance of the model on specific hardware, thus avoiding the embarrassing situation of “good paper performance but slow actual operation.”

Exploratory Experiments on MoE Core Structure

The Huawei team first constructed a small MoE model with approximately 20 billion parameters as a test foundation, quickly validating the effects of various design choices through this “mini version” model. They focused on two key dimensions:

  1. 1. Expert Granularity: Is it beneficial to increase the number of experts within a fixed computational budget? The results showed that increasing from 64 to 256 experts significantly reduced training loss, but further increasing to 512 experts yielded diminishing returns. This indicates thatexpert diversity can indeed enhance model performance, but there is a balance point.
  2. 2. Shared Expert Architecture: Should there be globally shared experts? Experiments showed that models with shared experts had lower training loss than variants without shared experts, confirming the value of shared experts. Shared experts act as the model’s “generalists,” handling general tasks that do not require specialized knowledge, allowing other experts to focus more on specific areas.

Hardware-Aware Model Simulation and Optimization

After determining the basic structure of MoE, how to find the most suitable configuration for Ascend NPU? Directly trying all possible combinations is obviously impractical—training a large model may take weeks and cost millions.

Huawei’s innovation lies in developing ahigh-precision performance simulation framework, which quickly screens optimal architectures by analyzing the theoretical performance of different configurations on NPU. This simulation system not only considers basic factors such as computational volume and communication volume but also incorporates the hardware characteristics of Ascend NPU, such as:

  • • Cubic computing units and vector computing capabilities
  • • Memory access performance and bandwidth
  • • Network communication characteristics
  • • The impact of various parallel strategies

Interestingly, this simulator has an accuracy rate of up to 90%, providing reliable guidance for model design. By simulating and analyzing nearly 10,000 different configurations, Huawei ultimately determined the key parameters for Pangu Ultra MoE:

  • • 61-layer Transformer structure
  • • Hidden layer size of 7680
  • • 256 routing experts (activating 8 for each token)
  • • Key combinations of parallel strategies (TP=8, PP=16, EP=4, etc.)

These parameters were not chosen arbitrarily but were carefully calculated to ensure optimal performance on Ascend NPU. For example, choosing 7680 as the hidden layer size instead of 8192 was because the former better utilizes Ascend NPU’s matrix multiplication units (supporting 16×16 matrix multiplication); while selecting 256 experts (2^8) facilitates efficient expert parallelism.

MoE Training Strategy Optimization: Enhancing Stability of Sparse Models

Expert Load Balancing: Finding the Equilibrium Point

One of the biggest challenges in MoE model training isexpert load imbalance. Imagine a group of students asking questions to professors; if all questions are concentrated on one or two professors, not only will these professors become overloaded, but other professors will also experience “skill degradation” due to lack of practice. MoE training faces a similar issue; without appropriate mechanisms, most tokens may choose the same few “star experts,” leading to:

  1. 1. Training Instability: A few experts become overly specialized while most experts do not receive adequate training
  2. 2. Low Computational Efficiency: Load imbalance leads to some NPUs being idle while others are overloaded

The Huawei team systematically evaluated several strategies to address this:

Auxiliary Loss Function Design

Traditional auxiliary loss functions (such as sequence-level and micro-batch-level) force each expert to handle a similar number of tokens, but this may be too strict for large models. Huawei proposed anEP group-level auxiliary loss, which calculates load balancing within expert parallel groups, finding a better compromise between strict balance and training effectiveness.

Specifically, they found:

  • • Sequence-level auxiliary loss has the strongest constraints but the highest training loss
  • • Data parallel group-level auxiliary loss has the weakest constraints but the lowest training loss
  • • EP group-level auxiliary loss is in between, balancing performance and training efficiency

Drop-and-Pad vs Dropless

For overloaded experts, a common strategy is to simply drop tokens that exceed capacity (Drop-and-Pad). Huawei’s experiments found that as model size increases, the negative impact of this strategy becomes more pronounced:

  • • 20 billion parameter model: The token drop rate for the Drop-and-Pad strategy is about 6%
  • • 718 billion parameter model: The drop rate increases to 8%, with more significant performance loss

Due to the pursuit of high accuracy, the final Pangu Ultra MoE adopted a no-drop strategy (Dropless), sacrificing some training speed to ensure model performance. However, to compensate for the efficiency loss caused by the no-drop strategy, they implemented a series of deep optimizations at the system level.

Training System Optimization: Four Major Breakthroughs on Ascend Platform

Parallel Optimization: Five-Dimensional Parallel Strategy

To collaboratively train such a large model on over 6000 Ascend NPUs, the design of parallel strategies is crucial. The Huawei team comprehensively utilized five parallel techniques:

  1. 1. Tensor Parallelism (TP): Splitting a single layer of the model across multiple devices
  2. 2. Pipeline Parallelism (PP): Allocating different layers of the model to different devices
  3. 3. Expert Parallelism (EP): Allocating different experts to different devices
  4. 4. Data Parallelism (DP): Processing different batches of data on different devices
  5. 5. Context Parallelism (CP): Processing different parts of sequences on different devices

The key innovation is the **Virtual Pipeline Parallelism (VPP)** design, which divides the training pipeline into two virtual stages, significantly reducing the pipeline bubble rate from 18.98% to 10.49%. Particularly ingenious is the load balancing handling of the multi-token prediction (MTP) layer: since the computational load of the MTP layer is 2.5 times that of the standard MoE layer, the team distributed its computational components across multiple stages, keeping the workload overflow within a 5% tolerance.

Communication Optimization: Breaking the Bottleneck of Inter-Device Interaction

When thousands of NPUs work together, inter-device communication becomes a key bottleneck limiting training speed. Huawei achieved two significant breakthroughs:

Hierarchical Expert Parallel All-to-All Communication

In traditional expert parallel implementations, inter-node communication uses the same mechanism as intra-node communication, leading to a large amount of inefficient cross-node data transfer. Huawei’s proposed hierarchical strategy divides communication into two stages:

  1. 1. Inter-node AllGather Synchronization: NPUs with the same rank on different nodes first perform global data synchronization
  2. 2. Intra-node All-to-All Redistribution: Each node only selects tokens related to its local experts, then performs optimized All-to-All exchanges within the node

This design converts most cross-node communication into high-bandwidth intra-node communication, significantly reducing communication overhead.

Adaptive Pipeline Overlapping Mechanism

To maximize hidden communication latency, the Huawei team proposed an adaptive pipeline overlapping strategy (1F1B_overlap) based on VPP, cleverly utilizing the independence between different micro-batches to mask forward computation with backward communication (and vice versa). Additionally, they achieved:

  • • Hierarchical communication overlap: Intra and inter-EP communications overlap with each other
  • • Host bottleneck alleviation: Decoupling preprocessing and replacement to reduce delays caused by synchronization
  • • Routing expert backpropagation decoupling: Making gradient computation more flexibly overlap with communication

Ultimately, these optimizations achieved a 95% communication overlap rate, nearly eliminating delays caused by communication.

Memory Optimization: Breaking Through NPU Storage Limitations

In large model training, insufficient NPU memory is often the biggest bottleneck. The Huawei team proposed two innovative technologies to address this challenge:

Fine-Grained Re-computation

Traditional re-computation strategies usually target entire layers or modules, while Huawei implemented a more fine-grained selective re-computation, optimizing memory usage for specific operators. For example:

  • MLA QKV Re-computation: Releasing activation memory for queries, keys, and values
  • Replacement Operation Re-computation: Reducing the maximum activation memory overhead
  • SwigLU Activation Re-computation: Providing the best memory-time trade-off

This approach not only saves memory but also overlaps with communication, minimizing the additional time overhead caused by re-computation.

Tensor Swapping

Huawei adopted an efficient host-device memory swapping mechanism, temporarily unloading activation values to host memory during forward propagation and reloading them during backward propagation. Particularly in non-replacement computations of tokens, probability value swapping provides significant memory savings. Through a prefetch mechanism, this method effectively reduces activation management overhead.

Load Balancing Optimization: Making Sparsity an Advantage

To address the inter-device imbalance caused by expert load imbalance, Huawei proposed a dynamic device-level load balancing mechanism, achieving real-time adjustments through planners and executors:

Planner

Based on historical load distribution data, it uses a sliding window average method to predict future load distribution and finds a balanced solution through a lightweight greedy algorithm. This mechanism leverages the temporal locality of load distribution, reducing device-level load imbalance by 80%-90%.

Executor

Before the next forward propagation, it dynamically places expert parameters, exchanging expert parameters and optimizer states through efficient All-to-All communication. To minimize communication overhead, specific layers’ expert reallocation is only triggered when load imbalance increases.

Through this series of system optimizations, Huawei increased the model floating-point operation utilization (MFU) of Pangu Ultra MoE on 6000 Ascend NPUs from 18.9% to 30.0%, and the number of tokens processed per second from 0.61M to 1.46M, achieving approximately 2.4 times performance improvement.

Model Performance and Expert Behavior Analysis

Evaluation of Pangu Ultra MoE’s Strength

Huawei’s Pangu Ultra MoE performed excellently in comprehensive evaluations, particularly in inference capability, reaching levels comparable to DeepSeek R1. It performed especially well in the medical field, achieving accuracies of 87.1% and 80.8% in MedQA and MedMCQA, respectively, surpassing DeepSeek R1.

In-Depth Analysis of MoE Expert Behavior

To understand the working mechanism of MoE models, the Huawei team conducted a systematic analysis of the expert behavior in Pangu Ultra MoE, discovering several interesting phenomena:

  1. 1. Domain Specialization: Tokens from different tasks tend to activate different experts, confirming that experts indeed form specialized divisions of labor
  2. 2. Depth of Specialization Dependency: Experts’ specialization levels in deep networks (e.g., layer 60) are significantly higher than in shallow layers (e.g., layer 3)
  3. 3. Contribution of Routing Experts: The contributions of routing experts and shared experts to model output are comparable, proving that the MoE structure indeed enhances the model’s expressive capability
  4. 4. Collaborative Activation of Experts: The degree of collaborative activation among experts is low, indicating minimal redundancy and clear division of labor among experts

These findings not only validate the effectiveness of the MoE architecture but also provide valuable references for future large model designs.

Conclusion and Insights

Huawei successfully trained the Pangu Ultra MoE model with 718 billion parameters on Ascend NPU, showcasing China’s strong capabilities in AI infrastructure. The key to this breakthrough lies inintegrating hardware characteristics with model design, systematically optimizing to address the core challenges of training sparse large models.

From this work, we can draw the following important insights:

  1. 1. Hardware-Aware Model Design Will Become a Trend: Future model architectures will increasingly consider hardware characteristics to achieve soft-hard collaborative optimization
  2. 2. System Optimization and Algorithm Innovation Are Equally Important: In the era of large models, engineering implementation capabilities have become key to AI competition
  3. 3. Sparse Models Are an Important Direction for Large Model Development: Structured sparsity can achieve more efficient utilization of computational resources
  4. 4. Autonomous and Controllable AI Infrastructure Is Crucial: Mastering full-stack capabilities from chips to algorithms is the foundation for building AI competitiveness

With the development of sparse models like MoE, we are entering a new era of AI computational efficiency. Huawei’s work is not only a technical breakthrough but also an important milestone in China’s AI infrastructure capabilities. We look forward to seeing more innovative applications and technological advancements based on this achievement!🚀

Leave a Comment