Huawei's Strategic Move Surpasses NVIDIA Despite Generational Chip Process Lag

The blow in the spring of 2019 was truly painful; the sanctions choked the supply chain. When the Ascend 910 was just launched, the stock was so low that we dared not sell it to internet clients, only prioritizing key industries. The external commentary was cold, suggesting that Huawei’s chip path was doomed. At that time, I was also worried. Who would have thought that six years later, we would witness today’s developments?

Whether the process can catch up is just one dimension. Huawei has focused on parallel systems, organizing more chips to work stably together. Latency must be minimized, bandwidth maximized, and scheduling precise. As long as large-scale resources can be commanded, the performance of a single card will no longer determine life or death.

In March of this year, the Atlas 900 super node was launched, combining 384 Ascend 910C chips, marking a training computing power of 300 PFLOPS, directly competing with the GB200 NVL72 system. Overseas institutions frankly stated that while the chip process is a generation behind, the overall solution is more advanced. This statement is both painful and uplifting!

In September, the All-Connect Conference continued to ramp up, with the Atlas 960 SuperPoD pushing the scale to 15,488 chips. The computing power volume increased, and the interconnect bandwidth was fully utilized. The goal is straightforward: large scale, stable efficiency, and coherent training.

NVIDIA also aims to scale up. In the early years, the NVL256 was tested, but the short copper cable distance and optical module failures led to a reduction to 72 cards. Huawei brought in its optical communication foundation, utilizing low-loss, long-distance optical interconnects combined with Lingqu’s low-latency scheduling to unify multi-machine and multi-card resources.

The significance of Lingqu lies in controlling the standards. The lesson from the CPU interconnect protocol being halted is a stark reminder. Huawei has systematically completed the optical devices, optical modules, interconnect protocols, and interconnect chips, opening them up to the 2.0 standard to allow the industry chain to participate, thus removing the barriers of distance and scale. This step was taken decisively!

The ecosystem did not take shortcuts; it directly integrated CANN and MindSpore, aligning its own compilation stack with its own framework, which is crucial for out-of-the-box usability. Major internet companies are already training models on super nodes, and the CloudMatrix 384 reportedly has an overall cost reduction of about 30%, with shorter delivery cycles, addressing the pain points that enterprises care about.

✅ Scale consistency: Thousands of chips maintain global scheduling and gradient synchronization, ensuring long-term training without lagging behind.

✅ Optical interconnect capability: Low link loss and long distance, flexible cross-cabinet wiring, and expansion not limited by copper cables.

Over the past six years, we have endured all the hardships. The earlier predicament of not daring to sell is still fresh in our minds. Now, customers are queuing for delivery. The turnaround is very tangible. The generational lag in chip processes has not closed off the path; the system solution is more advanced, turning the situation around. Next, it depends on production capacity and ecosystem. Whoever can stably output thousands of chips in coordination will hold the power of discourse.

Huawei’s Strategic Move Surpasses NVIDIA Despite Generational Chip Process Lag

Leave a Comment Cancel reply

Related posts

Leave a Comment Cancel reply