Discussing the Principles of the torch-npu Plugin

Overview

The official PyTorch plugin provided by Huawei: torch-npu enables running PyTorch code on Huawei Ascend servers, facilitating AI development—training and inference—based on the open-source PyTorch ecosystem.

Although Huawei has its own machine learning development framework: MindSpore, similar frameworks exist from companies like Baidu and Alibaba, such as Baidu’s PaddlePaddle. Essentially, these frameworks are directly competing with PyTorch. However, a significant issue arises regarding the open-source ecosystem; currently, in the fields of machine learning and deep learning, the mainstream development frameworks are based on PyTorch, making its development ecosystem the most comprehensive. If companies limit themselves to their proprietary development frameworks, it can be detrimental to their hardware development and software ecosystem. Consequently, more hardware manufacturers, such as Huawei Ascend and Cambricon, are integrating PyTorch for development on their hardware platforms.

Adaptation and Integration

PrivateUse1

An important feature was released in PyTorch version 2.1: PrivateUse1; PrivateUse1 is a type of device in PyTorch. In PyTorch, the device type indicates the hardware environment where tensors and operations are executed, with CPU and GPU being common device types. PrivateUse1 is a special device type typically designed for custom or specific hardware backends. With PrivateUse1, developers can adapt PyTorch code to non-standard hardware, such as specific accelerators or experimental hardware platforms.

The source code of the torch-npu^[1] plugin library provided by Ascend integrates the NPU device into the PyTorch backend through specific interfaces of PrivateUse1.

torch.utils.rename_privateuse1_backend("npu")
# rename device name to 'npu' and register funcs
torch._register_device_module('npu', torch_npu.npu)
unsupported_dtype = [torch.quint8, torch.quint4x2, torch.quint2x4, torch.qint32, torch.qint8]
torch.utils.generate_methods_for_privateuse1_backend(for_tensor=True, for_module=True, for_storage=True,
                                                     unsupported_dtype=unsupported_dtype)

Of course, having just this code is not enough; PrivateUse1 is a special device type provided by PyTorch for third-party hardware vendors to integrate. By utilizing PyTorch’s PrivateUse1 feature, operations originally intended for CUDA devices (such as torch.cuda) can be seamlessly replaced with torch.npu, allowing hardware backend switching without modifying most native code. However, distributing operators to the PrivateUse1 device requires additional mechanisms.

Dispatcher

Dispatcher is the official scheduling and dispatch mechanism of PyTorch, a key component in the core architecture of PyTorch. Dispatcher‘s main responsibility is to dispatch operations to the appropriate kernel functions based on the input tensor’s device type, data type, and other information. In simple terms, when you call an operation (like addition or multiplication) in PyTorch, Dispatcher determines which device the operation will execute on and which specific implementation will be used to complete the operation.

Roughly speaking, from an implementation perspective, dispatch is similar to Spring IOC; it essentially finds specific implementations for interface definitions to support extensions. Since the default operator implementations in PyTorch are on CPU and GPU, the dispatcher is introduced to facilitate operator extension and integration with third-party devices.

The dispatcher is essentially a large map/dictionary; when executing an operator, the dispatcher finds the corresponding dispatch key for that operator and then locates the appropriate kernel function based on that key.

Discussing the Principles of the torch-npu Plugin

There are three ways to register operators, after which the dispatcher will dispatch the operators to the corresponding function implementations for execution.

TORCH_LIBRARY_IMPL

m.impl
m.fallback

TORCH_LIBRARY

m.def

In the source code of the torch-npu plugin, various registration functions are present to call its own hardware on Huawei Ascend NPU to run various operators. This is why there are many versions of the torch-npu plugin provided by Ascend, which has been available since the PyTorch 1.x version. However, starting from PyTorch 2.1, the torch-npu plugin provided by Ascend can be considered as natively supporting PyTorch, and the MindSpeed series of training and inference acceleration middleware is heavily based on the torch-npu plugin.

OpenReg Example Project

The official PyTorch team has provided an example project^[2] that demonstrates how to dispatch custom operators to new hardware using the PrivateUse1 feature and dispatcher in an out-of-tree scenario.

csrc is the C++ code called by torch._C, implemented through pybind for Python to call C++.

The _aten_impl.py file registers some new operators and kernel fallbacks in the native aten of PyTorch, as follows:

_openreg_lib = torch.library.Library("_", "IMPL")
_openreg_lib.fallback(_openreg_kernel_fallback, dispatch_key="PrivateUse1")

_openreg_lib_aten = torch.library.Library("aten", "IMPL")
_openreg_lib_aten.impl("_copy_from", _copy_from, dispatch_key="PrivateUse1")
_openreg_lib_aten.impl(
    "set_.source_Tensor", _set_source_tensor, dispatch_key="PrivateUse1"
)
_openreg_lib_aten.impl(
    "_local_scalar_dense", _local_scalar_dense, dispatch_key="PrivateUse1"
)

The _device_daemon and _meta_parser handle device information and context metadata conversion.

References

OpenReg

Facilitating New Backend Integration through PrivateUse1^[3]

Reference[1]

torch-npu: https://gitee.com/ascend/pytorch

[2]

Example Project: https://github.com/pytorch/pytorch/blob/main/test/cpp_extensions/open_registration_extension/pytorch_openreg/init.py

[3]

Facilitating New Backend Integration through PrivateUse1: https://pytorch.ac.cn/tutorials/advanced/privateuseone.html