Local Deployment of DeepSeek-R1 Large Model on RK3588 Platform: NPU Edition

Using the NPU to run inference on large models can more effectively plan resource allocation, achieving more efficient applications.

Local Deployment of DeepSeek-R1 Large Model on RK3588 Platform: NPU Edition

1 Preparation Before Deployment

To deploy large language models using the NPU on RK3588, the following files need to be prepared in advance, which include the corresponding files for Rockchip’s NPU and the DeepSeek inference model files.

bash

1
2
3
4
## Rockchip Official Resources
https://github.com/airockchip/rknn-llm/tree/main/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo
## DeepSeek Resources
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/tree/main

Currently, before deploying the large model, first check if the NPU driver is version 0.9.8. It is recommended to update the driver to the latest version.

bash

1
2
sudo cat /sys/kernel/debug/rknpu/version
    RKNPU driver: v0.9.8

2 Driver Update Considerations

To update the NPU driver, recompile the kernel according to the official manual, paying attention to two points.

Compilation Error 1: Since the NPU for RK3588 and RK3576 is the same, the same driver file is used. However, the RK3588 driver does not define this structure, so simply comment out that line of code at the corresponding location.

bash

1
2
3
237 |  .set_soc_info = rockchip_opp_set_low_length,
    |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~
    |                  rockchip_pvtpll_add_length
Compilation Error 2: When the kernel version is 5.10, this function is not defined. Define it according to the official manual.

bash

1
2
3
4
5
6
  998 |  vm_flags_set(vma, VM_MIXEDMAP);
      |  ^~~~~~~~~~~~
make[3]: *** [scripts/Makefile.build:273: drivers/rknpu/rknpu_devfreq.o] Error 1
make[3]: *** Waiting for unfinished jobs....
 1148 |  vm_flags_clear(vma, VM_PFNMAP);
      |  ^~~~~~~~~~~~~~

If it is RK3588, you can also directly find the modified driver file and copy it into the kernel driver folder for compilation.

3 Compile Kernel && Deployment

3.1 Compile the Kernel

Overwrite the above driver files into the driver folder, compile the kernel according to the manual, and burn the generated boot.img to the board.

After updating the kernel, check the NPU driver version again. If the driver version has been upgraded, you can proceed to the next step.

bash

1
2
sudo cat /sys/kernel/debug/rknpu/version
    RKNPU driver: v0.9.8
3.2 Virtual Machine Operations for Deployment

Clone the official repository files to your local machine and install the rkllm-toolkit. It is recommended to create a new Python 3.8 or 3.10 environment using conda.

The rkllm-toolkit.whl file is located in the rknn-llm-main/rkllm-toolkit folder after cloning the repository, and currently supports Python 3.8 or Python 3.10.

bash

1
2
3
4
5
6
git clone https://github.com/airockchip/rknn-llm.git


conda create -n rkllm python=3.10
conda activate rkllm


pip3 install rkllm_toolkit-1.1.4-cp310-cp310-linux_x86_64.whl

After installation, you can test whether the installation was successful in the Python environment.

python

1
from rkllm.api import RKLLM

Clone the model repository data, or you can download it directly.

bash

1
2
3
4
5
6
7
8
9
10
11
12
13
# Install git-lfs
sudo apt-get install git-lfs
git lfs install


# Get DeepSeek-R1-Distill-Qwen-1.5B
git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
# Get from mirror site (optional)
git clone https://hf-mirror.com/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B


# Get DeepSeek-R1-Distill-Qwen-7B
git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
# Get from mirror site (optional)
git clone https://hf-mirror.com/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

Then create data_quant.json for quantizing the rkllm model.

bash

1
python generate_data_quant.py  -m ../deepseek-aiDeepSeek-R1-Distill-Qwen-1.5B/

Export the rkllm model, and you need to modify the hardware platform, model data location, and other information in the script.

bash

1
python export_rkllm.py

python

1
2
3
4
5
6
7
8
modelpath = './deepseek-aiDeepSeek-R1-Distill-Qwen-1.5B'
dataset = "./data_quant.json"
qparams = None # Use extra_qparams
target_platform = "RK3588"
optimization_level = 1
quantized_dtype = "W8A8"
quantized_algorithm = "normal"
num_npu_core = 3

Transfer the generated rkllm file to the board, and the operations on the virtual machine can be completed. The process of exporting the rkllm model requires a large amount of memory resources, so developers should choose the operating machine wisely.

3.2 Board Operations for Deployment

First, install some tools needed for compilation.

bash

1
2
sudo apt update
sudo apt install gcc g++ cmake

The board also needs to clone the official repository files, or you can transfer the cloned files from the virtual machine to the board.

bash

1
git clone https://github.com/airockchip/rknn-llm.git

Enter the corresponding case’s deploy folder, modify the compiler, and execute the compilation.

bash

1
2
3
4
5
cd rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy


##GCC_COMPILER_PATH=aarch64-linux-gnu


./build-linux.sh

After execution, the llm_demo file will be generated in the build folder, and you need to give this file execution permissions.

bash

1
chmod 777 llm_demo

Copy the necessary library files to the /usr/lib directory.

bash

1
cp rknn-llm-main/rkllm-runtime/Linux/librkllm_api/aarch64/librkllmrt.so /usr/lib

Finally, execute the following command to run DeepSeek.

bash

1
2
3
ulimt -Sn 50000
export LD_LIBRARY_PATH.
./ llm_demo deepseek_r1_rk3588.rkllm 10000 1000

4 Running Experience

bash

1
2
3
4
./llm_demo RK3588.rkllm 10000 1000
    rkllm init start
    I rkllm: rkllm-runtime version: 1.1.4, rknpu driver version: 0.9.8, platform: RK3588
    rkllm init success

Ask it if it understands Chinese.

bash

1
2
3
4
5
6
user: 你会汉语吗
robot: <think>


</think>


是的,我懂汉语。有什么问题可以问我吗?

Then test its professional knowledge.

bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
user: 你知道 Xilinx FPGA 吗
robot: <think>


</think>


是的,Xilinx FPGA(Field-Programmed Readable Array)是一种用于实现可编程逻辑器件的芯片,主要用于在数字电路设计和系统开发中提供快速、灵活的解决方案。Xilinx以其在该领域的创新和成熟产品而闻名。

### Xilinx FPGA的主要特点:
1. **可编程逻辑器件**:支持多种逻辑功能,如加法器、乘法器、存储器等。
2. **嵌入式解决方案**:适用于嵌入式系统设计,提供高可用性和低功耗。
3. **时序设计工具**:支持VHDL、Verilog等语言,用于描述和验证设计。
4. **快速开发和调试**:通过Xilinx的开发 boards(如XLP10等)简化了开发流程。

### Xilinx的主要产品:
- **XLP10**:嵌入式开发板,支持多种逻辑功能和时序工具。
- **XLP700系列**:高带宽、低延迟的高速开发设备。
- **XLP200系列**:适合小型系统设计的低功耗设备。

### Xilinx在行业中的应用:
- **嵌入式系统**:如Android、iOS系统,用于实现复杂的逻辑和时序功能。
- **物联网(IoT)**:用于数据处理和控制模块的设计。
- **自动驾驶**:用于传感器数据处理和决策算法的开发。

Xilinx通过其产品和技术,帮助开发者快速实现复杂数字系统设计,并在各种应用中发挥重要作用。

The above question can be directly queried through the knowledge base. Now ask a deeper question that requires some thought.

bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
user: 如何使用 RK3588 部署大模型
robot: <think>


</think>


我无法直接访问或部署特定的模型,如RK3588。不过,如果你想了解如何使用大型语言模型(如GPT-4等),可以参考以下步骤:

1. **选择合适的模型**:如GPT-4、Claude等。

2. **安装必要的工具和库**:
   - `llama-cpp` 或 `llama-python` 用于编译模型。
   - `sentence-transformers` 和 `transformers` 库用于处理上下文和预训练数据。

3. **配置模型参数**:根据你的需求调整模型的大小、隐藏层等参数。

4. **训练或推理**:
   - 使用训练集进行学习,或者使用测试集进行推理。
   - 调整超参数以优化模型性能。

5. **部署和使用**:将模型部署到服务器上,并通过API或本地环境进行推理或生成文本。

如果你有具体问题或需要进一步帮助,请提供更多细节。

There was no deep thinking process, and the answer was quite general. Now, let’s ask a simple math question.

bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
user: Please tell me what is 7 add 9?
robot: <think>
I need to calculate the sum of 7 and 9.

First, I'll identify the numbers involved: 7 and 9.

Next, I'll perform the addition by combining these two quantities.

Finally, I'll determine that the result of adding 7 and 9 is 16.
</think>

**Solution:**

To find the sum of 7 and 9, follow these simple steps:

1. **Identify the numbers to add:**  
   \[ 7 + 9 \]

2. **Perform the addition:**  
   \[ 7 + 9 = 16 \]

**Final Answer:**
\[ \boxed{16} \]

It must be said that compared to CPU deployment, the running speed has improved significantly, but the NPU deployment process still has some difficulties.

  • Compiling the kernel may require certain modifications to the driver due to differences in kernel versions and platforms;

  • Exporting the rkllm model has certain memory requirements for the virtual machine, etc.

Leave a Comment