Running Large Language Models on Orange Pi 5 Plus

Recently, I got an Orange Pi 5 Plus development board to use as an All In One server. Considering how popular large language models have become in the past two years, as a tech enthusiast, I can’t fall behind, right? So I plan to run some tests on my server. As the most powerful SOC from Rockchip, with a quad-core A76 and a quad-core A53, it should perform decently.

Next, I will explain how to run large models locally. After searching online, the simplest and most user-friendly solution is the Ollama framework from the Alpaca team. Also, from what I understand, these frameworks can only run on CPU + NVIDIA GPU or pure CPU. Therefore, my Orange Pi can only run in pure CPU mode.

Without further ado, following online tutorials, I can set it up using Docker with just two commands. Here they are:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollamadocker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

However, during actual testing, I found that open-webui could not start. According to the logs, the tool attempted to download from huggingface.co, but that site was inaccessible, preventing the web service from starting. I found a mirror site online (https://hf-mirror.com), and all I needed to do was replace the URL. The complete command is as follows:

docker run -d -p 5399:8080 --add-host=host.docker.internal:host-gateway -v /DATA/AppData/open-webui:/app/backend/data --env HF_ENDPOINT=https://hf-mirror.com --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Next, I successfully logged into the web page and registered an account. I downloaded three models in total, listed as follows:

gemma:2b
llama2:7b
llama2:13b

Below are the response speed tests for these three models (don’t click on the gif, it won’t play)):

[gemma 2b]

The response speed of gemma 2b is decent, close to usable, but it does have a bit of a response time at the start.

[llama2 7b]

The response speed of llama2 7b is slow, with longer wait times, making it unusable.

[llama2 13b]

The llama2 13b is just too slow, with a very long response time and slow output speed.

Summary:

In pure CPU mode, only the 2b model’s speed is barely usable, while the 7b and 13b are too slow to be practical. I look forward to running them on a GPU or even the built-in 6T NPU of the RK3588 in the future, and I’ll come back for a real test then.

Related posts

Leave a Comment Cancel reply