Author: David Tippett
Conducting extensive performance testing on the Raspberry Pi may seem a bit dull, but it made me realize the complexity of performance testing. For example, in some of the tests below, I actually exhausted all the resources of the Raspberry Pi, leading to poor performance. Each application’s performance characteristics are different, so we will discuss the factors to consider when deploying Valkey.
Testing Environment
We will use the Raspberry Pi Compute Module 4 (CM4) as the hardware. This is a single-board computer equipped with a 1.5GHz quad-core Broadcom CPU and 8GB of system memory. This is not an ideal choice for production systems. Using CM4 makes it easy to demonstrate how to optimize Valkey based on different hardware constraints.
Our operating system will be the 64-bit Debian-based Rasbian. This distribution is specifically optimized for the CM4. Valkey will run in a Docker container and be orchestrated via Docker Compose. I enjoy using containers for deployment because it simplifies operations. If you want to follow along, here’s a guide[1] to install Docker. Be sure to continue to the second page[2] of the installation process, as skipping may complicate subsequent operations.
We will use two CM4s for testing. The first will host Valkey, and the second will host the benchmarking software. This setup better reflects how most people run in production environments. The benchmark uses redis-benchmark, as it can be installed via sudo apt install redis-tools
. The built-in benchmarking tool valkey-benchmark can also be used, but it requires installing Valkey on the benchmark server or starting a container and connecting to it. Both are functionally nearly identical.
Setting Up the Environment
Below is a simple Docker Compose file that will start a Valkey container. This container will bind Valkey to port 6379 of the host device, which means anyone with access to your network can access it! This is crucial for us to access from the benchmark server.
# valkey.yaml
services:
valkey-1:
image: valkey/valkey:latest
hostname: valkey1
command: valkey-server --port 6379 --requirepass ${VALKEY_PASSWORD} --io-threads ${IO_THREADS} --save ""
volumes:
- ./data:/data
network_mode: host
volumes:
data:
driver: local
Since we are exposing it to the internal network, we need to create a password for the default user. I generated a random password using head -16 /dev/urandom | openssl sha1
. Given that Valkey handles requests very quickly, brute force attacks could try hundreds of thousands of passwords per second. After generating the password, I placed it in a .env
file in the same directory as the Docker Compose file.
#.env
VALKEY_PASSWORD=e41fb9818502071d592b36b99f63003019861dad
NODE_IP=<VALKEY SERVER IP>
IO_THREADS=1
Now, by running docker compose -f valkey.yaml up -d
, the Valkey server will start with the password we set!
Baseline Testing
Now we are ready to conduct some baseline tests. We will log in to the benchmark server. If you haven’t installed redis-benchmark yet, you can do so using sudo apt install redis-tools
.
redis-benchmark -n 1000000 -t set,get -P 16 -q -a <PASSWORD FROM .env> --threads 5 -h 10.0.1.136
Test details:
-
-n
– Run 1,000,000 operations using commands in-t
-
-t
– Perform set and get tests -
-P
– Specify that the test uses 16 pipelines (sending 16 operations per request) -
-q
– Silent output, only show final results -
-a
– Use the specified password -
-h
– Run tests against the specified host -
--threads
– Number of threads generating test data
Honestly, I was shocked by the results of the first test. Indeed, I expected Valkey to be fast, but how could it be this fast on a single-board computer? And under single-threaded conditions? Truly amazing.
redis-benchmark -n 1000000 -t set,get -P 16 -q -a e41fb9818502071d592b36b99f63003019861dad --threads 5 -h 10.0.1.136
SET: 173040.33 requests per second, p50=4.503 msec
GET: 307031.00 requests per second, p50=2.455 msec
Between the two tests, we averaged 240,000 requests per second.
Increasing CPU Clock Speed
Since Valkey is a single-threaded application, increasing the clock speed will naturally enhance performance. I don’t think most people will overclock their servers in production. Different servers may have different CPU clock speeds.
Note: Clock speeds are generally comparable only between CPUs with similar architectures. For example, you can reasonably compare the clock speeds between the 12th generation Intel i5 and the 12th generation Intel i7. If the maximum clock speed of the 12th generation i7 is 5GHz, it doesn’t mean it will be slower than the AMD Ryzen 9 9900X with a clock speed of 5.6GHz.
If you are following along on your own Pi, I have listed the steps to overclock the CM4. Otherwise, you can skip to the results section.
Warning: Please note that overclocking may damage your device. Please proceed with caution and research safe settings.
-
Open the following file:
sudo nano /boot/firmware/config.txt
-
Add the following content at the end of the file:
[all]
over_voltage=8
arm_freq=2200
-
Reboot the Pi and log in again:
sudo reboot now
By increasing the clock speed from 1.5GHz to 2.2GHz, we have boosted the Pi’s speed by 47%. Now let’s rerun the tests and see the results!
redis-benchmark -n 1000000 -t set,get -P 16 -q -a e41fb9818502071d592b36b99f63003019861dad --threads 5 -h 10.0.1.136
SET: 394368.41 requests per second, p50=1.223 msec
GET: 438058.53 requests per second, p50=1.135 msec
We reached 416,000 requests per second (reminder: this is the average for both operations). Mathematicians might note that this speed increase far exceeds the expected 47%. In reality, the requests per second increased by 73%. What’s going on?
Adding IO Threads
After achieving these gains, I was very excited to try the new IO threads[3] introduced in Valkey 8. First, we will stop the previously running Docker instance using docker compose -f valkey.yaml down
. Then, we will change the IO_THREADS parameter in the .env file to 5.
#.env
VALKEY_PASSWORD=e41fb9818502071d592b36b99f63003019861dad
NODE_IP=<VALKEY SERVER IP>
IO_THREADS=5
Next, we can restart it using docker compose -f valkey.yaml up -d
. Remote login to the benchmark server for testing, and the results are…?
redis-benchmark -n 10000000 -t set,get -P 16 -q -a e41fb9818502071d592b36b99f63003019861dad --threads 5 -h 10.0.1.136
SET: 345494.75 requests per second, p50=0.911 msec
GET: 327858.09 requests per second, p50=0.879 msec
Wait a minute, these results are even worse than before? Requests per second dropped from 416,000 to 336,000… What’s going on?
Our CPU has been over-subscribed. This means we created more worker threads than CPU cores. When threads are under constant load, they compete for resources on that core with other threads. Not to mention, they are also competing for resources with the Valkey process.
This is why Valkey recommends setting the number of threads to less than the number of cores. For our 4-core little server, we changed the IO_THREADS parameter in the .env file back to 2 and tried again.
redis-benchmark -n 10000000 -t set,get -P 16 -q -a e41fb9818502071d592b36b99f63003019861dad --threads 5 -h 10.0.1.136
SET: 609050.44 requests per second, p50=0.831 msec
GET: 521186.22 requests per second, p50=0.719 msec
Much better! Now we see around 565,000 requests per second. This is a 35% improvement on both metrics! What’s more, the chart below shows that all our CPUs are running at 100% utilization, meaning there’s no more room for improvement!
Really? Believe it or not, our little CM4 has even more performance to uncover!
The above is a representative overview of what’s happening on the server. The Valkey process needs to occupy precious cycles to manage IO threads. Additionally, it has to handle a lot of memory management work. This is a heavy task for a single process.
In fact, we have another optimization method that can make the single-threaded Valkey run even faster. Valkey has recently done a lot of work to support speculative execution. This work enables Valkey to predict which memory values will be needed in future processing steps. This way, the Valkey server doesn’t have to wait for memory access, which is several orders of magnitude slower than L1 cache access. I’m not going to detail this technique as there are already good blogs[4] that describe how to leverage these optimizations. Here are the results:
redis-benchmark -n 10000000 -t set,get -P 16 -q -a e41fb9818502071d592b36b99f63003019861dad --threads 5 -h 10.0.1.136
SET: 632791.25 requests per second, p50=1.191 msec
GET: 888573.00 requests per second, p50=0.695 msec
Although these results are better, they are also somewhat confusing. After communicating with some Valkey maintainers, it seems that Rasbian may have different memory write configurations. In their tests, GET/SET requests were nearly identical, but in my tests, the write speed seems to always lag behind the read speed. If you know the reason, please contact me!
Clustering Valkey
Finally, we will start a Valkey cluster. This cluster will run multiple instances of Valkey, each responsible for managing its own keys. This way, each instance can perform operations more easily in parallel.
I’m not going to go into detail about how the keyspace works, but here’s a great getting started guide[5] to help you understand Valkey clustering.
First, we will stop the previous Valkey container: docker compose -f valkey.yaml down
. Now we can create a Docker Compose file for the cluster. Since each instance is exposed on the host, they need to use different ports. Additionally, they all need to know that they are in cluster mode so that requests can be redirected to the appropriate instance.
# valkey-cluster.yaml
services:
valkey-node-1:
hostname: valkey1
image: valkey/valkey:latest
command: valkey-server --port 6379 --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --requirepass ${VALKEY_PASSWORD} --save ""
volumes:
- ./data1:/data
network_mode: host
valkey-node-2:
hostname: valkey2
image: valkey/valkey:latest
command: valkey-server --port 6380 --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --requirepass ${VALKEY_PASSWORD} --save ""
volumes:
- ./data2:/data
network_mode: host
valkey-node-3:
hostname: valkey3
image: valkey/valkey:latest
command: valkey-server --port 6381 --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --requirepass ${VALKEY_PASSWORD} --save ""
volumes:
- ./data3:/data
network_mode: host
volumes:
data1:
driver: local
data3:
driver: local
data2:
driver: local
Run docker compose -f valkey-cluster.yaml up -d
to start the cluster. There’s one more step to complete the creation of the cluster. Use the command docker ps --format '{{.Names}}'
to find the name of one of your nodes.
docker ps --format '{{.Names}}'
kvtest-valkey-node-1-1
kvtest-valkey-node-3-1
kvtest-valkey-node-2-1
I will use the first container to complete the cluster creation. Once the container is up, we need to tell it the details required for the cluster. Here’s the command I used with the host IP and the port configuration of all containers to create the cluster. This is because these addresses need to be accessible from the benchmark server.
docker exec -it kvtest-valkey-node-1-1 valkey-cli --cluster create 10.0.1.136:6379 10.0.1.136:6380 10.0.1.136:6381 -a e41fb9818502071d592b36b99f63003019861dad
Now we can run our benchmark tests! We need to add the --cluster
flag to the benchmark command. Additionally, since the speed is very fast, I increased the number of requests from 1 million to 10 million. This ensures that Valkey fully utilizes all resources.
redis-benchmark -n 10000000 -t set,get -P 16 -q --threads 10 --cluster -a e41fb9818502071d592b36b99f63003019861dad --threads 5 -h 10.0.1.136
Cluster has 3 master nodes:
Master 0: 219294612b44226fa32482871cf21025ff531875 10.0.1.136:6380
Master 1: e5d85b970551c27065f1552f5358f4add6114d98 10.0.1.136:6381
Master 2: 1faf3d0dd22e518eec11fd46c0de6ce18cd15cfe 10.0.1.136:6379
SET: 1122838.50 requests per second, p50=0.575 msec
GET: 1188071.75 requests per second, p50=0.511 msec
1,155,000 requests per second. We successfully doubled the number of requests. All of this was accomplished on a credit card-sized single-board computer.
While this is far from the standard I would recommend for production servers, these steps are what I would suggest for evaluating Valkey. It is important to first test with a single instance to find the best settings. Then you can scale the tests by adding IO threads or Valkey instances.
Tests should reflect your production workload as closely as possible. This test used synthetic data. Therefore, I recommend checking the documentation to understand other settings you may need to test. For example, we tested with the default settings of 50 client connections and 3-byte payloads. Your production workload may differ, so explore all settings! You may find that IO threads work better in your use case.
If you enjoyed this article, remember to visit my blog TippyBits.com, where I will regularly post similar content. My friends, stay curious!
Guide: https://docs.docker.com/engine/install/debian/
[2]Second Page: https://docs.docker.com/engine/install/linux-postinstall/
[3]IO Threads: https://valkey.io/blog/unlock-one-million-rps/
[4]Blog: https://valkey.io/blog/unlock-one-million-rps-part2/
[5]Getting Started Guide: https://valkey.io/topics/cluster-tutorial/
Click 【Read Original】 to read the original article on the website.
Contact Linux Foundation APAC
The Linux Foundation is a non-profit organization and an essential part of the technology ecosystem.
The Linux Foundation supports the creation of a sustainable open-source ecosystem by providing financial and intellectual resources, infrastructure, services, events, and training. Through collaborative efforts, the Linux Foundation and its projects have formed extraordinarily successful investments in the creation of shared technologies. Please follow LFAPAC (Linux Foundation APAC) on WeChat.