Running Stable Diffusion on Raspberry Pi with Only 260MB RAM

Machine Heart reports

Editors: Ziwen, Zhang Qian

Stable Diffusion can now run on Raspberry Pi!

Stable Diffusion was born 11 months ago, and the news that it could run on consumer-grade GPUs has encouraged many researchers. Not only that, but Apple quickly got involved and integrated Stable Diffusion into iPhones, iPads, and Macs. This greatly reduced the hardware requirements for Stable Diffusion, making it a ‘black technology’ that everyone can use.

Now, it can even run on the Raspberry Pi Zero 2.

Running Stable Diffusion on Raspberry Pi with Only 260MB RAM

Raspberry Pi Zero 2 ‘Just as small. Five times as fast.’

What does this mean? Running Stable Diffusion is not an easy task; it involves a large transformer model with 1 billion parameters, and the recommended minimum RAM/VRAM is usually 8GB. However, the RPI Zero 2 is just a microcomputer with 512MB of RAM.

This means running Stable Diffusion on the RPI Zero 2 is a huge challenge. Moreover, during the operation, the author did not increase the storage space or offload intermediate results to the disk.

Generally speaking, major machine learning frameworks and libraries focus on minimizing inference latency and/or maximizing throughput, but all of this comes at the cost of memory usage. Therefore, the author decided to write a super small, hackable inference library aimed at minimizing memory consumption.

OnnxStream has achieved this.

Project address: https://github.com/vitoplantamura/OnnxStream

OnnxStream is based on the idea of decoupling the inference engine from the component responsible for providing model weights, the latter being a class derived from WeightsProvider. A specialization of WeightsProvider can implement any type of model parameter loading, caching, and prefetching. For example, a custom WeightsProvider can decide to download data directly from an HTTP server without loading or writing anything to disk (this is also why ‘Stream’ is in the name OnnxStream). There are two default WeightsProviders available: DiskNoCache and DiskPrefetch.

Compared to Microsoft’s inference framework OnnxRuntime, OnnxStream consumes only 1/55 of the memory to achieve the same effect, but (on CPU) is only 0.5-2 times slower than the former.

Next, you will see the effect of running Stable Diffusion on the RPI Zero 2 and the methods behind it. It should be noted that although the running speed is slower, it is a brand new attempt to run large models on smaller, more limited devices.

Netizens think this project is cool

Running Stable Diffusion on Raspberry Pi Zero 2

The VAE decoder is the only model in Stable Diffusion that cannot fit into the RPI Zero 2 RAM in single or half precision. This is because there are residual connections, very large tensors, and convolutions in the model. The only solution is static quantization (8 bits).

The following images were generated using OnnxStream with the VAE decoder at different precisions, as included in the author’s repo.

The first image was generated on the author’s PC using the same latent generated by the RPI Zero 2.

Generation effect with VAE decoder at W16A16 precision

Generation effect with VAE decoder at W8A32 precision

The third image was generated by the RPI Zero 2 in about 3 hours.

Caption: Generation effect with VAE decoder at W8A8 precision

Features of OnnxStream

Decoupling inference engine from WeightsProvider
WeightsProvider can be DiskNoCache, DiskPrefetch, or custom
Attention slicing
Dynamic quantization (8 bit unsigned, asymmetric, percentiles)
Static quantization (W8A8 unsigned, asymmetric, percentiles)
Easy calibration of quantized models
Supports FP16 (with or without FP16 arithmetic)
Implemented 24 ONNX operators (most commonly used operators)
Operations are executed sequentially, but all operators are multithreaded
Single implementation file + header file
XNNPACK calls are encapsulated in the XnnPack class (for future replacement)

It is also important to note that OnnxStream relies on XNNPACK to accelerate certain primitives: MatMul, Convolution, element-wise Add/Sub/Mul/Div, Sigmoid, and Softmax.

Performance Comparison

Stable Diffusion consists of three models: text encoder (672 operations and 123 million parameters), UNET model (2050 operations and 854 million parameters), and VAE decoder (276 operations and 49 million parameters).

Assuming the batch size equals 1, generating a complete image requires 10 steps, which requires running the text encoder 2 times, running the UNET model 20 times (i.e., 2*10), and running the VAE decoder once to achieve good results (using the Euler Ancestral scheduler).

The table shows the different inference times of the three models of Stable Diffusion, as well as memory consumption (i.e., Peak Working Set Size in Windows or Maximum Resident Set Size in Linux).

It can be seen that in the UNET model (when running at FP16 precision, FP16 arithmetic is enabled in OnnxStream), the memory consumption of OnnxStream is only 1/55 of OnnxRuntime, but the speed is only 0.5-2 times slower.

Several points to note about this test:

The first run of OnnxRuntime is a warm-up inference because its InferenceSession is created before the first run and reused in all subsequent runs. OnnxStream, however, does not have a warm-up inference because its design is purely ‘eager’ (though subsequent runs can benefit from the operating system’s caching of weight files).
Currently, OnnxStream does not support batch size != 1 input, which differs from OnnxRuntime, which can significantly speed up the whole diffusion process when running the UNET model with batch size = 2.
In the test, changing OnnxRuntime’s SessionOptions (such as EnableCpuMemArena and ExecutionMode) had no significant impact on the results.
In terms of memory consumption and inference time, the performance of OnnxRuntime is very similar to that of NCNN (another framework).
Test running conditions: Windows Server 2019, 16GB RAM, 8750H CPU (AVX2), 970 EVO Plus SSD, 8 virtual cores on VMWare.

Attention Slicing and Quantization

When running the UNET model, the ‘attention slicing’ technique is used, and W8A8 quantization is applied to the VAE decoder, which is crucial for reducing the model’s memory consumption to a level suitable for running on the RPI Zero 2.

While there is a lot of information about quantizing neural networks on the internet, there is little about ‘attention slicing.’

The idea here is simple: the goal is to avoid generating the complete Q @ K^T matrix when calculating the scaled dot-product attention of various multi-head attentions in the UNET model. In the UNET model, when the number of attention heads is 8, the shape of Q is (8,4096,40), while K^T is (8,40,4096). Therefore, the final shape of the first MatMul is (8,4096,4096), which is a 512MB tensor (FP32 precision).

The solution is to vertically slice Q and then perform normal attention operations on each Q block. Q_sliced has the shape (1,x,40), where x is 4096 (in this case), divided by onnxstream::Model::m_attention_fused_ops_parts (default value is 2 but can be customized).

This simple trick can reduce the overall memory consumption of the UNET model from 1.1GB when running at FP32 precision to 300MB. A more efficient alternative is to use FlashAttention, but FlashAttention requires writing custom kernels for each supported architecture (AVX, NEON), bypassing XnnPack in the examples provided by the author.

For more information, see the project’s GitHub interface.

Reference links:

https://www.reddit.com/r/MachineLearning/comments/152ago3/p_onnxstream_running_stable_diffusion_in_260mb_of/

https://github.com/vitoplantamura/OnnxStream

For reprints, please contact this public account for authorization

Submissions or inquiries: [email protected]

Related posts

Leave a Comment Cancel reply