For some time now, I have been contemplating setting up an environment to implement algorithms based on neural networks on a smaller (8-pin) microcontroller. After reviewing existing solutions, I found that none truly satisfied me. An obvious problem is that flexibility often comes at the cost of overhead. As usual, for a truly optimized solution, you have to roll it out yourself.
Github link:
https://github.com/cpldcpu/BitNetMCU
It’s always easier to face a clear challenge: I chose the CH32V003 as my target platform. This is currently the smallest RISC-V microcontroller on the market, priced at $0.10. It has 2kb of SRAM and 16kb of flash memory. It is somewhat unique in implementing the RV32EC instruction set architecture, which does not even support multiplication. In other words, in many ways, the capabilities of this controller are inferior to those of the Arduino UNO.
As a test subject, I chose the famous MNIST dataset, which consists of images of handwritten digits that need to be classified from 0 to 9. For instance, there are many inspiring implementations of MNIST on Arduino. In this case, the inference time is 7 seconds, and the accuracy is 82%.
The idea is to train the neural network on a PC and optimize it for inference on the CH32V003 while meeting the following conditions:
-
As fast and accurate as possible
-
Low SRAM usage during inference, suitable for 2kb of SRAM
-
Keep the neural network weights as small as possible
-
No multiplication!
These criteria can be addressed by using neural networks with quantized weights, where each weight is represented with as few bits as possible. When the network is trained with quantized weights (quantization-aware training), optimal results can be achieved compared to models trained with high-precision weights. Currently, there is some hype about using binary and ternary weights for large language models. But in reality, we can also use these methods to fit neural networks onto small microcontrollers.
The benefit of using only a few bits to represent each weight is the low memory footprint, and we do not need real multiplication instructions — inference can be simplified to addition only.
Model Structure and Optimization
For simplicity, I decided to adopt a fully connected layer-based electronic network architecture instead of a convolutional neural network. The input image size is reduced to 16×16=256 pixels and then fed into the network as follows.
The implementation of the inference engine is straightforward since it only uses fully connected layers. The following code snippet shows the inner loop, which implements multiplication of 4-bit weights using addition and shifting. Weights are encoded in two’s complement, which helps improve code efficiency. 1-bit, ternary, and 2-bit quantization are implemented similarly.
int32_t sum = 0; for (uint32_t k = 0; k < n_input; k+=8) { uint32_t weightChunk = *weightidx++; for (uint32_t j = 0; j < 8; j++) { int32_t in=*activations_idx++; int32_t tmpsum = (weightChunk & 0x80000000) ? -in : in; sum += tmpsum; // sign*in*1 if (weightChunk & 0x40000000) sum += tmpsum<<3; // sign*in*8 if (weightChunk & 0x20000000) sum += tmpsum<<2; // sign*in*4 if (weightChunk & 0x10000000) sum += tmpsum<<1; // sign*in*2 weightChunk <<= 4; } } output[i] = sum;
Additionally, fc layer normalization and ReLU operators are also required. I found that the more complex RMS normalization could be replaced with simple shifting in inference. Inference does not require a complete 32×32 multiplication! With this simple inference structure, it means we must focus our efforts on the training part.
I explored network variants with different bit widths and sizes by changing the number of hidden activations. To my surprise, I found that the accuracy of predictions is proportional to the total number of bits used to store the weights. For example, when each weight uses 2 bits, it requires twice the number of weights to achieve the same performance as a 4-bit weight network. The following graph shows the relationship between training loss and total bits. We can see that for 1-4 bits, we can basically exchange more weights for fewer bits. For 8 bits and no quantization (fp32), this trade-off is less efficient.
I further optimized the training by using data augmentation, cosine scheduling, and more epochs. It seems that 4-bit weights provide the best trade-off.
For a model size of 12 KB, an accuracy of over 99% was achieved. While larger models can achieve higher accuracy, it is much more accurate than other MCU implementations of MNIST.
Implementation on the Microcontroller
The model data will be exported to a C header file to be included in the inference code. I utilized the excellent ch32v003fun environment, which allowed me to reduce overhead and store 12kb of weights and the inference engine in just 16kb of flash memory.
There is still enough available Flash to include 4 example images. The inference output is shown above. The execution time for one inference is 13.7 milliseconds, which indeed allows the model to process moving image inputs in real-time.
Additionally, I also tested a smaller model with 4512 2-bit parameters and only 1kb of flash memory usage. Despite being large, it still achieved a test accuracy of 94.22%, with an execution time of only 1.88 milliseconds.
Conclusion
This is a rather tedious project that required searching for many missing bits and rounding errors. I am very satisfied with the results as they show that with effort, neural networks can be optimized significantly. I learned a lot and plan to use the data pipeline for more interesting applications.
Leave a Comment
Your email address will not be published. Required fields are marked *