Author: Hu Po
Original submission for SSD Fans, earn >= 100 Yuan for your article.
Introduction: Recurrent Neural Networks (RNNs) have the ability to retain memory and learn from sequences of data. Due to the cyclic nature of RNNs, it is difficult to achieve parallelization of all computations on traditional hardware. Current CPUs lack large-scale parallelism, while GPUs can only provide limited parallelism due to the sequential components of RNN models. To address this issue, researchers from Purdue University proposed a hardware implementation of LSTM on the Zynq 7020 FPGA, achieving a two-layer RNN with 128 hidden units and tested using a character-level language model. This implementation is 21 times faster than the ARM Cortex-A9 CPU embedded in the Zynq 7020 FPGA.
LSTM is a special type of RNN that is well-suited for processing and predicting important events with long intervals and delays in time series. Standard RNNs can retain and use information from the recent past but cannot learn long-term dependencies. Moreover, traditional RNNs cannot train long sequences due to issues with gradient vanishing and explosion. To solve these problems, LSTM adds memory control units to decide when to remember, forget, and output. The structure of LSTM units is shown in Figure 1, where ⊙ represents element-wise multiplication.
Figure 1
Mathematically, the representation of Figure 1 is shown in Figure 2. Where represents the Sigmoid function, is the input vector of the layer, is the model parameters, is the activation value of the memory unit, is the candidate memory unit gate, and is the output vector of the layer. The subscript represents the previous time step, indicating the corresponding input, forget, and output gates. These gates determine when to remember or forget an input sequence and when to output. The model needs to be trained to obtain the desired output parameters. In simple terms, model training is an iterative process where training data is input, and the resulting output is compared with the target. The model is trained using the BP algorithm. With the addition of more layers and different functions, the model can become quite complex. For LSTM, each module has four gates and some element-wise operations. Deep LSTM networks consist of multiple LSTM modules cascaded together, making the output of one layer the input to the next layer.
Figure 2
Having understood the characteristics of LSTM, how do we design the implementation of LSTM on FPGA? Let’s take a look at the implementation plan.
1) Hardware
The main operations of hardware implementation are matrix-vector multiplication and nonlinear functions.
Matrix-vector multiplication is computed by MAC units, which require two streams: the input vector stream and the row vector stream of the weight matrix. The same vector stream is multiplied and accumulated with each row of the weight matrix to produce an output vector of the same dimension as the height of the weight matrix. After calculating each output element, the MAC is reset to avoid accumulating previous matrix row calculations. Bias b can be added to the multiply-accumulate by adding a bias vector to the last column of the weight matrix, while also adding an extra unit value to the input vector. This way, no additional input port is needed for the bias, and extra pre-configuration steps can be added to the MAC unit. The results of the MAC units are summed together. The output of the adder is a nonlinear function of the element, which is implemented using linear mapping.
The nonlinear function is divided into linear segments y = ax + b, where x is limited to a specific range. During the configuration phase, the values of a, b, and the range of x are stored in configuration registers. Each linear function segment is implemented with a MAC unit and a comparator. The comparison of the input value with the linear range determines whether to process the input or pass it to the next linear function segment module. The nonlinear function is divided into 13 segments, so the nonlinear module contains 13 pipeline segment modules. The main components of the implemented design are the gate modules shown in Figure 3.
Figure 3
The implementation module uses Direct Memory Access (DMA) ports for data input and output. Since the DMA ports are independent, even if the modules activate the ports simultaneously, the input streams will not synchronize. Therefore, a stream synchronization module is needed. This synchronization block is used to cache some stream data until all ports are streaming. When the last port starts transmitting, the synchronization block begins outputting synchronized streams. This ensures that the vectors and matrix row elements are aligned to the MAC units. Additionally, the gate module in Figure 3 also contains a re-partition block that converts 32-bit values to 16-bit values. The MAC unit performs 16-bit multiplication, producing 32-bit values. Then, addition is performed using the 32-bit values to maintain precision.
Equations 1, 2, 3, and 4 in Figure 2 can be implemented using the above modules, and the remaining part is to compute some element-wise operations for equations 5 and 6. For this, the solution introduces a module containing additional multipliers and adders, as shown in Figure 4.
Figure 4
The final implementation scheme for LSTM is shown in Figure 5. This scheme uses three modules from Figure 3 and one from Figure 4. The gates are pre-configured to have nonlinear functions (tanh or Sigmoid). The internal modules are controlled by a state machine to perform a series of operations. The implemented design uses four 32-bit DMA ports. Since operations are performed in 16 bits, each DMA port can transmit two 16-bit streams. Weights and connections are stored in the main memory to take advantage of this feature. The streams are then routed to different modules based on the operations to be performed.
Figure 5
2) Driver Software
The control and testing software is implemented in C code. This software places weight values and input vectors into the main memory and uses a set of configuration registers to control the hardware modules. Each row of the weight matrix ends with the corresponding bias value. The input vector contains an extra unit value, allowing the matrix-vector multiplication to only add the last element of the matrix row. Zero padding is used to match the dimensions of the matrix row and vector, facilitating stream synchronization.
Due to the cyclic nature of LSTM, the values of c and h are overwritten in each loop. This minimizes the number of memory copies performed by the CPU. To implement multi-layer LSTM, the output of the previous layer is copied to the position of the next layer to retain it for error measurement between layers. Additionally, the control software needs to change the weights of different layers by setting different storage locations in the control registers.
Experiments and Results
The experiment implemented a character-level language model that predicts the next character given the previous character. Based on characters, the model generates text that resembles the training dataset, which can be a book or a large internet corpus of over 2 MB. This experiment trained on a portion of Shakespeare’s works. The experiment implemented a two-layer LSTM model with a hidden layer size of 128.
The scheme was implemented on a Zedboard containing the Zynq-7000 SOC XC7Z020. It features a dual ARM Cortex-A9 MPCore, and the C code LSTM implementation used in this experiment runs on the dual ARM Cortex-A9 processor of the Zedboard at a clock frequency of 667 MHz. The implementation on the FPGA runs at a clock frequency of 142 MHz. The total power consumption of the chip is 1.942 W, and the hardware utilization is shown in Table 1.
Table 1
Figure 6 shows the execution time of the feedforward LSTM character-level language model on different embedded platforms, with shorter time being better. We see that even at a clock frequency of 142 MHz, this implementation is still 21 times faster than the implementation on the ARM Cortex-A9 CPU embedded in the Zynq 7020 FPGA.
Figure 6
Figure 7 shows the unit power performance of different embedded platforms (the higher the value, the better the performance). The results from the figure show that the unit power performance of the FPGA implementation far exceeds that of other platforms, further demonstrating the superiority of the FPGA implementation.
Figure 7
Citation:
Recurrent Neural Networks Hardware Implementation on FPGA, Andre Xian Ming Chang, Berin Martini, Eugenio Culurciello
If you like it, please share and forward!
How to read other articles from SSD Fans? Click the end of the article Read Original to enter www.ssdfans.com, and use the search box to search for keywords.
Don’t want to miss the exciting follow-up articles? Just long press or scan the QR code below to follow SSD Fans!
WeChat Group Introduction
Heterogeneous Computing Group | Discussing technical issues related to artificial intelligence and heterogeneous computing architectures |
ASIC-FPGA Group | Technical discussion group for chip and FPGA hardware developers |
Enterprise Level | Discussing artificial intelligence and big data |
If you want to join these groups, please add nanoarch as a WeChat friend, introduce your nickname – organization – profession, and specify the group name to be added.