Efficient Bitcoin Mining System Using Zynq SoC

Author: Alexander Standridge

Master of Science Graduate

California State University, Northridge [email protected]

Calvin Ho

Master of Science Graduate

California State University, Northridge [email protected]

Shahnam Mirzaei

Assistant Professor

California State University, Northridge [email protected]

Ramin Roosta

Professor

California State University, Northridge [email protected]

Integrating programmable logic with a processor subsystem in a single device to develop an adaptable and cost-effective Bitcoin mining machine.

Bitcoin is a virtual currency that has gained popularity over the past few years. As a result, Bitcoin enthusiasts invest some of their assets to support this currency through purchasing or “mining” Bitcoin. Mining refers to the process of using computer hardware to perform mathematical calculations for the Bitcoin network. Bitcoin miners who provide services can receive a reward (currently 25 bitcoins) as well as any embedded transaction fees. Since network rewards are distributed based on the amount of computation completed by all miners, the competition for mining is exceptionally fierce.

Bitcoin mining began as a software process running on low-cost hardware such as CPUs and GPUs. However, as Bitcoin gained popularity, the mining process underwent a dramatic transformation. Early miners had to use high-power processors to achieve satisfactory “hash rates”, or mining speeds. Although mining with CPU/GPU was very inefficient, it was flexible enough to adapt to changes in the Bitcoin protocol. In recent years, mining activities have gradually shifted to dedicated or semi-dedicated ASIC hardware to optimize and achieve efficient “hash rates”. Switching to this hardware improved mining efficiency, but at the cost of reduced flexibility to adapt to changes in mining protocols.

ASICs are specialized hardware used to efficiently execute specific tasks in certain applications. Although ASIC Bitcoin miners are relatively low-cost and achieve excellent “hash rates”, they come at the cost of reduced flexibility, making it difficult to adapt to protocol changes.

Similar to ASICs, FPGAs can also be used as efficient mining system solutions, and they are relatively low-cost. Additionally, FPGAs offer greater flexibility compared to ASICs, enabling them to adapt to changes in the Bitcoin protocol. The current challenge is to design a fully efficient and flexible mining system that can connect to the Bitcoin network without relying on a PC host or relay devices. Our team successfully accomplished this task using Xilinx Zynq-7000 All Programmable SoC.

Overall Approach

To design a complete mining system consisting of a viable Bitcoin node and an efficient and flexible mining machine, we need a powerful FPGA chip that meets both flexibility and performance requirements. In addition to FPGAs, we also need to use processing engines to enhance efficiency.

On this complete System-on-Chip (SoC), we need optimized cores to run all required Bitcoin tasks, including network maintenance and transaction processing. The hardware that meets all these conditions is the Zynq-7020 SoC located on the ZedBoard development board. The ZedBoard is priced at about $300 to $400, which is quite affordable compared to similar products (see http://www.zedboard.org/buy).

The Zynq-7020 SoC chip integrates two ARM® Cortex™-A9 processors and 85,000 Artix®-7 FPGA logic cells. The ZedBoard development board also comes with 512MB DDR3 memory, allowing us to run SoC designs more quickly. Finally, the ZedBoard provides an SD card slot for mass storage, enabling us to store the entire updated Bitcoin program on the SD card.

We implemented our SoC Bitcoin miner using the ZedBoard. It consists of a host, a relay, a driver, and the miner. We used the non-graphical interface of the original Bitcoin client as the host to interact with the Bitcoin network. The relay uses the driver to pass work from the host to the miner.

In-depth Mining Kernel

We began developing the mining kernel using Vivado® HLS (High-Level Synthesis) based on the specifications of SHA-256 from the U.S. Department of Commerce. With the rapid behavioral testing capabilities of Vivado HLS, we quickly completed the layout of several prototype mining kernels, ranging from simple single-process systems to complex multi-process systems. However, before exploring the detailed structure of the mining kernel, it is beneficial to understand the basic knowledge of the SHA-256 process.

SHA-256 generates a 32-byte hash value from 64 bytes of data through 64 rounds of shifting, addition, and XOR operations. During this computation, eight 4-byte registers are used to store the results of each round of iteration. Once this process is complete, the four registers are combined to generate the hash value. If the input data is less than 63 bytes, it is padded to 63 bytes, with the 64th byte used to store the length of the input data. More commonly, the length of the input data exceeds 63 bytes.

In this case, the data is padded to the nearest multiple of 64, and the last byte is reserved for storing the length of the data. Each 64-byte data block uses the output of the previous data block (also known as the intermediate state) as the basis for the next data block, running the SHA-256 process sequentially.

Using Vivado HLS, the SHA-256 process is a simple “for” loop design: an array holds the constants required for each iteration, another array holds temporary values used in subsequent iterations, and eight variables hold the results of each iteration while defining logical operations. As shown in Figure 2.

Efficient Bitcoin Mining System Using Zynq SoC

In simple terms, the Bitcoin mining process can be summarized as a combination of a SHA-256 process and a comparator. The SHA-256 process is responsible for processing the block header information twice and then comparing it with the extended target difficulty of the Bitcoin network, as shown in Figure 3A. Implementing this simple concept in hardware is quite complex because the initial Bitcoin block header is 80 bytes long. This means the initial path needs to run the SHA-256 process twice, while subsequent paths only need to run once, as shown in Figure 3B. This definition of two separate SHA-256 modules for dual paths complicates matters because we need to make the SHA-256 module as generic as possible to save development time and achieve reuse. This general requirement for the SHA-256 module determines the inputs and outputs we use, which can be set to a 32-byte initial value, a 64-byte data block, and a 32-byte hash value. Thus, after isolating the SHA-256 process on the kernel side, we can freely layout these inputs.

Among the three prototypes we developed, the first prototype used a single SHA-256 process module. From the beginning, we knew this kernel would be the slowest of the three since it did not use pipelining. However, we wanted to see how small the minimum SHA-256 process could be.

The second prototype incorporated three separate SHA-256 process modules in series, as shown in Figure 3C. This configuration allows static input while the first process handles the first 64 bytes of the data header. The second process is responsible for processing the remaining 16 bytes of the data header along with the 48 bytes needed for padding. The third process handles the hash value results obtained from the first two processes. These three separate processes simplify the control logic, making it easier to use simple pipelining.

The third and final prototype used two SHA-256 process modules and fully leveraged the role of dedicated patches used in the Bitcoin data header information and mining process within the Bitcoin mining community. This layout of the data header information only occurs when modifications to the module change the first 64 bytes of the header. This means that once the first 64 bytes of the header are processed, the output data (intermediate state) can be saved, and only the last 16 bytes of the data header need to be processed, significantly enhancing the hash rate (see Figure 3D).

The comparator is a major component that, when properly designed, can significantly enhance performance. The solution is a hash value whose numerical value is less than the extended target difficulty defined by the Bitcoin system. This means that every generated hash value must be compared to see if a solution can be found. Since the extended target difficulty comes with a dedicated patch, the first four bytes will always be zero regardless of the difficulty. As the network difficulty increases, the number of leading zeros also increases. At the time of writing this program, the first 13 bytes of the extended target difficulty were all zeros. Therefore, in addition to comparing the entire hash value with the target difficulty, we also need to check how many leading bytes are zero. If they are zero, we then compare the hash value with the target difficulty. If they are not zero, we need to discard this hash value and start over.

After clarifying the results of various existing prototypes, we ultimately selected the third kernel developed specifically for Xilinx FPGAs by the Bitcoin open-source community.

ISE Development

Bitcoin mining is a race to find a value between 0 and 232-1 that can serve as a solution. Therefore, there are effectively only two ways to improve the performance of the mining kernel: speeding up processing or dividing and conquering. We tested a series of different frequencies, pipelining techniques, and parallelization methods using Spartan®-3 and Spartan-6 development boards.

We started frequency testing with the Spartan-3E development board. However, we quickly found that it was ineffective beyond 50MHz. Therefore, we switched to the Spartan-6 to complete the frequency testing.

On the Spartan-6 development board, we tested three different frequencies: 50, 100, and 150MHz, yielding predictable results of 0.8 MHps, 1.6 MHps, and 2.4 MHps, respectively. In the parallelization and pipelining tests, we also attempted 75MHz and 125MHz, but these two frequencies were merely compromises made to adapt the miner to the Spartan-6.

One of the reasons we chose the mining kernel from the open-source community is that its design philosophy is to control the depth of the pipeline using deep variables. Depth is an exponential value between 0 and 6, used to control the number of executions of the VHDL “generate” command measured in powers of 2, ranging from 20 to 26 times. The depth we tested during the initial frequency test was 0, meaning each process executes only one round.

On the Spartan-3e, we tested depths 0 and 1 until timing constraint issues arose. Depth 0 achieved approximately 0.8MHps at 50MHz, while depth 1 achieved approximately 1.6MHps.

On the Spartan-6, we achieved a depth of up to 3 before timing constraint issues arose. Running at 50MHz, depths 0 and 1 yielded results similar to those obtained on Spartan-3. At this point, we noticed an interesting trend. Doubling the frequency had the same effect as increasing the pipelining depth. The upper limit was determined by the available routing resources, with peak averages around 3.8MHps, as shown in Figure 4.

The SHA-256 process module needs to complete 10 different 32-bit additions. In this design, we attempted to complete all additions within a single clock cycle. To shorten the longest path, we tried to break the adder chain into a series of stages. However, doing so would significantly alter the control logic of the entire kernel, requiring a complete rewrite. We abandoned these modifications to save time and effort.

The final performance improvement method we tested was parallelization. With minor modifications, we could double or quadruple the number of SHA-256 processes. Each group consists of two SHA-256 process modules. For the SHA-256 with doubled groups, this adjustment effectively halves the number of groups; for the SHA-256 with quadrupled groups, it results in a quarter of the original. To enable these additional SHA processes to run on the Spartan-6, we had to reduce the system frequency. Four groups of SHA-256 process modules ran at 75MHz, while two groups ran at 125MHz. The improvement in hash rates was difficult to record. We could easily observe the hash rate of a single SHA-256 group, but the speed at which the mining kernel finds a solution with multiple SHA-256 groups was faster than the implied hash rate.

Using EDK

After testing the FPGA, the next step was to connect the mining kernel to the AXI4 bus of the Zynq SoC. The Xilinx Embedded Development Kit (EDK) comes pre-loaded with configuration utilities designed specifically for the Zynq SoC, making it easy for us to configure every aspect. By default, the system enables 512MB DDR3, Ethernet, USB, and SD interfaces, which are all the configurations required for the Bitcoin SoC.

The two Cortex-A9 processors use the AXI4 interface instead of the PLB system used in previous soft-core systems. All peripherals are connected to the processor via the AXI4 interface, as shown in Figure 5.

The EDK custom peripheral wizard provides stub code for different variants of the AXI4 interface and serves as the development foundation for the AXI interface of the mining kernel. For simplicity, we used the AXI4-Lite interface to provide basic read and write functionalities for the mining kernel. Ideally, developers would want to use the standard AXI4 interface to leverage the advanced interface control features such as data bursts.

To achieve simplicity, we used three memory-mapped registers to handle the I/O of the mining kernel. The first register is responsible for feeding the host data packets to the miner. This register keeps track of the amount of data passed through, locking itself after the 11th byte has been sent. The lock is automatically released once a solution is found. If necessary, it can also be manually unlocked via a status register.

We used the second register as a status register, marking specific bits to represent the different states of the mining kernel during operation. Given the simplicity of our design, we only used three flags: a loading flag, a running flag, and a solution found flag. The loading flag is triggered when the mining kernel receives 11 bytes, as mentioned earlier, and is cleared when we write to it. The start/running flag is set when the mining kernel is initialized and is cleared by the mining kernel when a solution is found. The final register is the output register used to load the found solution.

Before adding this new component, we tested each stage of the AXI4-Lite interface development. We also tested the AXI4-Lite interface using the quasi-system in the Xilinx software development kit. The tests served two purposes: first, to confirm that the AXI4-Lite interface of the mining kernel was functioning correctly; second, to ensure that the mining kernel received data in the correct byte order format.

To improve our current design without sacrificing adaptability, we are preparing to add more mining kernels connected to the main Bitcoin node. Adding more mining kernels can significantly enhance performance and accelerate overall hash rates.

Embedded LINUX

After completing the connection between the mining kernel and the processor, we turned to software development work. The ideal consideration is to build our own firmware using the Linux kernel for maximum performance. To simplify development work, we installed a full version of Linux. We are using Xillinux developed by Xillybus, a derivative version of Ubuntu, for rapid embedded system development.

Our top priority is to compile Bitcoind for the Cortex-A9 architecture. We are using the original open-source Bitcoin software to ensure compatibility with the Bitcoin network.

Testing Bitcoin is straightforward. We first run the background program and wait for a substantial database chain to download, then use command-line instructions to start the built-in CPU mining software of Bitcoind.

For the Linux driver, a portion of dedicated functionality must be implemented to link it to the Linux kernel for proper operation. When the driver with specific initialization functionality runs, it prepares the system for interaction with the hardware. In this functionality, we first confirm that the memory address the mining kernel is prepared to link to is usable, and then we reserve that address. Unlike the quasi-system, Linux uses a virtual memory address scheme, meaning that before we can use the reserved address, we must first request address remapping for the mining kernel. The kernel provides us with a virtual address through remapping for subsequent communication with the mining kernel registers.

Now that we can use this hardware, the initialization functionality allows us to conduct a simple test on the mining kernel to confirm that it is functioning properly. If it operates correctly, we register the main device number and secondary device number of the kernel for identification in the user program. Each functionality has a counter, and the exit functionality is the counter for the initialization functionality. Using this functionality, we undo everything done during initialization, specifically releasing the main device number and secondary device number used by the driver, and then releasing the mapped virtual address.

When we first invoke the mining kernel from the relay program, we open the functionality to start running. All we do here is confirm whether the self-test in the initialization process was successful. If the self-test fails, we provide a system error message and exit with failure. When the device is released from the relay program, the close functionality is called. However, since we only check the self-test results in the open functionality, there is nothing to do in the close functionality. The read functionality is responsible for checking the data buffer to see which port the user is reading from, then retrieves data from the mining kernel and returns it. The write functionality determines which register the user is writing to and passes data to the mining kernel.

The final component of our system is a small relay program that passes work from Bitcoind to the mining kernel via the driver and returns results. Naturally, this relay program needs to check whether Bitcoind is running, whether the mining kernel is ready, and whether it is functioning correctly. Since the relay program is idle most of the time, we are preparing to design a statistical compilation subsystem to mine data and organize it into a log file. Ideally, we will use a web host interface for configuration, displaying statistical results, and showing device status.

Complete and Efficient Mining System We developed an efficient and complete Bitcoin mining system using the Xilinx Zynq-7000 All Programmable SoC on the ZebBoard development board. This development board can flexibly adapt to changes in the Bitcoin protocol while providing a high-performance FPGA solution with SoC functionality. To improve our current design without sacrificing adaptability, we are preparing to add more mining kernels connected to the main Bitcoin node. Adding more mining kernels can significantly enhance performance and accelerate overall hash rates.

Another improvement to further optimize this design is to use dedicated firmware. Our current design runs on Ubuntu Linux 12.04. This version has many unnecessary processes running alongside Bitcoin programs like SSH. Running these processes while the Bitcoin program is active wastes the resources of the development board. In future versions, we will eliminate these processes and only run our own firmware tailored for Bitcoin tasks.

Related posts

Leave a Comment Cancel reply