Understanding FPGA and Heterogeneous Computing

Welcome FPGA engineers to join the official WeChat technical group.

Clickthe blue textto follow us at FPGA Home – the best and largest pure FPGA engineer community in China.

Understanding FPGA and Heterogeneous Computing

Introduction

For a software developer, you may have heard of FPGA, and even used FPGA for computer architecture verification in college course design, but your first impression may be that this is something for hardware engineers.

Currently, with the rise of artificial intelligence, GPUs have taken the historical stage thanks to deep learning, and they are actively running various businesses, from training to inference. FPGA is also gradually entering data centers, leveraging its advantages. So next, let’s talk about how FPGA can enable programmers to develop more friendly without having to write tedious RTL code and without using simulation software like VCS or Modelsim, making unit tests easy to achieve.

The shift in this programming philosophy is due to FPGA’s implementation of programming through OpenCL, allowing programmers to program FPGA simply by adding appropriate pragmas in C/C++. To achieve higher performance for your FPGA applications implemented with OpenCL, you need to be familiar with the hardware introduced below. Additionally, we will introduce compilation optimization options that help better implement RTL conversion and mapping for your OpenCL applications and deploy them to FPGA for execution.

FPGA Overview

FPGA is a high-specification integrated circuit that can achieve functions of infinite precision through continuous configuration and combination, as it does not have fixed bit widths for basic data types like CPUs or GPUs. Instead, FPGA can perform very flexibly. In the use of FPGA, it is particularly suitable for some low-level operations, such as bit masking, shifting, and addition, which can be easily implemented.

To achieve parallel computing, FPGA contains look-up tables (LUTs), registers, on-chip memory, and arithmetic operation cores (such as digital signal processing (DSP) blocks). These internal modules in FPGA are connected through a network, and through programming, the connections can be configured to achieve specific logical functions. This reconfigurable characteristic of network connections provides FPGA with high-level programmability. (The programmability of FPGA is reflected in changing the connections between various modules and logical resources.)

For example, the programmability of LUTs in FPGA can be equivalently understood by programmers as a memory (RAM). A 3-bit input LUT can be equivalently understood as a memory with 3 address lines and 8 one-bit storage units (an array of length 8, where each element is 1 bit). When needing to implement a 3-bit bitwise AND operation, the array of length 8 stores the bitwise AND results of the 3-bit input numbers, totaling 8 possibilities. When needing to implement a 3-bit bitwise XOR, the array of length 8 stores the bitwise XOR results of the 3-bit input numbers, also totaling 8 possibilities. Thus, within one clock cycle, the bitwise operations of 3 bits can be obtained, and different functional bitwise operations can be achieved, which is completely programmable (equivalent to modifying the values in RAM).

Example of a 3-bit input LUT implementing bitwise AND:

Understanding FPGA and Heterogeneous Computing

Note: 3-bit input LUT look-up table

We see that the three-input bitwise AND operation is implemented through LUT in FPGA.

Understanding FPGA and Heterogeneous Computing

As shown above, a 3-input, 1-output LUT implementation. When LUTs are combined in parallel, series, and other ways, more complex logical operations can be achieved.

Traditional FPGA Development

Comparison of Traditional FPGA and Software Development

For traditional FPGA development and software development, the toolchain can be simply compared in the following table:

Understanding FPGA and Heterogeneous Computing

Note: Comparison table of traditional FPGA and software development

It is important to highlight that the synthesis phase of compilation is significantly different from software development compilation. General processors like CPUs and GPUs are already produced ASICs with their own instruction sets. However, for FPGA, everything is blank; there are only components, and nothing exists, but one can create any structural form of circuits, providing very high freedom. This freedom is both an advantage of FPGA and a disadvantage in the development process.

Writing this reminds me of a recent meme from “The Mysterious Programmers”:

Understanding FPGA and Heterogeneous Computing

Note: Comic source “The Mysterious Programmers 56” by Xi Qiao

Traditional FPGA development is like being 10 years old with Linux; to eat a cake, you need to process the raw materials from scratch. FPGA is in this state; to implement an algorithm, you need to write RTL, design state machines, and verify correctness through simulation.

Traditional FPGA Development Method

Complex systems require the use of finite state machines (FSMs), generally requiring the design of three parts of logic as shown in the diagram: combinational circuits, sequential circuits, and output logic. Combinational logic determines what the next state is, sequential logic stores the current state, and output logic combines combinational and sequential circuits to get the final output result.

Understanding FPGA and Heterogeneous Computing

Then, for a specific algorithm, design the flow of logic in the state machine:

Understanding FPGA and Heterogeneous Computing

The implemented RTL is as follows:

module fsm_using_single_always (clock , // clockreset , // Active high, syn resetreq_0 , // Request 0req_1 , // Request 1gnt_0 , // Grant 0gnt_1 );//=============Input Ports=============================input clock,reset,req_0,req_1; //=============Output Ports===========================output gnt_0,gnt_1;//=============Input ports Data Type===================wire clock,reset,req_0,req_1;//=============Output Ports Data Type==================reg gnt_0,gnt_1;//=============Internal Constants======================parameter SIZE = 3 ;parameter IDLE = 3’b001,GNT0 = 3’b010,GNT1 = 3’b100 ;//=============Internal Variables======================reg [SIZE-1:0] state ;// Seq part of the FSMreg [SIZE-1:0] next_state ;// combo part of FSM//==========Code starts Here==========================always @ (posedge clock)begin : FSMif (reset == 1’b1) begin state <= #1 IDLE; gnt_0 <= 0; gnt_1 <= 0;end elsecase(state) IDLE : if (req_0 == 1’b1) begin state <= #1 GNT0; gnt_0 <= 1; end else if (req_1 == 1’b1) begin gnt_1 <= 1; state <= #1 GNT1; end else begin state <= #1 IDLE; end GNT0 : if (req_0 == 1’b1) begin state <= #1 GNT0; end else begin gnt_0 <= 0; state <= #1 IDLE; end GNT1 : if (req_1 == 1’b1) begin state <= #1 GNT1; end else begin gnt_1 <= 0; state <= #1 IDLE; end default : state <= #1 IDLE;endcaseendendmodule // End of Module arbiter

Traditional RTL design is simply a nightmare for programmers, and the toolchain is completely different, development thinking is entirely different, and timing analysis is required. If a clock cycle is off, everything must be redone and re-verified; it all seems too low-level and inconvenient. So, let’s leave this to professional FPGAers. The following introduces OpenCL development for FPGA, which is somewhat like a 25-year-old Linux. With a higher level of abstraction, it is naturally more convenient to use.

FPGA Development Based on OpenCL

OpenCL has injected fresh blood into FPGA development, a programming language oriented towards heterogeneous systems, with FPGA as a selectable device for heterogeneous implementation. The CPU Host controls the entire execution flow of the program, while the FPGA Device serves as a means of heterogeneous acceleration. The heterogeneous architecture helps to free the CPU from processing methods that it is not good at, delegating those to the Device side. Currently, typical heterogeneous devices include: GPU, Intel Phi, and FPGA.

What is OpenCL?

Note: Quoted from wiki

Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies a programming language (based on C99) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task-based and data-based parallelism.

In summary: OpenCL is a framework for programming heterogeneous platforms, with the main heterogeneous devices being CPUs, GPUs, DSPs, FPGAs, and other hardware accelerators. OpenCL develops device-side code based on C99 and provides corresponding APIs for invocation. OpenCL provides standard interfaces for parallel computing to support task parallelism and data parallelism.

OpenCL Case Analysis

Here we analyze the matrix multiplication case from the Altera official website. You can download the case via the following link: Altera OpenCL Matrix Multiplication.

The code structure is as follows:

Among them, the code related to FPGA is matrix_mult.cl, which describes the kernel function. This part of the function will generate RTL code through the compiler and then map it to FPGA circuits.

The definition of the kernel function is as follows:

__kernel__attribute((reqd_work_group_size(BLOCK_SIZE,BLOCK_SIZE,1)))__attribute((num_simd_work_items(SIMD_WORK_ITEMS)))void matrixMult( __global float *restrict C, __global float *A, __global float *B, int A_width, int B_width)

The pattern is quite fixed; it should be noted that __global indicates data coming from the CPU, stored in global memory, which can be FPGA on-chip storage resources, DDR, QDR, etc., depending on the FPGA’s OpenCL BSP driver. num_simd_work_items indicates the width of SIMD. reqd_work_group_size indicates the size of the workgroup. These concepts can be referenced in the OpenCL usage manual.

The function implementation is as follows:

// Declare local storage to temporarily store a certain BLOCK of arrays__local float A_local[BLOCK_SIZE][BLOCK_SIZE];__local float B_local[BLOCK_SIZE][BLOCK_SIZE];// Block indexint block_x = get_group_id(0);int block_y = get_group_id(1);// Local ID index (offset within a block)int local_x = get_local_id(0);int local_y = get_local_id(1);// Compute loop boundsint a_start = A_width * BLOCK_SIZE * block_y;int a_end = a_start + A_width – 1;int b_start = BLOCK_SIZE * block_x;float running_sum = 0.0f;for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width)){ // Read the corresponding BLOCK data from global memory into local memory A_local[local_y][local_x] = A[a + A_width * local_y + local_x]; B_local[local_x][local_y] = B[b + B_width * local_y + local_x]; // Wait for the entire block to be loaded. barrier(CLK_LOCAL_MEM_FENCE); // Calculate the part, expanding the computing units in parallel, forming a multiplication addition tree #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) { running_sum += A_local[local_y][k] * B_local[local_x][k]; } // Wait for the block to be fully consumed before loading the next block. barrier(CLK_LOCAL_MEM_FENCE);}// Store result in matrix CC[get_global_id(1) * get_global_size(0) + get_global_id(0)] = running_sum;

Simulating FPGA with CPU

For simulation, the programmer does not need to worry about how the timing flows; they only need to verify the logical function. The Altera OpenCL SDK provides the functionality of a CPU emulation device, which can be done as follows:

# To generate a .aocx file for debugging that targets a specific accelerator board$ aoc -march=emulator device/matrix_mult.cl -o bin/matrix_mult.aocx –fp-relaxed –fpc –no-interleaving default –board <your-board># Generate Host exe.$ make# To run the application$ env CL_CONTEXT_EMULATOR_DEVICE_ALTERA=8 ./bin/host -ah=512 -aw=512 -bw=512

In the above script, the -march=emulator sets up a device executable file for CPU debugging. -g adds the debugging flag. —board is used to create a debugging file that matches the device. CL_CONTEXT_EMULATOR_DEVICE_ALTERA indicates the number of devices for CPU emulation.

After executing the above script, the output is as follows:

$ env CL_CONTEXT_EMULATOR_DEVICE_ALTERA=8 ./bin/host -ah=512 -aw=512 -bw=512Matrix sizes: A: 512 x 512 B: 512 x 512 C: 512 x 512Initializing OpenCLPlatform: Altera SDK for OpenCLUsing 8 device(s) EmulatorDevice : Emulated Device … EmulatorDevice : Emulated DeviceUsing AOCX: matrix_mult.aocxGenerating input matricesLaunching for device 0 (global size: 512, 64)…Launching for device 7 (global size: 512, 64)Time: 5596.620 msKernel time (device 0): 5500.896 ms…Kernel time (device 7): 5137.931 msThroughput: 0.05 GFLOPSComputing reference outputVerifyingVerification: PASS

When setting Device = 8 during simulation, it simulates 8 devices running (512, 512) * (512, 512) matrix scale, and finally verifies correctness. Next, we can compile it to run on the FPGA device.

Running Matrix Multiplication on FPGA Device

At this point, we are ready to download the code to execute on the FPGA. The only thing needed to do is use the compiler provided by the OpenCL SDK to adapt the *.cl code to the FPGA, executing the compilation command as follows:

$ aoc device/matrix_mult.cl -o bin/matrix_mult.aocx –fp-relaxed –fpc –no-interleaving default –board <your-board>

This process is relatively slow and generally takes several hours to over ten hours, depending on the size of resources on the FPGA. (Currently, this part of the time is temporarily unsolvable, because the compilation here is actually creating a circuit that can work properly, and the software will perform layout and routing work.)

After waiting for the compilation to complete, the generated matrix_mult.aocx file can be burned to the FPGA.

The burning command is as follows:

$ aocl program <your-board> matrix_mult.aocx

At this point, the task is completed, and the host-side program can be run:

$ ./host -ah=512 -aw=512 -bw=512Matrix sizes: A: 512 x 512 B: 512 x 512 C: 512 x 512Initializing OpenCLPlatform: Altera SDK for OpenCLUsing 1 device(s) <your-board> : Altera OpenCL QPI FPGAUsing AOCX: matrix_mult.aocxGenerating input matricesLaunching for device 0 (global size: 512, 512)Time: 2.253 msKernel time (device 0): 2.191 msThroughput: 119.13 GFLOPSComputing reference outputVerifyingVerification: PASS

It can be seen that matrix multiplication can run normally on FPGA, with a throughput of about 119 GFlops.

Conclusion

From the development process described above, OpenCL greatly liberates FPGA developers from development cycles, and for software developers, it is also relatively easy to get started. This is its advantage, but currently, there are still some issues during development, such as insufficient compiler optimization, leading to performance gaps compared to RTL writing; the compilation time to the Device side is too long. However, these will gradually improve with the development of the industry.

Additionally, for those interested in FPGA or those who have used FPGA for solutions, feel free to discuss together.

Welcome communication engineers and FPGA engineers to follow our public account.

The largest FPGA WeChat technical group in the country

Welcome everyone to join the national FPGA WeChat technical group. This group has tens of thousands of engineers, a group of engineers who love technology, where FPGA engineers help each other, share, and have a strong technical atmosphere! Hurry up and call your friends to join!!

Understanding FPGA and Heterogeneous Computing

Press and hold to join the national FPGA technical group.

FPGA Home Component City

Advantageous component services, please scan the code to contact the group owner: Jin Juan Email: [email protected] Welcome to recommend to procurement

ACTEL, AD part of the advantageous ordering (operating the full series):

Understanding FPGA and Heterogeneous Computing

XILINX, ALTERA advantageous stock or ordering (operating the full series):

Understanding FPGA and Heterogeneous Computing

(The above components are part of the models, for more models please consult the group owner Jin Juan)

Service concept: FPGA Home Components Self-operated Component City aims to facilitate engineers to quickly and conveniently purchase components. After years of dedicated service, our customer service is spread across large listed companies in China, military research units, small and medium-sized enterprises. Our greatest advantage is emphasizing the service-first concept and achieving fast delivery and favorable prices!

Direct brands: Xilinx ALTERA ADI TI NXP ST E2V, and more than a hundred component brands, especially good at components under US embargo against China, Welcome engineer friends to recommend us to procurement or consult us personally! We will continue to provide the best service in the industry!

Understanding FPGA and Heterogeneous Computing

FPGA technical group official thanks to brands: Xilinx, Intel (Altera), Microsemi (Actel), Lattice, Vantis, Quicklogic, Lucent, etc.

Related posts

Leave a Comment Cancel reply