The Viability of FPGA Compared to CPU, GPU, and ASIC

In recent years, the concept of FPGA has become increasingly prevalent.

For example, in Bitcoin mining, there are miners based on FPGA. Additionally, Microsoft previously stated that it would use FPGA to “replace” CPU in data centers, among other things.

In fact, for professionals, FPGA is not unfamiliar; it has been widely used. However, most people are still not very familiar with it and have many questions—what exactly is FPGA? Why use it? What are the characteristics of FPGA compared to CPU, GPU, and ASIC (Application-Specific Integrated Circuit)?

Today, with this series of questions in mind, let’s uncover the mysteries of FPGA.

1. Why Use FPGA?

It is well known that Moore’s Law for general-purpose processors (CPU) has reached its twilight years, while the scale of machine learning and web services is growing exponentially.

People use custom hardware to accelerate common computing tasks; however, the rapidly changing industry requires that these custom hardware can be reprogrammed to perform new types of computing tasks.

FPGA is precisely a hardware-reconfigurable architecture. Its full name in English is Field Programmable Gate Array.

For many years, FPGA has been used as a low-volume alternative to ASICs, but in recent years, it has been deployed on a large scale in data centers of companies like Microsoft and Baidu, providing both powerful computing capabilities and sufficient flexibility.

The Viability of FPGA Compared to CPU, GPU, and ASIC

Comparison of performance and flexibility across different architectures

Why is FPGA fast? “It is all due to the good comparison with peers.”

Both CPU and GPU belong to the von Neumann architecture, which involves instruction decoding execution and shared memory. The reason why FPGA is more energy-efficient than even CPU and GPU is fundamentally due to its instruction-free architecture that does not require shared memory.

In the von Neumann structure, since execution units (such as CPU cores) can execute arbitrary instructions, instruction memory, decoders, various instruction arithmetic units, and branch jump processing logic are needed. Due to the complex control logic of the instruction stream, there cannot be too many independent instruction streams; therefore, GPU uses SIMD (Single Instruction Multiple Data) to allow multiple execution units to process different data in unison, and CPU also supports SIMD instructions.

In contrast, the function of each logic unit in FPGA is determined during reprogramming (burning), and no instructions are needed.

In the von Neumann structure, memory serves two purposes: one is to maintain state, and the other is for communication between execution units.

Since memory is shared, access arbitration is required; to take advantage of access locality, each execution unit has a private cache, which necessitates maintaining cache consistency among execution components.

For the need to maintain state, the registers and on-chip memory (BRAM) in FPGA belong to their respective control logic, eliminating unnecessary arbitration and caching.

For communication needs, the connections between each logic unit in FPGA are determined during reprogramming (burning) and do not require communication via shared memory.

Having discussed so much at a high level, how does FPGA perform in practice? Let’s look at both compute-intensive tasks and communication-intensive tasks.

Examples of compute-intensive tasks include matrix operations, image processing, machine learning, compression, asymmetric encryption, and sorting for Bing search. Generally, these tasks are offloaded by the CPU to be executed by FPGA. For these tasks, the current performance of the Altera (which should be called Intel now, but I still prefer to call it Altera…) Stratix V FPGA’s integer multiplication performance is roughly equivalent to that of a 20-core CPU, while the floating-point multiplication performance is comparable to that of an 8-core CPU, but is an order of magnitude lower than that of a GPU. Our upcoming next-generation FPGA, Stratix 10, will be equipped with more multipliers and hardware floating-point units, theoretically achieving computing power comparable to the current top-tier GPU computing cards.

Estimated integer multiplication capability of FPGA (without using DSP, based on logic resource usage estimation)

Estimated floating-point multiplication capability of FPGA (float16 using soft core, float32 using hard core)

In data centers, the core advantage of FPGA compared to GPU is latency.

For tasks like sorting Bing search results, it is essential to return search results as quickly as possible, which requires minimizing latency at every step.

If using a GPU for acceleration, to fully utilize the computing power of the GPU, the batch size cannot be too small, resulting in latency on the order of milliseconds.

In contrast, when using FPGA for acceleration, only microsecond-level PCIe latency is required (our current FPGA is used as a PCIe accelerator card).

In the future, when Intel launches the Xeon + FPGA connected via QPI, the latency between the CPU and FPGA can be reduced to below 100 nanoseconds, making it comparable to accessing main memory.

Why does FPGA have significantly lower latency than GPU?

This is fundamentally due to architectural differences.

FPGA possesses both pipeline parallelism and data parallelism, while GPU primarily has data parallelism (with limited pipeline depth).

For example, processing a data packet may involve 10 steps; FPGA can build a 10-stage pipeline, with different stages processing different data packets. After each packet passes through 10 stages, it is completed. Each time a packet is completed, it can be output immediately.

In contrast, the data parallel approach of GPU involves creating 10 computing units, each processing different data packets, but all computing units must operate in unison, performing the same tasks (SIMD, Single Instruction Multiple Data). This requires that 10 data packets must be input and output together, which increases input and output latency.

When tasks arrive one by one rather than in batches, pipeline parallelism can achieve lower latency than data parallelism. Therefore, for streaming computing tasks, FPGA inherently has a latency advantage over GPU.

Comparison of orders of magnitude for compute-intensive tasks: CPU, GPU, FPGA, and ASIC in terms of 16-bit integer multiplication (the numbers are only rough estimates)

ASICs excel in throughput, latency, and power consumption, but Microsoft did not adopt them for two reasons:

The computing tasks in data centers are flexible and varied, while ASIC development costs are high and the cycles are long. After deploying a batch of acceleration cards for a certain neural network, if another neural network becomes more popular, the investment would be wasted. FPGA only requires a few hundred milliseconds to update its logical functions. The flexibility of FPGA protects investments; in fact, the way Microsoft currently uses FPGA is quite different from the initial concept.

Data centers are rented out for use by different tenants. If some machines have neural network acceleration cards, while others have Bing search acceleration cards or network virtualization acceleration cards, task scheduling and server maintenance become quite complicated. Using FPGA can maintain the homogeneity of the data center.

Next, let’s look at communication-intensive tasks.

Compared to compute-intensive tasks, communication-intensive tasks are not very complex for processing each input data; generally, they involve simple calculations followed by output, making communication often a bottleneck. Examples of communication-intensive tasks include symmetric encryption, firewalls, and network virtualization.

Comparison of orders of magnitude for communication-intensive tasks: CPU, GPU, FPGA, and ASIC in terms of processing a 64-byte network data packet (the numbers are only rough estimates)

For communication-intensive tasks, the advantages of FPGA over both CPU and GPU are even greater.

In terms of throughput, the transceivers on FPGA can be directly connected to 40 Gbps or even 100 Gbps network cables, processing data packets of any size at line speed; whereas the CPU needs to receive data packets from the network card before processing, and many network cards cannot handle 64-byte small data packets at line speed. Although high performance can be achieved by inserting multiple network cards, the number of PCIe slots supported by the CPU and motherboard is often limited, and the network cards and switches themselves are also quite expensive.

In terms of latency, when the network card receives a data packet and the CPU then sends it back to the network card, even using high-performance packet processing frameworks like DPDK, the latency is still around 4-5 microseconds. A more serious issue is that the latency of general-purpose CPU is not stable. For example, when the load is high, the forwarding latency may rise to several tens of microseconds or even higher (as shown in the figure below); clock interrupts and task scheduling in modern operating systems also increase the uncertainty of latency.

Comparison of forwarding latencies between ClickNP (FPGA) and Dell S6000 switch (commercial switch chip), Click+DPDK (CPU), and Linux (CPU); error bars indicate 5% and 95%. Source: [5]

Although GPU can also process data packets at high performance, it lacks a network port, meaning it first has to receive the data packets from the network card before processing. This limits throughput to that of the CPU and/or network card. The latency of the GPU itself is even more concerning.

So why not implement these network functions directly into the network card or use programmable switches? The flexibility of ASIC remains a critical issue.

Although there are increasingly powerful programmable switch chips available now, such as Tofino which supports P4 language, ASIC still cannot perform complex stateful processing, such as certain custom encryption algorithms.

In summary, the main advantages of FPGA in data centers are stable and extremely low latency, making it suitable for both streaming compute-intensive tasks and communication-intensive tasks.

2. Microsoft’s Practice of Deploying FPGA

In September 2016, Wired magazine published an article titled “Microsoft Bets the Future on FPGA,” detailing the past and present of the Catapult project.

Shortly after, Doug Burger, the head of the Catapult project, demonstrated FPGA-accelerated machine translation with Microsoft CEO Satya Nadella at Ignite 2016.

The total computing power of the demonstration was 1.03 million T ops, equivalent to 100,000 top-tier GPU computing cards. A single FPGA (along with onboard memory and network interfaces) consumes about 30 W, adding only one-tenth to the total server power consumption.

Demonstration at Ignite 2016: Machine translation computing power of 1 Exa-op (10^18) per second

Deploying FPGA at Microsoft has not been without challenges. Regarding where to deploy the FPGA, it has gone through three phases:

Dedicated FPGA clusters filled with FPGAs;

Each machine has one FPGA, connected via a dedicated network;

Each machine has one FPGA, placed between the network card and switch, sharing the server network.

Three phases of Microsoft’s FPGA deployment approach, source: [3]

The first phase involved dedicated clusters filled with FPGA accelerator cards, resembling a supercomputer made entirely of FPGAs.

The following image shows the earliest BFB experimental board, which housed 6 FPGAs on a single PCIe card, with 4 PCIe cards inserted into each 1U server.

Earliest BFB experimental board with 6 FPGAs, source: [1]

It is noteworthy to mention the company’s name. In the semiconductor industry, as long as the volume is large enough, the price of chips tends towards the price of sand. It is rumored that the choice of another company was due to the unwillingness of this company to provide chips at “sand prices”.

Of course, both companies’ FPGAs are now used in the data center field. As long as the scale is large enough, concerns about high FPGA prices become unnecessary.

Earliest BFB experimental board with 4 FPGA cards inserted into a 1U server, source: [1]

The supercomputer-like deployment method means that there is a dedicated cabinet filled with servers like the one shown, each equipped with 24 FPGAs (left image).

This method has several issues:

FPGAs in different machines cannot communicate with each other, limiting the scale of problems that FPGAs can handle to the number of FPGAs in a single server;

Other machines in the data center must concentrate tasks to this cabinet, creating in-cast, making it challenging to achieve stable network latency.

The dedicated FPGA cabinet constitutes a single point of failure; if it fails, no one can accelerate tasks;

Servers equipped with FPGAs are custom-built, complicating cooling and maintenance.

Three deployment methods for FPGA, from centralized to distributed, source: [1]

A less aggressive approach is to deploy a server filled with FPGAs in each cabinet (shown in the image above). This avoids issues (2) and (3), but (1) and (4) remain unresolved.

In the second phase, to ensure the homogeneity of servers in the data center (which is also an important reason for not using ASICs), one FPGA is inserted into each server (shown in the right image), with FPGAs connected via a dedicated network. This is also the deployment method published by Microsoft at ISCA’14.

Interior of Open Compute Server. The red box indicates the location for the FPGA, source: [1]

Open Compute Server with FPGA inserted, source: [1]

Connection and fixing between FPGA and Open Compute Server, source: [1]

The FPGA uses Stratix V D5, which has 172K ALMs, 2014 M20K on-chip memory, and 1590 DSPs. The board has an 8GB DDR3-1333 memory, a PCIe Gen3 x8 interface, and two 10 Gbps network interfaces. FPGAs between cabinets are connected via a dedicated network, where a group of 10G ports is connected in a ring of 8, and another group of 10G ports is connected in a ring of 6, without using switches.

Network connection method between FPGAs in the cabinet, source: [1]

This cluster of 1632 servers and 1632 FPGAs has doubled the overall performance of sorting Bing’s search results (in other words, halving the number of servers).

As shown in the figure below, every 8 FPGAs are linked in a chain, with the aforementioned 10 Gbps dedicated network cable used for communication. Each of these 8 FPGAs has its own responsibilities: some are responsible for feature extraction from documents (yellow), some for computing feature expressions (green), and some for calculating document scores (red).

FPGA accelerates the process of sorting Bing’s search results, source: [1]

FPGA not only reduces the latency of Bing searches but also significantly improves the stability of that latency, source: [4]

Both local and remote FPGAs can reduce search latency; the communication latency of remote FPGAs is negligible compared to search latency, source: [4]

The deployment of FPGA in Bing has been successful, and the Catapult project continues to expand within the company.

Microsoft’s cloud computing Azure department has the most servers internally.

The urgent issue for the Azure department is the overhead caused by network and storage virtualization. Azure sells virtual machines to customers and needs to provide network functions such as firewalls, load balancing, tunneling, and NAT for the virtual machines. Since the physical storage of cloud storage is separated from computing nodes, data must be moved from storage nodes over the network, requiring compression and encryption.

In the era of 1 Gbps networks and mechanical hard drives, the CPU overhead for network and storage virtualization was negligible. However, as network and storage speeds increase, reaching 40 Gbps, and a single SSD can achieve a throughput of 1 GB/s, the CPU is becoming increasingly overwhelmed.

For example, the Hyper-V virtual switch can only handle around 25 Gbps of traffic and cannot achieve 40 Gbps line speed, performing even worse with smaller data packets; AES-256 encryption and SHA-1 signatures can only be processed at 100 MB/s per CPU core, which is only one-tenth of a single SSD’s throughput.

Required CPU cores for processing network tunnel protocols and firewall handling at 40 Gbps, source: [5]

To accelerate network functions and storage virtualization, Microsoft has deployed FPGA between network cards and switches.

As shown in the figure below, each FPGA has a 4 GB DDR3-1333 DRAM, connected to a single CPU socket through two PCIe Gen3 x8 interfaces (physically it is a PCIe Gen3 x16 interface, as the FPGA does not have a x16 hard core, logically treated as two x8 interfaces). The physical network card (NIC) is just a regular 40 Gbps network card, used solely for communication between the host and the network.

Architecture for deploying FPGA in Azure servers, source: [6]

The FPGA (SmartNIC) virtualizes a network card for each virtual machine, allowing the virtual machine to access this virtual network card directly via SR-IOV. The data plane functions originally in the virtual switch have been moved into the FPGA, allowing virtual machines to send and receive network data packets without CPU involvement or passing through the physical network card (NIC). This not only saves CPU resources that can be sold but also improves the network performance of virtual machines (25 Gbps), reducing network latency between virtual machines in the same data center by a factor of 10.

Accelerated architecture for network virtualization, source: [6]

This is the third-generation architecture for deploying FPGA, currently adopted for large-scale deployment of “one FPGA per server”.

The original intent of reusing the host network for FPGA is to accelerate network and storage, but its more profound impact is to extend the network connections between FPGAs to the scale of the entire data center, creating a truly cloud-scale “supercomputer”.

In the second-generation architecture, the network connections between FPGAs were limited to within the same rack, and the dedicated interconnection method between FPGAs made scaling difficult, while forwarding via the CPU incurred too high overhead.

In the third-generation architecture, FPGAs communicate via LTL (Lightweight Transport Layer). Latency within the same rack is under 3 microseconds; under 8 microseconds, it can reach 1000 FPGAs; and under 20 microseconds, it can reach all FPGAs in the same data center. Although the second-generation architecture has lower latency within 8 machines, it can only access 48 FPGAs via the network. To support large-scale communication between FPGAs, the LTL in the third-generation architecture also supports PFC flow control protocol and DCQCN congestion control protocol.

Vertical axis: LTL latency; horizontal axis: number of reachable FPGAs, source: [4]

Logical module relationships within FPGA, where each Role represents user logic (such as DNN acceleration, network function acceleration, encryption), and the outer part is responsible for communication between roles and between roles and peripherals, source: [4]

The data center acceleration plane constituted by FPGAs, situated between the network switching layer (TOR, L1, L2) and traditional server software (software running on the CPU), source: [4]

The high-bandwidth, low-latency network interconnecting FPGAs forms the data center acceleration plane between the network switching layer and traditional server software.

In addition to the acceleration of network and storage virtualization, which is required for every server providing cloud services, the remaining resources on the FPGA can also be used to accelerate tasks like Bing search and deep neural networks (DNN).

For many types of applications, as the scale of distributed FPGA accelerators expands, their performance improvements become superlinear.

For example, in CNN inference, when only one FPGA is used, the on-chip memory is insufficient to hold the entire model, requiring constant access to model weights in DRAM, leading to a performance bottleneck in DRAM. However, if the number of FPGAs is sufficient, each FPGA can handle a layer or several features within a layer of the model, allowing the model weights to be fully loaded into on-chip memory, eliminating the performance bottleneck in DRAM and fully utilizing the computational power of the FPGA units.

Of course, overly fine partitioning can also lead to increased communication overhead. The key to distributing tasks among the FPGA cluster is balancing computation and communication.

From neural network models to FPGA on HaaS. By utilizing the parallelism within the model, different layers and features of the model are mapped to different FPGAs, source: [4]

At the MICRO’16 conference, Microsoft proposed the concept of Hardware as a Service (HaaS), which allows hardware to be treated as a schedulable cloud service, enabling centralized scheduling, management, and large-scale deployment of FPGA services.

Hardware as a Service (HaaS), source: [4]

From the first generation of dedicated servers filled with FPGAs, to the second generation of FPGA accelerator card clusters connected through dedicated networks, to the current large-scale FPGA cloud that reuses data center networks, three guiding principles have shaped our path:

Hardware and software are not mutually exclusive but rather collaborative;

Flexibility is essential, meaning the ability to be software-defined;

Scalability must be present.

3. The Role of FPGA in Cloud Computing

Finally, I would like to share my personal thoughts on the role of FPGA in cloud computing. As a third-year PhD student, my research at Microsoft Research Asia aims to answer two questions:

What role should FPGA play in large-scale network interconnection systems?

How can we program heterogeneous systems of FPGA + CPU efficiently and scalably?

My main regret regarding FPGA in the industry is that the mainstream use of FPGA in data centers, from internet giants other than Microsoft to the two major FPGA manufacturers and academia, mostly treats FPGA as an accelerator card for compute-intensive tasks, similar to a GPU. However, is FPGA really suitable for doing what a GPU does?

As previously discussed, the biggest difference between FPGA and GPU lies in their architectures; FPGA is better suited for low-latency streaming processing, while GPU is more suitable for processing large batches of homogeneous data.

Because many people intend to use FPGA as a computing accelerator, the high-level programming models launched by the two major FPGA manufacturers are also based on OpenCL, mimicking the batch processing model based on shared memory of GPU. For the CPU to assign a task to FPGA, it needs to first load the data into the DRAM on the FPGA board, then instruct the FPGA to start execution, and finally retrieve the execution results back from the DRAM.

Why go through the board’s DRAM when the CPU and FPGA can communicate efficiently via PCIe? Perhaps it is an engineering implementation issue; we find that using OpenCL to write to DRAM, start kernel execution, and read back from DRAM takes about 1.8 milliseconds, while communication via PCIe DMA only takes 1-2 microseconds.

Performance comparison between PCIe I/O channel and OpenCL. The vertical axis is logarithmic, source: [5]

Communication between multiple kernels within OpenCL is even more exaggerated, as the default method is also through shared memory.

As mentioned at the beginning of this article, FPGA is more energy-efficient than both CPU and GPU; the fundamental architectural advantage is the absence of instructions and the lack of need for shared memory. Using shared memory for communication between multiple kernels is unnecessary in sequential communication (FIFO). Moreover, the DRAM on FPGA is generally much slower than the DRAM on GPU.

Therefore, we proposed the ClickNP network programming framework, which uses channels instead of shared memory for communication between execution units (elements/kernels) and between execution units and host software.

Applications requiring shared memory can also be implemented based on channels, as CSP (Communicating Sequential Process) and shared memory are theoretically equivalent. ClickNP is still a framework based on OpenCL, limited by the constraints of C language for hardware description (although HLS indeed has a much higher development efficiency than Verilog). Ideally, the hardware description language would not be C.

ClickNP uses channels for communication between elements, source: [5]

ClickNP uses channels for communication between FPGA and CPU, source: [5]

Low-latency streaming processing requires the most efficient communication.

However, due to the parallelism limitations and scheduling of the operating system, the communication efficiency of the CPU is not high, and the latency is also unstable.

Furthermore, communication inevitably involves scheduling and arbitration. Due to the limitations of single-core performance and inefficiency of inter-core communication, the scheduling and arbitration performance of the CPU is constrained, whereas hardware is well-suited for such repetitive tasks. Thus, I define FPGA as the “big steward” of communication, whether it is communication between servers, between virtual machines, between processes, or between the CPU and storage devices, all can be accelerated using FPGA.

Success and failure can both stem from the same origin. The lack of instructions is both an advantage and a weakness of FPGA.

For each different task, a certain amount of FPGA logic resources must be occupied. If the tasks are complex and not highly repetitive, they will consume a large amount of logic resources, most of which will remain idle. In such cases, it is not as efficient as using a von Neumann structure processor.

Many tasks in data centers exhibit strong locality and repetitiveness: part of it is related to the network and storage tasks required by virtualization platforms, which all fall under communication; another part relates to customer computing tasks, such as machine learning and encryption/decryption.

The priority should be to use FPGA for its strongest suit in communication, and perhaps in the future, like AWS, it can also offer FPGA as a computing accelerator for rent to customers.

Whether for communication, machine learning, or encryption/decryption, the algorithms are quite complex. Attempting to completely replace the CPU with FPGA would inevitably lead to substantial waste of FPGA logic resources and increased development costs for FPGA programs. A more practical approach is for FPGA and CPU to work in collaboration, with tasks exhibiting locality and repetitiveness assigned to FPGA, while complex tasks are handled by the CPU.

As we accelerate more services like Bing search and deep learning with FPGA; as network virtualization, storage virtualization, and other foundational components’ data planes are controlled by FPGA; as the “data center acceleration plane” formed by FPGAs becomes a formidable barrier between networks and servers… it seems that FPGA will take control, and tasks on the CPU will become fragmented, driven by FPGA. In the past, the CPU was dominant, offloading repetitive computing tasks to FPGA; will it become the case in the future that the FPGA is dominant, offloading complex computing tasks to the CPU? With the advent of Xeon + FPGA, will the ancient SoC experience a renaissance in data centers?

“Crossing the memory wall and reaching a fully programmable world.”

References:

[1] Large-Scale Reconfigurable Computing in a Microsoft Datacenter https://www.microsoft.com/en-us/research/wp-content/uploads/2014/06/HC26.12.520-Recon-Fabric-Pulnam-Microsoft-Catapult.pdf

[2] A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA’14 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Catapult_ISCA_2014.pdf

[3] Microsoft Has a Whole New Kind of Computer Chip—and It’ll Change Everything

[4] A Cloud-Scale Acceleration Architecture, MICRO’16 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/Cloud-Scale-Acceleration-Architecture.pdf

[5] ClickNP: Highly Flexible and High-performance Network Processing with Reconfigurable Hardware – Microsoft Research

[6] Daniel Firestone, SmartNIC: Accelerating Azure’s Network with. FPGAs on OCS servers.

Source | FPGA Research Institute

☞ For business cooperation: ☏ Please call 010-82306118 / ✐ or email [email protected]

Click here to read the original article, directly to the Electronic Technology Application Official Website

Related posts

Leave a Comment Cancel reply