Unveiling the Power of FPGA in Modern Computing

In recent years, the concept of FPGA has become increasingly prevalent.

For example, Bitcoin mining has seen the use of FPGA-based mining machines. Additionally, Microsoft previously announced its intention to use FPGAs in data centers as a “replacement” for CPUs, among other applications.

For professionals, FPGAs are not a foreign concept, as they have been widely used for years. However, many people are still not well-informed about them and have numerous questions – what exactly is FPGA? Why should we use it? What are the characteristics of FPGA compared to CPUs, GPUs, and ASICs (Application-Specific Integrated Circuits)?…

Today, with this series of questions in mind, let’s together – unveil FPGA.

1. Why Use FPGA?

As we all know, Moore’s Law for general-purpose processors (CPUs) has reached its twilight, while the scale of machine learning and web services is growing exponentially.

People are using custom hardware to accelerate common computational tasks, but the rapidly changing industry demands that this custom hardware be reprogrammable to perform new types of computational tasks.

FPGA is precisely a hardware-reconfigurable architecture. Its full English name is Field Programmable Gate Array.

For many years, FPGAs have been used as a small-batch alternative to ASICs, but in recent years, they have been deployed on a large scale in data centers of companies like Microsoft and Baidu, to simultaneously provide powerful computational capabilities and sufficient flexibility.

Comparison of performance and flexibility among different architectures

Why is FPGA fast? “It’s all about the company you keep.”

CPUs and GPUs both belong to the von Neumann architecture, which involves instruction decoding, execution, and shared memory.The reason FPGAs are more energy-efficient than CPUs and even GPUs is fundamentally due to the benefits of having no instructions and not requiring shared memory.

In the von Neumann structure, since execution units (like CPU cores) can execute arbitrary instructions, there is a need for instruction memory, decoders, various instruction operators, and branch jump processing logic. Due to the complexity of the control logic for the instruction stream, there cannot be too many independent instruction streams, so GPUs use SIMD (Single Instruction, Multiple Data) to allow multiple execution units to process different data in sync, and CPUs also support SIMD instructions.

In contrast, the function of each logic unit in FPGA is determined during reprogramming (burning), and no instructions are needed.

In the von Neumann structure, memory serves two purposes: one is to store state, and the other is for communication between execution units.

Since memory is shared, access arbitration is necessary; to utilize access locality, each execution unit has a private cache, which requires maintaining cache consistency among execution components.

For the need to store state, registers and on-chip memory (BRAM) in FPGA belong to their respective control logic, eliminating unnecessary arbitration and caching.

For communication needs, the connections between each logic unit in FPGA and surrounding logic units are determined during reprogramming (burning), and do not require communication through shared memory.

Having said all this at a high level, how does FPGA actually perform? Let’s look at both computationally intensive tasks and communication-intensive tasks.

Examples of computationally intensive tasks include matrix operations, image processing, machine learning, compression, asymmetric encryption, and sorting in Bing search. Generally, for these tasks, the CPU offloads the task to FPGA for execution. For these tasks, the integer multiplication performance of the Altera (which seems to be called Intel now, but I still prefer to call it Altera…) Stratix V FPGA is roughly equivalent to that of a 20-core CPU, while the floating-point multiplication performance is comparable to that of an 8-core CPU but lower than that of a GPU by an order of magnitude. The next-generation FPGA, Stratix 10, which we are about to use, will be equipped with more multipliers and hardware floating-point units, theoretically achieving computational capabilities comparable to today’s top GPU computing cards.

Estimated integer multiplication capabilities of FPGA (not using DSP, estimated based on logic resource usage)

Estimated floating-point multiplication capabilities of FPGA (float16 using soft core, float 32 using hard core)

In data centers, the core advantage of FPGA compared to GPU is latency.

For tasks like sorting in Bing search, to return search results as quickly as possible, minimizing the latency of each step is essential.

If using a GPU to accelerate, to fully utilize the GPU’s computational power, the batch size cannot be too small, resulting in latency of milliseconds.

In contrast, using FPGA for acceleration only requires microsecond-level PCIe latency (our current FPGA functions as a PCIe accelerator card).

In the future, when Intel launches Xeon + FPGA connected via QPI, the latency between CPU and FPGA can be reduced to below 100 nanoseconds, comparable to accessing main memory.

Why does FPGA have such a lower latency than GPU?

This is fundamentally due to architectural differences.

FPGA possesses both pipeline parallelism and data parallelism, while GPU primarily has data parallelism (with limited pipeline depth).

For example, processing a data packet involves 10 steps, FPGA can build a 10-stage pipeline, with different stages processing different packets, completing the processing of each packet after passing through all 10 stages. Once a packet is processed, it can be output immediately.

In contrast, the GPU’s data parallel approach involves 10 computing units, each processing different packets, but all computing units must operate in sync and perform the same task (SIMD, Single Instruction Multiple Data). This requires that 10 packets must be input and output together, increasing input-output latency.

When tasks arrive one by one rather than in batches, pipeline parallelism can achieve lower latency than data parallelism. Therefore, for streaming computing tasks, FPGA inherently has a latency advantage over GPU.

Comparison of order of magnitude among CPU, GPU, FPGA, and ASIC for computationally intensive tasks (using 16-bit integer multiplication as an example, numbers are rough estimates)

ASICs are impeccable in terms of throughput, latency, and power consumption, but Microsoft has not adopted them for two reasons:

Computational tasks in data centers are flexible and variable, while ASIC development is costly and time-consuming. After painstakingly deploying a batch of accelerators for a particular neural network, if another neural network becomes more popular, the investment is wasted. FPGAs can update their logical functions in just a few hundred milliseconds. The flexibility of FPGAs protects investments; in fact, Microsoft’s current FPGA applications differ significantly from the original vision.

Data centers are rented out to different tenants. If some machines have neural network accelerator cards, some have Bing search accelerator cards, and others have network virtualization accelerator cards, scheduling tasks and maintaining servers can become quite cumbersome. Using FPGAs can maintain homogeneity in the data center.

Next, let’s look at communication-intensive tasks.

Compared to computationally intensive tasks, communication-intensive tasks involve relatively simple processing for each input data, often just a simple calculation before output, making communication the bottleneck. Examples of communication-intensive tasks include symmetric encryption, firewalls, and network virtualization.

Comparison of order of magnitude among CPU, GPU, FPGA, and ASIC for communication-intensive tasks (using 64-byte network packet processing as an example, numbers are rough estimates)

For communication-intensive tasks, FPGA has even greater advantages over CPUs and GPUs.

In terms of throughput, FPGAs can directly connect to 40 Gbps or even 100 Gbps network cables, processing packets of any size at wire speed; CPUs need to receive packets from the network card before processing them, and many network cards cannot process 64-byte packets at wire speed. Although multiple network cards can be inserted to achieve high performance, the number of PCIe slots supported by CPUs and motherboards is often limited, and network cards and switches can be quite expensive.

In terms of latency, the network card receives packets to the CPU, which then sends them back to the network card; even using high-performance packet processing frameworks like DPDK, the latency is around 4-5 microseconds. A more serious issue is that the latency of general-purpose CPUs is not stable. For example, under heavy load, forwarding latency can rise to tens of microseconds or even higher (as shown in the figure below); clock interrupts and task scheduling in modern operating systems also increase latency uncertainty.

Comparison of forwarding latency between ClickNP (FPGA), Dell S6000 switch (commercial switch chip), Click+DPDK (CPU), and Linux (CPU); error bars indicate 5% and 95%. Source: [5]

Although GPUs can also process packets at high performance, they do not have network ports, meaning that packets must first be received from the network card before the GPU can process them. This limits throughput to that of the CPU and/or network card. The latency of the GPU itself is even more concerning.

So why not incorporate these network functions into network cards or use programmable switches? The flexibility of ASIC remains a significant drawback.

Despite the growing number of powerful programmable switch chips, such as Tofino that supports P4 language, ASICs still cannot perform complex stateful processing, such as certain custom encryption algorithms.

In summary, the main advantages of FPGA in data centers are stable and extremely low latency, making them suitable for both streaming computational tasks and communication-intensive tasks.

2. Microsoft’s Practice of Deploying FPGA

In September 2016, Wired magazine published an article titled “Microsoft Bets the Future on FPGA,” detailing the history of the Catapult project.

Following that, Doug Burger, the head of the Catapult project, demonstrated FPGA-accelerated machine translation alongside Microsoft CEO Satya Nadella at the Ignite 2016 conference.

The total computational capacity of the demonstration was 1.03 million T ops, equivalent to 100,000 top GPU computing cards. The power consumption of a single FPGA (along with on-board memory and network interfaces) is about 30 W, adding only one-tenth to the total server power consumption.

Demonstration at Ignite 2016: 1 Exa-op (10^18) machine translation computational capability per second

The deployment of FPGA by Microsoft has not been without its challenges. Regarding where to deploy FPGA, it has generally gone through three stages:

Dedicated FPGA clusters filled with FPGAs

Each machine has one FPGA, connected via dedicated networks

Each machine has one FPGA, positioned between the network card and switch, sharing the server network

Three stages of Microsoft’s FPGA deployment methods, source: [3]

The first stage involved dedicated clusters filled with FPGA accelerator cards, resembling a supercomputer made up entirely of FPGAs.

The following image shows the earliest BFB experimental board, which housed 6 FPGAs on a PCIe card, with 4 PCIe cards inserted into each 1U server.

The earliest BFB experimental board, which housed 6 FPGAs. Source: [1]

Notably, the name of the company is visible. In the semiconductor industry, as long as the batch is large enough, the price of chips will approach the price of sand. It is rumored that it was precisely because this company refused to offer “sand prices” that another company was chosen.

Of course, now both companies’ FPGAs are used in the data center space. As long as the scale is sufficient, concerns about FPGA prices being too high will be unnecessary.

The earliest BFB experimental board, with 4 FPGA cards inserted into a 1U server. Source: [1]

Deploying FPGAs in a supercomputer-like manner means that a dedicated cabinet is filled with servers equipped with 24 FPGAs (as shown in the left image).

This approach has several issues:

FPGAs in different machines cannot communicate, limiting the scale of problems that FPGAs can address to the number of FPGAs in a single server;

Other machines in the data center must concentrate tasks to this cabinet, creating in-cast, making network latency difficult to stabilize.

FPGA dedicated cabinets create single points of failure; if one fails, no one can accelerate anything;

Servers equipped with FPGAs are custom-built, complicating cooling and maintenance.

Three deployment methods for FPGA, from centralized to distributed. Source: [1]

A less aggressive approach is to deploy a server filled with FPGAs in each cabinet (as shown in the image above). This avoids issues (2) and (3) mentioned above, but (1) and (4) remain unresolved.

The second stage aimed to ensure homogeneity among servers in the data center (which is also a significant reason for not using ASICs) by inserting one FPGA into each server (as shown on the right), with FPGAs connected via dedicated networks. This was also the deployment method adopted in Microsoft’s paper presented at ISCA’14.

Open Compute Server in the rack. Source: [1]

Interior view of Open Compute Server. The red box indicates the location for FPGA. Source: [1]

Open Compute Server with FPGA inserted. Source: [1]

Connection and fixation between FPGA and Open Compute Server. Source: [1]

FPGA uses Stratix V D5, with 172K ALMs, 2014 M20K on-chip memory, and 1590 DSPs. The board includes an 8GB DDR3-1333 memory, a PCIe Gen3 x8 interface, and two 10 Gbps network interfaces. FPGAs between cabinets use dedicated network connections, with one group of 10G ports connected in a ring of 8, and another group of 10G ports connected in a ring of 6, without using switches.

Network connection method between FPGAs in the cabinet. Source: [1]

This cluster of 1632 servers, each with one FPGA, has doubled the overall performance of sorting Bing’s search results (in other words, halving the number of servers needed).

As shown in the figure below, every 8 FPGAs are connected in a chain, using the aforementioned 10 Gbps dedicated network cables for communication. Each of these 8 FPGAs has specific responsibilities: some extract features from documents (yellow), some compute feature expressions (green), and some calculate document scores (red).

FPGA accelerating the sorting process of Bing’s search results. Source: [1]

FPGA not only reduced the latency of Bing search but also significantly improved the stability of that latency. Source: [4]

Both local and remote FPGAs can reduce search latency, with the communication latency of remote FPGAs being negligible compared to search latency. Source: [4]

The deployment of FPGA in Bing has been successful, and the Catapult project continues to expand within the company.

The department within Microsoft with the most servers is the cloud computing Azure division.

The urgent problem Azure needs to solve is the overhead brought by network and storage virtualization. Azure sells virtual machines to customers, which require firewall, load balancing, tunneling, NAT, and other network functions.

As the physical storage for cloud storage is separated from computing nodes, data must be transported over the network from storage nodes, which also requires compression and encryption.

In the era of 1 Gbps networks and mechanical hard drives, the CPU overhead of network and storage virtualization was negligible. However, as network speeds have increased to 40 Gbps and SSDs can achieve a throughput of 1 GB/s, CPUs are gradually becoming overwhelmed.

For instance, the Hyper-V virtual switch can only handle around 25 Gbps of traffic, which is insufficient for 40 Gbps wire speed, and performs worse with smaller data packets; AES-256 encryption and SHA-1 signing can only be processed at 100 MB/s per CPU core, which is only one-tenth of an SSD’s throughput.

Number of CPU cores needed for network tunnel protocol and firewall processing at 40 Gbps. Source: [5]

To accelerate network functions and storage virtualization, Microsoft has deployed FPGAs between network cards and switches..

As shown in the figure below, each FPGA has a 4 GB DDR3-1333 DRAM, connected to a CPU socket via two PCIe Gen3 x8 interfaces (physically a PCIe Gen3 x16 interface, as the FPGA does not have a x16 hard core, logically used as two x8). The physical network card (NIC) is just a standard 40 Gbps network card, used solely for communication between the host and the network.

Architecture of Azure servers deploying FPGA. Source: [6]

FPGA (SmartNIC) virtualizes a network card for each virtual machine, allowing virtual machines to directly access this virtual network card via SR-IOV. The data plane functions that were originally in the virtual switch have been moved to the FPGA, meaning that virtual machines can send and receive network packets without CPU involvement or passing through the physical network card (NIC). This not only saves CPU resources that can be sold but also improves the network performance of virtual machines (25 Gbps), reducing network latency between virtual machines in the same data center by a factor of ten.

Accelerated architecture for network virtualization. Source: [6]

This is the third-generation architecture for Microsoft’s FPGA deployment, which is also the architecture currently adopted for large-scale deployment of “one FPGA per server.”

The initial intention of reusing host networks for FPGA is to accelerate network and storage, but the far-reaching impact is that it extends the network connections between FPGAs to a data center scale, creating a truly cloud-scale “supercomputer.”

In the second-generation architecture, the network connections between FPGAs were limited to within the same rack, and the dedicated interconnection method between FPGAs was difficult to scale, while forwarding via CPU incurred too high overhead.

In the third-generation architecture, FPGAs communicate via LTL (Lightweight Transport Layer). The latency within the same rack is under 3 microseconds; within 8 microseconds, it can reach 1000 FPGAs; and within 20 microseconds, it can reach all FPGAs in the same data center. Although the second-generation architecture had lower latency within 8 machines, it could only access 48 FPGAs via the network. To support extensive communication between FPGAs, the LTL in the third-generation architecture also supports PFC flow control protocol and DCQCN congestion control protocol.

Vertical axis: Latency of LTL; Horizontal axis: Number of reachable FPGAs. Source: [4]

Logical module relationships within FPGA, where each Role represents user logic (such as DNN acceleration, network function acceleration, encryption), while the outer part is responsible for communication between Roles and communication between Roles and peripherals. Source: [4]

Data center acceleration plane composed of FPGAs, located between the network switching layer (TOR, L1, L2) and traditional server software (software running on CPU). Source: [4]

Through high-bandwidth, low-latency network interconnections, FPGAs form a data center acceleration plane situated between the network switching layer and traditional server software.

In addition to the network and storage virtualization acceleration required by every server providing cloud services, the remaining resources on FPGAs can also be used to accelerate tasks like Bing search and deep neural networks (DNN).

For many types of applications, as the scale of distributed FPGA accelerators expands, their performance improvements are superlinear.

For example, in CNN inference, when only one FPGA is used, the on-chip memory is insufficient to hold the entire model, necessitating continuous access to model weights in DRAM, creating a performance bottleneck. However, if there are enough FPGAs, each FPGA can handle one layer or several features within a layer, allowing the model weights to be entirely loaded into on-chip memory, thus eliminating the DRAM performance bottleneck and fully leveraging the computational capabilities of the FPGA.

Of course, overly granular division can also lead to increased communication overhead. The key to distributing tasks across a cluster of FPGAs is balancing computation and communication.

Mapping from neural network models to FPGAs on HaaS. Utilizing the parallelism within the model, different layers and different features are mapped to different FPGAs. Source: [4]

At the MICRO’16 conference, Microsoft proposed the concept of Hardware as a Service (HaaS), which allows hardware to be treated as a schedulable cloud service, enabling centralized scheduling, management, and large-scale deployment of FPGA services.

Hardware as a Service (HaaS). Source: [4]

From the first generation of dedicated clusters filled with FPGAs to the second generation of clusters connected via dedicated networks, to the current large-scale FPGA cloud utilizing the data center network, three guiding principles have shaped our trajectory:

Hardware and software are not in a relationship of replacement, but rather cooperation;

Flexibility is essential, meaning the ability to be software-defined;

Scalability is a must.

3. The Role of FPGA in Cloud Computing

Finally, I’d like to share my personal thoughts on the role of FPGA in cloud computing. As a third-year PhD student, my research at Microsoft Research Asia attempts to answer two questions:

What role should FPGA play in cloud-scale network interconnection systems?

How can we efficiently and scalably program heterogeneous systems of FPGA + CPU?

My main regret regarding FPGA in the industry is that the mainstream use of FPGA in data centers, from internet giants other than Microsoft to the two major FPGA manufacturers and academia, is largely treating FPGA as a computing accelerator for computationally intensive tasks, similar to GPUs. However, is FPGA really suitable for performing tasks typically assigned to GPUs?

As previously mentioned, the biggest difference between FPGA and GPU lies in architecture, with FPGA being more suited for low-latency streaming processing, while GPU is better for processing large batches of homogeneous data.

Since many people plan to use FPGA as a computing accelerator, the high-level programming models introduced by the two major FPGA manufacturers are also based on OpenCL, mimicking the batch processing model based on shared memory used by GPUs. For a CPU to assign a task to an FPGA, it first needs to load data into the FPGA board’s DRAM, then instruct the FPGA to begin execution, after which the FPGA returns the execution results to DRAM, and the CPU is notified to retrieve them.

Why go through the DRAM on the board for communication between CPU and FPGA when efficient communication could occur directly via PCIe? Perhaps due to engineering implementation issues, we find that using OpenCL to write to DRAM, start a kernel, and read from DRAM takes about 1.8 milliseconds. However, communication via PCIe DMA only takes 1-2 microseconds.

Performance comparison between PCIe I/O channel and OpenCL. The vertical axis is on a logarithmic scale. Source: [5]

Communications between multiple kernels in OpenCL are even more exaggerated, with the default method also relying on shared memory.

This article begins by stating that FPGA is more energy-efficient than CPU and GPU, and the fundamental architectural advantage lies in having no instructions and not requiring shared memory. Using shared memory for communication between multiple kernels in sequential communication (FIFO) is entirely unnecessary. Moreover, the DRAM on FPGAs is generally much slower than that on GPUs.

Therefore, we propose the ClickNP networking programming framework [5], utilizing channels for communication between execution units (elements/kernels) and between execution units and host software.

Applications that require shared memory can also be implemented based on pipelines, as CSP (Communicating Sequential Process) and shared memory are theoretically equivalent. ClickNP is still a framework based on OpenCL, limited by the constraints of C language in hardware description (though HLS is indeed much more efficient than Verilog). The ideal hardware description language is unlikely to be C.

ClickNP using channels for communication between elements, source: [5]

ClickNP using channels for communication between FPGA and CPU, source: [5]

Low-latency streaming processing requires maximum efficiency in communication.

However, due to limitations in parallelism and scheduling by the operating system, CPU communication efficiency is not high and latency is unstable.

Moreover, communication inevitably involves scheduling and arbitration, and CPUs, due to single-core performance limitations and low efficiency of inter-core communication, have limited scheduling and arbitration performance, making hardware more suitable for such repetitive tasks. Therefore, I define FPGA as the “big steward” of communication, capable of accelerating communication between servers, between virtual machines, between processes, and between CPUs and storage devices.

Success and failure are intertwined. The lack of instructions is both an advantage and a weakness of FPGA.

Every distinct task requires a certain amount of FPGA logical resources. If the tasks are complex and not highly repetitive, it will occupy a large number of logical resources, most of which will remain idle. In such cases, it would be better to use a von Neumann architecture processor.

Many tasks in data centers exhibit strong locality and repetitiveness: some tasks belong to the network and storage functions required by virtualization platforms, while others are client computing tasks, such as machine learning and encryption/decryption.

Initially applying FPGA to its strengths in communication may lead to a future where, like AWS, FPGA is rented out as a computing accelerator to customers.

Whether for communication, machine learning, or encryption/decryption, the algorithms involved are complex; attempting to completely replace CPUs with FPGAs will inevitably lead to significant wastage of FPGA logical resources and increase the development costs of FPGA programs. A more practical approach is for FPGA and CPU to work in collaboration, with FPGA handling tasks that have strong locality and repetitiveness, and CPUs managing the complex ones.

As we increasingly utilize FPGA to accelerate services like Bing search and deep learning; as the data plane for network virtualization and storage virtualization is dominated by FPGA; as the “data center acceleration plane” composed of FPGAs becomes a crucial link between networks and servers… it seems that FPGA will take the helm, while computational tasks on CPUs become fragmented and driven by FPGA. Previously, we relied on CPUs to offload repetitive computational tasks to FPGAs; will it shift to FPGAs taking the lead and offloading complex computational tasks to CPUs? With the advent of Xeon + FPGA, will the ancient SoC experience a renaissance in data centers?

“Crossing the memory wall, reaching a fully programmable world” (Across the memory wall and reach a fully programmable world.)

References:

[1] Large-Scale Reconfigurable Computing in a Microsoft Datacenter https://www.microsoft.com/en-us/research/wp-content/uploads/2014/06/HC26.12.520-Recon-Fabric-Pulnam-Microsoft-Catapult.pdf

[2] A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA’14 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Catapult_ISCA_2014.pdf

[3] Microsoft Has a Whole New Kind of Computer Chip—and It’ll Change Everything

[4] A Cloud-Scale Acceleration Architecture, MICRO’16 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/Cloud-Scale-Acceleration-Architecture.pdf

[5] ClickNP: Highly Flexible and High-performance Network Processing with Reconfigurable Hardware – Microsoft Research

[6] Daniel Firestone, SmartNIC: Accelerating Azure’s Network with. FPGAs on OCS servers.

Author Introduction:

Li Bojie, PhD candidate at Microsoft Research Asia, University of Science and Technology of China

This article is reprinted with the author’s permission.

Related posts

Leave a Comment Cancel reply