Alibaba Sister’s Introduction: The super high growth of data centers and cloud computing, the endless demand for computing power from applications like AI, video, and gene sequencing, and the reality that Moore’s Law has effectively stagnated have all created enormous application potential and business opportunities for heterogeneous acceleration. However, FaaS solutions still have a high threshold. Today, let’s explore where the difficulties of FaaS lie and how we achieve true FaaS at Alibaba.
1. Introduction
In recent years, the DC and cloud computing fields have been experiencing rapid development. Manufacturers from both home and abroad have launched strategies such as “DC First,” “All in Cloud,” and “Cloud or Dead.” Regardless of their previous main business, companies entering the DC and cloud computing space have proliferated. However, like any ICT field, after sufficient competition, the market will stabilize, dominated by the top 2-3 manufacturers, while others will share the remaining pie. On April 24, Gartner released a report showing that Alibaba Cloud holds a 19.6% market share, ranking first in the Asia-Pacific region, with AWS and Microsoft coming in second and third, respectively. Globally, the ranking remains AWS, Microsoft, and Alibaba Cloud as first, second, and third, respectively.
Alibaba Cloud’s FPGA as a Service (hereinafter referred to as FaaS) Shuntian platform is a leader and pioneer in the field of FPGA heterogeneous acceleration and an advocate and builder of a healthy ecosystem in this area. Leveraging Alibaba Cloud’s million enterprise paying customers and its powerful Feitian operating system, the FaaS Shuntian platform has become the infrastructure for Alibaba Group’s FPGA acceleration business internally; externally, it significantly lowers the development and usage threshold for FPGA, striving to provide customers with the best cost-performance ratio for computing power and to build a healthy FPGA acceleration ecosystem.
2. Differences Between Traditional FPGA Applications and FaaS
Due to its strong flexibility, FPGA has gained extensive applications in thousands of vertical markets since its inception. However, these applications cannot be considered “cloud” or “service.” We know that traditional IT infrastructure lacks elasticity, which can easily lead to either being unable to support business peaks, resulting in system crashes, or facing business troughs where a large amount of IT resources remain idle, causing high costs. Therefore, one of the biggest differences between “cloud” and “non-cloud” lies in whether it supports the elastic scaling of resources: obtaining on-demand when needed and releasing at any time when not needed. To achieve “elasticity,” virtualization must be supported. If elasticity and virtualization cannot be achieved, it cannot be called FaaS, and fundamentally, it is no different from traditional FPGA usage.
From the perspective of FPGA design and usage, even with existing thresholds, designing a glue logic FPGA or running a simple algorithm for basic control is relatively limited in difficulty. However, it cannot be said that possessing these capabilities allows one to claim the ability to provide FaaS.
Firstly, the threshold for implementing complex algorithms using FPGA is very high (for example, implementing H.265 encoding with FPGA); secondly, the threshold for efficiently using FPGA to implement complex algorithms is extremely high (taking H.265 encoding as an example, poorly designed, a large FPGA may only support one 1080p/30 frame H.265 video, while a good design may support four); finally, the threshold for outputting FPGA’s acceleration capability to customers via the “cloud” is extremely, extremely high. Therefore, one of the cores of FaaS is to make FPGA’s computing power “x86化,” meaning that purchasing and using FPGA computing power in the cloud should be as simple as purchasing and using CPU computing power; the second core is to make FPGA’s computing power “service-oriented,” meaning that customers do not need to perform secondary development and adaptation, and can use it through simple URL-like calls.
3. Where Are the Difficulties of FaaS?
The value of FaaS can be viewed from three aspects:
-
Firstly, transforming FPGA’s computing power from traditional “offline” output to “online” output;
-
Secondly, forming a resource pool for FPGA’s computing power, allowing it to be summoned instantly during business peaks to easily cope with high business volumes and released promptly during business troughs to save costs;
-
Thirdly, making FPGA algorithm IP modular, allowing customers to select suitable IP “blocks” based on actual business needs and quickly form targeted solutions.
These three aspects of value also represent the difficulties of FaaS.
1. Cloudification
The traditional way of using FPGA (the so-called “offline” model) generally involves the FPGA being soldered onto the motherboard, and the main control CPU configuring and controlling the FPGA either directly or via a CPLD bridge, with the FPGA’s storage space directly mapped to the CPU’s main memory. In a cloud environment, the FPGA’s board (usually called an FPGA acceleration card) is a PCIe device on the host machine’s motherboard, and cloud service customers use virtual machines and PCIe messages to configure and control the acceleration card (the FPGA). At this point, there is no direct connection between the FPGA and the host machine’s CPU, and traditional functions such as resetting, loading, status, and performance monitoring of the FPGA are no longer so “conventional” in a virtual machine (cloud) environment. In short, issues that are not a problem offline may become significant problems online (in the cloud). Without resolving these issues, it is impossible to make FPGA easy to use, and thus the goal of “universal computing power” and “cloud service” cannot be achieved.
FPGA cloud vendors need to create an adaptation layer between the FPGA driver layer and the customer’s software SDK layer, which should shield the underlying software and hardware details as much as possible, providing necessary control interfaces to the customer’s software SDK through APIs, allowing customers to call FPGA’s computing power in a simple manner like a URL call. In short, if ease of use cannot be achieved, despite FPGA providing extremely high cost-performance compared to CPU and GPU, it will still fall short against the powerful ecosystems of CPU and GPU. Customers have a simple demand: they want to get results with just a mouse click (they do not care whether the underlying computing power is provided by CPU, GPU, or FPGA) rather than having to read hundreds of pages of manuals and requiring three to five or more developers to adapt for three months before it can be used.
2. Pooling of Computing Power
Traditionally, since FPGA is soldered onto the motherboard, the CPU of the host machine has 100% “ownership” and “usage rights” over that FPGA (or several FPGAs). Even if many times the FPGA is idle, it cannot be used by other host machines (even if there might be network connections between host machines, whether WAN or LAN, or even direct cable connections). However, in a cloud environment, each host machine’s FPGA is part of a computing cluster, and each host machine (and the virtual machines running on it) can use not only the FPGA on its own board/machine but also the FPGAs from other host machines. The traditional usage method cannot meet demands like “providing 1.25 FPGAs or 3.5 FPGAs for a user,” but in a cloud environment, meeting such demands is a basic function of cloud services.
3. Modularization of Algorithm IP
Algorithms (often referred to as IP) are the soul of FPGA; without algorithms, FPGA is essentially meaningless; with algorithms, FPGA can do almost anything, and its high flexibility is enabled by algorithms. Many third-party ISVs and independent developers in the industry have fully utilized FPGA’s high parallelism and pipelining characteristics to develop many efficient IPs that can effectively perform specific functions. More often than not, customer needs require the cooperation of multiple IPs to be met. Since there is no standard organization to define external interfaces for IPs, the interfaces developed by various ISVs/independent developers are diverse. Combining them into a solution often requires a significant amount of time and effort to develop the adaptation layer in between, thus losing the advantage of quickly forming solutions through IP combinations. Only when all IPs adhere to a unified interface standard can they be combined like Lego blocks to quickly form solutions.
This is, of course, an ideal state. In fact, FPGA devices have been around for over 30 years, and there are many ISVs designing various IPs, but very few ISVs can grow and scale. There has been no emergence in the traditional stronghold of FPGA, and even as FaaS is becoming a trend, there has not been any emergence yet. The main problem in developing IP lies in balancing “universality” and “specificity,” including the IP algorithm itself and the IO interfaces. Generally speaking, better universality means poorer performance; very powerful performance-tuned IPs often lack universality, requiring various adaptations and sacrifices during use; supporting more IO interfaces means higher costs for the IP itself, but if only certain IO interfaces are supported, costs can be reduced but severely limit the application range of the IP.
In the FaaS cloud era, due to FPGA’s outstanding cost-performance ratio compared to CPU or GPU in specific vertical fields, the performance of IP is often not the primary consideration, and since only computing power can be output in the cloud, the issue of supported IO types does not exist. The biggest obstacle to cloudification of FPGA computing power is that IP is still quite far from being a “service”; to fully leverage IP’s performance, customers often need to perform secondary development and a significant amount of software adaptation, which contradicts the overarching goal of reducing the usage threshold of FPGA and making FPGA’s computing power universally available.
4. Alibaba Cloud Shuntian Platform: Achieving True FaaS
From its inception, Alibaba Cloud’s FaaS Shuntian platform has aimed to universalize FPGA computing power as its mission. By cloudifying the output of FPGA computing power, it provides customers with higher cost-performance computing solutions, which is the value of the Shuntian platform. The platform not only effectively addresses the three difficulties mentioned above but also focuses on three areas of work to ensure that FaaS truly lives up to its name.
Firstly, targeting specific vertical markets where FPGA has obvious acceleration advantages, it aims to deliver FPGA’s high cost-performance computing power truly as a “service.” The ecosystems of CPU and GPU are already very mature, with tens of thousands or even millions of developers globally. This allows customers to easily and quickly establish PaaS/SaaS services on IaaS infrastructures such as ECS (Elastic Compute Service) and EGS (Elastic Graphic Service).
However, for FPGA, due to the imperfect and fragmented ecosystem, most cloud FPGA users do not have the capability to build PaaS/SaaS on top of EFS (Elastic FPGA Service) after purchasing it. This means that if cloud service providers only offer FPGA IaaS, customers will not be willing to pay; even providing IaaS+ with IP still requires customers to perform secondary development and adaptation, significantly reducing the appeal of FaaS. Therefore, to make FPGA as a Service a viable business model, it is essential to provide SaaS services based on FPGA IaaS/IaaS+, only then can it compete with CPU/GPU and leverage FPGA’s advantages in cost-performance, low latency, and high flexibility.
Secondly, establishing a complete FPGA IP cloud market, bridging the gap between IP vendors and customers of FPGA heterogeneous computing cloud services: IP vendors can generate revenue through Alibaba Cloud’s FaaS IP market, allowing them to grow and design IP suitable for more vertical markets; customers can quickly form solutions by flexibly selecting IP in the IP cloud market, thus obtaining higher cost-performance computing power. As mentioned above, without an ecosystem, there is no FaaS; the ecosystem is crucial for the success of FaaS, with FPGA device manufacturers, FPGA cloud service providers, and a large number of FPGA IP independent developers and ISVs forming the three pillars of this ecosystem. Without any one of them, building the ecosystem becomes extremely challenging.
Relatively speaking, FPGA device manufacturers and cloud service providers should invest more funds and resources to support independent developers and ISVs. Meanwhile, cloud service providers should closely collaborate with ISVs to quickly implement applications in vertical fields where FPGA has clear advantages, creating demonstrations and benchmarks to attract more independent developers and ISVs to join the effort to build the FaaS ecosystem.
Thirdly, Alibaba Cloud’s FaaS Shuntian platform is also dedicated to establishing a cloud-based FPGA development environment and platform, lowering the thresholds for FPGA design, development, and verification, allowing customers, ISVs, and independent developers to focus on design itself without worrying about EDA tools, development environments, verification environments, and other aspects that do not significantly add value to the final business but require substantial time and effort.
Traditionally, FPGA has been a “heavy asset” application: requiring the purchase of FPGA devices, development boards, EDA tools, and FPGA debugging instruments (such as logic analyzers), all of which contribute to the high threshold of FPGA applications. Just the EDA tools alone can be prohibitively expensive for small and medium-sized ISVs and independent developers, making it difficult for them to afford the licensing fees for EDA tools (including those provided by FPGA device manufacturers, such as Intel’s Quartus and XILINX’s Vivado). Using unlicensed software also poses significant risks, including software infringement risks and potential design flaws. The FaaS Shuntian platform effectively addresses these issues, significantly reducing the costs and thresholds for FPGA development and usage, laying a solid foundation for building a healthy FaaS ecosystem.
In response to the above values, FaaS has made numerous targeted innovative designs.
1. Support for Mainstream FPGA Device Manufacturers
Currently, Alibaba Cloud’s FaaS Shuntian platform supports devices from both Intel and XILINX, making Alibaba Cloud the public cloud service provider with the most comprehensive FaaS product line globally. For customers who only want to leverage FPGA computing power for acceleration, there is no need to know or care about the underlying FPGA manufacturer providing acceleration. On one hand, the devices and development environments from both manufacturers have their strengths; on the other hand, a significant portion of third-party ISVs and independent developers designing for FaaS target offline applications, thus necessitating support for devices from both manufacturers. Alibaba Cloud currently stands as the cloud service provider with the most comprehensive FaaS product line.
2. Hardware Design Innovation
The F3 instance of the FaaS Shuntian platform employs a high-density design of dual chips on a single card (XILINX’s VU9P chip), while most cloud service providers claiming to offer FaaS use a more conservative single card single chip solution. The typical power consumption of a single VU9P chip is 75W, totaling 150W for two chips, making power supply and heat dissipation critical design considerations. If these issues are not addressed well, they could significantly impact the stability of F3. Additionally, achieving signal integrity (e.g., for PCIe, MAC, and other high-speed interfaces) is a significant challenge due to the PCB reaching 26 layers. Overcoming these challenges, the FaaS Shuntian platform’s F3 achieves the industry’s highest computing power density, potentially saving up to 50% on physical machine procurement costs, further enhancing the competitiveness of FPGA’s cost-performance ratio.
3. Software Design Innovation
Alibaba Cloud FaaS also features numerous software innovations:
-
A complete FPGA monitoring system, allowing users to obtain real-time information on various operational states of the FPGA, including power consumption, temperature, IP usage rate, etc.;
-
Users can select interconnection topologies for 1/2/4 FPGAs, allowing them to flexibly choose and configure suitable instances based on their workload size to achieve the best cost-performance ratio; with high-speed interconnection channels of up to 600Gbps between FPGAs on the same card, there are no bandwidth bottlenecks for real-time, large-volume data transfers between two FPGAs if required;
-
Adaptive network interfaces: Two 100G optical ports ensure that there are no bandwidth bottlenecks for communication between different FPGAs;
-
Hot upgrades: Online reconfiguration of part of the user logic without interrupting customer operations to implement new functions and features;
-
Support for co-simulation of hardware and software. These innovative designs provide users with flexible and rich instance specification choices; greatly simplify the complexity of outputting high cost-performance FPGA computing power, while significantly enhancing the ease of use of FaaS services.
The FaaS Shuntian platform offers two major suites: HDK and SDK, providing a more efficient and unified development and deployment platform.
-
HDK: A combination of Shell + Role that ensures the Shell’s lightest weight and stability while balancing convenience and flexibility;
-
SDK: One part corresponds to the host-side drivers and software libraries for HDK, while the other part is the FPGA management tool suite faascmd. The drivers and software libraries correspond to the Shell and Role of HDK, providing unified and flexible software support for users together with HDK. The faascmd tool suite offers cloud FPGA management services, including BIT/DCP file security verification, FPGA image generation, downloading and management, and FPGA acceleration card status query feedback, among other functions.
4. Security Innovation
Custom virtualization technology is used to achieve strong isolation between IP acceleration and deployment environments; users of IP and the netlist files of IP are completely isolated, and the transmission, deployment, and acceleration processes of the netlist files are entirely invisible to the user; at the same time, the acceleration computing capability can be transparently opened to customers using the IP. This innovation completely eliminates the possibility of FPGA IP being pirated or misused during cloud output, providing a very high level of security protection. Additionally, IP owners can use Alibaba Cloud’s KMS encryption service to protect their IP; every time before loading the IP, a key must be obtained from the KMS service for decryption, ensuring that the usage and download of the IP are traceable; furthermore, this ensures the internal security of the IP released by the IP provider within the data center, as even Alibaba Cloud cannot decrypt the encrypted netlist without the KMS key from the IP provider.
You Might Also Like
Click the image below to read
New Battlefields of Cloud Native: How Can We Seize the Opportunity?
How to Achieve Continuous Integration? Xianyu Doubles Development Efficiency
N Methods for Anomaly Detection, Summarized by Alibaba Engineers
Follow “Alibaba Technology”
Stay Updated on Cutting-Edge Technology