The Path of FPGA: Is AI ASIC Inevitable?

One month ago, a major news event in the AI industry occurred—the well-known AI hardware company, Deephi, was acquired by FPGA giant Xilinx. The rumored transaction amount varied in the billions of dollars, and everyone was amazed by the founder’s financial freedom and noble sentiments (donating 5 million to Tsinghua University, which is a model for the domestic cycle of learning, research, and returning to education). This became a hot topic. Meanwhile, various alarming rumors began to circulate, such as the fear that vertical integration in the AI field was about to begin, and that the bubble was about to burst, casting a shadow over the rapidly rising AI hardware companies.

I do not wish to judge the motives behind that business transaction. Instead, I want to take this opportunity to discuss from a technical perspective the key factors behind this acquisition—the relationship between FPGA and ASIC in AI computation. I am not an expert, so please point out any misunderstandings.

From FPGA to ASIC,

Similarities or Differences?

Among the top three domestic AI hardware companies, Deephi is strongest in its AI-specific design kit—DNNDK and its FPGA implementation, which includes its killer technology—sparse networks. Anyone doing AI hardware who has not studied pruning can give up on research.

At the same time, Deephi also has its ASIC product line—the Listening Wave series SoC.

We assume that the Aristotle structure of Listening Wave is inherited from Deephi’s Aristotle architecture on Zynq 7020 (Aristotle is the English name for 亚里士多德), as shown in the figure below: (Note: this is just a casual guess, and this assumption is likely incorrect)

So, the question arises: Is the optimal solution for AI hardware architecture a consistent transition from FPGA to ASIC?

This question requires us to return to the values of FPGA and ASIC design. With the continuous deepening of FPGA chip development, the core basic modules in an FPGA fabric are no longer just look-up tables (LUTs). In FPGA designs where computing power is the main concern (a typical example is neural networks), the high efficiency of DSP and BRAM IP in the FPGA determines the final performance of the design.

Let’s take a look at the widely used Xilinx 7 series dsp48 macro IP, whose basic architecture is shown in the figure below. It can be understood as a configurable multiply-accumulate (MAC) module. Notably, its input bit widths are 25 bits and 18 bits, and the output bit width can reach 48 bits.

At this point, an awkward story occurred. DNN, especially most applications of endpoint DNN, only require 8-bit precision. Using the powerful dsp48 is like using a cannon to shoot a mosquito, and using LUT-based logic timing cannot meet the requirements. At this point, Xilinx officially announced a white paper WP487, which presented a method for achieving parallel processing of two 8-bit precisions in an NN scenario using a dsp48. In short, it involves combining two 8-bit numbers into a 27-bit number, with 10 bits in between, and then multiplying by a third number. The MSB and LSB of the multiplication result are the results of the two multiplications. In summary, the awkwardness still exists.

In this scenario, each MAC requires 3 cycles to complete, and the complexity of the pipeline implementation leaves much room for debugging. However, in ASIC implementation, an 8-bit MAC only requires one cycle, and running at 500MHz is a matter of moments. Therefore, if the RTL from FPGA is directly applied to ASIC, it will lead to many unnecessary performance losses. This issue may become increasingly pronounced in the now popular low-precision neural networks. For example, in ISSCC 2018, KAIST in South Korea proposed a new logic for MAC that exceeded the mapping range of FPGA but showed significant advantages in power performance.

The same issue arises with the use of on-chip RAM. The author believes that the biggest difference between CNN-specific processors and classic SIMD computing/matrix multiplication accelerators lies in utilizing CNN data reuse to achieve diverse data flows. The real need for various data flows is a scratchpad of appropriate size to store partial sums. Currently, mainstream designs have a scratchpad size of about 0.5kb to 2kb for each MAC. However, the unit size of FPGA on-chip macro IP (RAMB18E1) provides BRAM/FIFO at 18kb, which is significantly larger than the scratchpad requirement. Therefore, implementing this scratchpad on FPGA becomes a dilemma; directly synthesizing it consumes a large amount of LUT and DFF resources, while using on-chip macros leads to some waste and squeezes the space for storing features/weights. Due to the awkward situation of the scratchpad size, many FPGA DNN implementations focus on matrix multiplication and abandon support for complex data flows in CNN/DNN. Similarly, this issue is not a problem in ASIC implementations based on RAM compilers, as ASIC designs can freely configure the size of the scratchpad.

In summary, although FPGA and ASIC both involve writing RTL in AI-specific designs, there are significant differences in specific architectures and philosophies. The optimal solution for FPGA design is to maximize the modular design of underlying macro IP, while ASIC has no such limitations and seeks possibilities in a more liberated manner. Thus, ASICs derived directly from FPGA may be influenced by these limitations to some extent and fail to achieve the optimal solution that exists for ASICs. This may explain why Deephi, after completing the FPGA prototype development, still put in a lot of effort to finalize the actual ASIC design.

FPGA Prototype Verification:

Tasteless yet Wasteful to Discard?

Traditionally, one important reason for the emergence of FPGA is to prototype ASICs. Undeniably, prototype verification remains a significant market for FPGAs.

In AI applications, in addition to verifying the functionality of RTL code and high-speed simulation, FPGA prototyping’s more important advantage is that it allows embedded software design to enter the overall design process earlier. Bugs and the number of flexibilities in the software domain often far exceed those in hardware. If software and system interfaces are only addressed after ASIC tape-out, it would be a waste of time. One major advantage of prototype verification is that it allows for early software and embedded development starting from a system and integration perspective using hardware prototypes, while backend and tape-out ASIC development can proceed concurrently.

However, compared to RTL simulation, the debugging capability of prototypes is well-known to be inferior. A common debugging method for FPGA prototypes is to manually set observation points in RTL, store signals in on-chip BRAM, and then read the stored signals via a serial interface similar to JTAG to reproduce waveforms. Clearly, this observation method competes with actual functional RTL for on-chip BRAM resources, especially in cases of large storage depth and wide bit widths. A more serious problem arises when a new round of large-scale modifications to probes leads to re-synthesis and implementation, which may consume a lot of time, potentially making simulation more efficient. Currently, mainstream FPGA debugging solutions follow this approach, as shown in the ChipScope + ILA mode in the figure below.

Moreover, the performance of FPGA prototyping in complex clock designs is also concerning. For FPGA beginners, clock gating is almost completely not recommended. However, as the most mainstream ASIC power-saving method, clock gating exists in every corner of AI chips, especially in networks with sparsity, where clock gating is the simplest and most effective way to reduce power consumption. The weak support for this feature in FPGAs could lead to incompleteness issues in prototype verification. Additionally, multi-clock domain issues are also a concern in FPGA prototype verification, due to limited PLL resources on the FPGA, which impose many restrictions on prototype designs.

Under these circumstances, while FPGA remains an important platform for prototype verification of AI chips, the challenges it faces are rendering it increasingly less viable.

Hardware Emulator,

Domain-Specific FPGA

With the advancement of EDA tools for integrated circuits, a new type of SoC system development tool is emerging that offers good debugging performance while providing convenience for software development—hardware emulators. It can be said to combine the advantages of simulation and prototyping while largely compensating for their disadvantages. Currently, mainstream EDA tool developers offer emulator platforms and expect to achieve an SoC development process centered around emulators in the near future. Synopsys’ Zebu, Cadence’s Palladium, and Mentor’s Veloce are among them. Zebu, in particular, is built using Xilinx’s high-end FPGAs as its basic components.

From a technical perspective, the difference between FPGA emulation and prototyping lies in that the emulator’s RTL mapping decomposes and partitions the original RTL across multiple FPGAs, each of which integrates code for debugging observation hardware. During partitioning, the EDA software also focuses on the communication behavior between modules, utilizing the high-speed transmission and routing features integrated into the FPGA to complete SoC partitioning, avoiding the hardware resource limitations present in a single FPGA.

The figure below compares the performance of FPGA-based prototype verification platforms with emulator platforms. It can be seen that while emulators do not have an advantage in speed, they have significant advantages in internal data observability and the debugging capabilities that arise from it. It can be said that FPGA-based emulators are not a direct mapping of the original AISC design code, but rather a new RTL mapping generated after a series of RTL regenerations based on the source code through partitioning, interconnection, and probe serialization. To put it stylishly, emulators are domain-specific FPGA prototyping.

Of course, FPGA emulators have a significant disadvantage: they are expensive! For AI hardware startups just crossing the threshold, purchasing an emulator can be a financial burden. However, even so, as the requirements for systems and applications in AI ASICs continue to rise, will FPGA-based emulators replace FPGA-based prototyping become a trend? Let us wait and see.

FPGA AI:

Is it Necessary to Follow the Old Path of ASIC?

As mentioned earlier, FPGA designs are difficult to directly transplant to ASIC. In fact, does the application of AI on FPGA really need to follow the traditional ASIC path of “discovering needs—defining product specifications—mass production and shipment—updating every few years”? We believe that the reconfigurable nature of FPGAs allows them to bypass this path and adopt a development model closer to software development. One example is the recently popular cloud FPGA instances (AWS, Alibaba Cloud, etc.), where users can burn corresponding bit streams into cloud FPGA instances based on their needs, allowing FPGAs to serve as dedicated accelerators for their applications. Another benefit of cloud FPGAs is that they potentially unify FPGA selection, reducing many unnecessary configuration bugs during open-source work. The well-known FPGA version of NVDLA is primarily designed to support AWS’s FPGA platform.

Thus, with this approach, the design iteration speed of FPGA AI (especially when combined with agile development processes like Chisel and HLS) can far exceed that of traditional ASIC processes, while the energy efficiency of the hardware is much higher than that of traditional CPUs/GPUs. This aligns with the growing emphasis on heterogeneous computing today (for more on heterogeneous computing, see the transcripts of the speeches by computer architecture masters Patterson and Hennessy on RISC-V and DSA!). This is why we see Microsoft and Amazon deploying FPGAs in cloud data centers, while Intel is also integrating Altera FPGAs into high-end CPUs. In the future, this new model is expected to become a new growth point for the FPGA market, which is worth our attention.

In conclusion,

(1) In terms of AI hardware implementation, the optimization paths for FPGA and ASIC differ significantly, and directly transplanting from FPGA to ASIC is not an efficient approach.

(2) It is emphasized here that this does not mean that FPGA-based AI implementations do not have a future (on the contrary, I believe they have unlimited potential). This article merely presents some thoughts on the direct transplantation from FPGA to ASIC. We anticipate that FPGA will develop its own new ecosystem in conjunction with agile design.

(3) The influence of FPGA on SoC design processes is evolving from prototype verification to hardware emulation. Is your product falling behind?

Silicon Talk aims to provide in-depth insights and various benefits for the semiconductor industry.

Your support is our motivation to move forward. If you like our article, please long-press the QR code below and select “Recognize the QR code in the pop-up menu” to follow us!

WeChat ID:

Related posts

Leave a Comment Cancel reply