In-Depth Understanding of GPU, FPGA, and ASIC

Artificial intelligence consists of three elements: algorithms, computation, and data. Algorithms are the core of AI implementation, while computation and data serve as the foundation. In terms of algorithms, they can mainly be divided into engineering methods and simulation methods.

The engineering method employs traditional programming techniques to improve algorithm performance using extensive data processing experience; the simulation method mimics the methods or skills used by humans or other organisms to enhance algorithm performance, such as genetic algorithms and neural networks. Currently, the primary computation capability is achieved through GPU parallel computing for neural networks, while FPGA and ASIC are emerging as powerful contenders for the future.

With companies like Baidu, Google, Facebook, and Microsoft entering the AI space, the applications of AI are incredibly broad. It is evident that the future applications of AI will multiply geometrically.

Application areas include the internet, finance, entertainment, government agencies, manufacturing, automotive, gaming, and more. From an industrial structure perspective, the AI ecosystem is divided into three layers: foundation, technology, and application. The application layer includes AI + various industries (fields), the technology layer encompasses algorithms, models, and application development, while the foundation layer consists of data resources and computation capabilities.

AI will see widespread applications in many fields. Currently, key deployed applications include speech recognition, facial recognition, drones, robots, and autonomous driving.

1. Deep Learning

The core of artificial intelligence is the algorithm, with deep learning being the most mainstream AI algorithm today. Deep learning was proposed in 1958, but it only gained real traction recently, mainly due to the explosion of data and the capabilities/costs of computers.

Deep learning is a method for modeling patterns (such as sound, images, etc.) in the field of machine learning, and it is also a statistical probabilistic model. After modeling various patterns, recognition can be performed; for instance, if the pattern to be modeled is sound, this recognition can be understood as speech recognition. To draw an analogy, if machine learning algorithms are akin to sorting algorithms, then deep learning algorithms are one type among many sorting algorithms that may have certain advantages in specific application scenarios.

Deep learning is formally known as Deep Neural Networks, which evolved from the earlier Artificial Neural Networks model. This model generally uses graphical representations from computer science for intuitive expression, and the ‘depth’ in deep learning refers to the number of layers in the graphical model and the number of nodes in each layer, representing a significant improvement over earlier neural networks.

From a single neuron to a simple neural network to a deep neural network used for speech recognition, the complexity between layers increases geometrically.

Taking image recognition as an example, the original input of an image is pixels, adjacent pixels form lines, multiple lines create textures, which further form patterns, and patterns constitute the local appearance of objects until the entire object’s appearance is formed. It is not difficult to see the connection between the original input and shallow features, gradually obtaining connections with high-level features through intermediate features.

Attempting to jump directly from raw input to high-level features is undoubtedly challenging. The entire recognition process requires a massive amount of data and computation.

The breakthroughs in deep learning today are attributed to:

1. Massive data training

2. High-performance computational capabilities (CPU, GPU, FPGA, ASIC), both of which are essential.

2. Computation Power

The key metric for measuring chip computational performance is known as computation power. Generally, the number of floating-point operations executed per second (also referred to as peak speed per second) is used as a metric for computation power, abbreviated as FLOPS. Mainstream chips currently have computational capabilities reaching the TFLOPS level. One TFLOPS (teraFLOPS) equals a trillion (10^12) floating-point operations per second. Increasing deep learning computation power requires simultaneous enhancements across multiple dimensions:

1. System parallelism

2. Clock speed

3. Memory size (including register, cache, memory);

4. Memory bandwidth

5. Bandwidth between computing chips and CPUs

6. Various subtle algorithmic improvements in hardware.

This report will primarily focus on the chip domain in artificial intelligence, emphasizing the applications and future developments of GPU, FPGA, and ASIC chips in the field of AI.

3. Introduction to GPU

The GPU, also known as graphics core, visual processor, or display chip, is a microprocessor specialized in image computation on personal computers, workstations, game consoles, and some mobile devices (such as tablets, smartphones, etc.), similar to a CPU, but designed specifically for executing complex mathematical and geometric calculations required for graphics rendering. With the development of artificial intelligence, today’s GPUs are no longer limited to 3D graphics processing; the development of general-purpose GPU computing technology has garnered significant attention in the industry, and it has been proven that in terms of floating-point operations, parallel computing, and other computational aspects, GPUs can provide tens to hundreds of times the performance of CPUs.

The characteristics of GPUs include a large number of cores (up to thousands of cores) and a significant amount of high-speed memory, initially designed for gaming and computer image processing. GPUs excel at performing parallel computations similar to image processing, known as coarse-grain parallelism.

This is suitable for image processing because pixels are relatively independent of each other, and GPUs provide numerous cores to process many pixels concurrently. However, this does not lead to improved latency (only increased throughput).

For example, when a message arrives, although the GPU has many cores, only one core can be used to process the current message, and GPU cores are typically designed to support computations related to image processing, making them less versatile than CPUs. GPUs are primarily suitable for applications exhibiting high data-parallelism, such as parallel computations like Monte Carlo simulations.

The differences between CPUs and GPUs stem from their distinct architectures and computational purposes. The main differences are as follows.

Because the characteristics of GPUs are particularly suited for large-scale parallel computations, they play a significant role in the field of deep learning, as GPUs can process massive amounts of fragmented information in parallel. Deep learning relies on neural networks that closely resemble the neural systems of the human brain, which aim to analyze vast amounts of data at high speed.

For instance, if you want to teach this network how to recognize a cat’s appearance, you need to provide it with countless images of cats. This is precisely the task that GPU chips excel at. Additionally, compared to CPUs, another significant advantage of GPUs is their much lower energy requirements. GPUs excel at rapid processing of massive data.

Although machine learning has a history of several decades, two relatively recent trends have facilitated the widespread application of machine learning: the emergence of massive training data and the powerful, efficient parallel computing provided by GPUs. People utilize GPUs to train these deep neural networks, using significantly larger training sets, reducing the time required, and occupying much less data center infrastructure.

GPUs are also used to run these machine learning training models to classify and predict in the cloud, enabling support for much larger data volumes and throughput with lower power consumption and less infrastructure usage than before.

Early adopters of GPU accelerators for machine learning include various network and social media companies of all sizes, as well as leading research institutions in data science and machine learning. Compared to solely using CPUs, GPUs have thousands of computing cores and can achieve 10-100 times application throughput, making GPUs the processors of choice for data scientists dealing with big data.

In summary, we believe that in the era of artificial intelligence, GPUs are no longer merely traditional graphics processors but should be regarded as specialized processors endowed with powerful parallel computing capabilities.

Domestically, GPU chip design is still in its infancy, with a certain gap compared to international mainstream products. However, small sparks can ignite a prairie fire. Some companies are gradually beginning to possess independent research and development capabilities, such as the domestic company Jingjia Micro.

Jingjia Micro has developed China’s first independently designed GPU chip, the JM5400, specifically for the company’s graphics display control field. The JM5400, as a representative graphics chip, breaks the foreign monopoly in the military GPU field in China, achieving the domestic production of military GPUs. The GPU JM5400 mainly replaces AMD‘s GPU M9, and the performance comparison between the two is as follows. In comparison, the company’s JM5400 has advantages in low power consumption and superior performance.

4. Introduction to FPGA

FPGA, or Field-Programmable Gate Array, is a product that further develops on the basis of programmable devices such as PAL, GAL, and CPLD. FPGA chips are primarily composed of six parts: programmable input/output units, basic programmable logic units, complete clock management, embedded block RAM, abundant routing resources, embedded low-level functional units, and embedded dedicated hardware modules.

FPGAs also feature static reprogrammability and dynamic in-system reconfiguration, allowing hardware functions to be modified through programming like software. FPGAs can perform the functions of any digital device, even high-performance CPUs can be implemented using FPGAs.

FPGAs possess a large number of programmable logic units, enabling targeted algorithm design tailored to customer needs. Furthermore, when processing massive data, FPGAs have unique advantages over CPUs and GPUs: FPGAs are closer to IO. In other words, FPGAs are at the hardware layer of the architecture.

For example, when data is computed using a GPU, it must first enter memory and be copied into GPU memory under CPU instructions; after execution, it must be copied back to memory for further processing by the CPU. This process does not offer a time advantage.

Although FPGAs generally have lower frequencies than CPUs, CPUs are general-purpose processors that may require many clock cycles for specific computations (such as signal processing, image processing), whereas FPGAs can reconfigure circuits through programming to generate dedicated circuits, potentially requiring only one clock cycle for specific computations due to circuit parallelism.

For instance, if a CPU operates at a frequency of 3GHz and an FPGA at 200MHz, if a specific computation requires 30 clock cycles for the CPU, the FPGA only requires one, resulting in the following time consumption: CPU: 30/3GHz = 10ns; FPGA: 1/200MHz = 5ns. It is evident that the FPGA performs this specific computation faster than the CPU, aiding in acceleration.

A collaboration study between Peking University and the University of California on FPGA acceleration of deep learning algorithms demonstrated the time comparison between FPGA and CPU in executing deep learning algorithms. When running one iteration, using a CPU took 375 milliseconds, while using an FPGA only took 21 milliseconds, achieving an acceleration ratio of about 18 times.

FPGAs have significant energy consumption advantages over CPUs and GPUs for two main reasons. Firstly, there is no instruction fetching and decoding operation in FPGAs. In Intel‘s CPUs, due to the use of CISC architecture, decoding alone accounts for 50% of the total chip energy consumption.

In GPUs, instruction fetching and decoding also consume 10% to 20% of energy. Secondly, FPGAs typically operate at much lower frequencies than CPUs and GPUs, usually below 500MHz, while CPUs and GPUs range between 1GHz and 3GHz. This substantial frequency difference results in FPGAs consuming significantly less energy than CPUs and GPUs.

Comparing energy consumption between FPGA and CPU when executing deep learning algorithms, using a CPU consumes 36 joules, while using an FPGA only consumes 10 joules, achieving an energy-saving ratio of about 3.5 times. By accelerating and saving energy with FPGAs, real-time deep learning computations become more feasible for mobile applications.

Compared to CPUs and GPUs, FPGAs exhibit unique advantages in deep learning applications due to their bit-level fine-grained customizable structure, pipelined parallel computing capabilities, and efficient energy consumption, holding great potential for large-scale server deployments or resource-constrained embedded applications. Additionally, the flexible FPGA architecture allows researchers to explore model optimizations beyond the fixed architectures of GPUs.

5. Introduction to ASIC

ASIC (Application-Specific Integrated Circuit) refers to integrated circuits designed and manufactured to meet specific user requirements or the needs of specific electronic systems. Strictly speaking, ASICs are specialized chips that differ from traditional general-purpose chips. They are chips customized for specific needs.

ASICs, as products closely integrating integrated circuit technology with specific user or system technology, offer several advantages over general-purpose integrated circuits: smaller size, lower power consumption, improved reliability, enhanced performance, increased confidentiality, and reduced costs. Returning to the most critical metrics in deep learning: computation power and power consumption. We compare the specifications of NVIDIA’s GK210 and a certain ASIC chip as follows:

In terms of computation power, ASIC products have 2.5 times the computational capacity of GK210. The second metric is power consumption, which is reduced to 1/15 of GK210. The third metric is the size and bandwidth of internal storage capacity. This internal MEMORY is equivalent to the CACHE on a CPU.

Deep learning models are quite large, typically ranging from hundreds of MB to around 1GB, and are frequently read. If the model is stored in external DDR, the bandwidth pressure on DDR can reach TB/S levels.

Fully customized ASIC designs, due to their inherent characteristics, have the following advantages over non-customized chips:

With the same process and the same functionality, the first fully customized design improves performance by 7.6 times.

In ordinary designs, the difference between fully customized and non-customized can vary by 1 to 2 orders of magnitude.

A fully customized approach can surpass non-customized designs by 4 process nodes (a fully customized design made using 28nm may outperform a non-customized design made using 5nm). We believe that the advantages of ASICs hold significant potential in the field of artificial intelligence deep learning.

Although the application of ASICs in artificial intelligence deep learning is still limited, we can draw a similar inference from the development of Bitcoin mining chips. Bitcoin mining and artificial intelligence deep learning share similarities, as both rely on underlying chips for large-scale parallel computing. ASICs have demonstrated unparalleled advantages in the Bitcoin mining field.

The chips for Bitcoin mining have gone through four stages: CPU, GPU, FPGA, and ASIC. ASIC chips are specifically tailored for mining, removing functions that are not utilized during mining from FPGA chips, resulting in faster execution speeds compared to comparably processed FPGA chips, and lower costs after mass production.

The historical development of ASICs in the Bitcoin mining era illustrates the inherent advantages of ASICs in dedicated parallel computing: high computation power, low power consumption, low cost, and strong specificity. Google’s recently unveiled TPU, designed for artificial intelligence deep learning computation, is also an ASIC.

In conclusion, as the era of artificial intelligence approaches, traditional chip domains such as GPU, FPGA, and ASIC will experience a new explosion in the age of artificial intelligence.

Welcome to Join Our Angel Round and A Round Groups (Friendly connection to over 500 investment institutions including top-tier organizations; many companies have already completed funding); We have numerous groups for communication in fields such as science and technology innovation companies, automotive industry, automotive semiconductors, key components, new energy vehicles, intelligent connected vehicles, aftermarket, automotive investment, autonomous driving, Internet of Vehicles, etc. Please scan the admin’s WeChat to join the group (Please indicate your company name)

Related posts

Leave a Comment Cancel reply