The Evolution of Artificial Intelligence Chips: CPU, FPGA, and ASIC

Artificial Intelligence Chips (Part 1): Evolution

Artificial intelligence algorithms require strong computational power, especially with the large-scale use of deep learning algorithms, which demand even higher computational capabilities. Deep learning models have numerous parameters, a large amount of computation, and even larger data scales. In early models using deep learning algorithms for speech recognition, there was an input layer with 429 neurons, and the entire network had 156 million parameters, with training times exceeding 75 days. The Google Brain project, led by AI pioneers Andrew Ng and Jeff Dean, utilized a parallel computing platform with 16,000 CPU cores to train deep neural networks with over 1 billion neurons. The next step, if we want to simulate the human brain’s neural system, involves simulating 100 billion neurons, which will require several orders of magnitude more computational power.

Additionally, with the rapid development of mobile terminals represented by smartphones, there is a growing desire to apply artificial intelligence to these devices, which raises the requirements for hardware computational power and energy consumption. The traditional method for implementing AI on mobile terminals is to transmit all terminal data to the cloud via the network, calculate in the cloud, and then send the results back to the mobile terminal, as seen with Apple’s Siri service. However, this approach encounters several issues: first, using the network to transmit data incurs delays, and the results may take several seconds or even tens of seconds to return to the terminal. Consequently, applications that require immediate results cannot use this method. For instance, deep learning algorithms for obstacle avoidance in drones and image recognition algorithms used in ADAS systems would face severe consequences if the computation is not performed locally but relies on the cloud, due to potential communication delays and reliability issues. Secondly, once data is transmitted over the network, there is a risk of data interception. Therefore, applications that require low computational latency and are sensitive to data security need to implement AI algorithms entirely on the terminal or at least perform some pre-processing on the terminal before sending a small amount of computation results (rather than large amounts of raw data) to the cloud for final calculation, necessitating mobile terminal hardware capable of rapid computation. Thus, mobile hardware must satisfy both high speed and low power consumption requirements.

In response to these demands, AI core computing chips have undergone four major changes. Before 2007, AI research and applications experienced several ups and downs without developing into a mature industry; at the same time, due to limitations in algorithms, data, and other factors, the demand for chips was not particularly strong, as general-purpose CPU chips provided sufficient computational power. Subsequently, due to the development of industries like high-definition video and gaming, GPU products made rapid breakthroughs; people discovered that the parallel computing characteristics of GPUs were well-suited for the requirements of AI algorithms’ large data parallel computation. For example, GPUs can improve efficiency in deep learning algorithms by 9 to 72 times compared to traditional CPUs, leading to attempts to use GPUs for AI computation. After 2010, with the widespread promotion of cloud computing, AI researchers could leverage large amounts of CPUs and GPUs for hybrid computation through cloud computing. In fact, today, the main computing platform for AI remains cloud computing. However, the demand for computational power in the AI industry continues to rise rapidly, leading to the development of dedicated chips for AI after 2015, which offer a further 10-fold improvement in computational efficiency through better hardware and chip architecture.

Trends in AI Core Computing Chip Development

Currently, based on computational models, the development of AI core computing chips can be divided into two directions: one is to imitate the brain’s capabilities from a functional perspective using artificial neural networks, with main products being standard CPUs, GPUs, FPGAs, and custom ASIC chips. The other is neuromorphic computing, which aims to approach the brain from a structural perspective. This structure can be further divided into two levels: one is the neural network level, corresponding to neuromorphic architectures and processors, such as IBM’s TrueNorth chip, which treats digital processors as neurons and memory as synapses. Unlike traditional von Neumann architectures, its memory, CPU, and communication components are fully integrated, allowing information processing to occur entirely locally, overcoming the bottleneck between memory and CPU in traditional computers. Meanwhile, neurons can communicate quickly and easily; upon receiving impulses (action potentials) from other neurons, they can act simultaneously. The other level is the neuron level, corresponding to innovations at the component level. For example, IBM’s Zurich Research Center announced the creation of the world’s first artificial nanoscale random phase neurons capable of achieving fast unsupervised learning.

Main Types of AI Brain-like Chips

From the perspective of the development stage of AI chips, while imitating brain computation from a structural level is the ultimate goal of AI, it remains relatively distant from practical applications; functional imitation is currently the mainstream. Therefore, general-purpose chips such as CPUs, GPUs, and FPGAs are the primary chips in the AI field, while dedicated chips ASICs for neural network algorithms are being introduced by Intel, Google, Nvidia, and numerous startups, expected to replace current general-purpose chips as the mainstay of AI chips in the coming years.

Artificial Intelligence Chips (Part 2): GPU

“The implementation of artificial intelligence algorithms requires robust computational support, particularly with the large-scale use of deep learning algorithms, which poses higher demands on computational capabilities.”

The primary reason traditional general-purpose CPUs are unsuitable for executing AI algorithms lies in their sequential execution of computational instructions, which fails to leverage the full potential of the chip. In contrast, GPUs possess a high parallel structure, offering greater efficiency than CPUs in processing graphical data and complex algorithms. Comparing the structural differences between GPUs and CPUs, most of the CPU’s area is occupied by controllers and registers, while GPUs have more ALUs (Arithmetic Logic Units) for data processing, making this structure suitable for parallel processing of dense data. When a CPU executes computational tasks, it processes only one data point at a time, lacking true parallelism, while GPUs have multiple processing cores capable of processing multiple data points simultaneously. Programs running on GPU systems can often achieve speed improvements of dozens to even thousands of times compared to single-core CPUs. As companies like Nvidia and AMD continue to advance their GPUs’ large-scale parallel architecture support, general-purpose GPUs (GPGPU) have become an important means to accelerate parallel applications.

The development of GPUs has gone through three stages: the first generation of GPUs (before 1999), where some functions were separated from CPUs for hardware acceleration, represented by GE (Geometry Engine), which could only accelerate 3D image processing without software programming features.

The second generation of GPUs (1999-2005) achieved further hardware acceleration and limited programmability. In 1999, Nvidia’s GEFORCE 256 separated T&L (Transform and Lighting) functions from CPUs for fast transformations, marking the true emergence of GPUs; in 2001, Nvidia and ATI launched GEFORCE3 and RADEON 8500, defining the graphics hardware pipeline as stream processors, introducing vertex-level programmability and limited pixel-level programmability, although GPU programmability remained relatively limited.

The third generation of GPUs (after 2006) saw GPUs achieve convenient programming environments that allow direct program writing; in 2006, Nvidia and ATI launched the CUDA (Compute Unified Device Architecture) programming environment and CTM (Close to the Metal) programming environment; in 2008, Apple proposed a universal parallel computing programming platform OPENCL (Open Computing Language), which, unlike CUDA bound to Nvidia’s graphics cards, is not tied to specific computing devices.

Development Stages of GPU Chips

Currently, GPUs have reached a relatively mature stage. Companies like Google, Facebook, Microsoft, Twitter, and Baidu are using GPUs to analyze images, videos, and audio files to improve search and image tagging functionalities. GPUs are also applied in industries related to VR/AR. Additionally, many automotive manufacturers are using GPU chips to develop autonomous driving.

According to research firm TRACTICA LLC, the market for GPUs used in AI is predicted to grow from less than $100 million in 2016 to $14 billion by 2025, indicating explosive growth for GPUs.

Predictions for AI GPU Revenue by Region from 2016 to 2025 (Source: TRACTICA)

In the past decade, general-purpose computing GPUs for AI have been led by Nvidia. In 2010, Nvidia began investing in AI products, announcing its next-generation PASCAL GPU chip architecture in 2014, which was the fifth-generation GPU architecture designed specifically for deep learning, supporting all mainstream deep learning computing frameworks. In the first half of 2016, Nvidia launched the TESLA P100 chip and the corresponding supercomputer DGX-1 based on the PASCAL architecture for training neural networks. Nvidia’s CEO Jensen Huang stated that the development cost of the TESLA P100 GPU reached $2 billion, while Nvidia’s total annual revenue was only $5 billion. The deep learning supercomputer DGX-1 contains TESLA P100 GPU accelerators and utilizes Nvidia’s NVLINK interconnect technology, with a software stack that includes major deep learning frameworks, deep learning SDKs, the DIGITS GPU training system, drivers, and CUDA, capable of rapidly designing deep neural networks (DNN). It boasts a half-precision floating-point performance of up to 170 TFLOPS, equivalent to 250 traditional servers, accelerating deep learning training speeds by 75 times and enhancing CPU performance by 56 times, priced at $129,000. In September 2016, at the Beijing GTC conference, Nvidia launched the TESLA P4/P40 based on PASCAL for neural network inference processes.

AMD also concentrated on releasing a series of AI products at the end of 2016, including three graphics accelerator cards (branded MI), four OEM chassis, and a series of open-source software, along with the next-generation VEGA architecture GPU chips. In the future, AMD aims to provide a complete solution based on its product line for supercomputing clients through the synergy of MI series hardware accelerators, the ROCM software platform, and ZEN-based 32-core and 64-core server CPUs.

In addition to Nvidia and AMD, Intel plans to launch deep learning inference accelerators and 72-core Xeon PHI chips in 2017. Beyond traditional CPU and GPU giants, major players in the mobile field are also worth noting for their GPU strategies. It is reported that Apple is also recruiting GPU development talent to enter the VR market. Currently, the GPU performance of Apple’s A9 is comparable to that of the Snapdragon 820, with A9 GPU utilizing a design from the POWERVR ROGUE family, while Apple’s A9X processor performance is comparable to Intel’s Core M processor. The ARM architecture, which had previously dominated the mobile processor market, is also starting to value the GPU market, with its MALI series GPUs gradually emerging due to their low power consumption and low cost advantages.

Artificial Intelligence Chips (Part 3): FPGA

FPGA (Field-Programmable Gate Array) is a product developed further based on programmable devices like PAL, GAL, and CPLD. Users can define the connections between these gates and memory by burning FPGA configuration files. This burning process is not one-time; users can configure the FPGA as a microcontroller (MCU), and after use, edit the configuration files to reconfigure the same FPGA as an audio codec. Thus, it addresses the lack of flexibility in custom circuits while overcoming the limitations of programmable devices with a limited number of gates.

FPGAs contain a large number of repeated IOBs, CLBs, and routing channels as basic units. When shipped, FPGAs are “universal chips,” and users can design their hardware circuits using hardware description language (HDL) according to their needs. Each time a burn is completed, the FPGA’s internal hardware circuit has a defined connection scheme, acquiring certain functionalities. Input data can sequentially pass through various gate circuits to produce output results. In other words, there is no computation process from input to output; the signal transmission is completed through the pre-burned hardware circuits, making FPGA highly targeted for computational tasks and very fast. This working mode of FPGA necessitates pre-arranging a large number of gate arrays to meet user design requirements, leading to the saying: “trading area for speed,” where using many gate circuits consumes more FPGA core resources to enhance the overall system’s operating speed.

FPGAs can perform both data parallel and task parallel computations simultaneously, demonstrating even greater efficiency in handling specific applications. For a particular computation, a general-purpose CPU might require multiple clock cycles, while an FPGA can reconfigure its circuit through programming to generate a dedicated circuit that can complete computations in just a few clock cycles or even in one clock cycle.

In terms of power consumption, FPGAs also have significant advantages, with energy consumption over 10 times lower than CPUs and 3 times lower than GPUs. This is due to the absence of instruction and instruction decoding operations in FPGAs; in Intel’s CPUs, the CISC architecture incurs about 50% of total chip energy consumption just for decoding; in GPUs, instruction fetching and decoding consume about 10% to 20% of energy.

Furthermore, due to the flexibility of FPGAs, many lower-level hardware control operations that are difficult to implement with general processors or ASICs can be easily realized with FPGAs, allowing for greater room for algorithm functionality realization and optimization. Additionally, the one-time costs (like photomask production costs) for FPGAs are much lower than those for ASICs, making FPGAs the best choice for implementing semi-custom AI chips when chip demand has not yet scaled, and deep learning algorithms require continuous iteration and improvement.

Given the flexibility and speed of FPGAs, there is a trend to replace ASICs in many fields. According to market research firm GRANDVIEW RESEARCH, the FPGA market is expected to grow from $6.36 billion in 2015 to about $11 billion by 2024, with an annual growth rate of 6%.

Global FPGA Market Size Forecast from 2014 to 2024 (Source: GRANDVIEW RESEARCH)

Currently, the FPGA market is primarily dominated by four foreign companies: XILINX, ALTERA (now merged with INTEL), LATTICE, and MICROSEMI. Among them, XILINX and ALTERA hold an absolute monopoly on FPGA technology and market share. In 2014, before ALTERA was acquired by INTEL, XILINX and ALTERA achieved revenues of $2.38 billion and $1.93 billion, respectively, capturing 48% and 41% of the market share, while LATTICE and MICROSEMI (only the FPGA business segment) generated revenues of $366 million and $275 million, respectively, with the top two manufacturers accounting for nearly 90% of the market share.

Market Share Analysis of FPGA Manufacturers in 2015 (Source: IHS)

Artificial Intelligence Chips (Part 4): Dedicated Integrated Circuits

Currently, the demand for artificial intelligence computing, represented by deep learning, primarily uses existing general-purpose chips like GPUs and FPGAs, which are suitable for parallel computation. In the absence of large-scale industrial applications, using these existing general-purpose chips can avoid the high investment and risks associated with developing dedicated chips (ASICs). However, since these general-purpose chips were not originally designed specifically for deep learning, they inherently face performance and power consumption bottlenecks. As the scale of AI applications expands, these issues will become increasingly prominent.

GPUs, as image processors, were initially designed to handle the large-scale parallel computations required in image processing. Therefore, they face three limitations when applied to deep learning algorithms: first, they cannot fully leverage their parallel computing advantages during application. Deep learning consists of two computational phases: training and application. GPUs are highly efficient in training deep learning algorithms but can only process one input image at a time during application, preventing the full realization of their parallelism. Second, their hardware structure is fixed and lacks programmability. Since deep learning algorithms have not yet stabilized, if significant changes occur in these algorithms, GPUs cannot flexibly reconfigure their hardware structure like FPGAs. Third, the energy efficiency of running deep learning algorithms is far below that of FPGAs.

Despite the optimism surrounding FPGAs, and even though the new generation of Baidu Brain is based on FPGA platforms, they are not specifically designed for deep learning algorithms and still have several limitations: first, the computational power of basic units is limited. To achieve reconfigurability, FPGAs contain many very fine-grained basic units, but the computational power of each unit (primarily relying on LUT lookup tables) is far below that of ALU modules in CPUs and GPUs. Second, speed and power consumption still lag behind dedicated custom chips (ASICs). Third, the cost of FPGAs is relatively high, making their unit cost far exceed that of dedicated custom chips when scaled up.

Therefore, as artificial intelligence algorithms and application technologies continue to develop, and as the industrial environment for AI dedicated chips (ASICs) matures, AI ASICs will become an inevitable trend in the development of AI computing chips.

First, the performance improvements of custom chips are significant. For example, Nvidia’s first chip designed from the ground up for deep learning, the Tesla P100, has a data processing speed 12 times that of its 2014 GPU series. Google’s TPU chip, customized for machine learning, has hardware performance equivalent to that of current chips seven years after Moore’s Law’s advancement. Just as CPUs transformed the massive computers of their time, AI ASIC chips will significantly change the landscape of today’s AI hardware devices. For instance, the famous AlphaGo utilized approximately 170 GPUs and 1200 CPUs, which required an entire machine room, high-power air conditioning, and multiple experts for system maintenance. If all dedicated chips were used, it is highly likely that only a box-sized device would be needed, and power consumption would also be significantly reduced.

Second, downstream demand promotes the specialization of AI chips. From servers and computers to autonomous vehicles, drones, and various smart home appliances, devices requiring perceptual interaction capabilities and artificial intelligence computational power far exceed the scale of smartphones. Due to the requirements for real-time performance and considerations of training data privacy, these capabilities cannot rely entirely on the cloud; a local hardware and software infrastructure must support them, leading to massive demand for AI chips.

In recent years, numerous AI chips have emerged both domestically and internationally. Nvidia announced over $2 billion in research and development investment for deep learning dedicated chips in 2016, while Google’s TPU chip for deep learning has been running secretly for a year, directly supporting the globally shocking human-machine Go competition. Whether it is Nvidia, Google, IBM, Qualcomm, or domestic companies like Zhaoxin Micro and Cambricon, both giants and startups view AI chips as strategically significant key technologies for development, resulting in a flourishing landscape of AI chips.

Overview of AI Dedicated Chip Development

Currently, the development directions for dedicated AI chips include: semi-custom based mainly on FPGAs, fully custom designed for deep learning algorithms, and brain-like computing chips in three stages.

When chip demand has not yet scaled and deep learning algorithms require continuous iteration and improvement, using reconfigurable FPGA chips to achieve semi-custom AI chips is the best choice. A notable representative among these chips is the domestic startup Deep Insight Technology, which has designed a chip called the “Deep Processing Unit” (DPU), aiming to achieve performance superior to GPUs at ASIC-level power consumption, with its first batch of products based on FPGA platforms. Although this semi-custom chip relies on the FPGA platform, it abstracts an instruction set and compiler to enable rapid development and iteration, offering significant advantages compared to dedicated FPGA accelerator products.

In the fully custom stage targeting deep learning algorithms, chips are designed entirely using ASIC methods, optimizing performance, power consumption, and area metrics for deep learning algorithms. Google’s TPU chip and the deep learning processor chip from the Institute of Computing Technology of the Chinese Academy of Sciences are typical representatives of this category.

In the brain-like computing stage, the design purpose of chips is no longer limited to accelerating deep learning algorithms but aims to develop new brain-like computing architectures at the basic structure or even device level, potentially using memristors and ReRAM to enhance storage density. Research on these chips is still far from being a mature technology that can be widely used in the market and carries significant risks. However, in the long run, brain-like chips could revolutionize computing architectures. A typical representative of this category is IBM’s TrueNorth chip. The market potential for brain-like computing chips is enormous. According to third-party predictions, the market for brain-like computing chips, including consumer terminals, will reach a scale of hundreds of billions of dollars before 2022, with consumer terminals being the largest market, accounting for 98% of the total, while other demands include industrial inspection, aerospace, military, and defense sectors.

This article is sourced from: Electronic Product World

Leave a Comment Cancel reply