Compiled by New Intelligence1
Source: arxiv
Authors: Griffin Lacey, Graham Taylor, Shawaki Areibi
Abstract
In recent years, the rapid increase in data volume and accessibility has led to a shift in the design concepts of algorithms in artificial intelligence. The manual creation of algorithms has been replaced by the ability of computers to automatically learn composable systems from large amounts of data, resulting in significant breakthroughs in key areas such as computer vision, speech recognition, and natural language processing. Deep learning is the most commonly used technology in these fields and has garnered significant attention from the industry. However, deep learning models require an extremely large amount of data and computational power, and only better hardware acceleration conditions can meet the growing demands of existing data and model scales. The existing solutions use GPU clusters as General-Purpose Graphics Processing Units (GPGPU), but Field-Programmable Gate Arrays (FPGA) offer another solution worth exploring. The increasingly popular FPGA design tools make it more compatible with the upper-level software commonly used in the deep learning field, making it easier for model builders and deployers to utilize FPGAs.FPGA architectures are flexible, allowing researchers to explore model optimization beyond the fixed architectures of GPUs. At the same time, FPGAs provide stronger performance per unit of energy, which is crucial for research in large-scale server deployments or resource-constrained embedded applications. This article examines deep learning and FPGAs from the perspective of hardware acceleration, pointing out trends and innovations that make these technologies compatible and stimulate discussions on how FPGAs can help the development of the deep learning field.
Machine learning has a profound impact on daily life. Whether clicking on personalized recommendations on websites, using voice communication on smartphones, or utilizing facial recognition technology to take photos, some form of artificial intelligence technology is involved. This new wave of artificial intelligence is accompanied by a shift in algorithm design concepts. Traditionally, data-based machine learning mostly relied on specific domain expertise to manually “shape” the features to be learned. The ability of computers to learn compositional feature extraction systems from large amounts of example data has led to significant performance breakthroughs in key areas such as computer vision, speech recognition, and natural language processing. Research on these data-driven technologies is called deep learning, which is currently attracting the attention of two important groups in the tech community: researchers who wish to use and train these models to achieve high-performance cross-task computing, and application scientists who want to deploy these models for new applications in the real world. However, they all face a limitation: the need to enhance hardware acceleration capabilities to meet the demands of scaling existing data and algorithms.
For deep learning, current hardware acceleration mainly relies on using GPU clusters as General-Purpose Graphics Processing Units (GPGPU). Compared to traditional General-Purpose Processors (GPP), GPUs have several orders of magnitude more core computing power and are easier to perform parallel computing. Especially with NVIDIA CUDA, the most mainstream GPGPU programming platform, all major deep learning tools use it for GPU acceleration. Recently, the open parallel programming standard OpenCL has gained attention as an alternative tool for heterogeneous hardware programming, and enthusiasm for these tools is also rising. Although OpenCL has received slightly less support than CUDA in the deep learning field, it has two unique advantages. First, OpenCL is open-source and free for developers, unlike the single-supplier approach of CUDA. Second, OpenCL supports a range of hardware, including GPUs, GPPs, FPGAs, and Digital Signal Processors (DSPs).
1.1. FPGA
As a strong competitor to GPUs in algorithm acceleration, it is particularly important for FPGAs to support different hardware immediately. The difference between FPGAs and GPUs lies in the flexibility of hardware configuration, and FPGAs typically provide better performance per unit of energy when running critical subprograms in deep learning (such as the computation of sliding windows). However, setting up FPGAs requires specific hardware knowledge, which many researchers and application scientists do not possess. For this reason, FPGAs are often seen as a domain-specific architecture. Recently, FPGA tools have begun to adopt software-level programming models, including OpenCL, making them increasingly favored by users trained in mainstream software development.
For researchers examining a range of design tools, their selection criteria typically relate to whether the tools have user-friendly software development environments, whether they have flexible and upgradable model design methods, and whether they can compute quickly to reduce the training time of large models. As FPGAs become increasingly easier to program due to the emergence of highly abstract design tools, their reconfigurability makes customized architectures possible, while their high degree of parallel computing capability increases instruction execution speed, providing benefits to researchers in deep learning.
For application scientists, although there are similar tool-level choices, the focus of hardware selection is on maximizing performance per unit of energy to reduce costs for large-scale operations. Therefore, FPGAs, with their strong performance per unit of energy and the ability to customize architectures for specific applications, can benefit deep learning application scientists.
FPGAs can meet the needs of both audiences, making them a logical choice. This article examines the current state of deep learning on FPGAs and the technological developments currently used to bridge the gap between the two. Therefore, this article has three important objectives. First, it points out the opportunities to explore new hardware acceleration platforms in the field of deep learning, with FPGAs being an ideal choice. Secondly, it outlines the current state of FPGA support for deep learning and points out potential limitations. Finally, it provides key suggestions for the future direction of FPGA hardware acceleration to help address the challenges faced by deep learning in the future.
Traditionally, when evaluating the acceleration of hardware platforms, the trade-off between flexibility and performance must be considered. On one hand, General-Purpose Processors (GPPs) provide high flexibility and ease of use but lack efficiency in performance. These platforms are often easier to obtain, can be produced at a low cost, and are suitable for various purposes and reuse. On the other hand, Application-Specific Integrated Circuits (ASICs) can offer high performance but at the cost of being less flexible and harder to produce. These circuits are dedicated to specific applications and are expensive and time-consuming to manufacture.
FPGAs are a compromise between these two extremes. FPGAs belong to a class of more general Programmable Logic Devices (PLDs) and are essentially reconfigurable integrated circuits. Therefore, FPGAs can provide the performance advantages of integrated circuits while also possessing the reconfigurable flexibility of GPPs. FPGAs can easily implement sequential logic using flip-flops (FF) and combinational logic using look-up tables (LUT). Modern FPGAs also contain hardened components to realize some common functions, such as full processor cores, communication cores, arithmetic cores, and block RAM (BRAM). Additionally, current FPGA trends are moving towards System-on-Chip (SoC) design methods, where ARM co-processors and FPGAs are often located on the same chip. The current FPGA market is dominated by Xilinx and Altera, which together hold 85% of the market share. Furthermore, FPGAs are rapidly replacing ASICs and Application-Specific Standard Products (ASSP) for fixed-function logic. The FPGA market size is expected to reach $10 billion by 2016.
For deep learning, FPGAs offer significant potential for acceleration capabilities superior to traditional GPPs. GPPs rely on traditional von Neumann architecture for software-level execution, where instructions and data are stored in external memory and retrieved when needed. This has led to the emergence of caching, greatly alleviating the burden of costly external memory operations. The bottleneck of this architecture is the communication between the processor and memory, which severely undermines GPP performance, especially affecting the storage information technology that deep learning frequently requires. In contrast, the programmable logic elements of FPGAs can be used to implement data and control paths in ordinary logic functions without relying on von Neumann architecture. They can also utilize distributed on-chip memory and deeply leverage pipeline parallelism, which naturally fits with feedforward deep learning methods. Modern FPGAs also support partial dynamic reconfiguration, allowing one part of the FPGA to be reconfigured while another part remains in use. This will impact large-scale deep learning models, as different layers of the FPGA can be reconfigured without disrupting ongoing computations in other layers. This can be used for models that cannot be accommodated by a single FPGA while also reducing the costly global memory read expenses by storing intermediate results in local memory.
Most importantly, compared to GPUs, FPGAs provide another perspective for exploring hardware acceleration design. The design of GPUs and other fixed architectures follows a software execution model and builds structures around autonomous computing units to perform tasks in parallel. As a result, the goal of developing GPUs for deep learning technology is to adapt algorithms to this model, ensuring that computations are completed in parallel and that data dependencies are maintained. In contrast, FPGA architectures are specifically customized for applications. When developing deep learning technologies for FPGAs, there is less emphasis on adapting algorithms to a fixed computational structure, thus leaving more freedom to explore optimizations at the algorithmic level. Techniques that require many complex lower-level hardware control operations are difficult to implement in higher-level software languages but are particularly attractive for FPGAs. However, this flexibility comes at the cost of significant compilation (placement and routing) time, which can often be a problem for researchers who need to iterate quickly through design cycles.
In addition to compilation time, attracting researchers and application scientists who prefer higher-level programming languages to develop for FPGAs is particularly challenging. While fluency in one programming language often means that one can easily learn another, this is not the case for hardware language translation skills. The most commonly used languages for FPGAs are Verilog and VHDL, both of which are Hardware Description Languages (HDL). The main difference between these languages and traditional software languages is that HDLs simply describe hardware, whereas software languages like C describe sequential instructions without needing to understand the execution details at the hardware level. Effectively describing hardware requires specialized knowledge of digital design and circuits, and although some lower-level implementation decisions can be left to automatic synthesis tools, they often do not achieve efficient designs. Therefore, researchers and application scientists tend to choose software design, as it is already mature and has many abstractions and conveniences to enhance programmer efficiency. These trends have made the FPGA field currently favor highly abstract design tools.
Milestones in FPGA Deep Learning Research
1987 VHDL becomes an IEEE standard
1992 GANGLION becomes the first FPGA neural network hardware implementation project (Cox et al.)
1994 Synopsys launches the first generation of FPGA behavioral synthesis solutions
1996 VIP becomes the first CNN implementation solution for FPGAs (Cloutier et al.)
2005 FPGA market value approaches $2 billion
2006 First implementation of 5 GOPS processing capability on FPGA using BP algorithm
2011 Altera launches OpenCL, supporting FPGAs
Emergence of large-scale FPGA-based CNN algorithm research (Farabet et al.)
2016 Based on Microsoft’s Catapult project, emergence of FPGA-based data center CNN algorithm acceleration (Ovtcharov et al.)
The future of deep learning, both in terms of FPGAs and overall, primarily depends on scalability. To enable these technologies to successfully address future challenges, they must scale to support the rapidly growing data volumes and architectures. FPGA technology is adapting to this trend, with hardware evolving towards larger memory, fewer feature points, and better interconnectivity to accommodate multiple configurations of FPGAs. Intel’s acquisition of Altera and IBM’s collaboration with Xilinx both indicate a transformation in the FPGA field, and we may soon see the integration of FPGAs with personal applications and data center applications. Additionally, algorithm design tools may develop towards further abstraction and software-like experiences to attract a broader range of users.
4.1. Common Deep Learning Software Tools
Among the most commonly used software tools for deep learning, some tools have already recognized the necessity of supporting OpenCL while supporting CUDA. This will make it easier to implement deep learning with FPGAs. Although to our knowledge, there are currently no deep learning tools explicitly stating support for FPGAs, the table below lists tools that are moving towards OpenCL support:
Caffe, developed by the Berkeley Vision and Learning Center, provides unofficial support for OpenCL through its GreenTea project. There is also an AMD version of Caffe that supports OpenCL.
Torch, a scientific computing framework based on the Lua language, is widely used, with its project CLTorch providing unofficial support for OpenCL.
Theano, developed by the University of Montreal, is developing a gpuarray backend that provides unofficial support for OpenCL.
DeepCL, an OpenCL library developed by Hugh Perkins for training convolutional neural networks.
For those just entering this field and looking to choose tools, we recommend starting with Caffe, as it is widely used, well-supported, and has a simple user interface. It is also easy to experiment with pre-trained models using Caffe’s model zoo library.
4.2. Increasing Training Flexibility
One might think that the process of training machine learning algorithms is entirely automated; in reality, there are some hyperparameters that need to be adjusted. This is especially true for deep learning, where the complexity of the model in terms of the number of parameters often accompanies a large number of possible hyperparameter combinations. Adjustable hyperparameters include training iterations, learning rates, batch gradient sizes, number of hidden units, and layers, among others. Adjusting these parameters equates to selecting the model most suitable for a particular problem among all possible models. In traditional practices, hyperparameter settings are either based on experience or conducted through system grid searches or more efficient random searches. Recently, researchers have turned to adaptive methods, using the results of hyperparameter tuning attempts as configuration references. Among these, Bayesian optimization is the most commonly used method.
No matter which method is used to adjust hyperparameters, the current training processes using fixed architectures somewhat limit the possibilities of models. In other words, we may only glimpse a part of all the solutions. Fixed architectures make it easy to explore hyperparameter settings within a model (e.g., number of hidden units, number of layers, etc.), but it becomes difficult to explore parameter settings between different models (e.g., different model types) because training a model that does not easily conform to a fixed architecture may take a long time. In contrast, the flexible architecture of FPGAs may be more suitable for the aforementioned types of optimization, as FPGAs can create an entirely different hardware architecture and accelerate it during runtime.
4.3. Low Power Compute Clusters
The most fascinating aspect of deep learning models is their scalability. Whether to discover complex high-level features from data or to enhance performance for data center applications, deep learning technology often scales across multi-node computing infrastructures. Current solutions use GPU clusters with Infiniband interconnect technology and MPI to achieve upper-layer parallel computing capabilities and rapid data transfer between nodes. However, as the load of large-scale applications becomes increasingly diverse, using FPGAs may be a better approach. The programmable nature of FPGAs allows systems to be reconfigured based on applications and loads, and their low power consumption helps reduce costs for the next generation of data centers.
Compared to GPUs and GPPs, FPGAs provide an attractive alternative to meet the hardware demands of deep learning. With their ability for pipeline parallel computing and efficient energy consumption, FPGAs will demonstrate unique advantages in general deep learning applications that GPUs and GPPs do not possess. At the same time, algorithm design tools are becoming increasingly mature, making it possible to integrate FPGAs into commonly used deep learning frameworks. In the future, FPGAs will effectively adapt to the trends in deep learning, ensuring that relevant applications and research can be freely realized from an architectural perspective.
Enter the WeChat public account and reply with the keyword 160227
Recruitment
Journalist, Translator, and Event Operations
Full-time and Interns Needed
And Volunteers for AI Translation Society
For more details, please enter the public account and click “Recruitment”
Or email [email protected]