GPU, ASIC, FPGA: Which Accelerates Large Models Better?

Click the blue textto follow us

With thedevelopment ofLLM, the importance of accelerators capable of efficiently processingLLM computations has become increasingly prominent. This article discusses the necessity ofLLM accelerators and provides a comprehensive analysis of the main hardware and software features of currently available commercialLLM accelerators. Based on this, it proposes development ideas and future research directions for the next generation ofLLM accelerators.

Introduction

With the success ofChatGPT, research and development ofLLM and its applications have accelerated. Recently, in addition to traditional text-based chatbots, multimodalLLM capable of processing text, images, audio, and video have emerged. As these models become increasingly complex, the data requirements for training and inference, along with the associated hardware costs, have also grown exponentially.

GPUs are commonly used forLLM training and inference due to their versatility and availability. However, the high power consumption ofGPUs leads to thermal issues, necessitating expensive cooling systems. Therefore, optimizing hardware forLLM computations is crucial for reducing costs.

LLM accelerators are specifically designed to enhanceLLM computations and are currently being developed in various forms. Unlike general-purposeGPUs, LLM accelerators utilize dedicated circuits to improve efficiency, providing higher performance with lower energy consumption. While advancements in hardware are critical, the supporting software stack is also becoming increasingly important.

This article analyzes the hardware and software ofLLM accelerators, particularly off-the-shelfLLM accelerators. Based on this, it discusses the fundamental elements and challenges faced byLLM accelerators, specifically those implemented asASICs for commercialLLM accelerators. This article aims to analyze the characteristics of proprietaryLLM accelerators, identify the challenges they face, and propose future research directions.

Types of LLM Accelerators

Based on hardware architectures optimized forLLM computations, the main types ofLLM accelerators includeGPUs,ASICs,FPGAs.

GPUs are the most widely usedLLM accelerators, capable of quickly processing large amounts of data due to their highly parallel structure. They can execute multiple threads simultaneously and achieve parallel processing through multiGPU interconnect technologies.

FPGALLM accelerators implemented in the form ofFPGAs provide an intermediate platform betweenASICs andGPUs, offering programmable flexibility and a degree of optimization forLLM computations. They can adapt to newLLM architectures or algorithms through relatively flexible hardware reconfiguration and have relatively low power consumption.

ASIC-basedLLM accelerators are chips tailored forLLM computations. UnlikeFPGAs,ASICs are optimized for specific tasks, providing excellent efficiency. They can optimize memory usage inLLM computations and typically have lower power consumption thanGPUs.

Requirements for Accelerators

In addition to efficiently processingTransformer blocks,LLM accelerators must also meet various requirements, such asmulti-head attention and feedforward operations, which are widely used inLLM training and inference.

Low Power Consumption: LLM accelerators require higher throughput and lower power consumption thanGPUs, necessitating collaborative hardware and software design to achieve energy-efficient computing. Optimizing memory and host-accelerator interfaces, as well as compiler technologies, is essential.

Low Latency: Reducing inference latency is critical forLLM services. Techniques such astiling andpipelining can decomposeLLM computations into smaller units for optimal processing. Current research has also proposed methods to optimize data transfer between accelerators and memory or between accelerators.

High Memory Capacity: Models likeLlama 3.1 have up to405 billion parameters, requiring a large amount of memory.Inference requires1,944 GB or972 GB to achieve32 or16 bit precision, equivalent to about24 or12 NVIDIA H100 80GB GPUs. Efficient memory management also requires optimizing queries, keys, and values for storing and loading models.

Support for Mixed Precision: To reduce the memory required duringLLM training and inference, methods such as quantization are used to reduce memory usage. While 4-bit and 8-bit formats are employed, operations like addition withinTransformer blocks require mixed precision.

Support for Parallel Processing: AsLLM models grow larger, executing them on a single accelerator becomes more challenging. Parallel and distributed training and inference techniques using multiple accelerators are necessary.

Overview of LLM Accelerators

For the analysis ofLLM accelerators, we selectedNVIDIA H100,AMD MI300X,Cerebras WSE-2,Google TPU v4,Graphcore IPU,Groq LPUP,Intel Gaudi 3,SambaNova SN40L, andTenstorrent Grayskull.