ASIC
Application-Specific Integrated Circuit. An ASIC is designed for a specific application, optimized for a particular task or function, but its wiring and connections are fixed and cannot be changed. For example, Bitcoin mining chips and mobile image processing units.
CPU
Handles general tasks. A CPU operates using SIMD (Single Instruction, Multiple Data), which is suitable for vectorized operations (this refers to the advancement of SIMD compared to traditional CPUs, not a comparison with GPUs). For instance, for multi-element arrays, the same operation can be performed on all elements simultaneously, thereby improving efficiency.
FPGA
An FPGA consists of a collection of logic gates and switches, described by HDL (Hardware Description Language). A bitstream file is generated to control the logic gates and switches used, establishing the interconnection paths between logic units.
GPU
A GPU operates using SIMT (Single Instruction, Multiple Threads), executing multiple threads simultaneously.
In contrast, a CPU relies on a single thread to fetch data, maximizing the use of the memory bus to enhance efficiency. For example: With a memory bandwidth of 131GB/s and a memory delay of 89ns, it can transfer 131 x 89 = 11659 bytes within 89ns.
<span>ax+y</span> will transfer 16 bytes (double 8 bytes) within 89ns, resulting in a memory bandwidth utilization of 16/11659 = 0.14%, leaving 99.86% of the storage bus idle. To improve this, data can be fetched in larger quantities, requiring 11896/16 = 729 requests, but this puts pressure on the threads, as a single thread can only operate concurrently (time-slicing) and not in parallel, making it challenging. For instance, with a 2.9GHz core, 89ns can only execute 89 x 2.9 = 258 operations, including context switching.
In contrast, a GPU can directly compute with 729 threads in parallel (this applies to computation; for scheduling, multi-threading is also ineffective).
Q: So, can a GPU do everything a CPU can? It’s just a cost issue; the performance of a single GPU core is lower than that of a CPU, roughly 1/2 (which is still significant).
GPUs excel by leveraging a high number of threads, possessing more threads than actually needed, with some threads waiting for data, some waiting to be activated for computation, and some actively computing. The GPU memory must be sufficiently large, with each thread having ample registers to store real-time data, minimizing data movement.
CGRA configures registers and interconnect routing to control computational units and data flow.
Control Example
1. Configure computational units. Assume a Processing Element (PE) has +, x, AND operations; the configuration registers of the PE may include:a) Opcode field: Determines the current operation type (e.g., 00 for addition, 01 for multiplication, 10 for AND)b) Input: Which input port the operands come fromc) Output: Where the result is sent, either to another PE or a storage unit
2. Configure the interconnect network
The interconnect network is defined through registers:
•How the output of one PE connects to the input of another PE•Routing priority, such as whether to use a global bus or point-to-point transmission?
3. Task scheduling
When executing multi-stage tasks:a) Configure PE1 for addition, with inputs from data streams A and Bb) Configure PE2 for multiplication, with inputs from PE1 and data stream Cc) Configure the interconnect network to route the result of PE2 to the storage unit
Q: So, when a CGRA mxn chip is produced, each tile’s connections and functional units are included, and then controlled by HDL. Why is layout necessary? Isn’t the production just about varying m and n?
FPGAs consist of numerous small programmable logic units (Lookup Tables, LUTs) and interconnect networks; CGRAs utilize larger computational units (such as Arithmetic Logic Units, ALUs, and multipliers) as basic modules.
Others
•DRAM is dynamic (implemented with capacitors, which can lose data if not refreshed, hence the term dynamic, and it loses data when power is off, commonly referred to as memory) and allows random access (unlike hard drives, which require sector-based access, it can access any location);•SRAM is static, does not require refreshing, used in CPUs, and also loses data when power is off;•Flash memory is a type of hard drive that cannot be accessed randomly.