A Heterogeneous System-Level Package Integrating FPGA and DSP Chiplets

▌This article is from Semiconductor Industry Observation

Abstract: Integrating multiple heterogeneous chiplets into a package is a promising and cost-effective strategy that can build flexible and scalable systems while effectively accelerating diverse workloads. Based on this, we propose Arvon, which integrates a 14nm FPGA chiplet and two closely packed high-performance 22nm DSP chiplets using Embedded Multi-die Interconnect Bridges (EMIBs). The chiplets are interconnected via a 1.536-Tb/s Advanced Interface Bus (AIB) 1.0 interface and a 7.68-Tb/s AIB 2.0 interface. Arvon is programmable and supports a variety of workloads ranging from neural networks (NN) to communication signal processing. Each DSP chiplet provides a peak performance of 4.14 TFLOPS in half-precision floating-point operations while maintaining an energy efficiency of 1.8 TFLOPS/W. We also developed a compiler that can allocate workloads for the FPGA and DSP to optimize system performance and resource utilization. The AIB 2.0 interface, implemented with 36-µm-pitch micro-bumps, achieves a data transfer rate of 4 Gb/s/pin, with an energy efficiency of 0.10–0.46 pJ/b including the adapter. Its Shoreline bandwidth density reaches 1.024 Tb/s/mm, and areal bandwidth density reaches 1.705 Tb/s/mm².

I. Introduction

With the rapid development of fields such as machine learning, robotics, and 5G/6G communications[1], [2], [3], [4], DSP workloads for these areas are becoming increasingly dynamic and diverse. A flexible and computationally efficient hardware system can meet these demands well. Such a system can integrate multiple components, including CPUs, FPGAs, and DSP accelerators, where the CPU or FPGA provides the necessary flexibility, and the DSP accelerators provide efficient core acceleration.

One way to achieve such a system is to design and manufacture a monolithic system-on-chip (SoC). However, building a large SoC comes with high time and economic costs. As a cost-effective and feasible alternative, heterogeneous system-level packaging (SiP) using 2.5D or 3D chiplet integration provides a promising avenue for constructing large-scale systems, achieving performance comparable to monolithic integration while avoiding the high costs, risks, and workloads associated with monolithic integration. SiP consists of interconnected components called chiplets. Each chiplet contains a functional module that can be manufactured at the most suitable technology node for optimal performance and efficiency. Since each chiplet is smaller in size and more specialized in function, its design complexity is reduced, and yield is improved. By assembling SiP with known good die (KGD), system yield can be enhanced. In the envisioned future chip ecosystem, we can procure verified chips from different vendors and reuse them when building diverse systems, effectively overcoming the challenges and obstacles encountered in the rapid development of new systems.

Figure 1 shows a SiP solution for a multifunctional accelerator that integrates an FPGA chiplet, a DSP accelerator chiplet, and possibly additional chiplets such as analog-to-digital converters (ADC) or optical transceivers. This heterogeneous SiP design can flexibly map various dynamic DSP workloads—from machine learning to communication signal processing—onto it. The FPGA chiplet provides the necessary adaptability, the DSP chiplet contributes efficient computational power, while the additional chiplets provide connections to front-end (FE) components such as sensors, wireless, or optical interfaces. Within the SiP, the inter-chip interfaces between chiplets are crucial for data transmission, and they must provide sufficiently high bandwidth to ensure performance comparable to monolithic SoCs while maintaining low energy consumption per bit, ensuring the competitiveness of the entire solution.

Figure 1 Arvon SiP achieves flexible workload mapping through heterogeneous integration of FPGA, DSP, and FE chiplets

Recent studies have demonstrated the integration of chiplets in SiPs with high bandwidth and efficient die-to-die interfaces[5], [6], [7], [8], [9], [10], [11]. In reference[5], two dual-Arm core chiplets are integrated on chip-on-wafer-on-substrate (CoWoS) with a low-voltage package interconnect (LIPINCON) interface of 8 Gb/s/pin. In reference[6], 36 deep neural network (DNN) accelerator chiplets are integrated on an organic substrate using a ground-referenced signaling (GRS) interface of 25 Gb/s/pin[7]. In references[8] and[9], four runtime reconfigurable universal digital signal processors (UDSP) are integrated on a silicon interconnect structure (Si-IF) with a 1.1 Gb/s/pin SNR-10 interface. IntAct[10] integrates six 16-core chiplets on an active silicon interposer using a 1.2-Gb/s/pin 3-D-Plug interface. These results represent typical applications of homogeneous integration, effectively scaling computational systems by stitching together multiple instances of modular chiplets.

In Arvon, we demonstrate heterogeneous integration of different types of chiplets to build a multifunctional accelerator for DSP workloads. Arvon consists of a 14nm FPGA chiplet and two 22nm DSP chiplets integrated via Embedded Multi-die Interconnect Bridge (EMIB) technology[12], [13]. We prototyped the first and second generations of the open advanced interface bus (AIB) inter-chip interfaces, referred to as AIB 1.0 and AIB 2.0, for connecting these chiplets. The results are demonstrated in a SiP that can effectively accelerate various machine learning and communication DSP workloads while maintaining high hardware utilization. This work also showcases the AIB 2.0 interface, which achieves a coastline bandwidth density of 1 Tb/s/mm with an energy efficiency of 0.1 pJ/b under a 0.1 pJ/b energy consumption.

The remainder of this paper is organized as follows: Section II provides an overview of the Arvon SiP. Section III details the design of the AIB interface, including physical layer (PHY) I/O, clock distribution, and bus adaptation. Section IV delves into the details of the DSP chiplets and their vector engine design. Section V discusses the mapping of various workloads. Section VI introduces chip measurements and system evaluation. Finally, Section VII concludes the paper.

II. ARVON SYSTEM OVERVIEW

The overview of the Arvon system is shown in Figure 2. The system includes one FPGA chiplet and two DSP chiplet instances, named DSP1 and DSP2, respectively. DSP2 is a physically rotated version of DSP1. The FPGA connects to DSP1 using EMIB technology via the AIB 1.0 interface, while DSP1 connects to DSP2 using EMIB technology via the AIB 2.0 interface. Arvon provides three operational modes, as shown in Figure 2. In modes 1 and 2, the FPGA connects to either DSP1 or DSP2, offloading general computation cores onto the DSPs. These general cores include matrix multiplication (MMM) and two-dimensional convolution (conv), which are critical in neural network (NN) and communication workloads. In mode 3, DSP1 and DSP2 are combined to enhance computational capability. DSP2 can also be replaced by an FE chiplet (e.g., optical tile or ADC tile) to realize a complete communication or sensing system.

Figure 2 Data flow modes supported by Arvon SiP: In modes 1 and 2, the FPGA connects to one of the DSPs; in mode 3, the FPGA connects to both DSPs simultaneously

DSP Chiplet

The DSP chiplet provides offloading and acceleration capabilities for compute-intensive workloads. The design of the DSP chiplet is shown in Figure 2. Inter-chip interfaces are placed on both sides of the chiplet. On the west side, there are 24 AIB 1.0 interface channels, providing 1.536 Tb/s bandwidth for communication with the FPGA. On the east side, there are 24 AIB 2.0 interface channels, providing 7.68 Tb/s bandwidth for communication with another DSP. The chiplet contains three DSP clusters, each providing 1024 16-bit half-precision floating-point processing elements (PE). Each cluster can use up to 8 AIB 1.0 interface channels and 8 AIB 2.0 interface channels for input and output. A low-jitter ring oscillator (PLL) generates clocks for the DSP clusters and AIB 1.0 and AIB 2.0 interfaces. There are two rows of general-purpose input/output (GPIO) ports along the top and bottom of the chiplet for global configuration and debugging.

FPGA Host Chiplet

The FPGA plays a key role in realizing the flexibility of Arvon. The programmable logic of the FPGA is used to support various tasks, such as performing data processing operations like transposition and shuffling for the DSP. Additionally, the FPGA can provide special functionalities not available on the DSP, thus meeting the complete processing requirements.

In Arvon, the FPGA acts as the host, taking the form of an instruction-based processor as shown in Figure 3. A simple host processor is equipped with instruction memory, data memory for storing input/output data and weight data, and a direct memory access (DMA) unit for managing and coordinating data transfers with the DSP chiplets. Instructions are used to configure and reconfigure the DSP at runtime, guide the data flow between the data memory and DSP, and perform preprocessing and postprocessing for the DSP.

When the main processor within the FPGA is triggered and reads the first instruction from the instruction memory, the execution of the workload officially begins. These instructions detail all the required information, including data content, register access addresses, memory addresses, bus addresses, data lengths for read/write operations by DMA, and the execution order. Based on the instructions, the host processor generates AXI bus transactions to access the DSP configuration registers sent to the DSP. At the same time, it issues DMA commands to read or write data from the data memory and to read and write data to the DSP. Given the fast processing time of the vector engine in the DSP, the FPGA, including the host processor, is highly utilized to minimize latency and prevent any potential bottlenecks.

Figure 3 Example of FPGA host implementation

III. AIB Inter-Chip Interface

Within the DSP chiplet, the west side integrates 24 AIB 1.0 interface channels[14], and the east side integrates 24 AIB 2.0 interface channels[14]. The AIB channels consist of two layers: an adapter layer and a physical layer I/O layer. The adapter layer is responsible for coordinating data transmission between the DSP core and the physical layer I/O. It handles data framing and synchronization between the two domains. A state machine is used to initiate the AIB link and enable automatic clock phase adjustment. This adjustment helps determine the eye width and center of the data. In AIB 2.0, the adapter also supports optional Data Bus Inversion (DBI), which reduces bus switching activity and improves energy efficiency.

The physical layer of the AIB interface implements source synchronous, short-distance, low-latency, and parallel single-ended I/O. In double data rate (DDR) mode, each I/O port of AIB 1.0 provides a bandwidth range from 1 Mb/s to 2 Gb/s through full-track signaling. AIB 2.0 further achieves a bandwidth range from 1 Mb/s to 4 Gb/s in DDR mode through swing variations from 0.4 V to full-track signaling, significantly enhancing data transfer rates. A single AIB 1.0 channel consists of 96 pins, including 2 TX clock pins, 2 RX clock pins, 20 TX data pins, 20 RX data pins, and additional pins for sideband control and redundancy. In contrast, a single AIB 2.0 channel consists of 102 pins, including two TX clock pins, two RX clock pins, 40 TX data pins, 40 RX data pins, and additional pins for sideband control and redundancy. AIB 2.0 improves upon AIB 1.0 by doubling the data transfer rate per pin and the number of data pins per channel, thereby quadrupling the data transfer bandwidth. Additionally, AIB 2.0 enhances energy efficiency by using low-swing signaling. A summary of the comparison between AIB 1.0 and AIB 2.0 is shown in Table I. Notably, AIB 1.0 and AIB 2.0 share similar design structures.

Table I

AIB 2.0 Adapter

The AIB adapter manages data transmission between the DSP core and the PHY I/O layer. The data path includes a serializer at the TX end and a parallelizer at the RX end. Figure 4 illustrates an example of data transmission. In chiplet 1, an AIB 2.0 TX channel collects four 80-bit wide data streams from the DSP core at a clock frequency of 500 MHz. A serializer implemented using a two-level 2:1 multiplexer converts the parallel data stream into a single 80-bit wide data stream for transmission. After optional DBI, the 80-bit data is divided into two 40-bit segments, which are sent to the data0 and data1 pins of the 40 TX I/O units, respectively. In DDR mode, these TX I/O units operate at a frequency of 2 GHz, transmitting 2 bits of data per unit at a time, achieving an effective data transfer rate of 4 Gb/s. The differential 2 GHz TX clock is forwarded along with the data to chiplet 2. In chiplet 2, an AIB 2.0 RX channel is responsible for receiving the 80-bit wide data from the 40 RX I/O units. In DDR mode, the data is sampled at a frequency of 2 GHz. Subsequently, the received data stream is processed through a parallelizer, which uses a two-level 1:2 demultiplexer to recover it into four 80-bit wide data streams. The clock signal from the TX end is fine-tuned using an adjustable delay line to meet the sampling clock requirements of the RX end I/O units.

1) Automatic Clock Phase Adjustment: During the initialization phase of the link, the RX clock phase is adjusted to sample RX data at the optimal point. The adapter employs an automatic RX clock phase adjustment mechanism. The TX is responsible for sending a known pseudo-random binary sequence (PRBS), while the RX scans the delay of the clock signal received from the TX using a configurable delay line to monitor potential errors. By analyzing the error patterns in the received PRBS sequence, the RX can estimate the boundaries of the eye diagram. The goal is to set the delay and sampling point at the estimated midpoint of the eye diagram.

2) Data Bus Inversion (DBI): AIB 2.0 supports data bus inversion, which effectively reduces conversion and synchronous switching output (SSO) noise in single-ended and source-synchronous interfaces. Figure 5 shows a 1:19 ratio DBI encoder and decoder. At the TX end, the 80-bit data is encoded by four parallel DBI encoding units. Each unit retrieves the values of 19 data lines (represented by icurr[18:0] in Figure 5) and counts the number of bits that have changed in the previously encoded data (iprev[18:0]). If the count exceeds 10 (half of 20 bits), the DBI encoding unit inverts those bits and assigns a high (HIGH) value to the DBI bit. If the count equals 10 and the previous DBI bit is already high (HIGH), the DBI bit remains high (HIGH). If neither condition is met, the data remains unchanged, and the DBI bit is set to low (LOW). The DBI bit is then combined with the encoded 19-bit data, packed into 20 bits of TX data, and sent to the 20 I/O units. At the RX end, four parallel DBI decoding units are used. If the DBI bit (the highest bit of the received 20-bit data block) is high (HIGH), each unit will invert the received 19-bit data bits; if the DBI bit is low (LOW), the data remains unchanged.

Figure 4 AIB 2.0 channel top-level diagram and automatic clock phase adjustment

Figure 5 1:19 ratio DBI encoder (top) and decoder (bottom)

AIB 2.0 I/O

Figure 6 shows the schematic and layout of a compact unified AIB 2.0 I/O unit design. To achieve the target of 36 µm micro-bump pitch, the layout of this I/O unit has been carefully optimized, with each unit connected under the corresponding micro-bump to ensure compliance with the specified bump pitch. The unified I/O unit supports multiple modes. First, the transmission direction can be flexibly set to TX or RX mode, which not only aids in redundancy repair but also facilitates flexible connections between chiplets. In TX mode, the clock of the RX component is gated to reduce power consumption; conversely, in RX mode, the clock of the TX component is gated. Second, for AIB 1.0 and AIB 2.0, the I/O signal swing can be set to full track, while for AIB 2.0, the swing can also be reduced to 0.4 V. Third, the transmission mode can be set to single data rate (SDR) mode or double data rate (DDR) mode. In DDR mode, data0 and data1 are serialized for transmission, with data1 delayed by half a clock cycle compared to data0. This means that data0 is sent to the driver on the rising edge of the TX clock, while data1 is sent on the falling edge. At the RX end, this process is reversed, and data is restored through parallelization. The SDR mode only uses data0, which is sent to the driver on the rising edge of the TX clock. Finally, the I/O unit can be set to operate in asynchronous mode with respect to the clock and other sideband signals.

Figure 6 Schematic and layout of a unified AIB I/O unit

Figure 7 CMOS-based TX driver (left) and strongARM-based RX (right) schematic

1) TX Driver: As shown in Figure 7 (left), the TX driver design employs a segmented technique, consisting of four parts. This design allows for the connection of up to four segments of drivers to achieve adjustable driving strength, enabling flexible adjustments based on channel variations while balancing transmission speed and power consumption. Each driver segment includes an NMOS transistor for pull-down, and a switchable PMOS or NMOS pull-up driver, which can provide full-track or low-swing driving strength as needed. In low-swing mode, the NMOS pull-up driver is moderately enhanced to ensure balance with the pull-down driving strength. Additionally, the system allows for initial power-up values to be configured by setting weak pull-ups and pull-downs.

2) RX Buffer: The RX buffer design distinguishes between full-track input and low-swing input. For full-track input signals, standard cell buffers are used; for low-swing input signals, a regenerative comparator is employed, as shown in Figure 7 (right). This comparator is an optimized version of the StrongARM latch[15], [16], which reduces average offset to 4.1 mV without calibration. Furthermore, the design utilizes PMOS to enhance detection of low-swing inputs. The design incorporates a simple reference voltage generator. The comparator can reliably detect inputs as low as 0.38 V at a 2 GHz DDR frequency.

3) Bump Diagram: Figure 8 shows a bump diagram of an AIB 2.0 channel with a 12×17 bump layout. This channel consists of 40 TX data pins, 40 RX data pins, 2 TX forwarding clock pins, 2 RX forwarding clock pins, and 18 sideband and redundant pins. The designs of the TX and RX bumps are symmetrical, ensuring that the wiring lengths for each pair of TX-RX on the EMIB are equal. It has a total of 80 data pins, each with a data rate of 4 Gb/s, providing a total bandwidth of 320 Gb/s for an AIB 2.0 channel. The design has a micro-bump pitch of 36 µm, a channel shoreline width of 312.08 µm, and a bandwidth density of 1024 Gb/s/mm.

Figure 8 Bump diagram of an AIB 2.0 channel

Figure 9 Two-level clock distribution

Clock Distribution

For high-speed parallel I/O interfaces like AIB, low-skew clock distribution must be employed to ensure that all data pins in a given channel are correctly aligned in phase. As shown in Figure 9, we adopt a two-level clock distribution in each AIB channel. The upper layer is a balanced H-tree structure that covers the entire channel; the lower layer consists of a local clock grid. This dual-layer design effectively limits the depth of the H-tree, ensuring better balance between branches. Additionally, the local clock grid provides more stable clock sinks without significantly increasing power consumption. Therefore, the entire clock network can control the worst-case clock skew to within 8ps. Both the H-tree and grid clock networks are created and evaluated using the multi-source clock tree synthesis (MSCTS) process of IC Compiler II.

IV. DSP Clusters

As shown in Figure 10, each DSP cluster includes a flexible vector engine, a bypass buffer, a rotation block for data framing, two AXI-compatible bus converters for packing and unpacking data across multiple AIB channels, and an AXI-compatible system bus. Additionally, a bus hub is included to establish connections between the vector engine and the tester, AIB 1.0 interface, or AIB 2.0 interface. The bypass buffer supports mode 2 operation of Arvon, allowing direct connection between the FPGA and DSP2, bypassing DSP1. Through this connection, AIB 1.0 transactions from the FPGA can be directly forwarded to AIB 2.0 transactions of DSP2. The rotation block reverses the channel index order of the AIB interface. For example, when connecting DSP1 to DSP2 (the rotated version of DSP1), channels 1-8 of DSP1 connect to channels 24-17 of DSP2, requiring the rotation block of DSP2 to reverse the connection order.

Figure 10 DSP cluster (top) and vector engine (bottom)

Vector Engine

The core component of the DSP cluster is the vector engine, which consists of four 2-D symmetric array instances[17]. Each pulsed array contains 256 PEs, with each PE performing multiplication in half-precision floating-point format (FP16). These 256 PEs are divided into eight units, each containing 32 PEs. The summation results of each 32-PE unit are then input into a configurable adder tree. The configurable adder tree can flexibly support various workload mappings by selecting which units among the eight to sum together. This design provides shorter partial sum accumulation paths and achieves higher utilization through concurrent workloads, distinguishing it from classical pulsed arrays. The entire vector engine provides a total of 1024 PEs to support matrix-matrix multiplication (MMM) and convolution (conv). Finally, a global I/O buffer and scheduler are implemented, which use multicast or polling arbitration techniques to allocate inputs to the PE array.

Through instruction configuration, the vector engine facilitates continuous computation of input streams. The vector engine also possesses high mapping flexibility. First, the four pulsed arrays can be independently mapped. Additionally, the 256 PEs within each array can be configured in units of 32 PEs, accommodating 1 to 8 independent workloads.

System Bus and Bus Converter

The AIB connections are abstracted by a point-to-point system bus compatible with AXI. The bus converter handles the packing and unpacking of data across multiple AIB channels. It also supports burst mode to maximize bandwidth for streaming. The channels and signals of the system bus are illustrated in Figure 11. The system bus consists of four channels: read command channel, write command channel, read data channel, and write data channel. A master device can issue a read/write command with a 32-bit address and a 6-bit burst length, along with 512 bits of write data and write commands. In response to a read command, the slave device sends back 512 bits of read data to the master device. The conversion between the system bus and AIB channels is performed by the bus converter. In designing the bus converter, we adopted a header-based streaming approach to achieve high bandwidth and low latency. A vector engine can utilize up to eight AIB channels to ensure optimal utilization. Each AIB channel can be flexibly configured as a master or slave device, allowing TX/RX bandwidth to be adjusted as needed.

Figure 11 AXI-compatible system bus: bus converter (left) and bus interface channels and signals (right)

V. Workload Mapping

As a multifunctional computing platform, Arvon can support computing tasks of varying scales, and the complexity of these tasks can be dynamically adjusted as needed during operation. To ensure efficient data processing, a systematic approach must be established to map workloads to the optimal hardware configuration and data layout.

To achieve this goal, we developed a compiler, as shown in Figure 12. Workloads are first divided into several parts, namely those using conv cores or MMM cores, or both, which can be directly mapped to the Arvon DSP through appropriate configurations. Additionally, some intermediate steps between computation cores can be executed by the FPGA host. Specifically, the configuration of conv is based on the size of the filter and input (R × S × C), while the configuration of MMM is based on the size of the matrices. Subsequently, the conv and MMM cores in the workload are scheduled and allocated to the vector engine of the Arvon DSP according to established instructions and memory data configurations. This allocation process considers multiple key factors, including improving resource utilization, enhancing data reusability, and minimizing end-to-end latency.

Figure 12 Compilation process for workload mapping

The vector engine adopts a static weight scheme, with its core weights allocated to the PEs. To map MMM to the vector engine[17], each row of the weight matrix is allocated to a PE, effectively distributing a one-dimensional vector across the two-dimensional array. Rows with the same weight matrix can be allocated to the same group of PEs. In multi-tenant scenarios involving multiple cores, rows of different weight matrices can be allocated to different partitions, represented by p1–p8 in Figures 12 and 13. The partition outputs are directed to the corresponding inputs of the configurable adder tree, ensuring that individual sums are computed as outputs.

The weight mapping for conv is similar to that of multi-tenant MMM, as it may involve multiple convolution kernels. Figure 13 illustrates mapping examples for two convolution operations. Each convolution kernel has a size of R × S × C and is unfolded into the two-dimensional PE array by weaving three-dimensional slices in two dimensions. The three-dimensional input activation elements under the sliding convolution window are also unfolded into the two-dimensional PE array accordingly. Input activations are retained within the PE array to achieve horizontal and/or vertical reuse through pulsed data forwarding between adjacent PEs. For the case of a single convolution kernel (as in the first example in Figure 13), mapping can be performed without considering partition boundaries, achieving high efficiency. However, when multiple convolution kernels are present, as in the second example in Figure 13, each convolution kernel needs to align with partition boundaries, reducing utilization.

Figure 13 Mapping examples for different kernel sizes

VI. Chip Measurements and Comparisons

The DSP chiplet is manufactured using 22 nm FinFET technology, with an area of 32.3 mm², as shown in Figure 14. To construct the Arvon SiP, we package and interconnect a 14 nm FPGA chiplet and two DSP chiplets using two ten-layer EMIBs, while employing 36-micron pitch micro-bumps. The average wire length on the AIB 1.0 side is 1.5 mm, while the average wire length on the AIB 2.0 side is 0.85 mm.

At room temperature and a chip voltage of 0.85 V, the maximum operating frequency of each DSP cluster is 675 MHz, with a power consumption of 0.76 W. In this configuration, the peak performance of the DSP chiplet is 4.14 TFLOPS, with a power efficiency of 1.8 TFLOPS/W. Under a 0.85 V I/O voltage and an 800-MHz clock (limited by the FPGA clock frequency), the power consumption of the AIB 1.0 I/O is 0.44 pJ/b, including the adapter, which is 0.85 pJ/b, with a transmission delay of 3.75 ns. At room temperature, with an input/output voltage of 0.4 V and a clock frequency of 2 GHz, the AIB 2.0 input/output consumes 0.10 pJ/b, including the adapter, which is 0.46 pJ/b, with a transmission delay of 1.5 ns. The energy consumption breakdown of the AIB 2.0 interface is shown in Figure 15. The energy consumption of the adapter accounts for the majority, at 0.32 pJ/b, approximately 69% of the total energy consumption. In contrast, the I/O unit consumes only 0.10 pJ/b, accounting for about 22% of the total energy consumption. The low energy consumption of the I/O unit is due to the utilization of a low signal swing of 0.4 V.

Figure 14 Test device, Arvon multi-chip package, and microphotograph of DSP chiplet

Figure 15 Energy consumption breakdown of AIB 2.0 interface

Figure 16 Relationship between energy efficiency and area bandwidth density (left) and shoreline bandwidth density (right)

The comparison of Arvon’s AIB I/O interface with state-of-the-art SiP I/O interfaces is shown in Table II. Similar to the AIB interface, SNR-10[8], 3-D-Plug[10], and LIPINCON[5] are also parallel I/O interfaces. Among them, LIPINCON has the highest data transfer rate, reaching 8 Gb/s/pin, with the lowest I/O energy consumption of only 0.073 pJ/b at a 0.3 V signal swing; 3-D-Plug has the highest bandwidth density, reaching a coastline of 900 Gb/s/mm; SNR-10 has the smallest I/O size, only 137 µm². GRS[7] is a high-speed serial I/O interface that provides 25 Gb/s/pin with an energy efficiency of 1.17 pJ/b. Our AIB 2.0 prototype offers an attractive solution, with an I/O energy consumption of only 0.10 pJ/b, and 0.46 pJ/b including the adapter. As shown in Table II, it also achieves the highest bandwidth density of 1.0 Tb/s/mm coastline and 1.7 Tb/s/mm² area. Figure 16 compares the energy efficiency, area bandwidth density, and shoreline bandwidth density of inter-chip interfaces. Compared to the GRS interface, the AIB 2.0 interface improves energy efficiency, area bandwidth density, and shoreline bandwidth density by 2.5 times, 3.4 times, and 5 times, respectively, outperforming other interfaces.

We demonstrate various applications of Arvon’s workload mapping, including deep neural networks (DNN), multiple-input multiple-output (MIMO) signal processing, and image filtering. The workload sizes, overall throughput, and utilization are summarized in Table III. In addition to commonly used DNN models, the 128×16 MIMO detection workload utilizes 128 receiving antennas to detect 16 single-antenna users. The processing involved in this task includes minimum mean square error (MMSE) filtering operations, which require matrix-matrix multiplication (MMM) calculations to compute the filtering matrix, followed by applying the filtering matrix using MMM. To perform these operations, MMM cores of sizes 32×256, 256×32, 32×32, and 32×1 are needed to complete this workload. These computation cores can be efficiently mapped to the PE array, achieving 100% utilization. The image filtering workload involves 16 5×5 filters and 16 3×3 filters, which are applied to 1280×720 image frames, requiring convolution kernels for these operations. However, due to the small filter sizes, their utilization is lower than that of other workloads. The results of these example workloads indicate that Arvon’s heterogeneous SiP architecture provides flexibility, performance, and efficiency for neural network (NN) and communication processing.

Table II

Table III

VII. Conclusion

Arvon is a heterogeneous system-level package (SiP) that integrates an FPGA chiplet and two DSP chiplets using Embedded Multi-die Interconnect Bridges (EMIBs). This integration allows Arvon to not only leverage the flexibility of the FPGA as a host but also the high computational performance and efficiency of the DSPs.

The main features of the SiP include the use of parallel, short-distance AIB 1.0 and AIB 2.0 interfaces for seamless chiplet connections. The input/output (I/O) unit design is compact, primarily digital, and synthesizable. These units are highly flexible, supporting multiple modes. Additionally, they employ power gating and two-level clock distribution based on dependency patterns, enhancing energy efficiency. We achieved a low-swing 4-Gb/s AIB 2.0 interface with an energy efficiency of 0.10 pJ/b, and 0.46 pJ/b including the adapter, while achieving a coastline bandwidth density of 1.024 Tb/s/mm and an area bandwidth density of 1.705 Tb/s/mm². This interface abstracts using an AXI-compatible bus protocol, simplifying the use of the host and DSP.

Each DSP chiplet in Arvon adopts a low-latency pulsed array architecture, featuring 3072 FP16 PEs. These PEs are hierarchically organized into three clusters, each containing eight 32-PE units. This fine-grained organizational structure allows for the simultaneous parallel execution of multiple workloads. Each DSP chiplet can provide a peak performance of 4.14 TFLOPS, with a power efficiency of 1.8 TFLOPS/W. We developed a systematic program to map workloads onto Arvon and demonstrated various workloads that Arvon can accelerate to achieve competitive performance and utilization.

Acknowledgments

The views, opinions, and/or findings expressed are those of the authors and should not be construed as representing the official views or policies of the Department of Defense or the U.S. Government.

Thanks to Professor Huang Letian and student Chen Feiyang from the School of Integrated Circuit Science and Engineering at the University of Electronic Science and Technology of China for their assistance in translation and proofreading.

References

[1] W. Jiang, B. Han, M. A. Habibi, and H. D. Schotten, “The road towards 6G: A comprehensive survey,” IEEE Open J. Commun. Soc., vol. 2, pp. 334–366, 2021.

[2] H. Tataria, M. Shafi, A. F. Molisch, M. Dohler, H. Sjöland, and F. Tufvesson, “6G wireless systems: Vision, requirements, challenges, insights, and opportunities,” Proc. IEEE, vol. 109, no. 7, pp. 1166–1199, Jul. 2021.

[3] S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark analysis of representative deep neural network architectures,” IEEE Access, vol. 6, pp. 64270–64277, 2018.

[4] G. Menghani, “Efficient deep learning: A survey on making deep learning models smaller, faster, and better,” 2021, arXiv:2106.08962.

[5] M.-S. Lin et al, “A 7-nm 4-GHz arm-core-based CoWoS chiplet design for high-performance computing,” IEEE J. Solid-State Circuits, vol. 55, no. 4, pp. 956–966, Apr. 2020.

[6] B. Zimmer et al, “A 0.32–128 TOPS, scalable multi-chip-modulebased deep neural network inference accelerator with ground-referenced signaling in 16 nm,” IEEE J. Solid-State Circuits, vol. 55, no. 4, pp. 920–932, Apr. 2020.

[7] J. W. Poulton et al, “A 1.17-pJ/b, 25-Gb/s/pin ground-referenced singleended serial link for off- and on-package communication using a process- and temperature-adaptive voltage regulator,” IEEE J. Solid-State Circuits, vol. 54, no. 1, pp. 43–54, Jan. 2019.

[8] U. Rathore, S. S. Nagi, S. Iyer, and D. Markovic, “A 16 nm 785 GMACs/J 784-core digital signal processor array with a multilayer switch box interconnect, assembled as a 2 × 2 dielet with 10 µm-pitch inter-dielet I/O for runtime multi-program reconfiguration,” in Proc. IEEE Int. Solid- State Circuits Conf. (ISSCC), Feb. 2022, pp. 52–54.

[9] S. S. Nagi, U. Rathore, K. Sahoo, T. Ling, S. S. Iyer, and D. Markovic, “A 16-nm 784-core digital signal processor array, assembled as a 2 × 2 dielet with 10 µm pitch interdielet I/O for runtime multiprogram reconfiguration,” IEEE J. Solid-State Circuits, vol. 58, no. 1, pp. 111–123, Jan. 2023.

[10] P. Vivet et al, “IntAct: A 96-core processor with six chiplets 3D-stacked on an active interposer with distributed interconnects and integrated power management,” IEEE J. Solid-State Circuits, vol. 56, no. 1, pp. 79–97, Jan. 2021.

[11] W. Tang et al, “Arvon: A heterogeneous SiP integrating a 14 nm FPGA and two 22 nm 1.8 TFLOPS/W DSPs with 1.7 Tbps/mm2 AIB 2.0 interface to provide versatile workload acceleration,” in Proc. IEEE Symp. VLSI Technol. Circuits, Jun. 2023, pp. 1–2.

[12] R. Mahajan et al, “Embedded multi-die interconnect bridge (EMIB)— A high density, high bandwidth packaging interconnect,” in Proc. IEEE 66th Electron. Compon. Technol. Conf. (ECTC), May 2016, pp. 557–565.

[13] G. Duan, Y. Kanaoka, R. McRee, B. Nie, and R. Manepalli, “Die embedding challenges for EMIB advanced packaging technology,” in Proc. IEEE 71st Electron. Compon. Technol. Conf. (ECTC), Jun. 2021, pp. 1–7.

[14] AIB Specifications. Accessed: Dec. 13, 2023. [Online]. Available: https://github.com/chipsalliance/AIB-specification

[15] B. Razavi, “The StrongARM latch [A circuit for all seasons],” IEEE Solid StateCircuits Mag., vol. 7, no. 2, pp. 12–17, Jun. 2015.

[16] M. Miyahara, Y. Asada, D. Paik, and A. Matsuzawa, “A low-noise selfcalibrating dynamic comparator for high-speed ADCs,” in Proc. IEEE Asian Solid-State Circuits Conf., Nov. 2008, pp. 269–272.

[17] S.-G. Cho, W. Tang, C. Liu, and Z. Zhang, “PETRA: A 22 nm 6.97 TFLOPS/W AIB-enabled configurable matrix and convolution accelerator integrated with an Intel Stratix 10 FPGA,” in Proc. Symp. VLSI Circuits, Jun. 2021, pp. 1–2.

[18] Intel Stratix 10 TX 2800 Specification. [Online]. Available: https:// www.intel.com/content/www/us/en/products/sku/210283/intel-stratix10-tx-2800-fpga/specifications.html

END

Selected Previous Issues

[Free] FPGA Engineer Recruitment Platform

ISE 14.7 Installation Tutorial and Detailed Instructions

Vivado 2019.2 Installation Tutorial

SANXIN-B01 Development Board Verilog Tutorial V3 Electronic Edition

Student Notes Series | FPGA Zynq Gigabit Ethernet Loopback

Job Interview | Latest Compilation of FPGA or IC Interview Questions

FPGA Project Development: 204B Practical Application – LMK04821 Code Explanation (Part II)

Project Collaboration | Announcement for Undertaking FPGA Projects

A Heterogeneous System-Level Package Integrating FPGA and DSP Chiplets

FPGA Technology Community Recruitment Post

Ad-free pure mode, providing a clean space for technical exchanges, from beginners to industry elites and big names, covering various directions from military to civilian enterprises, from communication, image processing to artificial intelligence, QQ and WeChat dual selection, FPGA Technology Community creates the purest and most professional technical exchange and learning platform.

FPGA Technology Community WeChat Group

Add the group owner’s WeChat, note your name + company/school + position/major to join the group

FPGA Technology Community QQ Group

Note your name + company/school + position/major to join the group

Related posts

Leave a Comment Cancel reply