How FPGA Accelerates X-DB for Million TPS in Heterogeneous Computing

AliMei Guide: X-Engine is the new generation storage engine developed by the Group Database Division, and it is the foundation of the new generation distributed database X-DB.In online transaction processing database storage engines, how to efficiently reclaim multi-version old data has always been a challenge, especially in write-intensive applications, where transaction processing is inevitably interfered by background tasks (compaction or vacuum). The idea of introducing heterogeneous computing devices to offload these tasks has been around for a long time, but truly applying it is indeed difficult.

Today, we will introduce the X-Engine storage engine with FPGA acceleration in detail.This article not only talks about how to design and implement more efficient FPGA logic but also how to improve I/O, manage mixed load scheduling, and ensure fault tolerance. The word “smooth” may seem calm, but it actually hides huge waves.

Introduction

X-Engine is the new generation storage engine developed by the Group Database Division and is the foundation of the new generation distributed database X-DB. To achieve the goal of 10 times the performance of MySQL and 1/10 the storage cost, X-DB has adopted a hardware-software combined design approach from the very beginning to fully leverage the cutting-edge technology advantages in the current software and hardware fields. The introduction of FPGA acceleration is our first attempt in customized computing. Currently, the FPGA accelerated version of X-DB has begun small-scale gray testing online, and during this year’s June 18 and Double 11 promotions, FPGA will assist X-DB in meeting Alibaba’s higher performance requirements for databases without increasing costs.

Background

As the world’s largest online trading platform, Alibaba’s OLTP (online transaction processing) database system needs to meet high throughput business requirements. According to statistics, the daily record write volume of the OLTP database system reaches tens of billions, and during the Double 11 in 2017, the peak throughput of the system reached tens of millions of TPS (transactions per second). Alibaba’s business database system has the following characteristics:

High throughput for transactions with low latency for read and write operations;
Relatively high write operation ratio; in traditional database workloads, the read-write ratio is generally above 10:1, while Alibaba’s trading system reached a read-write ratio of 3:1 on Double 11;
Data access hotspots are relatively concentrated; a newly written piece of data accounts for 99% of the overall access in the next 7 days, and the probability of being accessed after 7 days is extremely low.

To meet Alibaba’s almost harsh requirements for performance and cost, we have redesigned and developed a storage engine called X-Engine. In X-Engine, we have introduced many cutting-edge technologies in the field of databases, including efficient memory index structures, asynchronous pipeline processing mechanisms for writes, and optimistic concurrency control used in in-memory databases.

To achieve extreme write performance and facilitate the separation of hot and cold data for tiered storage, X-Engine draws on the design philosophy of LSM-Tree. It maintains multiple memtables in memory, and all newly written data will be appended to the memtable instead of directly replacing existing records. Since the amount of data to be stored is large, it is impossible to store all data in memory.

When the data in memory reaches a certain amount, it will be flushed to persistent storage to form SSTable. To reduce read operation latency, X-Engine schedules compaction tasks to periodically compact SSTables in persistent storage, merging key-value pairs from multiple SSTables, and only retaining the latest version of multi-version key-value pairs (all currently referenced versions of key-value pairs by transactions also need to be retained).

According to the characteristics of data access, X-Engine will tier the persistent data, with more active data staying in higher data layers, while relatively inactive (less accessed) data will be merged with lower-level data and stored in the bottom layer. These bottom layer data are stored in a highly compressed manner and will be migrated to larger, relatively inexpensive storage media (such as SATA HDD) to achieve the goal of storing large amounts of data at a lower cost.

This tiered storage brings about a new problem: the entire system must frequently perform compaction. The larger the write volume, the more frequent the compaction process. Compaction is a compare & merge process, which consumes a lot of CPU and storage I/O. In high-throughput write scenarios, a large number of compaction operations consume a lot of system resources, inevitably leading to a drastic drop in overall system performance, which has a huge impact on application systems.

However, the completely redesigned X-Engine has very superior multi-core scalability, achieving very high performance. The front-end transaction processing can almost fully consume all CPU resources, and its resource utilization efficiency compared to InnoDB is as shown in the figure below:

At such performance levels, the system has no spare computing resources for compaction operations, otherwise it will bear the cost of performance decline.

Tests have shown that in the write-only scenario of the DbBench benchmark, the system experiences periodic performance fluctuations. When compaction occurs, system performance drops by over 40%, and when compaction ends, system performance returns to normal levels. As shown in the figure below:

However, if compaction is not performed in a timely manner, the accumulation of multi-version data will severely affect read operations.

To solve the problem of compaction jitter, academia has proposed structures such as VT-tree, bLSM, PE, PCP, and dCompaction. Although these algorithms optimize compaction performance through different methods, the CPU resources consumed by compaction itself are unavoidable. According to related studies, when using SSD storage devices, the computational operations of compaction occupy 60% of the computational resources in the system. Therefore, regardless of what optimizations have been made at the software level for compaction, the performance jitter caused by compaction will always be the Achilles’ heel for all LSM-tree-based storage engines.

Fortunately, the emergence of dedicated hardware provides a new idea for solving the performance jitter caused by compaction. In fact, using dedicated hardware to solve the performance bottlenecks of traditional databases has become a trend. Currently, select and where operations in databases have been offloaded to FPGAs, and more complex operations such as group by have also undergone related research. However, the current FPGA acceleration solutions have the following shortcomings:

The current acceleration solutions are mostly designed for the SQL layer, and FPGAs are usually placed between storage and host as a filter. Although there have been many attempts in FPGA-accelerated OLAP systems, the design of FPGA acceleration for OLTP systems remains a challenge;
As the chip size of FPGAs becomes smaller, internal errors in FPGAs, such as single-event upsets (SEUs), are becoming a greater threat to FPGA reliability. For a single chip, the probability of internal errors occurring is about 3-5 years, and for large-scale availability systems, the design of fault tolerance mechanisms becomes particularly important.

To mitigate the impact of compaction on the performance of the X-Engine system, we introduced heterogeneous hardware devices, FPGAs, to replace CPUs in performing compaction operations, maintaining the overall performance of the system at a high level and avoiding jitter. This is the key to the storage engine serving the harsh requirements of the business. The contributions of this article are as follows:

Efficient design and implementation of FPGA compaction. By pipelining compaction operations, FPGA compaction achieves ten times the processing performance of a single-threaded CPU;
Asynchronous scheduling logic design for hybrid storage engines. Since a single FPGA compaction link request is at the millisecond level, using traditional synchronous scheduling methods would block a large number of compaction threads and incur significant thread switching costs. Through asynchronous scheduling, we reduced the costs of thread switching and improved the system’s engineering availability.
Design of fault tolerance mechanisms. Due to input data limitations and internal FPGA errors, certain compaction tasks may need to be rolled back. To ensure data integrity, all tasks rolled back by FPGA will be executed again by equivalent CPU compaction threads. The fault tolerance mechanism designed in this article meets Alibaba’s actual business needs while avoiding the instability of FPGA.

Problem Background

Compaction of X-Engine

The storage structure of X-Engine includes one or more memory buffers (memtables) and multiple layers of persistent storage L0, L1, … Each layer consists of multiple SSTables.

When the memtable is full, it will be converted into an immutable memtable and then flushed to the L0 layer as SSTable. Each SSTable contains multiple data blocks and an index block used to index the data blocks. When the number of files in the L0 layer exceeds the limit, it will trigger the merging of SSTables with overlapping key ranges in the L1 layer, a process called compaction. Similarly, when the number of SSTables in a layer exceeds the threshold, it will trigger merging with lower-level data. In this way, cold data continuously flows down, while hot data remains in higher layers.

A compaction process merges key-value pairs within a specified range, which may include multiple data blocks. Generally, a compaction process handles the merging of data blocks from two adjacent layers, but the compaction between the L0 and L1 layers requires special consideration. Since the SSTables in the L0 layer are directly flushed from memory, the SSTables between layers may have overlapping keys, so the compaction between the L0 and L1 layers may involve merging multiple data blocks.

For read operations, X-Engine needs to search through all memtables. If not found, it must check persistent storage from higher to lower layers. Therefore, timely compaction operations not only shorten the read path but also save storage space, but they also seize system computing resources, causing performance jitter, which is a dilemma that X-Engine urgently needs to solve.

FPGA Accelerated Database

From the current analysis of FPGA accelerated databases, we can categorize the architecture of FPGA accelerated databases into two types: “bump-in-the-wire” design and hybrid design architecture. In the early stages, due to the insufficient memory resources of FPGA boards, the former architecture was more popular, with FPGAs placed in the data path between storage and host, acting as a filter. The advantage of this design is zero-copy data transfer, but it requires the accelerated operations to be part of stream processing, making the design less flexible;

The latter design scheme treats FPGA as a co-processor, connecting the FPGA to the host via PCIe, and data is transmitted via DMA. As long as the offloaded operation is compute-intensive enough, the data transfer cost is acceptable. The hybrid architecture design allows for more flexible offloading methods. For the complex operation of compaction, data transmission between FPGA and host is necessary, so in X-Engine, we adopted a hybrid design architecture for hardware acceleration.

System Design

In traditional LSM-tree-based storage engines, the CPU not only handles normal user requests but also schedules and executes compaction tasks. In terms of compaction tasks, the CPU is both the producer and consumer. For the CPU-FPGA hybrid storage engine, the CPU is only responsible for producing and scheduling compaction tasks, while the actual execution of compaction tasks is offloaded to dedicated hardware (FPGA).

For X-Engine, the processing of normal user requests is similar to other LSM-tree-based storage engines:

The user submits a request to operate a specified KV pair (Get/Insert/Update/Delete). If it is a write operation, a new record will be appended to the memtable;
When the size of the memtable reaches the threshold, it will be converted into an immutable memtable;
The immutable memtable is converted into SSTable and flushed to persistent storage.

When the number of SSTables in the L0 layer reaches the threshold, a compaction task will be triggered. The offloading of compaction is divided into the following steps:

Load the SSTables that need compaction from persistent storage. The CPU splits them into multiple compaction tasks at the data block granularity based on metadata and pre-allocates memory space for the calculation results of each compaction task. Each constructed compaction task will be pushed into the Task Queue, waiting for FPGA execution;
The CPU reads the status of the Compaction Unit on the FPGA and assigns the compaction tasks from the Task Queue to available Compaction Units;
The input data is transmitted to the FPGA’s DDR via DMA;
The Compaction Unit performs the Compaction task. After computation, the results are returned to the host via DMA, along with a return code indicating the status of the compaction task (failure or success). The completed compaction results are pushed into the Finished Queue;
The CPU checks the result status of the compaction tasks in the Finished Queue. If compaction fails, the task will be executed again by the CPU;
The results of the compaction are flushed to storage.

Detailed Design

FPGA-based Compaction

The Compaction Unit (CU) is the basic unit for executing compaction tasks on the FPGA. Multiple CUs can be placed on a single FPGA board, and each CU consists of the following modules:

Decoder. In X-Engine, KV is stored in data blocks after prior compression encoding. The main function of the Decoder module is to decode key-value pairs. Each CU contains 4 Decoders, and the CU supports a maximum of 4-way compaction. More than 4-way compaction tasks need to be split by the CPU. Based on assessments, most compactions are below 4-way. The placement of 4 Decoders is also a balance between performance and hardware resource consumption, increasing hardware resource usage by 50% compared to 2 Decoders while achieving a 3-fold performance improvement.
KV Ring Buffer. The KV pairs decoded by the Decoder module are temporarily stored in the KV Ring Buffer. Each KV Ring Buffer maintains a read pointer (managed by the Controller module) and a write pointer (managed by the Decoder module). The KV Ring Buffer maintains three signals to indicate the current status: FLAG_EMPTY, FLAG_HALF_FULL, FLAG_FULL. When FLAG_HALF_FULL is low, the Decoder module continues to decode KV pairs; otherwise, the Decoder will pause decoding until downstream consumers have consumed the already decoded KV pairs.
KV Transfer. This module is responsible for transferring keys to the Key Buffer. Since KV merging only involves key value comparisons, values do not need to be transferred, and we track the currently compared KV pairs using the read pointer. Key Buffer. This module stores the keys that need to be compared. When all keys needing comparison are transferred to the Key Buffer, the Controller will notify the Compaction PE to perform the comparison.
Compaction PE. The Compaction Processing Engine (compaction PE) is responsible for comparing the key values in the Key Buffer. The comparison results are sent to the Controller, which will notify KV Transfer to transfer the corresponding KV pairs to the Encoding KV Ring Buffer, waiting for the Encoder module to encode them.
Encoder. The Encoder module is responsible for encoding KV pairs from the Encoding KV Ring Buffer into data blocks. If the size of the data block exceeds the threshold, the current data block will be flushed to DDR.
Controller. In the CU, the Controller acts as a coordinator. Although the Controller is not part of the compaction pipeline, it plays a key role in every step of the compaction pipeline design.

A compaction process consists of three steps: decode, merge, and encode. The biggest challenge in designing a suitable compaction pipeline is the significant time difference between the execution times of each step. For example, due to parallelization, the throughput of the decode module is much higher than that of the encoder module. Therefore, we need to pause some modules that execute faster to wait for downstream modules in the pipeline. To match the throughput differences among modules in the pipeline, we designed the controller module to coordinate different steps in the pipeline. This design brings an additional benefit of decoupling the various modules in the pipeline design, enabling more agile development and maintenance in engineering implementation.

When integrating FPGA compaction into X-Engine, we aim to achieve independent throughput performance for CUs. The experimental baseline is a single-core CPU compaction thread (Intel(R) Xeon(R) E5-2682 v4 CPU with 2.5 GHz).

From the experiments, we can draw the following three conclusions:

For all KV lengths, the throughput of FPGA compaction is superior to the processing capacity of a single-threaded CPU, confirming the feasibility of compaction offload;
As the key length increases, the throughput of FPGA compaction decreases due to the increased byte length of comparisons, which increases the comparison cost;
The acceleration ratio (FPGA throughput / CPU throughput) increases with the length of the value, as the need for frequent communication and state checks between modules incurs significant overhead compared to ordinary pipelining operations.

Asynchronous Scheduling Logic Design

Due to the millisecond-level latency of a single link request for FPGA, using traditional synchronous scheduling methods would result in frequent thread switching costs. Targeting the characteristics of FPGA, we redesigned the way to asynchronously schedule compaction: the CPU is responsible for constructing compaction tasks and pushing them into the Task Queue. By maintaining a thread pool, compaction tasks are allocated to specified CUs. When compaction ends, the compaction tasks are pushed into the Finished Queue, and the CPU checks the execution status of the tasks. For tasks that fail to execute, the CPU’s compaction thread will execute them again. Through asynchronous scheduling, the CPU’s thread switching costs are greatly reduced.

Design of Fault Tolerance Mechanisms

For FPGA compaction, there are three possible reasons for compaction task errors:

Data is corrupted during transmission. By calculating the CRC values of the data before and after transmission and comparing them, if the two CRC values are inconsistent, it indicates that the data has been corrupted;
Errors in the FPGA itself (bit flips). To resolve this error, we configure an additional CU for each CU, and the calculation results of the two CUs are compared bit by bit. If they are inconsistent, it indicates that a bit flip error has occurred;
Invalid input data for compaction. To facilitate the design of FPGA compaction, we have limited the length of KV pairs, and compaction tasks exceeding this limit will be deemed invalid.

For all erroneous tasks, the CPU will recalculate to ensure data correctness. Under the aforementioned fault tolerance mechanism, we solved a small number of compaction tasks that exceeded the limit and avoided the risks of internal FPGA errors.

Experimental Results

Experimental Environment

CPU: 64-core Intel (E5-2682 v4, 2.50 GHz) processor
Memory: 128GB
FPGA Board: Xilinx VU9P
Memtable: 40 GB
Block Cache: 40GB

We compared the performance of the two storage engines:

X-Engine-CPU: compaction operations executed by CPU
X-Engine-FPGA: compaction offloaded to FPGA

DbBench

Result Analysis:

In the write-only scenario, the throughput of X-Engine-FPGA improved by 40%. The performance curve shows that when compaction begins, the performance of the X-Engine-CPU system drops by more than one-third;
Due to the higher throughput of FPGA compaction and its timeliness, the read path reduces faster. Thus, in mixed read-write scenarios, the throughput of X-Engine-FPGA improved by 50%;
The throughput in mixed read-write scenarios is lower than that in pure write scenarios. Due to the existence of read operations, data stored in persistent layers will also be accessed, leading to I/O overhead, thus affecting overall throughput performance;
The performance curves of the two systems represent two different compaction states. In the left figure, the system performance experiences periodic jitter, indicating that compaction operations are competing for CPU resources with normal transaction processing threads. In the right figure, the performance of X-Engine-CPU remains stable at a low level, indicating that the speed of compaction is slower than the write speed, leading to SSTable accumulation, with compaction tasks continuously scheduled in the background;
Since compaction scheduling is still executed by the CPU, this also explains why X-Engine-FPGA still experiences jitter and is not absolutely smooth.

YCSB

Result Analysis:

In the YCSB benchmark, due to the impact of compaction, the performance of X-Engine-CPU dropped by about 80%, while for X-Engine-FPGA, due to the impact of compaction scheduling logic, the performance of X-Engine-FPGA only fluctuated by 20%;
The existence of unique checks introduced read operations. As the testing time increased, the read path became longer, causing the performance of both storage engines to decline over time;
In the write-only scenario, the throughput of X-Engine-FPGA improved by 40%. As the read-write ratio increased, the acceleration effect of FPGA compaction gradually decreased. This is because the higher the read-write ratio, the lower the write pressure, and the slower the accumulation of SSTable, resulting in fewer threads executing compaction. Therefore, for write-intensive workloads, the performance improvement of X-Engine-FPGA is more significant;
As the read-write ratio increased, the throughput increased. Since the write throughput is lower than the KV interface, the cache miss ratio is low, avoiding frequent I/O operations. However, as the write ratio increases, the number of threads executing compaction increases, thus reducing the system’s throughput capacity.

TPC-C (100 warehouses)

Connections	X-Engine-CPU	X-Engine-FPGA
128	214279	240105
256	203268	230401
512	197001	219618
1024	189697	208532

Result Analysis:

With FPGA acceleration, as the number of connections increases from 128 to 1024, X-Engine-FPGA can achieve a performance improvement of 10% to 15%. As the number of connections increases, the throughput of both systems gradually decreases due to increased lock contention on hot rows;
The read-write ratio of TPC-C is 1.8:1. From the experimental process, we observed that over 80% of CPU resources were consumed by SQL parsing and lock contention on hot rows, and the actual write pressure is not significant. Observations during the experiment show that for the X-Engine-CPU system, the number of threads executing compaction does not exceed 3 (out of a total of 64 cores), so the acceleration effect of FPGA is not as obvious as in previous implementations.

SysBench

In this experiment, we included tests for InnoDB (buffer size = 80G).

Result Analysis:

X-Engine-FPGA improved throughput performance by over 40%. Due to the significant consumption of CPU resources by SQL parsing, the throughput of the DBMS is lower than that of the KV interface;
X-Engine-CPU reached a balance at a low level, as the speed of compaction was slower than the write speed, leading to SST file accumulation, with compaction continuously scheduled;
The performance of X-Engine-CPU is twice that of InnoDB, proving the advantages of LSM-tree-based storage engines in write-intensive scenarios;
Compared with the TPC-C benchmark, Sysbench is more similar to Alibaba’s actual transaction scenarios. For transaction systems, the types of queries are mostly insertions and simple point queries, with very few range queries. The reduction in hot row conflicts reduces the resources consumed by the SQL layer. During the experiment, we observed that for X-Engine-CPU, over 15 threads were executing compaction, making the performance improvement brought by FPGA acceleration more evident.

Conclusion

In this article, we propose the X-Engine storage engine with FPGA acceleration, which achieves a 50% performance improvement for KV interfaces and a 40% performance improvement for SQL interfaces. As the read-write ratio decreases, the effect of FPGA acceleration becomes more apparent, indicating that FPGA compaction acceleration is suitable for write-intensive workloads, which is consistent with the design intent of LSM-tree. Additionally, we designed a fault tolerance mechanism to mitigate the inherent flaws of FPGA, ultimately forming a high-availability CPU-FPGA hybrid storage engine suitable for Alibaba’s actual business.

Final Note

This project is the first implementation of introducing heterogeneous computing devices to accelerate the core functions of databases in X-DB. From our experience, FPGA can completely address the computational demands brought about by the Compaction of X-Engine. We are also further researching to schedule more suitable computational tasks to run on FPGA, such as compression, generating Bloom Filters, SQL JOIN operators, etc. Currently, the compression function has been developed and will be integrated with Compaction to complete data merging and compression operations simultaneously.

X-DB FPGA-Compaction hardware acceleration is a collaborative R&D project completed by the database kernel team of the Database Division, the customized computing team of the Server R&D Division, and Zhejiang University.At the same time, the implementation of this project has also relied on the strong support of the technical team from Xilinx, for which we express our gratitude.

X-DB will be launched for public testing on Alibaba Cloud this year, and everyone can experience the extreme performance brought by FPGA acceleration to X-DB. We also look forward to your valuable suggestions.

You may also like

Click the image below to read

How FPGA Accelerates X-DB for Million TPS in Heterogeneous Computing

How to quickly grow into a technical expert?

How FPGA Accelerates X-DB for Million TPS in Heterogeneous Computing

How to stand out in Alibaba’s technical interviews?

(Internal material)

How FPGA Accelerates X-DB for Million TPS in Heterogeneous Computing

Alibaba’s new breakthrough!

Next-generation matching & recommendation technology based on independent innovation

Follow「Alibaba Technology」

Keep up with cutting-edge technology trends

Introduction

YCSB

TPC-C (100 warehouses)

SysBench

Conclusion

Related posts

Leave a Comment Cancel reply