Flash Technology Enhances DSI to Address AI Training Scalability: A Case Study of Meta DLRM

Main Content

The Importance of DSI

DSI is a crucial component of the AI pipeline, equally important as models and accelerators.
The main task of DSI is to provide a continuous data flow to accelerators to avoid performance bottlenecks.
With the large-scale application of AI solutions in enterprises, traditional HDD-based object storage (such as S3, Google Cloud, and Azure Blob) faces scalability limitations in capacity and I/O bandwidth.

Challenges of DSI for DLRM Training

Recommendation models (DLRM) are widely used to predict user behavior, supporting the core businesses of companies like Meta, Netflix, and Google.
Unlike one-time training of foundational models (like ChatGPT), DLRM requires frequent training and fine-tuning to adapt to the latest user interaction data.
The parameter count of DLRM models is enormous; for instance, a certain model from Meta contains 12 trillion parameters, with an embedding table size of 96TB (compressed to 24TB).
The dataset size used for DLRM training typically reaches several tens of PB, and is read in a single cycle, requiring the storage system to have high throughput performance.

How Meta Uses Flash to Address DSI Scalability Challenges

DLRM Pipeline Architecture: Meta’s DLRM pipeline is a closed-loop feedback system, where user interaction data continuously feeds back to the training and fine-tuning stages.
Storage Architecture: Initial data is stored in a database, then stored in a compressed columnar format in an HDD-based Tectonic distributed file system.
SSD Cache Layer (Shift):

To address I/O bandwidth bottlenecks, Meta deployed an SSD cache layer, Shift, on top of the Tectonic file system to optimize training data reads.
Shift utilizes data locality to cache frequently accessed data across DLRM training tasks, such as feature columns that are often accessed together.
With Shift caching, DSI power consumption was reduced by 29%.

Online Preprocessing:

This includes operations like decompression, decryption, data transformation, and filtering, which are completed in independent mini-batches without inter-node communication, ensuring scalability.
Preprocessing is executed by DPP worker nodes, with each node independently processing assigned data blocks.

Storage Test Analysis Results

Payload Characteristics: Approximately 90% of read and write operations have a payload of 512KB.
Write Characteristics: Most write operations during the preprocessing stage are sequential writes, accounting for about 85%-95%.
Read Characteristics: Some workloads also show high sequentiality in read operations.
Dynamic Performance Requirements: Storage performance demands fluctuate over time, and queue depth does not always remain high.

=========

About a year and a half ago, we posed a question: What role do storage systems (especially flash storage) play in the field of AI?

Initially, we all thought the answer to this question was simple: AI generates massive amounts of data that need to be stored, thus storage demand must grow.

As datasets and model sizes continue to expand, we realized that data offloading could be achieved through storage systems, reducing costs and sometimes decreasing parameter sizes. Among many discussions, the data ingestion phase is particularly critical, which is the focus of today’s discussion. Therefore, we are now more concerned about the role of SSDs in the data ingestion function of the AI pipeline.

The framework of this presentation is as follows:

Focus on data ingestion
Using the Deep Learning Recommendation Model (DLRM) as an example, rather than LLM
Focusing on the training process, without involving inference
Sharing insights based on benchmark test results

Specifically, I will first outline the key conclusions and then gradually elaborate.

Key conclusions discuss the following core issues:

1. Is Data Storage and Ingestion (DSI) a critical component of the AI pipeline?
2. How does flash storage solve scalability issues between storage capacity and I/O bandwidth?
3. What are the storage tracking analysis results for the DLRM model?
4. How to enhance MLPerf training benchmarks to better accommodate DLRM models?

I will primarily reference Meta’s work in this area, particularly their research findings from four to five years ago, which provide a solid foundation for answering the above questions.

Before delving deeper, let’s review some key points:

Data storage and ingestion are core components of the AI pipeline
This discussion focuses on DLRM
Recommendation model training systems face challenges in storage capacity and I/O scalability, and flash storage can effectively address these issues
We will use MLPerf training DLRM benchmark results to analyze the model’s training and preprocessing stages

In recent discussions, the issue of data ingestion has received significant attention, which is indeed a major challenge.

Let me first show a simplified pipeline diagram, after which we will explore more complex details.

Traditionally, the AI pipeline has focused mainly on models and accelerators. As mentioned earlier, we discussed data storage solutions, HBM adaptability, DRAM data offloading, and quantization technologies. However, in production environments, the situation is much more complex.

In this presentation, I will use the term “DSI” derived from the literature, which accurately describes the critical phase of data storage and ingestion. Why is DSI crucial?

In addition to the continuous growth of data volume, it is even more noteworthy that the frequency of data access is growing exponentially. This sharply contrasts with the traditional “write once, read many” model, where we now need to perform data reads more frequently.

The importance of DSI is also reflected in ensuring that GPUs continuously receive data input. Idle GPU resources not only reduce efficiency but also lead to economic losses. The optimization criteria have changed. The design goal of DSI is to provide ample data support for accelerators without compromising performance.

This discussion will focus on the storage and ingestion phases, particularly their application in the DLRM training pipeline, without involving LLMs, and only focusing on the training phase.

Recommendation systems are ubiquitous in our daily lives, used by platforms from Netflix and Amazon to Facebook. These models aim to guide user behavior, whether it’s purchasing products or watching content. Recommendation models have become core pillars for these companies, generating revenue, driving research and development, and influencing corporate strategic decisions.

As illustrated in a research paper, several key layers exist: Sparse Features, Dense Features, and Embedding Tables lookups. Similar to LLMs, embedding tables are also used here. Additionally, there are neural network layers and feature interaction layers. These layers work together to predict user behavior, such as determining whether a user will click on a particular advertisement. The actual application scenarios are far more extensive, with hundreds of different applications already in place.

An important characteristic of the Deep Learning Recommendation Model (DLRM) is its nature of continuous training. Unlike foundational models like ChatGPT-3/4 or LLaMA 3.1, which undergo one-time training, DLRM requires daily updates. A complete recommendation system typically contains 20-30 or even more sub-models, which are continuously fine-tuned with user interactions.

The parameters of DLRM consist mainly of Multi-Layer Perceptrons (MLP) and embedding tables. According to relevant papers, the parameter count can reach 12 trillion. In contrast, the latest LLaMA model has about 0.5 trillion parameters. The embedding table is particularly large, with an original size of 96TB, and even after compression, it remains at 24TB.

This presents a key question for Meta: How to allocate these data reasonably between HBM and memory?

More challenging is that these models need to be trained on PB-scale datasets. Five years ago, data volumes had already reached 25-30PB, even 40PB, and today’s scale far exceeds the past. Therefore, the Data Storage and Ingestion (DSI) system plays a crucial role in handling such massive amounts of data.

These contents fully illustrate the importance of DLRM and its challenges in storage, which are issues that we, as storage and flash experts, need to focus on.

When building the AI pipeline, two key points are worth noting.

First, the DLRM pipeline is a closed-loop feedback system, where user interaction data continuously feeds back into model training and fine-tuning.

Second, user interaction data, after being recorded and preprocessed, is stored in a disaggregated manner. Due to the large volume of data, multiple data centers are usually required to collaborate. These PB-scale data must be transmitted over the network, preprocessed, and then sent to GPUs for training. The periodic checkpoint mechanism is also indispensable in this process.

DLRM adopts a single-epoch training mode, which is fundamentally different from multi-epoch training. This means that 25PB of training data will only be read once, a characteristic that requires us to adopt special data processing methods.

In production environments, Meta mainly considers optimization issues from the perspective of power consumption. Assuming a total power consumption quota of 100 units, with storage accounting for 20%, preprocessing for 20%, and GPUs for 60%. The optimization goal is to allocate 90% of processing power to GPUs, leaving only 10% for storage and preprocessing. This power consumption limit directly affects the number of GPUs that can be run simultaneously, thus impacting business efficiency.

User interaction data processing is divided into two stages: data recording and data cleaning preprocessing. Meta uses RocksDB for initial data storage.

Data after preprocessing is stored in a compressed format in a distributed file system, with an overall storage capacity reaching EB levels. This forms the data foundation of the entire system.

Scalability. In addition to data volume growth, the workload of accessing this data is also continuously increasing. According to a paper published by Meta, the amount of data stored has doubled in two years while I/O bandwidth demand has quadrupled. This is mainly due to the need to train multiple models (especially DLRM models), leading to a significant increase in data access frequency.

To address storage and I/O demands, Meta added an SSD cache layer on top of its HDD-based Tectonic file system. This cache layer, named Shift, is specifically designed for AI training, particularly DLRM training. Although Shift does not provide persistent storage, it significantly enhances system performance by combining CacheLib and the Tectonic file system. According to Meta, the Shift solution saves 29% in power consumption compared to using HDD alone.

In cases of multiple parties contending for data, both I/O bandwidth and capacity scalability face challenges. Let me elaborate on this process:

First, log devices record user interactions and write data into data warehouse table partitions. This data is simultaneously used for model fine-tuning and the data warehouse system. Subsequently, the data is transformed into columnar format and stored in the Tectonic file system, with SSD caching on top.

The key lies in the trade-off: simply adding HDDs can solve capacity issues, but to achieve the required I/O bandwidth, 2-3 times the number of drives would be needed. The hybrid solution adopted by Meta has proven effective. Training data is stored in the Tectonic file system, which is very similar to the object storage and blob storage solutions introduced by Microsoft last night.

Data is stored in a compressed format, usually using Parquet format for block storage. Since single-epoch training is adopted, the caching system must have intelligent management capabilities across tasks. The system can identify frequently used datasets and discover data sharing patterns to achieve cross-task caching optimization.

This solution effectively addresses scalability issues. Meta officially put the SSD cache layer into production between 2022 and 2023. During the preprocessing stage, data is split into small blocks, extracted, transformed, and distributed through Data Processing Pipeline (DPP) nodes, finally sent to the training cluster. The system is scalable and can adjust the number of nodes based on GPU bandwidth requirements.

It is worth noting that while some AI training still relies on HDDs, Meta is working to increase the proportion of data extracted from the SSD cache system.

In summary: I mention these details to compare with the MLPerf training benchmark (MLPerf Training Benchmark). Meta’s DLRM pipeline stores data in columnar format in a distributed file system, with most data read from SSD caches. Although we lack specific data reading ratios, this system has proven its value in production environments.

Online processing of data includes decompression, decryption, and transformation. The transformation method depends on whether the data type is dense features or embedding vectors. The data processing pipeline does a lot of work. The Pre-processing Engine System (PES) adopts a near-storage computing approach, which is indeed suitable in this scenario. The key is that preprocessing occurs on independent data shards, allowing each DPP node to process data independently without inter-node communication.

Regarding the MLPerf training benchmark, this benchmark evaluates performance by measuring the training speed required for the system to achieve target quality metrics. Each benchmark contains specific datasets and quality targets. Micron participated in this benchmark, and we compared the viewpoints from the literature with benchmark results.

We used a DLRM model that includes 13 numerical features, 26 categorical features, and a real label to predict user click behavior. Although the scale is far smaller than the 12 trillion parameter model, it is a good starting point. The test environment used 4 NVIDIA A100 GPUs, and the dataset was preprocessed from NumPy (NPZ) format to Parquet format, with a total data volume of 1TB.

Preprocessing was completed offline. The original data was converted to columnar format and stored in compressed Parquet format, with categorical data transformed into continuous integer representations. The processing hardware included an AMD 128-core processor and 8 NVIDIA A100 GPUs.

By tracking the execution of workloads at the storage layer, we can study their characteristics. Let me elaborate on the experimental setup and key details.

In storage evaluations, the primary question is to determine the storage object: is it storing model parameters, state values, or just the training dataset? In this case, we focus on the storage of the training dataset.

Regarding performance evaluation, preprocessing work is done collaboratively by GPUs and CPUs. This method has been discussed by Nvidia and mentioned in Meta (Facebook) papers. We tried both processing methods. After preprocessing, we focused on the training phase, which is also the core of the benchmark.

The second important finding is about payload. In this case, about 90% of the payload size is 512KB, a characteristic that remains consistent in GPU training and preprocessing.

In terms of write operations, about 85-95% are sequential writes. This aligns with expectations since data is stored in an append-only manner—not modifying existing content but adding new data to the end. Notably, this sequential writing is not a single-threaded operation but is done in structured data sequences.

Read operations also exhibit similar sequentiality, but to a slightly lesser degree. This phenomenon can be partially explained, while further research is needed. However, it can be confirmed that these workloads mainly involve large amounts of sequential big data access.

In storage evaluations, there is often an overemphasis on maximum queue depth (Q-depth), but this may not accurately reflect actual usage. In real workloads, traffic often exhibits burst characteristics; Q depth may surge to 200-400 for a short time and then drop to 10-20 for a period. Therefore, we need to move beyond maximum Q depth and performance metrics to understand the dynamic characteristics of workloads.

Let us compare Meta’s DLRM pipeline with the DLRM benchmark. In production environments, the pipeline’s preprocessing and training occur synchronously, while in the benchmark, we used an offline preprocessing method.

Data is stored in columnar format. Although we lack specific details on data striping and erasure coding applications, it is confirmed that data is stored in Parquet format, which allows for specific columnar arrangements to optimize data localization.

While production environments handle PB-scale data, the benchmark only uses TB-scale data, there are many important innovations in data localization. For example, frequently used data columns are stored physically adjacent to optimize reading efficiency from large blocks of data on HDDs. Although data is ultimately read from HDDs, the way it is organized affects the efficiency of the SSD cache layer.

This brings several suggestions:

Need to establish more representative benchmark tests
Deeply understand the role of data ingestion and storage
Explore optimization directions for storage systems beyond capacity and power consumption
Research data localization strategies
Achieve synchronization of preprocessing and training

Currently, preprocessing in benchmark tests is done offline, which differs from the continuous pipeline processing in production environments that occupies more than 50% of power consumption. Deep understanding of the operational mechanisms of AI pipelines, the positioning and needs of storage systems will help drive the development of the entire storage industry. Understanding the underlying architecture is the foundation for improvement.

———-

References: Rajgopal, S. (2024, September 18). SNIA SDC 2024 – The Role of Flash in Data Ingestion within the AI Pipeline [Video]. YouTube. https://www.youtube.com/watch?v=wgHAhbXuVKc

—【End of Article】—

Recently Popular Articles:

AI Storage Guide: Analyzing Storage Types in the AI Pipeline
Eliminating SSD Network File System Performance Bottlenecks
ZNS SSDs for Large-Scale Deployments
Breaking the “Memory Wall”: SSDs Based on CXL Technology
Next-Generation Generative AI Infrastructure: Storage and Networking

For more communication, please add me on WeChat

(Please include your Chinese name/company/field of interest)

Related posts

Leave a Comment Cancel reply