ISort: SSD Internal Sorting Algorithm for Big Data

,


Introduction
Te development of storage technology and cloud computing has made it possible to process terabytes and petabytes of data [1].However, the widespread use of dataintensive applications and personal mobile devices generates massive amounts of data, estimated to reach 185 zettabytes in 2025 [2].Terefore, how to quickly mine valuable information from massive data has become an urgent problem to be solved in the era of rapid data growth.
External sorting is one of the most fundamental algorithms in data management systems, being used to deal with the situation that the main-memory capacity cannot hold all the data when data volume is too large.For example, in the MapReduce framework of Hadoop, a large number of external sorting is exploited to sort intermediate data and the fnal data in both operations of mapping and reducing [3].
Another instance is that external sorting plays a critical role in the database query procedure since a large amount of data is often involved in fnding the desired results [4].External sorting contains two phases: the run generation and the run merge.In the frst phase, the input data are divided into blocks that can hold the memory capacity and then loaded into the memory for sorting.Sorted data blocks (Run) are then written back to the storage device.In the second phase, multiple runs are merged into a fully sorted chunk of data [5], which will require lots of I/O operations, resulting in I/O overhead.Terefore, the I/O time is critical in the external merge sorting elapsed time.
Traditional external sorting algorithms are mainly designed for hard disk drives (HDDs) characterized by slow speed, high power, and poor earthquake resistance.In contrast, SSDs have more obvious advantages such as no robotic arm, random access, high read and write bandwidth, earthquake resistance, low energy consumption, high stability, long service life, and no noise.[6].With the development of fash memory technology and the price reduction, SSDs are gradually replacing HDDs in the storage market [7].However, external sorting is I/O-intensive because there are many read/write operations on storage devices in the execution process, which afects the performance of the algorithm and the service life of the SSD.To solve this problem, experts have tried to transfer computing to the SSD, called computing and storage fusion.
Many eforts have been made towards external sorting.Reference [8] makes use of the computing resources in SSD to accelerate deep learning.Te blueDBM architecture [9] accelerates data queries in SSD computing.Reference [10] unloaded the external sorting work to SSD.In Reference [11], source data are divided into multiple blocks and sorted separately in memory, and the merge work begins when there is an access request.Tis approach uses the channel parallelism of SSD but does not consider the situation that the data are partially sorted.All the above methods sufer from the issue that when the pages of big and smaller data are read simultaneously, the big data will remain in memory for a long time, reducing the memory utilization.To tackle this issue, we build an index table to record the minimum value of each run for each block that is sequentially read to the input bufer and merged within the SSD.Te channel congestion problem caused by the read/write rate is also discussed.In summary, our major contributions can be summarized as follows: (i) We present a new external sorting algorithm named ISort that implements rapid sorting within the SSD.For partially sorted data, it records the minimum values of sorted blocks and indexes them to determine the order for merging.By avoiding the extended storage of large values in memory, ISort can enhance the internal memory utilization of SSD and signifcantly improve external sorting performance.(ii) Te specifc proportion adjustment of SSD hardware equipment is carried out during the operation of ISort algorithm.We fnd the best ratio of parallel channel read-write numbers by comparing the effects of diferent ratios of the read/write channel on external sorting.(iii) Te experimental results show that ISort has better read and write performance than previous works.For example, ISort improves the read and write performance when the total amount of data increases.ISort also improves the performance when the data size remains the same and the memory size increases.
Te rest of this paper is organized as follows.Te background and motivation are introduced in Section 2. Section 3 describes the detailed implementation of ISort and diferent channel strategies.Simulation experiments are presented in Section 4. Section 5 provides an overview of the related work.Te conclusion is presented in Section 6.

Background and Motivation
In this section, we frst describe the basic external sorting algorithm and general architecture of a typical SSD, then we discuss the motivation of this work.

External Sorting.
Traditional external sorting is generally divided into two phases, as illustrated in Figure 1.Source data are initially in the storage device.Te frst phase divides the data into accommodating blocks according to the input bufer size.Ten, each block is loaded into host memory for sorting, and the sorted data blocks are written back to the storage device.Te second phase merges the sorted data blocks generated in the frst phase into a sorted output through several iterations [5].After these iterations, the merge operation will produce multiple read and write operations for the storage devices, resulting in high I/O overhead.Because of the large performance gap between DRAM and storage devices, the I/O times are decisive in the elapsed time of external merge sorting.A critical evaluation factor is for data-intensive applications, whether the data can be processed quickly and allow a response to the results.

Solid State Drives (SSDs).
With the development of fash memory technology and price reduction, SSDs have gradually become the mainstream type of high-performance storage media.SSDs based on fash memory have been widely studied by industry and academia because they provide random access, high speed, high throughput, and low energy consumption.
Considering the architecture, fash memory can be divided into single-level cell fash memory (SLC) and multilevel cell fash memory (MLC) [12].MLC allows a single storage unit to hold twice as much data, making it cheaper to manufacture.MLC has a slower writing speed, higher power consumption, shorter life, and higher error rate than SLC.SSD can be divided into NOR fash memory and NAND fash memory according to the type.Te random read speed of NOR fash memory is fast, but the erase and the programming operation is slow, and its fash capacity is relatively small.NOR fash memory allows random storage, is suitable for frequent read and write situations, and is usually used to store program code.Compared with NOR fash memory, NAND fash memory has a lower cost, higher density, higher capacity, and faster erase and write speed, being suitable for data storage.Tis article only discusses NAND fash memory, which provides three basic operations: 2 Mobile Information Systems (i) Read/write operations: Te basic unit of read/write operations is the page, but the erase operation's basic unit is the block.Te write operation of fash memory is generally 200-700 µs, approximately ten times that of the read operation.
(ii) Erase operation: Te erase operation sets all values on the target block to 1.However, if a fash page has been written, and we want to write the block again, we need to erase it frst.Tis process is called erasing before writing [13].Te delay of the erase operation is about 2-3 ms longer than that of the I/O operation.Terefore, frequent erase operations will afect the overall performance.
Flash memory can only be subjected to a limited number of erasures.If the data block is erased frequently, it can no longer be used.Te SSD controller adds a transformation layer named the fash translation layer (FTL) to avoid writing after erasure.Flash memory does not support overwriting.FTL writes the new data to other free pages when updating data, and the original data are marked invalid.FTL has three main functions: address mapping, load balancing, and garbage collection [14].Address mapping can be divided into page mapping, block mapping, and hybrid mapping [15].Te mapping table for block mapping is tiny, being able to reduce memory overhead and ofer an excellent response to read requests.Load balancing can improve the performance of the SSD and prolong its service life.Garbage collection [16] periodically reclaims space occupied by invalid data and erases appropriate blocks to recycle free pages.
Figure 2 illustrates the general architecture of a NAND SSD, which is composed of a master controller chip, a set of DRAM, multiple interfaces, and an array of NAND fash memory chips connected to fash controllers by multiple channels.

Motivation.
In our practical application, source data usually have data locality.Te most recent research algorithms mainly focus on reducing the amount of data transferred between memory and out-of-memory devices.Reference [10] takes advantage of SSDs' internal computing power but does not consider their limited internal memory resources.Completely ignoring the data's characteristics will lead to a large amount of data being stuck in the memory for Mobile Information Systems a long time.Terefore, we hope to make full use of the internal resources of the SSD and the characteristics of the data itself to achieve the acceleration of the external sorting algorithm.

ISort Design
We present ISort, an external sorting, that performs data merging by exploiting the internal hardware infrastructure of SSDs.We introduce its architecture and elaborate on its design techniques.

Overall Architecture.
We propose a new external sorting mechanism called ISort that performs data merging by exploiting the internal hardware infrastructure of SSDs.Te traditional external sorting algorithm cannot be transferred to SSD because it uses FTL to process the host-side data request [17], as described in Figure 2. Terefore, we need to change the SSD standard software architecture layer.However, the direct use of the merge sort algorithm will result in the high consumption of memory resources in the SSD, which will signifcantly impact SSD performance.To solve this problem, we built a page min index to record the minimum values of all pages.Te minimum values determine the order of pages entering the input bufer.Te whole process of ISort is shown in Figure 3, where the gray block represents unsorted data, the yellow and blue block represent the internal ordered data, and the green block represents fully sorted data.We divide ISort into two phases.Te frst is the run generation phase, which difers from the traditional external sorting in that the data are written back to the storage device.Te run merge phase performs operations inside the SSD.
Table 1 defnes notations for ISort.Te size of the key in the record is Q.Keys are allocated in B blocks, each represented as b i with 0 ≤ i ≤ B − 1.We use M to denote the host memory size assigned to perform the sort.Te record is divided into R � T/M parts, indicating the number of runs.P indicates the number of pages in a block; C represents the number of channels; D represents fully sorted datasets; CK 1 represents the minor C pages; and CK 2 represents the second minor C pages.

Run Generation
Phase.Algorithm 1 aims to generate intermediate sorted fles called runs.We discard the value in the record because we only need to sort the key.We split Q into S i , each of which is the size of an input bufer (see lines 2 and 3 in Algorithm 1).To accelerate the speed of writing storage, ISort activates multiple fash channels simultaneously, making full use of the parallelism of the SSD.However, when the key value is skewed in the run, the channel will be blocked, resulting in slower read operations.We slice each run into the page using interlaced write between channels.During the write-back process, we recorded the page minimum index table in the SSD memory, recorded the page's minimum value in each run, and built an index called page index.Input buffer Flash chips

Run Merge
The process 2 The process 1 4 Mobile Information Systems minor page every run.It is also possible that the parallel page read simultaneously is from the same run.In ISort, the order in which pages are loaded into the SSD's internal memory only follows the min page index.Because of the partial ordering of the data, a page we would like to see may be in highvalue runs that will not be read for a long time.Tese data will not be output after they are read into memory but instead will be output when a more extensive page is encountered.Next, we read CK 2 to the input bufer in order as the bufers for CK 1 .By doing so, the data transmission capacity can be better matched with computing power.When a page is consumed, we can supplement the data without afecting the sorting of CK 1 .When ISort is satisfed such that there are C pages in memory, the merging process starts synchronizing with the bufer data transfer process.Te minor key in the input bufer is copied to the output bufer in each iteration.We used a qsort in memory, and the computational complexity is O(n).
We fush the output bufer to a fash chip if the output bufer becomes full.Te same run is interlaced on a diferent channel.Terefore, this process will not occur when a channel does not have a page, except in the fnal phase.However, parallel read/write operations may cause channel congestion, which is discussed in the experiment section.Figure 4 illustrates an example.For the convenience of demonstration, we draw six channels and six input bufers to illustrate the merge phase of the ISort algorithm in more detail.Suppose there are six runs interleaved across six channels.We represent them in diferent colors.Te CK 1 in the fash chips is transferred to the input bufer in parallel, as shown in process 1. CK 2 is transferred to the input bufer in parallel, as shown in process 2. When a page in CK 1 is exhausted, the CK 2 page of the same channel in CK 2 is immediately converted to CK 1 .At the same time, the reading of the next page is triggered.CK 2 will be continuously converted into CK 1 as it is consumed.At best, CK 1 is distributed on a diferent channel, and we can implement the concurrent reading of the channel, as shown in process 1.In the worst case, CK 1 is distributed on the same channel, and we can only read it serially, as in the traditional method.Because our data are partially ordered, and the data of each run are cross-placed on a diferent channel, and the worstcase probability is negligible.
Figure 5 shows six sorted runs.Each run consists of three pages, and each page contains three keys.Let us assume that the input bufer can drop 12 pages, as shown in 4. Te middle of Figure 5 shows the traditional method of reading the minimum page of each run into the input bufer.When a page with large values and a page with decimal values appear in the input bufer simultaneously, it will cause long-term retention in memory, thereby reducing memory utilization.Te lower part of Figure 5 shows our method.Based on the page-min-index, ISort reads sequentially to avoid the occurrence of pages in the input bufer and improve input bufer space utilization.

Experimental Results
4.1.Evaluation Design.Tis section describes the experimental platform setup and the methodology to evaluate ISort.
In the following experiments, we SSDsim [18], an open-source solid state simulation system that follows the ONFI protocol, having high accuracy and modularization advantages.Te hardware confguration parameters of the SSDsim simulator used in this paper are shown in Table 2.
We take ActiveSort as the baseline that includes an additional write-back operation than ISort.Te comparison is conducted from the perspective of dataset size and memory.We also evaluate the impact of SSD memory and I/ O trace on performance.Also, we use diferent channel ratios to test the performance of a specialized channels and hybrid channels.

Experimental Results.
Since external sorting is an IOintensive algorithm, read and write requests are initiated frequently and alternately in the merging phase, as shown in Figure 6.When more channels are used for writing, both the read time (RT) and write time (WT) of ActiveSort increase evidently, while ISort decreases, indicating the superior performance of ISort.
If the read-write request separation processing is carried out and the number of reading channels increases, it will lead to writing request processing congestion.Similarly, reducing the number of reading channels can reduce read request processing congestion.To avoid idle channels and improve channels' resource utilization, we can make all channels read and write requests during the merge phase.

Mobile Information Systems
As shown in Figure 7, DRAM within SSD can cache read and write requests.With the increase of DRAM capacity, the hit rate of read and write requests can be improved.
Figure 8 shows the results of diferent data sets.We can fnd that ISort has a relatively more stable performance improvement than ActiveSort.
(1) Input: 7) end for (8) W←P/C (9) for i from 0 to R − 1 do (10) for k from 0 to W − 1 do (11) Open Channels (12) write from p j i to p j i+C (13) InsertIndexToMinIndex (minimum (p j i ), page.id)(14) SortMinIndex() (15) end for (16) end for � 0 ALGORITHM 1: Pseudo-code for the run generation phase.Mobile Information Systems Figure 9 shows the results of diferent page sizes.When page size is 4 kB or 8 kB, ISort has better performance.However, the performance will degrade as the page size increases or decreases.When the page size is relatively large, it will cause channel congestion and increase the request processing time.

Related Work
When source data are too large, it is necessary to use external sorting when it is impossible to load all the data into limited memory for sorting at one time.External sorting can be divided into HDD-based external sorting, embedded fash memory-based external sorting, SSD-based external sorting, and NVM-based external sorting.
Te external sorting based on HDD generally reduces the search time and rotation delay by optimizing the algorithm and reducing the random access to the external memory device.Reference [19] staggered placement and a new reading strategy are proposed, speeding up the execution of the external sorting algorithm based on HDD and improving the performance.Reference [20] proposed an external sorting algorithm based on HDD.Tis external sorting algorithm does not require additional disk space and does not generate intermediate data.Te main idea is to use quick sorting and a particular merging strategy to reduce the number of comparisons in the sorting process to improve execution performance.
Compared with HDD, fash-based SSD has no disk head and a mechanical arm, so there is no seek time and rotation delay [21].Reference [22] designed an FTL (FTL-SS) based on a single channel and single way and extended the FTL to the case of multichannel and multiways, thus verifying the versatility and efectiveness of this method.References [23][24][25][26] make full use of the internal parallelism of SSD through two phases and request rescheduling and dynamic write request mapping to improve the performance of SSD.Reference [27] proposes a channel striping technology to improve the resource utilization of the channel.FMsort makes full use of SSD's fast access delay and high random.Te I/O bandwidth is able speed up the execution of external merge sorting [28].Montres [29] takes advantage of the performance of SSD to speed up external sorting processing.ActiveSort implements the merging operation of external sorting inside SSD by using Active SSD [10].Active SSD is a special kind of SSD [30].Kang et al. proposed a multichannel storage system based on NAND fash memory.Te storage system has a plurality of independent channels, with each channel having a plurality of NAND fash memory chips [31].
With the development of new storage technology, new storage technology such as PCM, STT-RAM, and ReRAM have been widely used.PCM [32,33] is a new nonvolatile storage medium with byte-addressable, high density, and high persistence.NVM is a nonvolatile storage device with byte-addressable, nonvolatile, random access, high density, low energy consumption, and high access speed [34].However, NVM also has some limitations.Te service life of NVM is limited, and the reading and writing performance is asymmetric [35][36][37].Ahmed Khernache et al. proposed MONTRES-NVM, which is an external sorting algorithm based on the PCM and DRAM hybrid storage system [38].

Conclusion
Te amount of data has increased exponentially in recent years, and our demand for data processing speed has gradually increased.Te emergence of ActiveSSD provides a new possibility for us to process data in the near data segment.Traditional sorting algorithms need to be adjusted to better adapt to changes in memory size.Tis paper (1) Input: Partial sorted runs r 0 , . . ., r R−1 (2) Output: Sorted data (3) Read CK 1 and CK 2 (4) while has not yet processed all pages do (5) if there are C pages in the memory then (6) Sort (CK 1 ) (7) Output minimum key into bufer (8) if the output bufer is full then (9) Flush the output bufer to fash chip (10) end if (11) end if (12) end while � 0 ALGORITHM 2: Pseudo-code for run merge phase.Mobile Information Systems analyzes the latest algorithms and concludes that the large numerical data generated in memory will remain in memory for a long time, afecting memory utilization.Te main idea of ISort is to use the computing resources within SSD to deal with the merging phase.We use each page's minimum order to read the data to solve the problem of limited internal memory in SSD.To further improve the speed, we adopt the interleaving strategy in the write back part of the run generation phase.Many IO operations will produce varying degrees of read and write congestion; we have carried out diferent channel read-write ratio tests.We evaluated the performance of diferent read-write channel ratios, data size, page size, and SSD memory size.Compared with active sort + write, the performance of ISort reduces execution time by more than 36%.As a perspective for future work, it is signifcant to work to study the infuence of diferent storage devices on various algorithms.For benefts from data access according to the characteristics of other storage devices and to further reduce the time overhead, this paper only discusses channel-level parallelism.In future in-depth research, we can continue to explore the deeper level of parallelism.At the same time, in future research, we will continue to study that data placement leads to increased garbage collection load.During the merge phase of the outer sort, writing the ordered data back to SSD can continue to explore where the output structure is written back when compared with the efect of opening up new space and allocating piecemeal space, which is better.

Figure 4 :
Figure 4: Te status of the input bufer and the layout of the fash chips.

Table 1 :
Phase.Algorithm 2 describes the run merge phase of ISort.Te min index order recorded by Algorithm 1 reads CK 1 into the SSD's internal memory input bufer.Unlike traditional sorting methods, we do not look for a Algorithm parameters.