SW-LZMA: Parallel Implementation of LZMA Based on SW26010 Many-Core Processor

With the development of high-performance computing and big data applications, the scale of data transmitted, stored, and processed by high-performance computing cluster systems is increasing explosively. Efficient compression of large-scale data and reducing the space required for data storage and transmission is one of the keys to improving the performance of highperformance computing cluster systems. In this paper, we present SW-LZMA, a parallel design and optimization of LZMA based on the Sunway 26010 heterogeneous many-core processor. Combined with the characteristics of SW26010 processors, we analyse the storage space requirements, memory access characteristics, and hotspot functions of the LZMA algorithm and implement the thread-level parallelism of the LZMA algorithm based on Athread interface. Furthermore, we make a finegrained layout of LDM address space to achieve DMA double buffer cyclic sliding window algorithm, which optimizes the performance of SW-LZMA. The experimental results show that compared with the serial baseline implementation of LZMA, the parallel LZMA algorithm obtains a maximum speedup ratio of 4.1 times using the Silesia corpus benchmark, while on the large-scale data set, speedup is 5.3 times.


Introduction
With the improvement of high-performance computer performance, its scale is expanding. The large-scale highperformance computing cluster system must maintain long-term and stable uninterrupted operation. The amount of data transmitted, stored, and processed is increasing, and the amount of system log data is also increasing explosively. At present and in the future, it can be predicted that the scale of social computing data and scientific computing data will continue to grow with the improvement of informatization, which brings new challenges to big data processing. Effective compression is necessary to reduce the space required for data storage, make maximum use of the limited communication bandwidth, and make the high-performance computing cluster system give full play to its efficiency. With the increasing amount of data, blockchain applications need a lot of storage space. A fast big data compression algorithm can improve the efficiency of blockchain applications [1].
In the actual application of the Internet of Things, there are obvious shortcomings, such as the limited energy and bandwidth of the sensor nodes, which brings huge challenges to the network data transmission of the Internet of Things devices. The compression algorithm is currently an important technology to reduce the amount of transmitted data. It can appropriately remove the redundancy, reduce the data storage space of the IoT, and improve the speed and success rate of data transmission of the IoT. From the point of view of the server, the rapid development of information technology, especially IoT, has brought about the explosive growth in the amount of data on the server due to the demand for big data processing. This also requires efficient compression algorithms to reduce the amount of data storage and processing of algorithms such as distributed big data processing and machine learning [2,3] Lossless compression algorithms have a wide variety of open-source implementations. In the Sunway TaihuLight supercomputer, the existing data compression algorithms include zlib Deflate, XZ, and LZ4. None of the compression algorithms is optimized in parallel, and only a single processor core is used for compression and decompression, while the processing performance does not have much room for improvement. In compression algorithms, there are many problems, such as the contradiction between compression rate and storage space and poor data locality. In order to achieve an effective performance improvement, deep algorithm reconstruction and optimization must be carried out for specific high-performance processors.
Many studies have used multicore processor architecture to parallelize compression algorithms. The parallelization of BWT (burrows Wheeler transform) compression algorithm appeared earlier. Pankratius et al. [4] first proposed a parallel implementation of BWT, which obtained a linear speedup ratio and was applied to Bzip software. Pigz is a parallel version of Gzip compression algorithm, which was proposed by Gristwood et al. [5] and has been widely used, but the compression rate of this parallel algorithm is low. Patel et al. [6] used GPU to parallelize the binary tree search process of the BWT lossless compression algorithm, and the acceleration effect was significant. Wu et al. [7] studied the compression algorithm based on CUDA (compute unified device architecture) and used the block parallel strategy to optimize the LZ77 compression algorithm on the GPU. Pankratius et al. [8] use MPI (message passing interface) programming to realize the distributed MPIBZIP compression algorithm, which is suitable for distributed memory computing. Wright [9] uses MPI and pthreads programming interfaces to implement the bzip2 parallel algorithm in the distributed memory structure and the shared memory structure, respectively. Although BWT-based compression algorithms are easy to parallelize, they are not as good as LZMA (Lempel Ziv-Markov chain algorithm) in terms of compression rate. In the process of multithreading parallelization of opensource compression software such as XZ and 7zip, only the character matching core function in the LZMA algorithm is parallelized. The acceleration effect is not ideal and is limited by the number of processors [10,11] Leavline and Singh used FPGA to accelerate the LZMA algorithm [12,13], which can obtain a higher speedup ratio, but the application cost is higher and does not have general applicability.
In the Sunway TaihuLight supercomputer system [14], the basic unit is a computing node composed of a SW26010 many-core processor, 32 GB of memory, and other control units. The processor architecture is shown in Figure 1. Four core groups (CGs) constitute a SW26010 processor, and there are 64 computing processing elements (CPEs) plus one management processing element (MPE), totally 260 computing units in SW26010. Among them, the CPE adopts a lightweight core design, and its instruction set function is very streamlined, does not support operations such as interrupts, and only runs in user mode. Each CPE contains 16 KB instruction L1 cache and 64 KB LDM (local directive memory, on-chip local data space) and supports 256-bit SIMD operations. The CPE can share memory with the MPE and use DMA (direct memory access) to exchange data between memory and LDM. In the CPE cluster, the CPEs in the same row or column can exchange data through register communication, the maximum amount of data transmitted each time is 256 bits, and the delay is low. Figure 2 is a memory hierarchy diagram of CPE. The slave core can read data from memory in two ways: direct register access and register LDM access. Since there is no shared cache between CPEs and MPE, the delay of direct register access reaches nearly a hundred clock cycles. One of the ways to solve the problem is to copy data to LDM for memory access through DMA to improve memory access speed. This increases the difficulty of parallel program design and requires the programmers to set up DMA scheduling strategies reasonably, so as to achieve overlap of computing and communication as much as possible and to improve parallel efficiency. Data exchange between CPEs can be carried out by register communication. The parallel program on the SW26010 processor adopts the masterslave parallel programming model. The master thread runs on MPE, and the slave threads run on CPEs. The master thread mainly completes data input, memory copy, result output, and other operations, and the slave threads mainly perform computing tasks. According to the characteristics of the master-slave parallel programming model, Sunway TaihuLight supercomputer system provides the Athread accelerated thread library, which is divided into two parts: the MPE accelerated thread library and the CPEaccelerated thread library.
The main purpose of this paper is to design an LZMA parallel algorithm for Sunway TaihuLight supercomputer system and combine the characteristics of Sunway 26010 many-core processor to reconstruct and optimize the algorithm. We present SW-LZMA that can obtain a maximum speedup ratio of 4.1 times using the Silesia corpus benchmark while on the large-scale data set, speedup is 5.3 times.

Analysis of LZMA Algorithm Based on SW26010 Processor
In this section, we mainly analyse the characteristics of the LZMA algorithm that affect the performance of the algorithm such as space requirements, memory access methods, data locality, and hotspot functions. Combined with the analysis of the key technologies of SW26010 Processor, the algorithm can be reconstructed and optimized in a targeted manner.
2.1. LZMA Workflow. The LZMA compression algorithm was proposed by Pavlov in 1998 [15], and its core is based on the improvement of the LZ77 compression algorithm. LZMA uses a sliding window-based dynamic dictionary compression algorithm and interval coding algorithm, which has the advantages of high compression rate, small decompression space requirement, and fast speed. Figure 3 shows the LZMA workflow, including the sliding window algorithm based on LZ77 [16] and interval encoding [17,18] (range encoding) two-stage compression. The LZMA supports a dictionary space of 4 KB to hundreds of MBs, which increases the compression rate and also causes its search cache space to become very large. To reduce the time required to match the longest string and quickly search for matching characters, in the implementation of the LZMA algorithm, multiple possible longest matches are stored in the Hash table, and the data structure of the Hash linked list or binary search tree is used to search data. As 2 Wireless Communications and Mobile Computing shown in Figure 4, in the Hash function, the hash value of the first two bytes of the search cache is used as the index of a hash array, and the hash array stores the starting position of the corresponding matching character group. The size of the hash array is a power of 2 that is half the size of the dictionary. The LZMA encoder sets up different levels of hash functions for 2, 3, and 4 adjacent bytes to achieve efficient positioning corresponding to different dictionary sizes.

Memory Space Demand.
In the SW26010 processor, each CPE is equipped with a 64 KB LDM. In order to ensure that the CPR can obtain higher acceleration performance, it is necessary to copy the calculation data to the CPE's LDM space for memory access, which requires precise control of the use of the CPE's LDM variable memory space. Table 1 shows the usage of the local variables of the hotspot function of the LZMA algorithm, which mainly includes the local array size that takes up a large space, and the local scalar space takes up a small space and is negligible.
In the string-matching function based on the hash table, due to the large dictionary space, the hash table hash_buf reserves a larger hash space. This far exceeds the 64 KB LDM space of CPE and needs to be optimized to compress the use of local space. The range of the dictionary search can be reduced as much as possible within the allowable range of the compression loss, thereby reducing the size of the hash space of the hash table lookup function.

Memory Access Characteristics.
In the LZMA algorithm, the data structure of the hash linked list is used to quickly find matching characters. Due to its relatively large search cache, its hash look-up table space has increased, with random access to memory in the range of 100 KBs to 10 MBs. At the same time, the LZ77 algorithm is based on sliding window streaming compression, because the uncoded data is continuously input, the coded data is discarded after reaching the upper limit of the search buffer space, and its data locality is poor.
Since it is impossible to prejudge the length and position of the repeated character string in the uncoded data, nor can it predict the distance of the matching character string, it is difficult to prefetch the data in the LZMA algorithm. During the compression process, the size of the dictionary gradually increases as the number of matching strings increases. Compressing the current data block depends on the dictionary obtained from the previous compression process. The LZMA algorithm has the characteristics of random access to memory and data dependence, which is a memory access-intensive algorithm. The key to its performance optimization is to combine the storage structure of the SW26010 processor to reconstruct the data structure and memory access of the algorithm to reduce memory access overhead and maximize the acceleration performance of CPEs.

Hotspot Functions.
The main time-consuming functions of the LZMA compression algorithm are concentrated in the LZ77 string matching core function. The core function pattern matching process is shown in Algorithm 1. Among them, the time-consuming operations are mainly hash table lookup and character matching and hash table update.
In the hotspot function, the character matching process access to memory has a certain continuity, that is, starting from the current byte position, matching, and searching the same longest string in the cache. And each matched character is given by the input data. The position where the longest matching character may appear is stored in the hash table, and its look-up table access has certain randomness. In view of the characteristics of hotspot functions, there are mainly two optimization ideas. The first is to finely 10

Design and Implementation of SW-LZMA
3.1. Parallel Design of SW-LZMA. First, we designed the SW-LZMA multithreaded parallel algorithm on the SW26010 processor. The data to be compressed is evenly distributed to 64 CPEs cores. The CPE directly accesses the main memory to read the data to be compressed, and after adding header information to the compressed data, it directly outputs them to the main memory, and finally, the MPE writes the data blocks into the file in order. We adopted masterslave asynchronous parallelism and handed over the core computing tasks of the LZ77 compression algorithm and interval coding in the LZMA algorithm to the CPEs cluster. The MPE is only responsible for data partitioning and I/O operations. The steps of the thread parallel algorithm are as follows.
Step 1. Data segmentation. According to the number of CPEs, the data to be compressed are divided into several subblocks. We divide them according to the integer multiple of the memory page size. Since the amount of calculation in the compression algorithm is approximately proportional to the amount of input data, the parallel task load balance can be achieved only if the size of the divided data block is equal.
Step 2. Two-stage compression. Each data block is independently compressed by the CPE, including two-step compression. In the LZ77 algorithm, first, initialize the compression dictionary. As the sliding window advances, the data to be compressed continues to be input, and the dictionary size increases accordingly. Subsequently, the data structure compressed by the LZ77 algorithm is further compressed and output as the input data of interval coding.
Step 3. Data consolidation. After the CPEs complete the compression, the MPE is responsible for merging the com-pressed data. First, the MPE writes a 5 Byte header, and the content of which is compression parameter information such as dictionary size and maximum matching length. Then, each compressed block is output after adding 4 Byte header information block_size in the order of arrangement. The content of block_size is the size of the compressed data block.

Implementation of DMA Double-Buffers.
Through the parallel method in Section 3.1, the core computing part of the serial LZMA compression algorithm can be transplanted to the CPEs cluster. However, when the CPE directly accesses the main memory, its memory access overhead will seriously reduce the performance of the parallel algorithm, and its acceleration effect is not enough to compensate for the performance loss caused by the memory access delay. In addition, in the serial version of the LZMA compression algorithm, the data to be compressed is stored in a dynamically allocated memory space, and the current compressed sliding window is determined by the address pointer. Due to the large scale of compressed data and the limited space of the CPE's LDM, even if it is divided into blocks according to thread tasks, its size is far greater than the 64 KB maximum capacity of LDM, and the data blocks to be compressed cannot be loaded all at once. Therefore, the algorithm needs to be reconstructed and optimized to compress the use of LDM space. In order to improve the locality of data, we use the nonblocking DMA-based memory access double buffer technology based on the characteristics of the LZ77 sliding window algorithm and LDM space resources. As shown in Figure 5, the CPE does not directly access the main memory to read compressed data. Instead, the data in the current compression window and the data before and after it are transferred to the LDM buffer as a compression unit through the DMA method to achieve fast memory access. At the same time, the next compression unit has initiated DMA transfer to perform data prefetching. After the task of the current compression unit is completed, the compression calculation can be performed directly to achieve calculation and memory

Require:
Hash_table: Hash table for fast search entry cur_pos: Pointer on first byte of the uncompressed data Process: 1.
While(there are still having uncompressed data in cur_pos) 3.
Calculate the hash_value of the first batch 4.
If(the hash_value can be found in hash_table){ 5.
Update the value to the hash_table 6.

Wireless Communications and Mobile Computing
access overlap, which further reduces memory access overhead. At the same time, the output data is also buffered and copied to the memory through DMA. Algorithm 2 is an example of LZMA algorithm multithreaded parallel implementation using Athread interface.

LDM Space Layout Optimization.
In the serial version of the LZMA algorithm, a pointer is used to directly point to the memory space of the data to be compressed, and a sliding window-based dictionary compression algorithm is implemented in the form of displacement. In the SW-LZMA algorithm, the compressed data needs to be copied to the LDM buffer area for memory access. In order to achieve DMA double buffering and make full use of the LDM space, we use manual methods for fine-grained management and allocation of the LDM address space and reconstruct the sliding window algorithm. We set up continuous double buffer space, and the pointer buffer_base points to the starting address of the address space, that is, the starting position of the first buffer. The pointer buffer_middle points to the middle position of the buffer space, that is, the starting position of the second buffer. The pointers pos_start and pos_end point to the start and end positions of the current sliding window, respectively.
At the beginning of the algorithm, as shown in Figure 6, the CPE initiates a blocking DMA request to read the data block to buffer 1, then calls the sliding window compression function, and initiates a nonblocking DMA request to read the next data block to buffer 2. When the sliding window pointer pos_end moves to the buffer_middle position, check that the nonblocking DMA request is completed, and then the compression can continue. Later, when the sliding window pointer pos_start moves to the buffer_middle position, a nonblocking DMA request is initiated to read the next data block to buffer 1. When the pointer pos_start and pointer pos_end move to the end position of the buffer, they move to the start position in a loop and continue to compress until all the data is compressed.

Evaluation
We mainly test and analyse the compression rate and compression time of the SW-LZMA algorithm. The benchmark performance is the compression ratio and compression time of the serial LZMA algorithm running on the main core. The timing method of the test is to use the Athread timing interface to count the number of CPU beats that the algorithm has run and calculate the operation time cost. DMA_get(buffer_base, src, buffer_size) 2.
While(src data is not empty){ 3.
Range_Encode(compressed _data) 9.   [15], providing a file data set covering typical data types currently in use. The files' sizes are between 6 MB and 51 MB. The corpus is proposed to solve the problem of the lack of large files and single file types in the traditional Canterbury corpus. Table 2 shows the test example of the benchmark test set. The experimental platform is the Sunway Taihulight supercomputing system, and its parameters are shown in Table 3 [18]. The compression algorithm benchmark test set used in the experiment is the Silesia corpus benchmark test set. At the same time, in order to test the compression performance of a large amount of data, we copied and packaged the Silesia corpus test set files to form GB-level data for compression testing.

Performance Evaluation.
In order to test the acceleration effect of the SW-LZMA parallel algorithm on SW26010 processor, the serial version of the MPE LZMA algorithm was selected as a benchmark to compare the performance of different optimization schemes. The compression speed and compression rate of SW-LZMA in the Silesia corpus benchmark test are shown in Figure 7. Due to the large memory access bottleneck, the 64-thread parallel version that reads data directly from the main memory only obtains an average speedup of 2 times and even spends more time than the serial compression in some cases. In contrary, the optimized version of communication overlaps using DMA double buffering obtained an average speedup ratio of 3.7 times and a maximum speedup ratio of 4.1 times, indicating that the parallel performance of using DMA double buffering has been greatly improved. In terms of compression ratio, the compression ratios of parallel and serial versions are basically the same.
Further analysis, we discussed the impact of the choice of single buffer size buffer_size on the number of message transfers and compression rate in the DMA double buffer design. As shown in Figure 8, when the buffer_size is less than 20 KB, due to the small amount of data copied by a single DMA, calculation and communication cannot be fully overlapped. At the same time, the number of DMA increases, the corresponding DMA overhead increases, and the compression speed decreases slightly. When the buffer_ size is greater than 25 KB, the compression speed does not change much with the buffer size. Theoretically, the setting of the buffer should enable the DMA communication delay Step 1 Step 2 Step 3 Figure 6: LDM address space partition of sliding window encoding based on DMA double buffer. and calculation to achieve load balance, but because the LDM space is limited and needs to be reserved for other local variables, the buffer cannot be expanded indefinitely.
In Section 2.2, we mainly discuss the memory space demand of the LZMA algorithm and try to satisfy it within the 64 KB LDM space of each CPE. We designed DMA double-buffers, and LDM address space partition of sliding window in Section 3 to make full use of LDM space. According to the experimental results, SW-LZMA parallel algorithm has reached the maximum utilization of the local memory space of CPEs and cannot be expanded to reach the maximum bandwidth utilization and frequency of the SW26010 processor mainly due to the LDM space limitation and memory access latency. Therefore, we take the buffer size with the best performance currently as the optimal parameter to maximize the overlap gain of computing communication optimization.
Most of the compressed test corpus data are small in scale, and no GB-level test cases are provided. We use Linux tar tool to package multiple copies of Silesia corpus to generate several large file test sets. We test the compression performance of the SW-LZMA parallel algorithm on big data based on the large file test sets.      [19]. Their experiment platform is Intel E5-1650 v2 3.5 GHz processor. We compare it with the test results of the SW-LZMA parallel algorithm. Since the compressed data sets are different, we only compare the average compression ratio and average speedup ratio. The results are shown in Table 5. Because the CPU frequency gap is obvious, the performance of the LZMA serial algorithm on SW26010 is inferior to that on Intel CPU, and the parallel version of the LZMA has a more obvious acceleration effect than that of the Intel CPU, indicating that the SW-LZMA has better performance advantages.

Conclusions
The main work of this paper is to transplant the LZMA compression algorithm to the Sunway Taihulight supercomputer system and to reconstruct and optimize the parallel algorithm according to the characteristics of the Sunwei many-core processor. We use the Athread interface to parallelize the LZMA algorithm with multithreads and blocks and design a DMA-based double buffer mode to achieve overlap of computing communication. In further optimization, we perform fine-grained management and layout optimization on the LDM address space, set the buffer size reasonably, and obtain the best computing communication overlap effect. The test results show that in the Silesia Corpus bench-mark test set, the SW-LZMA algorithm achieves a maximum speedup of 4.1 times. In the large file compression test, the SW-LZMA parallel algorithm achieved a maximum speedup of 5.1 times. Compared with mainstream CPU serial algorithms such as x86 CPU, the SW-LZMA algorithm has an obvious acceleration effect on SW26010 many-core processors, greatly reducing algorithm execution time, and has better performance. The SW-LZMA parallel algorithm not only can provide high-speed compression algorithms for applications in the field of high-performance computing but also is well known about its feasibility for more big data applications such as smart grid [20] and cloud computing [21]. In the future, there will be two research directions to further improve the performance of the LZMA algorithm: one is to upgrade the LZMA algorithm to further reduce the use of local space without affecting the compression rate; the other is to design more efficient parallel LZMA algorithm based on the new high-performance computing processors.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.