Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU

Sparse matrix-vector multiplication (SpMV) is an important operation in computational science and needs be accelerated because it often represents the dominant cost in many widely used iterative methods and eigenvalue problems. We achieve this objective by proposing a novel SpMV algorithm based on the compressed sparse row (CSR) on the GPU. Our method dynamically assigns different numbers of rows to each thread block and executes different optimization implementations on the basis of the number of rows it involves for each block. The process of accesses to the CSR arrays is fully coalesced, and the GPU’s DRAM bandwidth is efficiently utilized by loading data into the shared memory, which alleviates the bottleneck of many existing CSR-based algorithms (i.e., CSR-scalar and CSR-vector). Test results on C2050 and K20c GPUs show that our method outperforms a perfect-CSR algorithm that inspires our work, the vendor tuned CUSPARSE V6.5 and CUSP V0.5.1, and three popular algorithms clSpMV, CSR5, and CSR-Adaptive.


Introduction
Given their many-core structures, graphics processing units (GPUs) have sufficient computation power for scientific computations.Processing big data by using GPUs has drawn considerable attention over the recent years.Following the introduction of the compute unified device architecture (CUDA), a programming model that supports the joint CPU/GPU execution of applications, by NVIDIA in 2007 [1], GPUs have become strong competitors as general-purpose parallel programming systems.
Sparse matrix-vector multiplication (SpMV) has proven to be of great importance in computational science.It represents the dominant cost in iterative methods for solving linear systems and eigenvalue problems which arise in a wide variety of scientific and engineering applications [2][3][4][5].Recently, with the increasing size of matrices, GPU-accelerated SpMVs have attracted considerable attention.
SpMV performance is heavily dependent on the format that is used to store the sparse matrix in the memory.Compressed sparse row (CSR) has historically been a frequently used format because it efficiently compresses both structured and unstructured matrices.The naive CSR-based parallel SpMV, known as CSR-scalar, assigns each row of the sparse matrix to a separate thread [6].However, the memory-access pattern of CSR-scalar is rarely coalesced, which results in disappointing performance.CSR-vector [7,8] improves the performance of CSR-scalar by using warps to access the CSR structure in a contiguous but not generally aligned fashion, which implies partial coalescing.Since then, researchers have developed many highly efficient CSR-based SpMV implementation techniques on the GPU by optimizing the memory-access pattern of the CSR structure [9][10][11][12][13][14][15][16][17][18][19].
Among the existing CSR-based algorithms on the GPU, a representative, called a perfect-CSR algorithm (PCSR) [13], achieves high performance by fully coalesced accesses to the CSR arrays.PCSR involves two kernels.The first kernel is used to perform many parallel coalesced loads from the CSR arrays and place values into global memory .The second kernel is composed of two stages.The first stage is to load the global memory  into the shared memory   in a fully coalesced manner.Next, the second stage performs a scalar-style reduction (with each thread working on a single row), and each row in   is reduced to an output value.We can observe that the process of loading and reading the global memory  is coalesced in PCSR but it still needs some cost.If a matrix has a great number of nonzero values, the operation that is similar to the vector sum in the first kernel will bottleneck the PCSR performance [20].Moreover, when each row has large and significantly different number of nonzero values for a matrix, PCSR may leave many threads inactive during the scalar-style reduction process for each block and the thread with a long row often has a slower reduction speed than that with a short row.This greatly decreases the reduction efficiency.
These observations motivate us to enhance the performance of PCSR.We propose an improved PCSR called IPCSR to calculate SpMV on the GPU.IPCSR reduces two kernels in PCSR to one kernel while keeping the coalesced loads of the CSR arrays intact and saves the cost of loading and reading global memory .IPCSR only includes two stages.In the first stage, IPCSR calculates dynamically the suitable number of rows for each block and quickly loads values from the CSR arrays into the shared memory  .This stage merges operations in the first kernel of PCSR and the first stage of the second kernel of PCSR.This process of loads is coalesced and effectively utilizes the GPU's DRAM bandwidth, alleviating the bottleneck of PCSR.IPCSR supports two reduction approaches: a scalar-style reduction (with each thread working on a single row) and a multiple scalar-style reduction (with multiple threads working on a single row).How many threads are assigned to a single row for each block is dynamically determined by a decision tree.Thus, IPCSR improves the reduction efficiency by keeping threads active as much as possible.
IPCSR loses efficiency when a block operates a row, and it becomes inoperative if a row has more values than can be allocated in the shared memory.To overcome this limitation, we specially design a highly efficient algorithm for this case and can dynamically determine which block executes IPCSR with a set of rows or a long row.Our experiments on a set of 20 sparse matrices show that IPCSR outperforms PCSR for all test cases.IPCSR achieves average performance improvement of 1.76x (single precision) and 1.53x (double precision) over PCSR on the C2050 GPU and 1.79x (single precision) and 1.56x (double precision) over PCSR on the K20c GPU.In addition, compared to CUSPARSE V6.5 [21], our proposed IPCSR achieves performance improvement by up to 2.76x on average in single precision and 1.98x on average in double precision on the C2050 GPU and 3.52x on average in single precision and 2.52x on average in double precision on the K20c GPU.Compared to CUSP V0.5.1 [22], which chooses the best of six algorithms that include COO, CSR-scalar, CSRvector, DIA, ELL, and HYB, IPCSR achieves a performance gain of up to 1.43x on average in single precision and 1.33x on average in double precision on the C2050 GPU and 1.75x on average in single precision and 1.62x on average in double precision on the K20c GPU.Compared to clSpMV [23], which combines advantages of many existing algorithms, IPCSR achieves performance improvement by up to 1.55x on average in single precision and 1.22x on average in double precision on the C2050 GPU and 1.56x on average in single precision and 1.26x on average in double precision on the K20c GPU.Compared to CSR5 [18], IPCSR achieves average performance improvement of 1.21x (single precision) and 1.11x (double precision) over PCSR on the C2050 GPU and 1.30x (single precision) and 1.20x (double precision) over PCSR on the K20c GPU.IPCSR has a slightly better behavior than CSR-Adaptive [17] and achieves performance improvement by up to 1.02x on average in single precision and 1.01x on average in double precision on the C2050 GPU and 1.07x on average in single precision and 1.08x on average in double precision on the K20c GPU.
The main contributions in this paper are summarized as follows: (i) We propose a novel CSR-based SpMV on the GPU.
Our proposed algorithm is inspired by the work of Gao et al. [13] but achieves performance improvement by up to 1.76x on average on the C2050 GPU and 1.79x on average on the K20c GPU compared to the work of Gao et al. (ii) In our proposed algorithm, each block may have different number of rows, and the optimal number of rows each block involves is dynamically calculated.(iii) Each block can dynamically determine whether to execute our proposed algorithm with a set of rows or a long row on the basis of the number of rows it involves.
The rest of this paper is organized as follows.Section 2 addresses the related work.Section 3 introduces the CSR format and the PCSR algorithm.Section 4 details our IPCSR algorithm.Experimental results are presented in Section 5. Section 6 summarizes our conclusions and suggestions for future research.

Related Work
Sparse matrix-vector multiplication (SpMV) is so important that there has existed a great deal of work on accelerating it.We only discuss the most relevant ones here.Initial work about accelerating the SpMV on the CUDA-enabled GPU is presented by Bell and Garland [7].They implement several well-known formats including DIA, ELL, CSR, and COO and a new hybrid format HYB on the GPU.Each storage format has its ideal matrix type.They conclude that the best performance of SpMV on the GPU depends on the storage format and advocate the column-major ELL and HYB.These conclusions are carried forward to the CUSP [22] library, which relies heavily on HYB for SpMV.
A lot of work has been proposed for GPUs using the variants of the CSR, ELL, and COO formats such as the compressed sparse eXtended [24], bit-representation-optimized compression [25], block CSR [26], ELLPACK-R [27], sliced ELL [28,29], SELL-C- [30], sliced COO [31], and blocked compressed COO [32].Specialized storage formats provide definitive advantages.However, as a great number of software programs use CSR, the conversion from CSR to these other formats will present a large engineering hurdle and can incur large runtime overheads and require extra storage space.
In [7,8], Bell and Garland declare that CSR-scalar loses efficiency due to its rarely coalesced memory accesses.CSR-vector improves the performance of CSR-scalar, but it accesses the CSR structure in a contiguous but not generally aligned fashion, which implies partial coalescing.Compared to CSR-scalar and CSR-vector, ELL, DIA, and HYB implementation techniques on the GPU benefit from full coalescing.Following the idea of optimizing the memory-access pattern of the CSR structure, researchers have developed many algorithms.Lu et al. [33] optimize CSR-scalar by padding the CSR arrays and achieve 30% improvement of the memoryaccess performance.In [10], Dehnavi et al. propose a prefetch-CSR method that partitions the matrix nonzeros into blocks of the same size and distributes them among GPU resources.The method obtains a slightly better behavior than CSRvector by padding rows with zeros to increase data regularity, using parallel reduction techniques, and prefetching data to hide global memory accesses.Furthermore, authors increase the performance of this method by replacing it with three subkernels in [11].Gao et al. [13] propose a perfect-CSR (PCSR) algorithm on the GPU, which consists of two kernels.This algorithm implements the fully coalesced accesses to the CSR arrays by introducing a middle array.Greathouse and Daga suggest a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs [14].Thus, authors enhance CSR-Adaptive and make it work well for both regular and irregular matrices while keeping the CSR format unchanged [17].Liu and Schmidt present LightSpMV, a novel fast CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high parallelism by benefiting from the fine-grained dynamic distribution of matrix rows over warps/vectors [16].Other related work can be found in [9,12,18].
Our proposed algorithm is inspired by Gao et al. 's work [13].It alleviates the bottleneck of PCSR that is suggested by Gao et al. while keeping fully coalesced accesses to the CSR arrays and thus outperforms PCSR.

The CSR Format and PCSR
3.1.The CSR Format.The compressed sparse row (CSR) format is a popular, general-purpose sparse matrix representation.CSR stores a sparse matrix  via three arrays: (1) the array  contains all the nonzero entries of , (2) the array  contains column indices of the nonzero entries stored in , and (3) entries of the array  point to the first entry of subsequence rows of  in the arrays  and .

PCSR. PCSR is CSR-based SpMV implementation on
the GPU and involves two kernels.Algorithm 1 shows the main procedure of the first kernel.This kernel is to load values from the CSR arrays into the global memory  in a coalesced manner.Each thread only calculates a nonzero entry.Algorithm 2 shows the main procedure of the second kernel.We observe that the second kernel is composed of two stages.The first stage performs many parallel coalesced loads from  and quickly places values into the shared memory  .The second stage completes a scalar-style reduction (with each thread working on a single row).Each row in   is reduced to an output value.

Improved PCSR
On the basis of the above discussions, we propose an improved PCSR on the GPU, called IPCSR, as shown in Algorithm 3. Compared to PCSR, IPCSR reduces two kernels to one kernel and only involves two stages.The first stage is to have each thread block perform many parallel coalesced loads from the  and  arrays.This quickly places values into the shared memory   (see lines ( 7)-( 9) in Algorithm 3).This stage combines operations in the first kernel of PCSR (see lines (3)-( 6) in Algorithm 1) and the first stage of the second kernel of PCSR (see lines ( 12)-( 17) in Algorithm 2).The second stage, similar to the second stage in the second kernel of PCSR, completes a scalar-style reduction (with each thread working on a single row).
For IPCSR, the size of shared memory is set as follows for a given  (number of threads per block): where   ,   , and   are the maximum number of threads per multiprocessor, maximum amount of shared memory per multiprocessor, and maximum number of blocks per multiprocessor, respectively.  ,   , and   are constant for a specific GPU.
The number of rows each block calculates for PCSR is an invariable value that is equal to the number of threads per block.However, in IPCSR, how many rows are assigned to a block is dynamically calculated.IPCSR splits the problem into row blocks, where the nonzero values within a block fit into a statically sized amount of shared memory (i.e., SHARED SIZE).For example, assume that there are 128 rows for a matrix, each with 16 nonzero values, and   = 1024, two blocks, each with 64 rows, are required.If the first row instead has 32 nonzero values and the remaining 127 rows still have 16 nonzero values, three blocks will be needed.The first, second, and third blocks work on 63, 64, and 1 rows, respectively.With the assumption that the number of nonzero values per row is not more than SHARED SIZE, the main procedure of generating row blocks is shown in Algorithm 4.
Here, the procedure in Algorithm 4 is executed on the CPU.Of course, this procedure can also be performed on the GPU.However, the test results show that the use of the GPU for this procedure does not increase the performance.For a specific matrix, the blocks need to be calculated only once.The complexity of generating the blocks is related to the number of rows, and the cost of generating the blocks is generally less than 1% of the cost that it takes to generate the CSR data structure.
In summary, compared to PCSR, IPCSR has the following improvement: (i) Reducing two kernels to one kernel and avoiding loading/reading the global memory  by merging operations in the first kernel of PCSR and the first stage of the second kernel of PCSR, which saves the cost of loading/reading the global memory.
(ii) Fitting the number of rows dynamically to the amount of   space, which simplifies the work done on the GPU.are active and can almost synchronously complete the scalarstyle reduction.However, when a sparse matrix has a large or significantly different number of nonzero values per row, many threads for each block are inactive and the thread with a long row has a slower reduction speed than that with a short row, which greatly decreases the performance.In order to improve the efficiency of the scalar-style reduction, we take full advantage of threads per block and propose a decision tree to fit the number of threads dynamically to a single row, as shown in Figure 1.For each block, this decision tree enables selecting the appropriate number of threads to calculate a single row on the basis of /  (number of rows that are assigned to it).When using multiple threads to calculate a single row in a block, the scalar-style reduction in Algorithm 3 is modified to the multiple scalar-style reduction (Algorithm 5).

Optimizing the
The multiple scalar-style reduction consists of two steps: a partial-reduction step and a warp-reduction step.In the partial-reduction step, each one of the threads in each thread group (warpNums threads are grouped into a thread group) performs a partial reduction and saves the reduction value to the shared memory.Then, in the warp-reduction step, the partial-style reduction values in the shared memory for each thread group are reduced to an output value in parallel.The main procedure of IPCSR with the adaptive number of threads per block is shown in Algorithm 7.

Optimizing Very Long Rows. If a single row has nonzero
values that are more than SHARED SIZE, IPCSR in Algorithm 7 will not be suitable for calculating the extremely long individual row because it cannot move these values into the shared memory  .This is solved by giving each of the very long rows to a single block.To assign each of these very long rows to a single block, the code in the 8th line of Algorithm 4 is modified as Algorithm 6.
The main procedure where a single long row is calculated by one block is listed in Algorithm 8. First, the thread block performs many parallel coalesced loads from the  and  arrays.This quickly places values into the shared memory   (see lines ( 2)-( 4) in Algorithm 8).Second, the values in   are reduced in parallel to an output value (see lines ( 7)- (15) in Algorithm 8).
The main procedure of IPCSR with very long rows is shown in Algorithm 9.By experiments, if one block only involves a single row whose nonzeros are not more than SHARED SIZE, the performance that is obtained by the multiple scalar-style reduction in Algorithm 5 is always less than that obtained by a single long row reduction in Algorithm 8. Therefore, we also use the single long row reduction in Algorithm 8 for each block involving a single row whose nonzeros are not more than SHARED SIZE.4.3.Optimizing the Accesses to .In IPCSR, the vector  in the global memory is randomly accessed, which results in decreasing its performance.Therefore, the texture memory is utilized to alleviate the high latency of randomly accessing the vector  in the global memory [34].For single precision,

𝐴𝐴 [𝑓𝑖𝑟𝑠𝑡𝐶𝑜𝑙 + 𝑖] * 𝑥 [𝐽𝐴 [𝑓𝑖𝑟𝑠𝑡𝐶𝑜𝑙 + 𝑖]]
(4) can be rewritten as Because the texture does not support the double value, the following function ℎ () shown in Algorithm 10 is adopted to transfer the int2 value to the double value.
Furthermore, for double precision, based on the function tℎ (), we rewrite as

Experimental Setup.
We use a total of 20 sparse matrices: 14 of them are from [35] and 6 of them are from the University of Florida Sparse Matrix Collection [36].Table 1 summarizes the information of the sparse matrices, including the number of rows, number of columns, number of nonzeros, and number of nonzeros per row.These matrices have been widely used in some previous work [8,14,23,28,29,32,37].Experimental environments include one machine that is equipped with an Intel Xeon Quad-Core CPU and an NVIDIA Tesla C2050 GPU and another machine with an Intel Xeon Quad-Core CPU and an NVIDIA Tesla K20c GPU.Table 2 shows an overview of NVIDIA GPUs.
The performance of all algorithms is measured in terms of GFlop/s (second) or GByte/s (GB/s).Measured performance does not include the data transfer (from GPU to CPU or from CPU to GPU).For each algorithm, we randomly execute it 50 times and then take the average performance as its performance.(1) device double fetch double(texture⟨int2, 1⟩, int ){ (2) int2 V = tex1Dfetch(, ); (3) return hiloint2double(V ⋅ , V ⋅ ); and K20c, respectively.GFlop/s values are calculated on the basis of the assumption of two flops per nonzero entry for a matrix [8].Table 3 presents speedups that are achieved by IPCSR over other implementation techniques on both GPUs for all test matrices.We can observe that the performance of our proposed IPCSR for each test matrix is much better than that of PCSR on both GPUs.IPCSR achieves speedups, ranging from 1.03 to 3.30 on C2050 and 1.04 to 3.25 on K20c, over PCSR (Table 3).The average speedups on the C2050 and K20c are 1.76 and 1.79, respectively (Figure 4).Furthermore, from Figure 2, we can see that IPCSR outperforms CUSPARSE, CUSP, and clSpMV on C2050 for all  4).In addition, IPCSR has also better behavior than CSR5 for all test cases and achieves average performance improvement of 1.21x over CSR5 (Figure 4).The performance of IPCSR is slightly higher than that of CSR-Adaptive for all test matrices except for Protein and mip1, and IPCSR only achieves average performance improvement of 1.02x over CSR-Adaptive (Figure 4).For Protein and mip1, there are many rows that have a large number of nonzeros.CSR-Adaptive can use several blocks to calculate a row whereas IPCSR at most uses a block to calculate a row, which can be the reason resulting in the improvement of the CSR-Adaptive performance compared to our proposed IPCSR.Similar to C2050, we also observe that IPCSR has higher performance than CUSPARSE, CUSP, and clSpMV for all test matrices except for Wind Tunnel and Epidemiology on K20c from Figure 3. IPCSR achieves an average speedup of 3.52x over CUSPARSE, 1.75x over CUSP, and 1.56x over clSpMV, respectively (Figure 4).IPCSR outperforms CSR5 and the performance of IPCSR is slightly better than that of CSR-Adaptive for all test cases.IPCSR achieves average performance improvement of 1.30x over CSR5 and 1.07x over CSR-Adaptive, respectively (Figure 4).

Double Precision.
The performance of CUSPARSE, CUSP, clSpMV, CSR5, CSR-Adaptive, PCSR, and IPCSR in double precision for all test matrices on C2050 and K20c is listed in Figures 5 and 6, respectively.Table 4 presents speedups that are achieved by IPCSR over other implementation techniques on both GPUs for all test matrices.Compared to PCSR, IPCSR achieves higher performance for all the matrices on both GPUs.The speedups of IPCSR over PCSR range from 1.07 to 2.72 on C2050 and 1.08 to 2.68 on K20c.The average speedups on C2050 and K20c are 1.53 and 1.56, respectively.
For CUSPARSE, CUSP, clSpMV, CSR5, CSR-Adaptive, and IPCSR, there is not a single approach that is best for all the matrices on both GPUs.However, IPCSR usually equals or outperforms other approaches and is the best approach on average.Similar to single precision, the major exceptional matrices are still Wind Tunnel, QCD, and Epidemiology in double precision.Although the performance of IPCSR is not the best for these exceptional matrices, IPCSR achieves average performance improvement of 1.98x over CUSPARSE, 1.33x over CUSP, 1.22x over clSpMV, 1.11x over CSR5, and 1.01x over CSR-Adaptive on C2050 (Figure 4).Similarly, IPCSR achieves average performance improvement of 2.52x over CUSPARSE, 1.62x over CUSP, 1.26x over clSpMV, 1.20x over CSR5, and 1.08x over CSR-Adaptive on K20c (Figure 4).Therefore, we can conclude that IPCSR greatly improves the performance of PCSR and shows better behavior than Here, the effective bandwidth is in units of GB/s;   is the number of bytes read per kernel and   is the number of bytes written per kernel.Figure 7 )  += ; (6) end while Algorithm 1: Kernel I in PCSR.is stored in the CSR format by

Figure 3 :
Figure 3: Performance comparison for single precision on K20c.

Figure 4 :
Figure 4: Average speedups of IPCSR over other implementation techniques.

Table 1 :
Sparse matrices used in the experiments.
test matrix, we choose the best of the six CUSP algorithms. }

Table 3 :
Speedups of IPCSR over other implementation techniques in single precision.

Table 4 :
Speedups of IPCSR over other implementation techniques in double precision.
[20]s the effective bandwidth of CUSPARSE, CUSP, clSpMV, CSR5, CSR-Adaptive, PCSR, and IPCSR for all test matrices on C2050.The unit is GB/s.The red line is the peak theoretical bandwidth (144 GB/s) of C2050.We can observe that IPCSR achieves effective bandwidth ranging from 72.63 to 139.48 GB/s for all test matrices on C2050 and its average effective bandwidth of IPCSR is 119.22 GB/s.On the basis of the definition of the bandwidth[20], we can conclude that IPCSR has high parallelism.For CUSPARSE, CUSP, clSpMV, CSR5, CSR-Adaptive, and PCSR, their average effective bandwidth for all test cases on C2050 is 61.64, 89.92, 85.35, 100.29, 117.43,