A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs

Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR onmultiple GPUs achieves good performance and has high parallel efficiency.


Introduction
Sparse matrix-vector multiplication (SpMV) has proven to be an important operation in scientific computing.It need be accelerated because SpMV represents the dominant cost in many iterative methods for solving large-sized linear systems and eigenvalue problems that arise in a wide variety of scientific and engineering applications [1].Initial work about accelerating the SpMV on CUDA-enabled GPUs is presented by Bell and Garland [2,3].The corresponding implementations in the CUSPARSE [4] and CUSP libraries [5] include optimized codes of the well-known compressed sparse row (CSR), coordinate list (COO), ELLPACK (ELL), hybrid (HYB), and diagonal (DIA) formats.Experimental results show speedups between 1.56 and 12.30 compared to an optimized CPU implementation for a range of sparse matrices.
SpMV is a largely memory bandwidth-bound operation.Reported results indicate that different access patterns to the matrix and vectors on the GPU influence the SpMV performance [2,3].The COO, ELL, DIA, and HYB kernels benefit from full coalescing.However, the scalar CSR kernel (CSR-scalar) shows poor performance because of its rarely coalesced memory accesses [3].The vector CSR kernel (CSRvector) improves the performance of CSR-scalar by using warps to access the CSR structure in a contiguous but not generally aligned fashion [3], which implies partial coalescing.Since then, researchers have developed many highly efficient CSR-based SpMV implementations on the GPU by optimizing the memory access pattern of the CSR structure.Lu et al. [6] optimize CSR-scalar by padding CSR arrays and achieve 30% improvement of the memory access performance.Dehnavi et al. [7] propose a prefetch-CSR method that partitions the matrix nonzeros to blocks of the same size and distributes them amongst GPU resources.This method obtains a slightly better behavior than CSR-vector by padding rows with zeros to increase data regularity, using parallel reduction techniques, and prefetching data to hide global memory accesses.Furthermore, Dehnavi et al. enhance the performance of the prefetch-CSR method by replacing it with three subkernels [8].Greathouse and Daga suggest a CSR-Adaptive algorithm that keeps the CSR format intact and maps well to GPUs [9].Their implementation efficiently accesses DRAM by streaming data into the local scratchpad memory and dynamically assigns different numbers of rows to each parallel GPU compute unit.In addition, numerous works have proposed for GPUs using the variants of the CSR storage format such as the compressed sparse eXtended [10], bit-representation-optimized compression [11], block CSR [12,13], and row-grouped CSR [14].
Besides using the variants of CSR, many highly efficient SpMVs on GPUs have been proposed by utilizing the variants of the ELL and COO storage formats such as the ELLPACK-R [15], ELLR-T [16], sliced ELL [13,17], SELL-C- [18], sliced COO [19], and blocked compressed COO [20].Specialized storage formats provide definitive advantages.However, as many programs use CSR, the conversion from CSR to other storage formats will present a large engineering hurdle and can incur large runtime overheads and require extra storage space.Moreover, CSR-based algorithms generally have a lower memory usage than those that are based on other storage formats such as ELL, DIA, and HYB.
All the above observations motivate us to further investigate how to construct efficient SpMVs on GPUs while keeping CSR intact.In this study, we propose a perfect CSR algorithm, called PCSR, on GPUs.PCSR is composed of two kernels and accesses CSR arrays in a fully coalesced manner.Experimental results on C2050 GPUs show that PCSR outperforms CSR-scalar and CSR-vector and has a better behavior compared to CSRMV and HYBMV in the vendor-tuned CUSPARSE library [4] and a most recently proposed CSR-based algorithm, CSR-Adaptive.
The main contributions in this paper are summarized as follows: (i) A novel SpMV implementation on a GPU, which keeps CSR intact, is proposed.The proposed algorithm consists of two kernels and alleviates the deficiencies of many existing CSR algorithms that access CSR arrays in a rare or partial coalesced manner.
(ii) Our proposed SpMV algorithm on a GPU is extended to multiple GPUs.Moreover, we suggest two methods to balance the workload among multiple GPUs.
The rest of this paper is organized as follows.Following this introduction, the matrix storage, CUDA architecture, and SpMV are described in Section 2. In Section 3, a new SpMV implementation on a GPU is proposed.Section 4 discusses how to extend the proposed SpMV algorithm on a GPU to multiple GPUs.Experimental results are presented in Section 5. Section 6 contains our conclusions and points to our future research directions.

Matrix Storage.
To take advantage of the large number of zeros in sparse matrices, special storage formats are required.In this study, the compressed sparse row (CSR) format is only considered although there are many varieties of sparse matrix storage formats, such as the ELLPACK (or ITPACK) [21], COO [22], DIA [1], and HYB [3].Using CSR, an  ×  sparse matrix  with  nonzero elements is stored via three arrays: (1) the array  contains all the nonzero entries of , (2) the array  contains column indices of nonzero entries that are stored in , and (3) entries of the array  point to the first entry of subsequence rows of  in the arrays  and .
For example, the following matrix is stored in the CSR format by : : [0 1 3 0 1 2 4 1 2 5 0 3 4 1 3 4 5 2 4 5] : [0 3 7 10 13 17 20] . (2) 2.2.CUDA Architecture.The compute unified device architecture (CUDA) is a heterogenous computing model that involves both the CPU and the GPU [23].Executing a parallel program on the GPU using CUDA involves the following: (1) transferring required data to the GPU global memory; (2) launching the GPU kernel; and (3) transferring results back to the host memory.The threads of a kernel are grouped into a grid of thread blocks.The GPU schedules blocks over the multiprocessors according to their available execution capacity.When a block is given to a multiprocessor, it is split in warps that are composed of 32 threads.In the best case, all 32 threads have the same execution path and the instruction is executed concurrently.

SpMV on a GPU
In this section, we present a perfect implementation of CSRbased SpMV on the GPU.Different with other related work, the proposed algorithm involves the following two kernels: where We call the proposed SpMV algorithm PCSR.For simplicity, the symbols used in this study are listed in Table 1.

Kernel 1.
For Kernel 1, its detailed procedure is shown in Algorithm 2. We observe that the accesses to two arrays  and  in global memory are fully coalesced.However, the vector  in global memory is randomly accessed, which results in decreasing the performance of Kernel 1.On the basis of evaluations in [24], the best memory space to place data is the texture memory when randomly accessing the array.Therefore, here texture memory is utilized to place the vector instead of global memory.For the single-precision floating point texture, the fourth step in Algorithm 2 is rewritten as Because the texture does not support double values, the following function ℎ () is suggested to transfer the int2 value to the double value.
Furthermore, for the double-precision floating point texture, based on the function ℎ (), we rewrite the fourth step in Algorithm 2 as

Kernel 2. Kernel 2 accumulates element values of
V that is obtained by Kernel 1 and its detailed procedure is shown in Algorithm 3.This kernel is mainly composed of the following three stages: (i) In the first stage, the array  in global memory is piecewise assembled into shared memory   of each thread block in parallel.Each thread for a thread block is only responsible for loading an element value of  into   except for thread 0 (see lines (05)-(06) in Algorithm 3).The detailed procedure is illustrated in Figure 1.We can see that the accesses to  are aligned.
(ii) The second stage loads element values of V in global memory from the position  [0] to the position  [TB] into shared memory V  for each thread block.The assembling procedure is illustrated in Figure 2. In this case, the access to V is fully coalesced.
(iii) The third stage accumulates element values of V , as shown in Figure 3.The accumulation is highly efficient due to the utilization of two shared memory arrays   and V .
Obviously, Kernel 2 benefits from shared memory.Using the shared memory, not only are the data accessed fast, but also the accesses to data are coalesced.
From the above procedures for PCSR, we observe that PCSR needs additional global memory spaces to store a middle array V besides storing CSR arrays , , and .Saving data into V in Kernel 1 and loading data from V in Kernel 2 to a degree decrease the performance of PCSR.However, PCSR benefits from the middle array V because introducing V makes it access CSR arrays , , and  in a fully coalesced manner.This greatly improves the speed of accessing CSR arrays and alleviates the principal deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing).

SpMV on Multiple GPUs
In this section, we will present how to extend PCSR on a single GPU to multiple GPUs.Note that the case of multiple GPUs in a single node (single PC) is only discussed because of its good expansibility (e.g., also used in the multi-CPU and multi-GPU heterogeneous platform).To balance the workload among multiple GPUs, the following two methods can be applied: (1) For the first method, the matrix is equally partitioned into  (number of GPUs) submatrices according to the matrix rows.Each submatrix is assigned to one GPU, and each GPU is only responsible for computing the assigned submatrix multiplication with the complete input vector.
(2) For the second method, the matrix is equally partitioned into  submatrices according to the number of nonzero elements.Each GPU only calculates a submatrix multiplication with the complete input vector.
In most cases, two partitioned methods mentioned above are similar.However, for some exceptional cases, for example, most nonzero elements are involved in a few rows for a matrix, the partitioned submatrices that are obtained by the first method have distinct difference of nonzero elements, and those that are obtained by the second method have different rows.Which method is the preferred one for PCSR?
If each GPU has the complete input vector, PCSR on multiple GPUs will not need to communicate between GPUs.In fact, SpMV is often applied to a large number of iterative methods where the sparse matrix is iteratively multiplied by the input and output vectors.Therefore, if each GPU only includes a part of the input vector before SpMV, the communication between GPUs will be required in order to execute PCSR.Here PCSR implements the communication between GPUs using NVIDIA GPUDirect.

Experimental Setup.
In this section, we test the performance of PCSR.All test matrices come from the University of Florida Sparse Matrix Collection [25].Their properties are summarized in Table 2.
All algorithms are executed on one machine which is equipped with an Intel Xeon Quad-Core CPU and four NVIDIA Tesla C2050 GPUs.Our source codes are compiled and executed using the CUDA toolkit 6.5 under GNU/Linux Ubuntu v10.04.1.The performance is measured in terms of GFlop/s (second) or GByte/s (second).

v_s[i]
∑ ptr_s [1]≤i≤ptr_s [2]   CSR-vector in the CUSP library [5] are chosen in order to show the effects of accessing CSR arrays in a fully coalesced manner in PCSR.CSRMV in the CUSPARSE library [4] is a representative of CSR-based SpMV algorithms on the GPU.HYBMV in the CUSPARSE library [4] is a finely tuned HYB-based SpMV algorithm on the GPU and usually has a better behavior than many existing SpMV algorithms.CSR-Adaptive is a most recently proposed CSR-based algorithm [9].We select 15 sparse matrices with distinct sizes ranging from 25,228 to 2,063,494 as our test matrices.Figure 4 shows the single-precision and double-precision performance results in terms of GFlop/s of CSR-scalar, CSRvector, CSRMV, HYBMV, CSR-Adaptive, and PCSR on a Tesla C2050.GFlop/s values in Figure 4 are calculated on the basis of the assumption of two Flops per nonzero entry for a matrix [3,13].In Figure 5, the measured memory bandwidth results for single precision and double precision are reported.

Single Precision.
From Figure 4(a), we observe that PCSR achieves high performance for all the matrices in the single-precision mode.In most cases, the performance of over 9 GFlops/s can be obtained.Moreover, PCSR outperforms CSR-scalar, CSR-vector, and CSRMV for all test cases, and average speedups of 4.24x, 2.18x, and 1.62x compared to CSR-scalar, CSR-vector, and CSRMV can be obtained, respectively.Furthermore, PCSR has a slightly better behavior than HYBMV for all the matrices except for af shell9 and  cont-300.The average speedup is 1.22x compared to HYBMV. Figure 6 shows the visualization of af shell9 and cont-300.
We can find that af shell9 and cont-300 have a similar structure, and each row for two matrices has a very similar number of nonzero elements, which is suitable to be stored in the ELL section of the HYB format.Particularly, PCSR and CSR-Adaptive have close performance.The average performance of PCSR is nearly 1.05 times faster than CSR-Adaptive.Furthermore, PCSR almost has the best memory bandwidth utilization among all algorithms for all the matrices except for af shell9 and cont-300 (Figure 5(a)).The maximum memory bandwidth of PCSR exceeds 128 GBytes/s, which is about 90 percent of peak theoretical memory bandwidth for the Tesla C2050.Based on the performance metrics [26], we can conclude that PCSR achieves good performance and has high parallelism.4(b) and 5(b), we see that, for all algorithms, both the double-precision performance and memory bandwidth utilization are smaller than the corresponding single-precision values due to the slow software-based operation.PCSR is still better than CSR-scalar, CSR-vector, and CSRMV and slightly outperforms HYBMV and CSR-Adaptive for all the matrices.The average speedup of PCSR is 3.33x compared to CSRscalar, 1.98x compared to CSR-vector, 1.57x compared to CSRMV, 1.15x compared to HYBMV, and 1.03x compared to CSR-Adaptive.The maximum memory bandwidth of PCSR exceeds 108 GBytes/s, which is about 75 percent of peak theoretical memory bandwidth for the Tesla C2050.

PCSR Performance without Communication.
Here we take the double-precision mode, for example, to test the PCSR  performance on multiple GPUs without considering communication.We call PCSR with the first method and PCSR with the second method PCSR-I and PCSR-II, respectively.Some large-sized test matrices in Table 2 are used.The execution time comparison of PCSRI and PCSRII on two and four Tesla C2050 GPUs is listed in Tables 3 and 4, respectively.In Tables 3 and 4, ET, SD, and PE stand for the execution time, standard deviation, and parallel efficiency, respectively.The time unit is millisecond (ms).Figures 7 and 8 show the parallel efficiency of PCSRI and PCSRII on two and four GPUs, respectively.On two GPUs, we observe that PCSRII has better parallel efficiency than PCSRI for all the matrices except for G3 circuit from Table 3 and Figure 7.The maximum, average, and minimum parallel efficiency of PCSRII are 98.06%, 91.64%, and 77.41%, which wholly outperform the corresponding maximum, average, and minimum parallel efficiency of PCSRI 98.06%, 88.16%, and 72.20%.Moreover, PCSRII has a smaller standard deviation than PCSRI for all the matrices except for ecology2, Transport, and G3 circuit.This implies that the workload balance on two GPUs for the second method is advantageous over that for the first method.
On four GPUs, for the parallel efficiency and standard deviation, PCSRII outperforms PCSRI for all the matrices except for G3 circuit (Table 4 and Figure 8).The maximum, average, and minimum parallel efficiency of PCSRII for all the matrices are 96.35%,85.14%, and 64.17% and are advantageous over the corresponding maximum, average, and minimum parallel efficiency of PCSRI 96.21%, 78.89%,On the basis of the above observations, we conclude that PCSRII has high performance and is on the whole better than PCSRI.For PCSR on multiple GPUs, the second method is our preferred one.

PCSR Performance with Communication.
We still take the double-precision mode, for example, to test the PCSR performance on multiple GPUs with considering communication.PCSR with the first method and PCSR with the second method are still called PCSR-I and PCSR-II, respectively.The same test matrices as in the above experiment are utilized.The execution time comparison of PCSRI and PCSRII on two and four Tesla C2050 GPUs is listed in Tables 5 and 6, respectively.The time unit is ms.ET, SD, and PE in Tables 5 and 6 are as the same as those in Tables 3 and 4. Figures 9 and 10 show PCSRI and PCSRII parallel efficiency on two and four GPUs, respectively.On two GPUs, PCSRI and PCSRII have almost close parallel efficiency for most matrices (Figure 9 and Table 5).As a comparison, PCSRII slightly outperforms PCSRI.The maximum, average, and minimum parallel efficiency of PCSRII for all the matrices are 96.34%,88.51%, and 80.44% and are advantageous over the corresponding maximum, average, and minimum parallel efficiency of PCSRI 96.05%, 86.03%, and 73.57%.
On four GPUs, for the parallel efficiency and standard deviation, PCSRII is better than PCSRI for all the matrices except that PCSRI has slightly good parallel efficiency for thermal2, G3 circuit, and Hamrle3 and slightly small standard deviation for thermal2, G3 circuit, ecology2, and CurlCurl 4 (Figure 10 and Table 6).The maximum, average, and minimum parallel efficiency of PCSRII for all the matrices are 90.12%,74.50%, and 58.27%, which are better than the corresponding maximum, average, and minimum parallel efficiency of PCSRI 88.77%, 65.69%, and 56.39%.Therefore, compared to PCSRI and PCSRII without communication, although the performance of PCSRI and PCSRII with communication decreases due to the influence of communication, they still achieve significant performance.Because PCSRII overall outperforms PCSRI for all test matrices, the second method in this case is still our preferred one for PCSR on multiple GPUs.

Conclusion
In this study, we propose a novel CSR-based SpMV on GPUs (PCSR).Experimental results show that our proposed PCSR on a GPU is better than CSR-scalar and CSR-vector in the CUSP library and CSRMV and HYBMV in the CUSPARSE library and a most recently proposed CSR-based algorithm, CSR-Adaptive.To achieve high performance on multiple GPUs for PCSR, we present two matrix-partitioned methods to balance the workload among multiple GPUs.We observe that PCSR can show good performance with and without considering communication using the two matrixpartitioned methods.As a comparison, the second method is our preferred one.Next, we will further do research in this area and develop other novel SpMVs on GPUs.In particular, the future work will apply PCSR to some well-known iterative methods and thus solve the scientific and engineering problems.

Figure 5 :
Figure 5: Effective bandwidth results of all algorithms on a Tesla C2050.

Figure 8 :
Figure 8: Parallel efficiency of PCSRI and PCSRII without communication on four GPUs.

Figure 9 :Figure 10 :
Figure 9: Parallel efficiency of PCSRI and PCSRII with communication on two GPUs.

Table 1 :
Symbols used in this study.

Table 2 :
Properties of test matrices.

Table 3 :
Comparison of PCSRI and PCSRII without communication on two GPUs.

Table 4 :
Comparison of PCSRI and PCSRII without communication on four GPUs.

Table 5 :
Comparison of PCSRI and PCSRII with communication on two GPUs.

Table 6 :
Comparison of PCSRI and PCSRII with communication on four GPUs.