Sparse matrixvector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSRbased SpMVs on graphic processing units (GPUs), for example, CSRscalar and CSRvector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSRbased SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSRscalar (rare coalescing) and CSRvector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSRscalar, CSRvector, and CSRMV and HYBMV in the vendortuned CUSPARSE library and is comparable with a most recently proposed CSRbased algorithm, CSRAdaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.
Sparse matrixvector multiplication (SpMV) has proven to be an important operation in scientific computing. It need be accelerated because SpMV represents the dominant cost in many iterative methods for solving largesized linear systems and eigenvalue problems that arise in a wide variety of scientific and engineering applications [
SpMV is a largely memory bandwidthbound operation. Reported results indicate that different access patterns to the matrix and vectors on the GPU influence the SpMV performance [
Besides using the variants of CSR, many highly efficient SpMVs on GPUs have been proposed by utilizing the variants of the ELL and COO storage formats such as the ELLPACKR [
All the above observations motivate us to further investigate how to construct efficient SpMVs on GPUs while keeping CSR intact. In this study, we propose a perfect CSR algorithm, called PCSR, on GPUs. PCSR is composed of two kernels and accesses CSR arrays in a fully coalesced manner. Experimental results on C2050 GPUs show that PCSR outperforms CSRscalar and CSRvector and has a better behavior compared to CSRMV and HYBMV in the vendortuned CUSPARSE library [
The main contributions in this paper are summarized as follows:
A novel SpMV implementation on a GPU, which keeps CSR intact, is proposed. The proposed algorithm consists of two kernels and alleviates the deficiencies of many existing CSR algorithms that access CSR arrays in a rare or partial coalesced manner.
Our proposed SpMV algorithm on a GPU is extended to multiple GPUs. Moreover, we suggest two methods to balance the workload among multiple GPUs.
The rest of this paper is organized as follows. Following this introduction, the matrix storage, CUDA architecture, and SpMV are described in Section
To take advantage of the large number of zeros in sparse matrices, special storage formats are required. In this study, the compressed sparse row (CSR) format is only considered although there are many varieties of sparse matrix storage formats, such as the ELLPACK (or ITPACK) [
For example, the following matrix
The compute unified device architecture (CUDA) is a heterogenous computing model that involves both the CPU and the GPU [
Assume that
(01)
(02)
(03)
(04)
(05)
(06)
(07)
(08)
(09)
In this section, we present a perfect implementation of CSRbased SpMV on the GPU. Different with other related work, the proposed algorithm involves the following two kernels:
We call the proposed SpMV algorithm PCSR. For simplicity, the symbols used in this study are listed in Table
Symbols used in this study.
Symbol  Description 


Sparse matrix 

Input vector 

Output vector 

Size of the input and output vectors 

Number of nonzero elements in 

Number of threads per block 

Number of blocks per grid 

Number of elements calculated by each thread 

Size of shared memory 

Number of GPUs 
For
Furthermore, for the doubleprecision floating point texture, based on the function
CUDAspecific variables:
(i) threadId.x: a thread
(ii) blockId.x: a block
(iii) blockDim.x: number of threads per block
(iv) gridDim.x: number of blocks per grid
(01)
(02)
(03)
(04)
(05)
(06)
In the first stage, the array
The second stage loads element values of
The third stage accumulates element values of
CUDAspecific variables:
(i) threadId.x: a thread
(ii) blockId.x: a block
(iii) blockDim.x: number of threads per block
(iv) gridDim.x: number of blocks per grid
(01) define shared memory
(02) define shared memory
(03)
(04)
(05)
(06)
(07) __syncthreads();
(08)
(09)
(10)
(11)
(12)
(13) __syncthreads();
(14)
(15)
(16)
(17)
(18)
(19)
(20) __syncthreads();
/
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
First stage of
Second stage of
Third stage of
Obviously,
From the above procedures for PCSR, we observe that PCSR needs additional global memory spaces to store a middle array
In this section, we will present how to extend PCSR on a single GPU to multiple GPUs. Note that the case of multiple GPUs in a single node (single PC) is only discussed because of its good expansibility (e.g., also used in the multiCPU and multiGPU heterogeneous platform). To balance the workload among multiple GPUs, the following two methods can be applied:
For the first method, the matrix is equally partitioned into
For the second method, the matrix is equally partitioned into
In most cases, two partitioned methods mentioned above are similar. However, for some exceptional cases, for example, most nonzero elements are involved in a few rows for a matrix, the partitioned submatrices that are obtained by the first method have distinct difference of nonzero elements, and those that are obtained by the second method have different rows. Which method is the preferred one for PCSR?
If each GPU has the complete input vector, PCSR on multiple GPUs will not need to communicate between GPUs. In fact, SpMV is often applied to a large number of iterative methods where the sparse matrix is iteratively multiplied by the input and output vectors. Therefore, if each GPU only includes a part of the input vector before SpMV, the communication between GPUs will be required in order to execute PCSR. Here PCSR implements the communication between GPUs using NVIDIA GPUDirect.
In this section, we test the performance of PCSR. All test matrices come from the University of Florida Sparse Matrix Collection [
Properties of test matrices.
Name  Rows  Nonzeros (nz)  nz/row  Description 

epb2  25,228  175,027  6.94  Thermal problem 
ecl32  51,993  380,415  7.32  Semiconductor device 
bayer01  57,735  277,774  4.81  Chemical process 
g7jac200sc  59,310  837,936  14.13  Economic problem 
finan512  74,752  335,872  4.49  Economic problem 
2cubes_sphere  101,492  1,647,264  16.23  Electromagnetics 
torso2  115,967  1,033,473  8.91  2D/3D problem 
FEM_3D_thermal2  147,900  3,489,300  23.59  Nonlinear thermal 
scircuit  170,998  958,936  5.61  Circuit simulation 
cont300  180,895  988,195  5.46  Optimization problem 
Ga41As41H72  268,096  18,488,476  68.96  Pseudopotential method 
F1  343,791  26,837,113  78.06  Stiffness matrix 
rajat24  358,172  1,948,235  5.44  Circuit simulation 
language  399,130  1,216,334  3.05  Directed graph 
af_shell9  504,855  17,588,845  34.84  Sheet metal forming 
ASIC_680ks  682,712  2,329,176  3.41  Circuit simulation 
ecology2  999,999  4,995,991  5.00  Circuit theory 
Hamrle3  1,447,360  5,514,242  3.81  Circuit simulation 
thermal2  1,228,045  8,580,313  6.99  Unstructured FEM 
cage14  1,505,785  27,130,349  18.01  DNA electrophoresis 
Transport  1,602,111  23,500,731  14.67  Structural problem 
G3_circuit  1,585,478  7,660,826  4.83  Circuit simulation 
kkt_power  2,063,494  12,771,361  6.19  Optimization problem 
CurlCurl_4  2,380,515  26,515,867  11.14  Model reduction 
memchip  2,707,524  14,810,202  5.47  Circuit simulation 
Freescale1  3,428,755  18,920,347  5.52  Circuit simulation 
All algorithms are executed on one machine which is equipped with an Intel Xeon QuadCore CPU and four NVIDIA Tesla C2050 GPUs. Our source codes are compiled and executed using the CUDA toolkit 6.5 under GNU/Linux Ubuntu v10.04.1. The performance is measured in terms of GFlop/s (second) or GByte/s (second).
We compare PCSR with CSRscalar, CSRvector, CSRMV, HYBMV, and CSRAdaptive. CSRscalar and CSRvector in the CUSP library [
We select 15 sparse matrices with distinct sizes ranging from 25,228 to 2,063,494 as our test matrices. Figure
Performance of all algorithms on a Tesla C2050.
Single precision
Double precision
Effective bandwidth results of all algorithms on a Tesla C2050.
Single precision
Double precision
From Figure
Visualization of the af_shell9 and cont300 matrix.
cont300
af_shell9
Furthermore, PCSR almost has the best memory bandwidth utilization among all algorithms for all the matrices except for af_shell9 and cont300 (Figure
From Figures
Here we take the doubleprecision mode, for example, to test the PCSR performance on multiple GPUs without considering communication. We call PCSR with the first method and PCSR with the second method PCSRI and PCSRII, respectively. Some largesized test matrices in Table
Comparison of PCSRI and PCSRII without communication on two GPUs.
Matrix  ET (GPU)  PCSRI (2 GPUs)  PCSRII (2 GPUs)  

ET  SD  PE  ET  SD  PE  
2cubes_sphere  0.4444  0.2670  0.0178  83.21 



scircuit  0.3484  0.2413  0.0322  72.20 



Ga41As41H72  4.2387  2.3084  0.0446  91.81 



F1  6.5544  3.8865  0.7012  84.32 



ASIC_680ks  0.8196  0.4567  0.0126  89.72 



ecology2  1.2321  0.6665 

92.42 

0.0152 

Hamrle3  1.7684  0.9651  0.0478  91.61 



thermal2  2.0708  1.0559  0.0056  98.06 



cage14  5.9177  3.4757  0.5417  85.13 



Transport  4.7305  2.4665 

95.89 

0.0407 

G3_circuit  1.9731 



1.1061  0.1148  89.18 
kkt_power  4.3465  2.7916  0.7454  77.85 



CurlCurl_4  5.1605  2.7107  0.0347  95.18 



memchip  3.8257  2.1905  0.3393  87.32 



Freescale1  5.0524  3.0235  0.5719  83.55 



Comparison of PCSRI and PCSRII without communication on four GPUs.
Matrix  ET (GPU)  PCSRI (4 GPUs)  PCSRII (4 GPUs)  

ET  SD  PE  ET  SD  PE  
2cubes_sphere  0.4444  0.1560  0.0132  71.23 



scircuit  0.3484  0.1453  0.0262  59.94 



Ga41As41H72  4.2387  1.6123  0.7268  65.72 



F1  6.5544  2.5240  0.6827  64.92 



ASIC_680ks  0.8196  0.2944  0.0298  69.59 



ecology2  1.2321  0.3593  0.0160  85.72 



Hamrle3  1.7684  0.5114  0.0307  86.45 



thermal2  2.0708  0.5553  0.0271  93.22 



cage14  5.9177  1.8126  0.3334  81.62 



Transport  4.7305  1.2292  0.0270  96.21 



G3_circuit  1.9731 



0.6195  0.0790  79.63 
kkt_power  4.3465  1.4974  0.5147  72.57 



CurlCurl_4  5.1605  1.3554  0.0153  95.18 



memchip  3.8257  1.1439  0.1741  83.61 



Freescale1  5.0524  1.7588  0.4039  71.81 



Parallel efficiency of PCSRI and PCSRII without communication on two GPUs.
Parallel efficiency of PCSRI and PCSRII without communication on four GPUs.
On two GPUs, we observe that PCSRII has better parallel efficiency than PCSRI for all the matrices except for G3_circuit from Table
On four GPUs, for the parallel efficiency and standard deviation, PCSRII outperforms PCSRI for all the matrices except for
On the basis of the above observations, we conclude that PCSRII has high performance and is on the whole better than PCSRI. For PCSR on multiple GPUs, the second method is our preferred one.
We still take the doubleprecision mode, for example, to test the PCSR performance on multiple GPUs with considering communication. PCSR with the first method and PCSR with the second method are still called PCSRI and PCSRII, respectively. The same test matrices as in the above experiment are utilized. The execution time comparison of PCSRI and PCSRII on two and four Tesla C2050 GPUs is listed in Tables
Comparison of PCSRI and PCSRII with communication on two GPUs.
Matrix  ET (GPU)  PCSRI (2 GPUs)  PCSRII (2 GPUs)  

ET  SD  PE  ET  SD  PE  
2cubes_sphere  0.4444 



0.2503 

88.75 
scircuit  0.3484  0.2234  0.0154  77.95 



Ga41As41H72  4.2387 



2.3795  0.0521  89.07 
F1  6.5544  3.9252  0.6948  83.49 



ASIC_680ks  0.8196 



0.4998  0.0178  81.99 
ecology2  1.2321  0.6865 

89.74 



Hamrle3  1.7684  1.0221  0.0209  86.50 



thermal2  2.0708  1.1403  0.0230  90.80 



cage14  5.9177  3.5756  0.5644  82.75 



Transport  4.7305  2.4623  0.0203  96.05 



G3_circuit  1.9731 



1.1766  0.0896  83.84 
kkt_power  4.3465  2.9539  0.6973  73.57 



CurlCurl_4  5.1605  2.7064  0.0092  95.34 



memchip  3.8257  2.3218  0.3467  82.39 



Freescale1  5.0524  3.1216  0.5868  80.92 



Comparison of PCSRI and PCSRII with communication on four GPUs.
Matrix  ET (GPU)  PCSRI (4 GPUs)  PCSRII (4 GPUs)  

ET  SD  PE  ET  SD  PE  
2cubes_sphere  0.4444  0.1567  0.0052  70.89 



scircuit  0.3484  0.1544  0.0204  56.39 



Ga41As41H72  4.2387  1.7157  0.7909  61.76 



F1  6.5544  2.1149  0.3833  77.48 



ASIC_680ks  0.8196  0.3449  0.0187  59.39 



ecology2  1.2321  0.4257 

72.35 

0.0056 

Hamrle3  1.7684 

0.0087 

0.6297 

70.21 
thermal2  2.0708 



0.6959  0.0269  74.39 
cage14  5.9177  1.9339  0.3442  76.50 



Transport  4.7305  1.3323  0.0279  88.77 



G3_circuit  1.9731 



0.7458  0.0620  66.14 
kkt_power  4.3465  1.7277  0.5495  62.89 



CurlCurl_4  5.1605  1.5065 

85.63 

0.8789 

memchip  3.8257  1.3804  0.1768  69.29 



Freescale1  5.0524  2.0711  0.4342  60.98 



Parallel efficiency of PCSRI and PCSRII with communication on two GPUs.
Parallel efficiency of PCSRI and PCSRII with communication on four GPUs.
On two GPUs, PCSRI and PCSRII have almost close parallel efficiency for most matrices (Figure
On four GPUs, for the parallel efficiency and standard deviation, PCSRII is better than PCSRI for all the matrices except that PCSRI has slightly good parallel efficiency for
Therefore, compared to PCSRI and PCSRII without communication, although the performance of PCSRI and PCSRII with communication decreases due to the influence of communication, they still achieve significant performance. Because PCSRII overall outperforms PCSRI for all test matrices, the second method in this case is still our preferred one for PCSR on multiple GPUs.
In this study, we propose a novel CSRbased SpMV on GPUs (PCSR). Experimental results show that our proposed PCSR on a GPU is better than CSRscalar and CSRvector in the CUSP library and CSRMV and HYBMV in the CUSPARSE library and a most recently proposed CSRbased algorithm, CSRAdaptive. To achieve high performance on multiple GPUs for PCSR, we present two matrixpartitioned methods to balance the workload among multiple GPUs. We observe that PCSR can show good performance with and without considering communication using the two matrixpartitioned methods. As a comparison, the second method is our preferred one.
Next, we will further do research in this area and develop other novel SpMVs on GPUs. In particular, the future work will apply PCSR to some wellknown iterative methods and thus solve the scientific and engineering problems.
The authors declare that they have no competing interests.
The research has been supported by the Chinese Natural Science Foundation under Grant no. 61379017.