Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations

Low-rank matrices arise in many scientific and engineering computations. Both computational and storage costs of manipulating such matrices may be reduced by taking advantages of their low-rank properties. To compute a low-rank approximation of a dense matrix, in this paper, we study the performance of QR factorization with column pivoting or with restricted pivoting on multicore CPUs with a GPU.We first propose several techniques to reduce the postprocessing time, which is required for restricted pivoting, on a modern CPU. We then examine the potential of using a GPU to accelerate the factorization process with both column and restricted pivoting.Our performance results on two eight-core Intel SandyBridgeCPUswith oneNVIDIAKeplerGPUdemonstrate that using the GPU, the factorization time can be reduced by a factor of more than two. In addition, to study the performance of our implementations in practice, we integrate them into a recently developed software StruMF which algebraically exploits such low-rank structures for solving a general sparse linear system of equations. Our performance results for solving Poisson’s equations demonstrate that the proposed techniques can significantly reduce the preconditioner construction time of StruMF on the CPUs, and the construction time can be further reduced by 10%–50% using the GPU.


Introduction
In applied and numerical mathematics or in scientific and engineering simulations, we often encounter low-rank matrices, and more frequently, we encounter matrices whose submatrices are low-rank.We can reduce both computational and storage requirements of manipulating many of these matrices by taking advantages of their low-rank properties.In this paper, we study the performance of the following two algorithms for computing a low-rank approximation of a dense matrix on multicore CPUs and investigate the potential of using a GPU to accelerate the process: (1) the QR factorization with column pivoting (QP3) [1,2] and (2) the QR factorization with restricted pivoting (QPR) [3,4].In addition, to study the performance of QP3 and QPR in practice, we integrate our implementations of QP3 and QPR into StruMF [5,6] which is a recently developed software for solving a general sparse linear system of equations.StruMF algebraically exploits a low-rank structure, referred to as a hierarchically semiseparable (HSS) structure [7], of a coefficient matrix while computing and applying a preconditioner, and uses QP3 and QPR for computing the low-rank approximation of the dense submatrices.In many cases, the preconditioner construction time of StruMF is dominated by the low-rank approximation of the submatrices.
The rest of the paper is organized as follows.First, in Section 2, we review QP3 and QPR.Then, in Section 3, after analyzing the performance of StruMF on a CPU, we propose several techniques to improve the QPR performance on the CPU.Next, in Section 4, we describe our implementations of QP3 and QPR using a GPU.Finally, in Section 5, we show their performance on multicore CPUs with a GPU and its impact on the performance of StruMF.We provide our final remarks in Section 6.Throughout this paper, the th column of a matrix  is denoted by a  , while   1 : 2 , 1 : 2 is the submatrix consisting of the  1 th through the  2 th rows and the  1 th through the  2 th columns of .
More detailed discussion on the RRQR factorizations can be found in [9] and the references therein.
In the following subsections, we review the existing algorithms to compute such RRQR factorizations.Namely, we first review the blocked versions of Householder QR [8] (Section 2.1).We then outline QP3 [1,2] which is a greedy algorithm for solving task-1 and is implemented in LAPACK [10] (Section 2.2).Next, we describe how an algorithm solving task-1 can be modified to solve task-2 and present three types of so-called hybrid algorithms [9] that are theoretically guaranteed to solve task-1 or task-2, or both (Section 2.3).Finally, we discuss QPR [3] that uses the hybrid algorithm as a postprocessing scheme to reduce the computational bottleneck of QP3, while ensuring the rank-revealing properties of the computed factorization (Section 2.4).
2.1.Blocked QR Algorithm.The th step of Householder QR [8] generates the Householder transformations to zero out the off-diagonal elements in the th column of  (for  = 1, 2, . . ., ).To improve the data locality of the factorization, a blocked version of the algorithm (QR3) accumulates   Householder transformations and uses a BLAS-3 to apply the accumulated transformations at once to the trailing submatrix; that is, for  = 1, 2, . . ., /  , and  = ( − 1)  (we assume that both  and  are multiples of   , but the discussion can be easily extended for other cases),  :,+  : :=  []  :,:  :,+  : , where  [] is the product of the   Householder matrices; that is,  [] =  [,  :−1:1] =  [,  ]  [,  −1] ⋅ ⋅ ⋅  [,1] and  [,] =  −  []   v []   v []  .(The matrices  [] is not explicitly formed.Instead, we store v []  :, in the lower-triangular part of a  and store  []   in an -length vector.)This matrix  [] can be represented in a so-called  form [11]: where w []    =  []    [,−1:−1:1] v []   .Since  = 1, 2, . . .,   , these  − 1 Householder matrices are applied to the vector v []    in addition to the matrix ; this blocked algorithm requires about   / times more floating point operations (flops) than the unblocked algorithm.However, on a modern computer, the blocked algorithm takes advantage of the memory hierarchy and often obtains a significant speedup over an unblocked algorithm.

QP3 Algorithm.
At each step of an RRQR factorization, selecting an optimal pivot to satisfy task-1 (7) or task-2 (8), or both, is likely to be a combinatorial optimization problem.To reduce the computational cost, a greedy algorithm is generally used.In this section, we discuss such a greedy algorithm for solving task-1.Specifically, at the ( + 1)th step (for  = 0, 1, . . .,  − 1), assuming that the  well-conditioned columns of  have been already selected and factorized, the next pivot column is selected from the remaining  −  columns such that the smallest singular value of the  + 1 selected columns are maximized: setup:  := 0 and  := 0 while  <  do (1) panel factorization: for  = 1, . . .,   do generation of Householder transformation:
Algorithm 2 shows a blocked version (QP3) of QRP [2].One difference between QP3 and QR3 of Algorithm 1 is at the trailing submatrix update.Specifically, QR3 stores the product of the   Householder matrices in the  form (10) (or using the matrix  [] to reduce the memory requirement) and updates the trailing submatrix by two matrixmatrix multiplies (our implementation takes advantage of the triangular structures of both  [] and  [] ); that is, after the th panel factorization, the trailing submatrix is updated by  :=  −  []  [] where  [] =    [] .On the other hand, the th step of the th QP3 panel factorization computes the th column of the matrix-matrix product    [] and uses the resulting vector f []   to update the th row of , which, in return, is needed to downdate the column norms of the trailing submatrix (Steps 1.5 and 1.6 of Algorithm 2).Finally, QP3 updates the trailing submatrix by a single matrix-matrix multiply,  :=  −  []  [] .Since QP3 saves the result of the matrix-matrix product    [] in the auxiliary matrix  [] , QP3 and QR3 require about the same numbers of flops.However, QP3 performs about a half of the total flops using BLAS-2, while QR3 performs most of its flops using BLAS-3.
In some cases, due to the round-off errors, the downdated norms   diverge significantly from the true norms [14].When this occurs, the trailing submatrix is immediately updated with the outstanding Householder transformations, and the column norms are recomputed.If the column norms must be frequently recomputed, then in comparison to QR3, QP3 not only requires significantly more flops for recomputing the norms but also exhibits a poorer data locality since the trailing submatrices are updated using smaller blocks.

Hybrid Algorithms.
In this section, we first show how an algorithm solving task-1 can be used to solve task-2, and then describe so-called hybrid algorithms for solving task-1 or task-2, or both.The algorithms presented in this subsection  are used as a postprocessing to improve the numerical properties of an RRQR factorization, and the matrix  is of upper-triangular form.For instance, Algorithm 3(a) shows the QP3 algorithm applied to an upper triangular matrix .To simplify our notation, we write the RRQR factorization as  =  [  [1,1]  [1,2]    [2,2] ] , where  [1,1] is the -by- leading submatrix.Then, task-2 is equivalent to task-3: max Hence, to solve task-2, we apply an algorithm that solves task-1 to the transpose-inverse of the matrix .Namely, if we have such an RRQR factorization, [2,2] ] , then we also have [1,1] ) [2,2] ) −  [1,2]  [1,1]  ( [2,2] ) Moreover, if P is the permutation matrix with ones on the antidiagonal, then where  P is a permutation matrix,  P is an orthogonal matrix, and ) is minimized.Hence, the factorization ( 21) is an RRQR factorization solving task-2.This is called the unification principle of the algorithms solving task-1 and task-2 [9].
Hence, at the ( + 1)th step, the algorithm pivots the ( + )th column, whose corresponding element V ,1 of the most dominant right singular vector v 1 has the largest module.Algorithm 3(b) shows the Chan-II algorithm which is the task-2 version of the Chan-I algorithm and pushes the column that minimizes  max ( [2,2] ) into the trailing submatrix at each step.Finally, Algorithm 4 shows a single iteration of three hybrid algorithms Hybrid-I, Hybrid-II, and Hybrid-III [9] that solve task-1, task-2, and both task-1 and task-2, respectively.The iteration is terminated when the th column is not moved.

QPR Algorithm.
Even when QP3 and QR3 performs about the same number of flops, QP3 is often slower.This is largely because QR3 performs most of its flops using BLAS-3, while QP3 performs about half of its flops using BLAS-2.Since this BLAS-2 is needed to select the pivot among all the remaining columns, QPR [3] tries to reduce the bottleneck by selecting the pivot among a fixed number,   , of the columns.Algorithm 5(a) shows the pseudocode of QPR.At each step of the panel factorization, while QP3 computes the matrix-vector product with the whole trailing submatrix, QPR updates the columns within this window using a Householder matrix (both by BLAS-2).Since the pivots are now selected only within the window, in order to ensure the rank-revealing properties, QPR uses a condition estimator and accepts only the pivots that satisfy (5).After all the columns are either accepted or rejected (Step 1 in Algorithm 5(a)), the QR factorization of the rejected columns is computed to obtain the upper-triangular matrix  (Steps 2 and 3).At the end, in comparison to QP3, QPR performs a fewer flops using BLAS-2 to select the pivots.However, the   −   columns of the trailing submatrix are now updated using BLAS-2.In Sections 3 and 5.1, we compare the performance of QPR and QP3 on a CPU and on multicore CPUs with a GPU, respectively.
Since QPR selects the pivots only within the windows, it requires postprocessing to globally ensure the rank-revealing property.Though there are two postprocessing options [9,15], here, we focus on the first [9] because it has shown to be more effective in our experiments.Algorithm 5(b) shows the pseudocode of this postprocessing scheme, which modifies the Hybrid-III algorithm of Algorithm 4 to improve its convergence rate [3].

Case Studies with StruMF on a CPU
The performance of both QP3 and QRP depends on the input matrix .To study their performance, in this section, we provide their case studies with StruMF for solving 3D sevenpoint Poisson's equations on one core Intel Sandy Bridge CPU.In our experiments, we use the same default parameters used for solving Poisson's equations with StruMF in [5] (e.g., the numerical rank tolerance is set to be  = 5 × 10 −1 , and FGMRES(30) is considered to be converged when the residual norm is reduced at least by the order of 10 −6 ), and for QPR, we used the default window size recommended in [3] (i.e.,   =   + min(, , max{10,   /2 + 0.05})).

Performance of Original StruMF and QRP. Figure 1(a)
shows the breakdown of the StruMF time using QP3.It clearly shows that StruMF spends most of its preconditioner construction time in QP3.Moreover, the percentage of the time spent in QP3 often increases as the global matrix dimension increases.To analyze the performance of QP3, Figure 2 shows the dimensions of the off-diagonal blocks for which the low-rank approximations are computed.Most of these off-diagonal blocks are short and wide having a few hundreds rows and tens of thousands of columns.In addition, the dimensions of the off-diagonal blocks, especially, their numbers of columns, often increase with respect to the global dimension.Hence, if the compression time of these shortwide blocks can be reduced using another algorithm or a GPU, then the StruMF solution time may be significantly reduced.Figure 3 compares the StruMF solution times using QP3 and QPR.The figure clearly indicates that even though the factorization time is reduced using QPR, the postprocessing can be expensive.In the next subsection, we propose modifications to the QPR implementation, which often reduce the postprocessing time and make QPR more competitive.Just for reference, Figure 3 compares the StruMF solution time with that of a direct multifrontal factorization.We see that for a large enough system, StruMF can reduce not only the memory requirement (see Figure 1(b)) but also the solution time of the direct factorization.

Proposed Modifications to Original QPR Implementation.
The performance of QPR depends strongly on the condition estimator used to evaluate Step 1.1.1 of Algorithm 5(a).In the original implementation, the smallest singular value of  [1,1]  is estimated using an incremental condition estimator (ICE) [16,17], while the largest singular value is estimated by the product of the largest column norm and the third root of the matrix dimension (i.e., ( + 1) 1/3 max +1 =1 ‖ r  ‖ 2 where  [1,1] is of dimension  + 1, and   is the th column of  [1,1] ) [3].This simple estimator is used for the largest singular value because though the estimator may lead to a greater estimation error, it requires a fewer flops (i.e., ICE requires () flops to update the estimation of  [1,1] ).During the postprocessing phase, the same estimators are used at Step 1 of Algorithm 3(b), but in order to obtain an accurate final factorization, both the largest and smallest singular values are estimated by ICE at Step 3 of Algorithm 5(b).We found that, in our numerical experiments using random matrices, this simple estimator underestimates the largest singular values.As a result, many components that should be rejected are accepted.In addition, during the preconditioner construction for solving Poisson's equation by StruMF, the simple estimator overestimates the singular values, rejecting the components which should be accepted.In either case, this estimation error could significantly increase the postprocessing cost.Figure 4(a) shows that when a more accurate condition estimator (i.e., ICE) is used, the postprocessing time can be dramatically reduced.In addition, ICE reduced the factorization time slightly because most of the components are accurately accepted by Step 1 of Algorithm 5(a), and Step 2 has less work.
During the postprocessing (Algorithm 5(b)), the Golub-I algorithm requires the column norms of the trailing submatrix  :,: , and the Chan-II algorithm requires the column norms of the leading submatrix  1:+1,1:+1 .At the beginning of Step 1 to call the Chan-II algorithm, the original implementation computes the column norms of the leading submatrix  1:−1,1:−1 .Then, the norms ‖r  ‖ 2 and ‖r +1 ‖ 2 are computed as they are permuted into  1:+1,1:+1 .On the other hand, for each Golub-I call, the column norms of the trailing submatrix  :,: are recomputed.This is because when the Chan-II algorithm applies Given's rotation to the leading submatrix  1:,1: , the column norms of the trailing submatrix  :,: changes.In our experiments, this norm computation could become expensive, especially when Step 1 requires many iterations or a large tolerance  is used.To reduce this computational cost, we incrementally update or downdate the norms when Given's rotation is applied and when the submatrix size  changes at the outer iteration of Algorithm 5(b).In addition, we use the same criteria used in QP3 [14] to detect if the updated norms have diverged from the actual norms due to the round-off errors.When this happens, the column norms are recomputed.Figure 4(b) clearly indicates that the postprocessing time of StruMF can be significantly reduced by avoiding the norm computation.
For Step 2 of the factorization in Algorithm 5(a), the original implementation uses the column-wise QRP algorithm.On a modern computer, a blocked algorithm can obtain significant speedups.Hence, we replaced it with QP3.In addition, we use ICE to detect the numerical rank instead of the simple estimator used in the original implementation (i.e., QP3 terminates when the estimated condition number of the leading submatrix  1:+1,1:+1 becomes greater than the threshold ). Figure 5(a) shows that the blocked algorithm reduces the factorization time, but the overall improvement is not significant since after ICE is integrated, only a short time is spent at Step 2 for solving this Poisson's equation.
Finally, while QP3 aims to solve task-1, the QPR postprocessing tries to solve both task-1 and task-2.Hence, we replaced it with the postprocessing algorithms for solving  only either task-1 or task-2.Even though the postprocessing solved different tasks, StruMF converged in a similar number of iterations.However, the task-1 postprocessing converged slightly faster than the task-2 postprocessing, which was faster than the original postprocessing.We have also used the test matrices from the original paper [3] to evaluate the quality of the factorization after the postprocessing.Figure 5(c) shows that the quality of the factorization did not change significantly when different postprocessing schemes were used.We use the original postprocessing scheme in the remaining of the paper since it provides the most robust behavior, while the overhead is not significant after all the proposed modifications are integrated.Figure 5(d) compares the StruMF solution time using QP3 and QPR after all the modifications are made.Now, the solution time using QPR is shorter than that using QP3.

GPU Implementations
We now describe our QP3 and QPR implementations that utilizes a GPU.
4.1.QP3 Implementation.MAGMA (http://icl.utk.edu/magma/)extends LAPACK (http://www.netlib.org/lapack/) to heterogeneous architectures based on a hybrid programming and static scheduling.For instance, for the blocked QR factorization, the latency-limited BLAS-2 based panel factorization is scheduled on the CPUs, while the compute-intensive BLAS-3 based trailing submatrix update is scheduled on the GPU [18].Furthermore, as soon as the next panel is updated on the GPU, it is copied to the CPU such that the panel factorization on the CPUs can be hidden behind the remaining submatrix update on the GPU (this is commonly referred to as a lookahead).As a result, for a large enough matrix, the QR factorization by MAGMA can obtain the performance of the BLAS-3, which exhibits a high-level of data parallelism and can be efficiently implemented on the GPU [19].
Our first QP3 implementation is based on the same hybrid paradigm, where the CPUs factorize the panel while the GPU updates the trailing submatrix.However, in contrast to the QR panel factorization that accesses only a panel, each step of the QP3 panel factorization accesses the whole trailing submatrix to look for the pivot.Hence, the entire trailing submatrix must be updated before the panel factorization starts.As a result, though the transfer of the panel to the CPU can be overlapped with the remaining submatrix update, the actual panel factorization cannot be overlapped with the update (the lookahead is not possible).In addition, the QP3 factorization time is often dominated by the BLAS-2 based panel factorization that performs about a half of the total flops.Hence, our second implementation accelerates this panel factorization using a GPU.Because the flop count of the panel factorization is dominated by the matrix-vector product with the trailing submatrix (Step 1.4 of Algorithm 2), the GPU is used to accelerate this BLAS-2 kernel.As shown in Figure 5(e), our implementation computes the matrix-vector product with the top block row of the trailing submatrix on the CPUs, while the rest of the product is computed on the GPU.This is because this top block row is needed for the column norm downdating (Step 1.6), which is performed on the CPUs.Hence, this hybrid paradigm avoids the transfer of a row from the GPU to the CPU at each step of the panel factorization.Figure 5(f) illustrates this hybrid paradigm.6: Execution trace of the hybrid QP3 implementation.The top trace is on the CPU, while the remaining two traces are on the GPU with two GPU streams (matrix-vector multiply, matrix-matrix multiply, column swap, pivot selection, reflector generation, norm computation, and communication are in green, purple, orange, magenta, red, cyan, and black, respectively.Since the BLAS matrix-vector multiply routine does not support a vector-matrix multiply, a matrix-matrix multiply is used to compute f  at Step 1.4 of Algorithm 2).The second GPU stream is used to transfer the next panel and top block row to the CPU.
Unfortunately, in comparison to the QR factorization, this hybrid implementation of the QP3 factorization lead to much lower speedups.This is mainly because its performance is limited by the memory bandwidth to move a column between the CPU and GPU at each step of the panel factorization.This data transfer is needed for the column pivoting and matrix-vector multiply (see Figure 5(f)).As an execution trace in Figure 6 also shows, this CPU-GPU data transfer could become as expensive as the actual computation at each step of the panel factorization.To avoid this data transfer, our second implementation of QP3 performs all the computation on the GPU.At each step of the factorization, this GPU implementation still requires two synchronizations on the CPU; one to pick a pivot and the other one to check if the column norms must be recomputed (Steps 1.1 and 1.6, resp.).Furthermore, in many cases, the CPUs obtain higher performance of BLAS-1 (e.g., Householder vector generation) than the GPU.However, once the matrix is copied from the CPUs to the GPU, this GPU implementation does not transfer any vector or matrix between the CPUs and GPU and could obtain a higher performance than our hybrid implementation could (see Section 5).
Initially, our GPU implementations used an individual GPU kernel for each BLAS or LAPACK routine used for the panel factorization.However, many of these routines perform only small amounts of computation or require a scalar to be on the CPU.In order to improve the performance, we merged several computational kernels into one kernel in order to avoid the kernel launch overhead and unnecessary GPU-CPU communication.In addition, we have tuned these kernels (e.g., block size and thread grid) for the matrix dimensions typical for the QPR factorization.We will show the effects of these optimization techniques in Section 5. A similar GPU-implementation is proposed in [20].

QPR Implementation.
In contrast to QP3, QPR allows lookaheads.Namely, once the GPU updates all the columns of the next window, the CPU can start the panel factorization while the GPU updates the remaining submatrix.However, in contrast to the QR panel factorization that only requires the panel, the QPR panel factorization updates all the columns within the window at each step of the panel factorization.Hence, when the window size is large in comparison to the trailing submatrix dimension, it becomes difficult to hide the BLAS-2 based panel factorization on the CPUs behind the BLAS-3 based trailing submatrix update on the GPU (we have investigated a hybrid panel factorization.However, in many cases, it was less efficient due to the CPU-GPU synchronization and communication, especially in our experiments with StruMF, where only a few columns were accepted at each panel factorization).
For Step 2 of the QPR factorization in Algorithm 5(a), we use the GPU implementation of QP3 (described in Section 4.1).Since at the end of Step 1, the coefficient matrix is on the GPU, Step 2 does not require any data transfer to the GPU.For Step 3, if the remaining submatrix is relatively small ( < 300 in our experiments), then we compute its QR factorization using LAPACK on the CPU.Otherwise, the QR factorization is computed using MAGMA on the GPU.We found that, in many cases, QP3 does not accept any of the rejected columns from Step 1.Hence, in order to hide the cost of copying the matrix from the GPU to the CPU for Step 3, the matrix is asynchronously copied to the CPU, while QP3 is performed on the GPU.Only when QP3 accepted a column, the matrix is resent to the CPU after the whole matrix is factorized.Finally, to generate the final orthogonal matrix , the Householder vectors from both QP3 and QR are accumulated on the GPU.

Integration into StruMF.
To call our GPU kernels from StruMF, we copy the matrix to the GPU, compute the factorization, and then copy the result back to the CPU.Initially, we allocate a fixed amount of GPU memory for the workspace, and the workspace is reallocated when the current workspace is not large enough.Since the interfaces of most of the MAGMA routines are identical to those of LAPACK,  replacing the LAPACK routine with the MAGMA routine is relatively easy.

Performance Studies with a GPU
We now study the performance of our QP3 and QPR implementations with a GPU (Section 5.1) and its impacts on the performance of StruMF (Section 5.2).We compiled our codes using the C compiler gcc 4.4.6 and the CUDA compiler nvcc 5.0.35 and linked them to the threaded version of MLK 2013.4.183.We emphasize that our CPU codes have been optimized.Namely, our QPR code integrates all the modifications described in Section 3.2, and our QP3 code computes a partial factorization, where the factorization is terminated at the numerical rank specified by the tolerance .Computing a partial factorization obtains a significant speedup compared to the QP3 routine of MKL that computes the full factorization.This was especially true in our experiments, where a relatively large  is used.
5.1.Kernel Performance.Figure 7(a) shows the performance of our QP3 implementations to factorize square random matrices with an NDIVIA Tesla K20c GPU ( = 0.0) and compares it with that of MKL on two eight-core Intel Sandy Bridge CPUs.Our GPU implementation improves the performance of our hybrid implementation because it avoids the data transfer between the CPU and GPU. Figure 7(b) demonstrates the advantage of the GPU implementation for the short-wide matrices.In addition, our optimized GPU implementation (Section 4.1) obtains significant speedups for both small and short-wide matrices, and this is used for the remaining of the experiments.
Figures 7(a) and 7(b) also show that all of our implementations obtain significant speedups over MKL.We observed that, for large matrices, our GPU implementation and MKL obtain similar performance relative to the practical peak performance on the corresponding hardware, where the practical peak performance is measured by computing the required matrix-vector products with the trailing submatrices and the matrix-matrix products for the submatrix update without any synchronization.In other words, our GPU implementation obtains the speedups over MKL because it effectively utilizes the higher memory and computational bandwidths of the GPU.
Figure 8(a) shows the breakdown of the QP3 factorization time.Up to 70% of the factorization time is spent in the BLAS-3 matrix-matrix and BLAS-2 matrix-vector multiplies.Even though these BLAS-3 and BLAS-2 perform about the same numbers of flops, the bandwidth-limited BLAS-2 dominates the factorization time.Figure 8(b) compares the performance of our QPR implementations with that of the QP3 implementations on random short-wide matrices.It shows that after the proposed modifications were integrated, the QPR implementation obtained the higher performance and greater scalability on the CPUs.Furthermore, our hybrid QPR implementation outperformed our QP3 GPU implementation due to the fewer BLAS-2 flops required to select the pivots.Since the performance of QP3 and QPR (especially that of QPR) depends greatly on the input matrix (e.g., the number of columns accepted at each panel factorization), in the next subsection, we study the performance of these two algorithms within StruMF.partial factorization by QP3 (the factorization is terminated at the numerical rank specified by the tolerance ).Next, the preconditioner construction time of StruMF using QP3 is often dominated by BLAS-2 or BLAS-3 on small submatrices, and the construction time did not scale well on the CPUs.As a result, our GPU kernel was able to accelerate the construction time by 30%-50% over StruMF running on up to 16 CPUs.Figure 9(b) then shows the performance of StruMF using QPR.Using QPR, the preconditioner construction slowed down on a single CPU, but it scaled better and was faster than using QP3 on multiple CPUs.Furthermore, the GPU reduced the construction time by 10%-20%.In comparison to QP3, the GPU acceleration was smaller in QPR mainly because the trailing submatrices were updated using smaller blocks.Most of the columns in each window are rejected, and significant time is spent swapping the rejected columns to the end of the matrix.

Performance of StruMF.
Figure 10 shows the results of solving a 2D Poisson equation, where the same tolerance  from [5] was used (i.e.,  = 10 −4 ).Though the sizes of the dense submatrices were significantly smaller in the 2D problem (see Figure 11), our GPU kernel, especially the QP3 implementation, still obtained significant speedups.Unfortunately, the compression time was less dominant in the 2D problem, and the reduction in the preconditioner construction time was less significant using the GPU.

Conclusion
We studied the performance of QP3 and QPR for computing the low-rank approximation of dense submatrices.We first proposed several modifications to the original QPR implementation to improve its performance on the CPUs.We then investigated the potential of using a GPU to accelerate the factorization time.Our performance results demonstrated that the proposed modifications could significantly improve the performance of the QPR factorization on the CPU, and the factorization time can be further reduced using a GPU.In addition, we provided the case studies with an hierarchically semiseparable linear solver StruMF, which showed that the preconditioner construction time of StruMF can be reduced by 30%-50% and 10%-20% using the GPU for the QP3 and QPR factorizations, respectively.Though we only show the results of solving Poisson's equations in this paper, the performance is representative of many cases and good indications for other cases.We emphasize that our focus is to study the performance of computing low-rank approximations, and our aim is not on improving the numerical performance of Scientific Programming StruMF.The techniques discussed in this paper are applicable to other software which computes low-rank approximations or numerical ranks of dense matrices, including those on distributed-memory systems.
We did not study the impact of the input parameters on the performance of our implementations.For instance, we used all the default parameters of StruMF and QPR (e.g., compression rate , iteration stopping and restarting criteria for StruMF, and window size for QPR).In particular for StruMF, a smaller compression rate would lead to larger dense blocks increasing the effectiveness of our GPU kernels, while it would also reduce the iteration count, potentially reducing the total solution time.For QPR, the smaller window size would improve the effectiveness of the lookahead and may improve the performance of our hybrid implementation.On the other hand, the larger window size could improve the performance of the trailing submatrix updates and may also lead to more accurate factorization, potentially reducing the postprocessing time.We are currently studying the effects of these parameters on the performance of StruMF.In addition, we mentioned that, in comparison to QP3, QPR obtained smaller speedups on the GPU, mainly because it uses a smaller block to update the trailing submatrix.We are currently studying prepossessing techniques that would increase the block size and improve the performance of the QPR implementation (e.g., before the factorization, reorder the matrix columns in the descending order of their norms).We also discussed that the performance of the QPR implementation depends on the performance of the condition number estimator.We are investigating if more accurate estimators are needed for other test matrices.Finally, we are studying the performance of the randomization algorithm [21] to compute the low-rank approximation on a GPU [22] and would like to investigate the performance of our QP3 implementation in a communication-avoiding version of the algorithm [23] on multiple GPUs (e.g., distributed-memory system).

Figure 2 :
Figure 2: Statistics of off-diagonal blocks whose low-rank approximations are computed.
(b) Effect of tuning the postprocess

Figure 4 :
Figure 4: Solution time of original StruMF (left bar) and after modification (right bars), using ICE and tuning the postprocess in left and right plots.

Figure
Figure6: Execution trace of the hybrid QP3 implementation.The top trace is on the CPU, while the remaining two traces are on the GPU with two GPU streams (matrix-vector multiply, matrix-matrix multiply, column swap, pivot selection, reflector generation, norm computation, and communication are in green, purple, orange, magenta, red, cyan, and black, respectively.Since the BLAS matrix-vector multiply routine does not support a vector-matrix multiply, a matrix-matrix multiply is used to compute f  at Step 1.4 of Algorithm 2).The second GPU stream is used to transfer the next panel and top block row to the CPU.

Figure 7 :
Figure 7: Performance comparison of different QP3 implementations on random matrices.

Figure 8 :
Figure 8: Performance of the QP3 GPU-implementation on random short-wide matrices.