Sparse Cholesky Factorization on FPGA Using Parameterized Model

Cholesky factorization is a fundamental problem in most engineering and science computation applications. When dealing with a large sparse matrix, numerical decomposition consumes the most time. We present a vector architecture to parallelize numerical decomposition of Cholesky factorization. We construct an integrated analytical parameterized performance model to accurately predict the execution times of typical matrices under varying parameters. Our proposed approach is general for accelerator and limited by neither field-programmable gate arrays (FPGAs) nor application-specific integrated circuit. We implement a simplified module in FPGAs to prove the accuracy of the model. The experiments show that, for most cases, the performance differences between the predicted and measured execution are less than 10%. Based on the performance model, we optimize parameters and obtain a balance of resources and performance after analyzing the performance of varied parameter settings. Comparing with the state-of-the-art implementation in CPU and GPU, we find that the performance of the optimal parameters is 2x that of CPU. Our model offers several advantages, particularly in power consumption. It provides guidance for the design of future acceleration components.


Introduction
In engineering and science computations, the solution of large sparse linear system of equations is a fundamental problem in most applications.In general, two methods for solving this problem exist: the direct and iterative methods.In the direct method, when focusing on factorization algorithms, we have LU decomposition, QR decomposition, and Cholesky decomposition [1].It should be noted that Cholesky decomposition is a special form of LU decomposition that deals with symmetric positive definite matrices.Its computational complexity is roughly half of LU algorithm.The Cholesky algorithm is widely used in applications because most matrices involved are symmetric positive definite.
The solution of large sparse linear system of equations can be adapted quite easily to parallel architectures.Supercomputers and multiprocessor systems became the main focus of this kind of research [2][3][4].However these systems cost highly in making marketable products [5].
There are a lot of software-and hardware-based approaches developed to get better solution in Cholesky decomposition.Field-programmable gate arrays (FPGAs) have unique advantages in solving such problems.First, according to the characteristics of the algorithm, a dedicated architecture could be designed [6].Second, the computational resources, memory, and I/O bandwidth are reconfigurable.Third, the energy consumption and cost are relatively low under the premise of meeting certain performances.Fourth, it provides experimental and verification platform for future coprocessor design.Yang et al., 2010 [7, 8], presented a Cholesky decomposition based on GPUs and FPGAs.The results show that the dedicated FPGA implementation has the highest efficiency.The design in FPGAs has been able to combine parallel floating-point operations with high memory bandwidth in order to achieve higher performance than microprocessors [9].
A sparse matrix, in which the ratio of nonzero values is low, is especially suitable for direct methods.In our implementation, we chose the Cholesky factorization method over LU decomposition.
The reason why we chose the Cholesky factorization is less operation.The matrix is symmetric and half of factors in matrix need to be calculated and suitable for parallel architectures.In parallel, there is less communication within each interprocessor.There are a lot of work in parallelizing the Cholesky factorization [3,4].The memory requirement is significantly lower than other direct methods, because only the lower triangular matrix is used for factorization and the intermediate and final results (the Cholesky factor ) are overwritten in the original matrix.
The three algorithms in the implementation of Cholesky factorization are row-oriented, column-oriented, and submatrix algorithms.The difference among these three algorithms is in the order of the three loops [3].Among these three algorithms, the column-oriented algorithm is the most popular algorithm for sparse matrix.Once each column is regarded as a node, data dependency of each node in Cholesky factorization can be described as an elimination tree [10].In the tree, the parent nodes depend on their descendants while their descendants must update them before being decomposed.Thus, the process of factorization can be taken as a process of eliminating the tree from bottom to top.Two main operations of the algorithm exist: scaling operation in columns (div()) and updating operation between dependent columns (mod()).Basing on the updating order, the researchers found that the algorithm can be classified into left-looking and right-looking algorithms.
A number of studies in sparse Cholesky factorization optimized in GPU are found.Vuduc and Chandramowlishwaran mapped the submatrix tasks directly to the GPU [11], George scheduled the tasks on the GPU [12], and Lucas et al. used multiple CPU threads to take small-scale computing tasks in parallel and used GPU to speed-up large-scale computing tasks [13].The small-scale computing tasks are in the bottom of the elimination tree whereas the large-scale computing tasks are in the top of the elimination tree.Lacoste et al. modified the task scheduling module of Cholesky factorization algorithm to provide a unified programming interface on CPU-GPU heterogeneous platforms [14].Hogg et al. proposed a supernodal algorithm of HLS_MA87 solver for multicore CPU platforms [15].Chen et al. introduced a GPU version of the CHOLMOD solver for CPU-GPU platforms [16].Zou et al. improved the generation and scheduling of GPU tasks by queue-based approach and designed a subtreebased parallel method for multi-GPU system [17].
To decrease the storage and computation complexity, the sparse matrix would be compressed into a dense format (CSC format, Compressed Sparse Column Format).However, in the process of Cholesky factorization, many original zero entries will turn to be nonzeros.They increase the overhead memory and dynamically change the compressed data structure.To avoid this situation, the matrix should be preprocessed and decomposed in symbol before undergoing numerical factorization (when preprocessed and decomposed in symbol, we could generate nonzero pattern of matrix and its elimination tree.When we reorganize the data, we would get the vector array).Preprocessing exposes parallelism by reordering matrix columns.Symbol decomposition is then executed in advance to identify the sparsity pattern of the result matrix.With the symbol factorization technology being well-developed, numerical decomposition is the most time consuming.This paper discusses and evaluates the performance of numerical decomposition based on the preprocessing and symbol factorization results.
After analyzing Cholesky algorithm when dealing with a large sparse matrix, we rewrite the algorithm in a vector version.In order to explore the relationship between parallel scale and performance, to predict the performance on field-programmable gate arrays (FPGAs) platform, we also designed corresponding vector architecture and construct a parameterized model.The performance depends on the underlying architecture and system parameters.The capacity of on-chip memory, the amount of computational resources, and the size of I/O bandwidth are critical parameters.Our model takes these factors into consideration and focused on their relationship.By using the model, we obtain the best performance when choosing the capacity of on-chip memory and the amount of computational resources.
The rest of this paper is organized as follows.Section 2 introduces the vector version of Cholesky algorithm.Section 3 describes our performance model in ideal, memory-bound, and computational-bound cases.Section 4 proves the consistency of the performance of our model and the implementation; then it analyzes the parameters and tries to find an optimal performance.

Cholesky Factorization
Cholesky factorization decomposes a symmetric positive definite matrix into L × L  where L is a lower triangular matrix.The main operations of the algorithm are scale operation in columns (div()) and update operation between dependent columns (mod(, )).div() divides the subdiagonal elements of column  by the square root of the diagonal element in that column, and mod(, ) modifies column  by subtracting a multiple of column .The algorithm can be classified into left-looking and right-looking algorithm based on updating order.Right-looking algorithm performs all the updates mod(, ) using column  immediately after the div().By contrast, left-looking algorithm accumulates all necessary updates mod(, ) for column  until before the div().
From Algorithm 1 we can note that the parallelism is mainly among the nondata dependency columns and the elements in a column, corresponding to the parallel elimination among different path in the elimination tree (internode parallelism) and the parallel scaling among different subdiagonal elements in a node (intranode parallelism).Besides, because of the columns in dense matrix depending on each other strictly in order, the parallelism among columns only exists in sparse matrix decomposition.
While decomposing matrix   , the update matrix   is also generated; specifically, the following equation presents the calculation of matrix   . (1)

Vector Version and Architecture
This section chiefly introduces our vector version of the Cholesky algorithm and its implementation on FPGA.

Vector Version of Cholesky Algorithm.
In Algorithm 3, we rewrite multifrontal algorithm in a more detailed fashion.U () is a storage structure, the information of a series th node in all the update matrices U. () counts the number of its update matrices, which means the amount of child nodes of the th node. is an array that records the corresponding parent node.For instance, () is the parent node of the th node.In line (6) of Algorithm 3, the th column of L is calculated by the first column of F j .Lines (7) to (12) work on the generation of update matrix to its parent node and store it in U (()) .Notably, the update matrix is symmetric; thus, only a lower part of the matrix is generated and stored.
The execution of lines (2) to (3) in Algorithm 3 is illustrated in a vector fashion.In Algorithm 4 we exploited the strip-mined technology. is the size of F j while  is the amount of lanes in our vector architecture. is the number of child nodes of this processed node.Furthermore, U(i; * , t) is the th column of update matrix from the th child.Each updating of a column is assigned to a lane at one time.The calculation of L and the updating of matrix U are in a similar vector processing method.

3.2.
Architecture.The primary architecture of vector machine contains -independent processing elements (PEs).We divide div and mod modules for each PE.   from DDR memory and the results after factorization are also stored in DDR.The primary architecture of vector machine contains independent processing elements (PEs).We divide div and mod modules for each PE.Figures 2(a) and 2(b) show the two architectures.Each module has  lanes.From the two figures, we could conclude the data flow in one PE as described in Algorithm 3.For the extended addition between F and U, each lane processes one column at one time.More concretely, Figure 2(a) shows when mod(, ) = 2, the first two columns of F and corresponding columns in U are processed first.After that, all lanes are fully parallel most of the time.In Figure 2(b), only the first column of F needs to be processed when calculating the corresponding column of L. The rest of the columns in F need to be used when generating the updating matrix for its parent node.

Performance Model
Basing on the analysis of the Cholesky algorithm, we construct a performance model on FPGA platform to predict the performance of our designed vector architecture.The performance is related to the underlying architecture and its parameters.When ignoring the implementation architecture, we can predict the peak performance with on-chip memory, computational resources, and I/O bandwidth.
Figure 3 depicts the nonzeros pattern of L and its elimination tree.The elimination tree of the matrix L is defined to be the structure with size(L, 1) nodes and node  is the parent of  if and only if the following is established.
In the elimination tree, each column in the matrix is presented as a node.In Figure 3 the numbers on the right side of the elimination tree represent the levels of the corresponding nodes.The path with a dashed line is the critical path.Given the existing dependency between parent node and its child node, the amount of nodes in the critical path, which is determined by the algorithm and the nonzeros pattern of the matrix, has the least number of nodes that should be processed sequentially.
To construct the model on FPGA, we define these variables and their descriptions as listed in Table 1.
Assuming that data are initially stored in DDR3, the result will finally be written back to DDR3.Therefore, RW contains the input data, output data, and intermediate values.The intermediate values are forced to store and load in DDR3 when needed because of the small on-chip memory size.Thus  Table 1: Variables that we define and their descriptions.

AOs
The amount of operations for decomposing matrix  RW The data volume transferred between DDR3 and FPGA on-chip memory BW The available bandwidth between DDR3 and FPGA OPV Byte/Flop ratio, that is AOs/RW  div The total computing time for div of a PE  mod The total computing time for mod of a PE  node Computing time of decomposing one node The amount of nonzeros in matrix L The amount of nodes in the th level of the elimination tree,  = 1, 2, . . ., The amount of operations for decomposing matrix  ( *  ) The amount of the nonzero entries in th column of L () The amount of the nonzero entries of L RW = Input + output + 2 × temp.For a given problem, input and output data volumes are fixed, only temp is determined by the problem itself and FPGA on-chip memory.The amount of operations (AO) is determined by the amount of nonzero entries, specifically, for  ×  matrix.It takes one square operation and ( *  ) − 1 division operation to scale one column (div).Given that the update and front matrices are symmetric, only the lower part of the matrix should be taken into consideration.The amount of multiplication operations is (1/2)( *  ) × (( *  ) + 1), and the AOs for addition in mod are the same.As presented in the following equation, AO is the sum of all these operations.

AOs
( The chip resources contain on-chip memory and computational resources.The on-chip memory is the size of the memory that could be loaded immediately.It is a critical factor in RW.Moreover, computational resources determine the extent of parallel execution.In the next two subsections, we construct two different performance models for both memory-and computational-bounded cases.
From Figure 4(b), (clk sqrt + clk div + ⌈(( *  ) − 2)/⌉) is needed to finish div when the operations are pipeline executed.Similarly, mod will take (clk mult +clk add +⌈(( *  )+ 1) × (( *  )/2)/⌉) clocks.clk sqrt , clk div , clk mult , and clk add are the clocks that square root, division, multiplication, and addition operation take.In our implementation, our operation modules of square root, division, multiplication, and addition need 16, 27, 1, and 8 clocks, respectively.Generally, when ( *  ) > 8√, clk mod > clk div , then we have Assuming that resources are enough,  node will be independent from the scale of nonzeros.They are executed in parallel and the time it takes to finish a square root, division, multiplication, and addition operation sequentially is the same.Besides that, only the time of nodes in the critical path will be counted into the amount of time.In this ideal case, the total time is  1 = ∑  =1 clk mod (  ) and the corresponding performance reaches the roofline value [21].
To be more realistic, we assume one PE contains  lanes.A total of  nonzeros exist, executed in parallel after the diagonal element being processed in one node.In this case, we assume that one PE requires  units of resources.When the computational resources are bound and FPGA available resources are V units, one FPGA can contain  = ⌊V/⌋ PEs.It means that it supports  = ⌊V/⌋ nodes processed mostly in parallel.
Assuming that the PE, which reaches bottleneck, is assigned to process  1 ,  2 , . . .,   nodes, we indicate that the amount of clocks is clks = ∑  =1 clk mod (  ) clocks.If the FPGA working frequency is , the total process time is  = clks/.
Given that the largest number of nodes processed in parallel is , decomposing the th level needs time for processing ⌈  /⌉ nodes.For the whole elimination tree, a total of ∑  =1 ⌈  /⌉( node ) nodes are processed sequentially at most.Consequently, when computational resources are bounded, the computation time is shown as node ≥ , which means that  2 ≥  1 .So when computational resources and data dependency are taken into consideration, the process time is described as Furthermore, the performance is described as 4.3.Summary.When considering I/O bandwidth and computational resources, we can conclude that the following is the performance equation according to (4), ( 5), ( 6), ( 7), (8), and . The analysis and summary are demonstrated in Table 2. Once the parameters of FPGA and matrix are ascertained, we can obtain the peak performance and bottleneck.
Specifically, when the system is computational-bound, the performance is determined by  node and clk mod .Table 3 discusses the influence of these two factors.From the table, these two factors are subject to the number of PE() and the number of lanes() in PE.Greater values of  and  do not guarantee a better performance.When  = max(  ) and  = max{(( *  ) + 1) × ( *  )/2}, factor  node and clk mod will reach their ultimate values.

Experimental Evaluation
The matrices used in this paper are part of the University of Florida matrix collection, which is involved in the field of electronics-magnetics, structure design, and so on.The data format is double floating-point (64 bits), needing 8 bytes per entry when using Compressed Sparse Column format and ignoring the index.We adopt the following steps in our experiments: (1) predicting the performance of our model using commonly used matrices, (2) measuring the actual performance with the matrices in our implementation, and (3) assessing the consistency of predicted and measured values.The results are shown in Figure 5 for accuracy evaluation and Figure 6 for comparison of performance different rates.In Figure 5,  is the number of PEs and  is the number of modules in one PE.The percentage of relative difference between a predicted value and measured value is calculated by ( − )/ × 100, where  is the predicted value and  is the measured value.There are many optimizations in hardware implementation (like DDR storage access mode, hardware configuration parameters).After optimizing the implementation and selecting some general class of matrices so they can get a better agreement, we have the following observations from our experimental data.

Accuracy of Performance Modeling.
The accuracy of model is critical since we need to use this model to infer large-scale parallel performance.We focus on performance and the difference rate of the performance.Figure 5 shows the comparisons between the measured performance and predicted performance by our model of four classes matrix sets (shipsec1, nd24k, Trefethen_2000b, and nd3k) at different parallel scales (1 PE 1 module, 1 PE 2 module, 2 PE 1 module, and 2 PE 2 module).The measured performances are relatively stable and accurate.From [22][23][24] this established literature on performance model, we can conclude that the predicted and measured values have good consistency.Figure 6 presents the performance difference rates.As the scale of parallelism increased, the difference rates dropped form 10 percent to less than 5 percent.
As shown in Figure 5, the estimated performance is higher than the tested performance.We simplify data processing in the computational-bound model.Furthermore in the memory-bound model, a deviation in the bandwidth is found between DDR3 and FPGA.

Optimization.
To obtain a balance in resources in FPGA and the performance, we set the memory chip to 64 MB, with  PEs and each PE having  modules.
We compare the performance of our optimum implementation with four state-of-the-art supernode algorithms.One is the HSL_MA87 solver for multicore CPU platforms [15] while the other is the GPU version of CHOLMOD solver for CPU-GPU platforms [16].The rest are subtree-based parallel methods for CPU-GPU platforms [17].And the factorization time obtained is shown in Table 4.We have the following conclusions in the performance comparison.
The frequency of HSL_MA87, CHOLMOD, and GPU is 2.5 GHz, 3.2 GHz, and 2.93 GHz (CPU is 2.93 GHz, GPU is 1.15 GHz in [16]).Our frequency is 200 MHz.After testing    matrix sets nd3k, nd24k, and Trefethen_20000b, our performance is on the same level as GPU performance.When our work is implemented as a coprocessor (i.e., when the frequency increases, our model can guide the design of coprocessor), our implementation has a greater advantage in performance.

Figure 2 :
Figure 2: The architecture of mod and div.

Figure 3 :
Figure 3: Nonzeros pattern of L and its elimination tree.

Figure 4 :
Figure 4: The parallel and pipelining execution.

Figure 5 :
Figure 5: Accuracy evaluation on different scales.

Figure 6 :
Figure 6: Comparison of performance difference rates.

Table 2 :
The analysis of sparse Cholesky factorization.Performance is negative correlation to  node and clk mod

Table 3 :
The analysis of factors in computational-bound cases.