Implementation and Optimization of GPU-Based Static State Security Analysis in Power Systems

Static state security analysis (SSSA) is one of the most important computations to check whether a power system is in normal and secure operating state. It is a challenge to satisfy real-time requirements with CPU-based concurrent methods due to the intensive computations. A sensitivity analysis-based method with Graphics processing unit (GPU) is proposed for power systems, which can reduce calculation time by 40% compared to the execution on a 4-core CPU. The proposed method involves load flow analysis and sensitivity analysis. In load flow analysis, amultifrontalmethod for sparse LU factorization is explored onGPU through dynamic frontal task scheduling between CPU and GPU. The varying matrix operations during sensitivity analysis on GPU are highly optimized in this study. The results of performance evaluations show that the proposed GPU-based SSSA with optimized matrix operations can achieve a significant reduction in computation time.


Introduction
An electric power system includes a network of connected electrical components which are used to supply, transfer, distribute, and use electric power.It is necessary to know the physical properties of the power system for safety.Different experiments can be conducted to study the properties of power system.Sometimes, however, it is impossible to perform such experiments on large-scale, complex power systems because the experiments are too expensive or difficult to take measurements on those power systems.Therefore, power system analysis [1] becomes another popular choice and plays more and more important role in planning, design, and operation of electrical power system.Power system analysis aims at building an equivalent model for a given power system, running load flow calculation, evaluating the effects when subjected to disturbances, and generating the ways to improve the stability performance of the power system.Power system analysis can be classified into steady state analysis and transient stability analysis.The former is further categorized into Load flow analysis and Static state security analysis (SSSA).Load flow analysis involves determining voltages and power flow through a stable system, where any transient disturbances are assumed to have settled down.SSSA involves power flow calculation when a chosen device is connected or disconnected.Transient stability is analyzed to estimate the system's stability under dynamic conditions during disturbances.
Load flow analysis is one of the most significant computations in power system planning and operations.In load flow analysis, we need to model the entire network with all generators, loads, transmission lines, transformers, and reactors.Following the modeling, since power system is a large-scale and highly nonlinear dynamic system, power flow calculation involves solving higher-dimensional sparse nonlinear algorithmic equations based on nodal admittance form.Load flow equations allow us to compute bus voltage magnitudes and phase angles as well as branch current magnitudes.Solving nonlinear equations is related to iterative computations, which are data-and computation-intensive.

Mobile Information Systems
SSSA can be replaced by a series of load flow analyses.However, the computation would be more intensive if a rigorous load flow calculation is used.Some studies are focusing on SSSA to improve the performance [2,3].Many highly simplified algorithms are worked out to save iterative computations, and a few approaches with parallel load flow analysis are proposed.Graphics processing unit (GPU) is becoming an attractive accelerator for parallel computation [4] and can achieve high performance for general-purpose computation.Load flow analysis is intended to solve highdimensional sparse nonlinear equations, and sensitivity analysis solves changes in network state based on the load flow results, which uses a number of sparse matrix operations, such as matrix multiplication, addition, and inversion.These features make GPU a suitable and viable solution for SSSA.In this study, while we have provided a GPU-based SSSA solution [5], in practice, a considerable amount of work is needed to continue improving its speed and performance.
Many operations in SSSA can be parallelized on GPU to improve the performance.However a hybrid approach is necessary to combine CPU and GPU computations in GPUbased SSSA.Branch prediction or speculative execution is not supported on GPU, which is not good at iterative operations and judgements of convergence in solving nonlinear equations.For a large sparse matrix, the nonzero storage and computation mechanism should be adopted.The proposed method can combine various small matrices into one matrix in multiplication to use the GPU threads better.
The reminder of this paper is organized as follows: the background and related work are described in Section 2. Section 3 describes the workflow of static state security analysis and its modules.Section 4 addresses the optimization of matrix operations.Section 5 evaluates the performance of the proposed system and addresses optimization issues.Section 6 contains the conclusions of this study as well as a discussion of future improvements in the proposed system.

Related Work
SSSA is an important computation tool in the design and operation of large, interconnected power systems.It determines whether a system is operating in a secure state at any given time with respect to unforeseen outages  in  buses, lines, transformers, or generators. − 1 branch outage simulation is a basic validation of a power grid safety [6].SSSA can be implemented by running one load flow calculation for each outage case.If a large number of cases are provided, the time needed for calculation and analysis can be very long.Hence, other methods, such as DC (Direct Current) power method and sensitivity analysis method, have been proposed to speed up static state security analysis.
The DC power method [7] can solve nonlinear equations in power systems by transforming them into linear equations.This reduces computational complexity to make calculations more efficient but can lead to poor precision.The DC power method can only check for the overload in practical application.If the voltage is raised above the upper limit, it will not be checked.The DC power method is commonly used to design power systems, for which the active power flows are an important concern.
In sensitivity analysis, the offline of one transmission line is regarded as a perturbation under normal circumstances [8].The method involves calculating a sensitivity matrix from the Taylor series expansion of load flow equations.The impact of line outage can be simulated by net injection and withdraw changes.Many analyses of outages on the same network conduct only one load flow calculation and perform their sensitivity analysis on the load flow results.This bypasses many intermediate, iterative steps of the calculation and significantly increases the efficiency of line-off analysis.This method can not only improve performance and precision, but also obtain active power, reactive power, voltage magnitude, and angle at the system node.Therefore, it is commonly used in general practice.
Static state security analysis pays more attention to the calculation of the linear algebraic operation.Reference [9] presents a systematic approach for a large set of frequently encountered dense linear algebra operations, but it is shown to yield new high-performance algorithms.There are many researches based on GPU, in which cuBLAS [10] is the most commonly used.cuBLAS is a linear algebraic library on GPU, which is an implementation of BLAS (Basic Linear Algebra Subprograms) on NVIDIA's CUDA (Compute Unified Device Architecture) runtime.When programming on cuBLAS, we use it like BLAS and do not care about the details of thread modeling or the storage model of CUDA programming.
Reference [11] proposed a hybrid CPU-GPU implementation of sparse Cholesky factorization based on the multifrontal method.Reference [12] introduced an efficient GPUbased sparse solver for circuit problems and developed a hybrid parallel LU factorization approach combining tasklevel and data-level parallelism on GPUs.Reference [13] proposed an implementation of the Newton-Raphson (NR) load flow algorithm, as it pertained to parallelizing and implementing in CUDA.However, there is no simple and efficient static state security analysis method based on GPU to conduct  − 1 branch outage simulations.

Workflow and Modules of Static
State Security Analysis

Overview of Static State Security Analysis.
Sensitivity analysis is a method to be highly parallelized, which is convenient to be used with GPU to accelerate calculations in static state security analysis.The topology of the power grid can be simply modeled as a graph with  nodes and  edges, in which the nodes represent buses (power stations or transformers) and the edges represent transmission lines [14].
The graph network can be converted into a nodal admittance matrix or a nodal impedance matrix.The two matrices are highly sparse for a large power system.According to the conservation of complex power theorem, the load flow equations can be written as follows: where   and   are the injected active and the reactive powers, respectively, at node .  ,   are the elements of nodal admittance matrix.Load flow calculations can be roughly considered as the problem of solving node voltages   ,   ( = 1, 2, . . ., ) at each node when the injecting complex powers   ,   ( = 1, 2, . . ., ) are given.This nonlinear equation can be solved by the Newton-Raphson (NR) method, which is based on a shortened Taylor series.NR needs to iteratively solve (2) until it converges within a given tolerance of Δ  , Δ  ( = 1, 2, . . ., ): For a large system, the load flow calculation may require significant computational resources to calculate, store, and factorize the Jacobian matrix .Following load flow analysis, the normal state of the system is determined.If there is a randomly injected power disturbance Δ or network change Δ, the state vector is correspondingly changed by Δ.The equation can be expressed as follows: in which  0 is the active and reactive power of nodes in normal state,  0 is the voltage vector of the nodes, and  0 is the normal network parameter.When power disturbance is ignored, Δ = 0, and Taylor series expansion [15] is used for (3).Finally, we obtain the following equations: In the above,  0 is the sensitivity matrix, which is equal to  −1 .If only one line-off is considered in SSSA, the injected power change Δ  of nodes along the offline can be obtained by where (2)     (1)     (2)   i  (3)     (4) (4)     (1)    (2)    (1)    (2)     (3)    (4)    (3)    (4 In the above equations,   ,   ,   , and   , the elements of the Jacobian matrix , are calculated as follows: and  (1)   ,  (2)   ,  (3)   , and  (4)   , the elements related to the offline node in the sensitivity matrix, are calculated as follows: We find that the items on the right-hand side of (6) come from load flow calculation, and only two 4 × 4 matrix operations occur in the equation.The overall workflow of the static state security analysis system is shown in Figure 1.It consists of four main modules: I/O, preprocessing, power flow calculation, and static state security analysis.
When the original data is input, it is preprocessed first for power flow calculations.The system then conducts a full power flow calculation and determines the Jacobian matrix , which is prepared for static state security analysis.Finally, the

Power flow calculation
Frontal matrix generation

Parallel matrix multiplication
Parallel matrix addition changing network parameters are assumed by a line-off and new system state variables identified.Once a set of critical elements are simulated to be tripped, we can determine whether there are equipment-limit violations through SSSA.Hence, the state of power system (secure or unsecured) can be easily determined.

3.2.
Preprocessing.Data preprocessing is a major and essential stage in power flow calculation, which can speed up static state security analysis.There are three processings: sparse matrix storage, node numbering optimization, and result storage optimization, which aim at reducing storage space for matrices and speeding up the calculation of nonzero elements.
Following raw data input, data in large sparse matrix is compressed in CSR (Compressed Row Storage) format [16].CSR is very efficient due to an indirect addressing step for every single scalar operation in matrix vector.The subsequent nonzero elements in matrix rows are placed in contiguous locations in memory and traversed in a row-wise fashion.
By matrix sorting with the minimum-degree minimumlength algorithm [17] or the minimum-degree minimumnumber algorithm [18], the rows and columns of sparse matrix are moved to avoid creating new nonzero elements in the matrix during factorization, and the forward and backward substitutions can save many nonzero computations.Such node numbering optimizations are conducted with an elimination tree, which can determine the position of injection elements in factorization.
Static state security analysis is performed on the results of power flow calculations.Since three read operations are necessary when accessing CSR-formatted data every time, it incurs a considerable costs for I/O operations.Hence, the matrix is decompressed prior to calculation.It may take up a lot of space to assemble several matrices, and the memory may not be large enough to store the matrices.Therefore, preprocessing involves the following three steps: (1) Analyze the result of power flow calculation , where sensitivity matrix is obtained by the inverse  0 .Allocate memory to store matrices assembled from  0 .
(3) Check matrix size.If it does not exceed the capacity, copy it directly into GPU device memory; otherwise, partition the matrix and copy submatrices into device.

Power Flow Calculation.
Power flow calculation is executed with the Newton-Raphson (NR) method to solve large sparse nonlinear equations.The NR method repeatedly solves large, sparse, linear systems of equations.It is a significant challenge of computational and memory capacity to factorize the Jacobian matrix at each iteration; hence, it is crucial to design a fast and efficient solution for the factorization.The multifrontal algorithm is a particularly useful method for factoring large sparse linear systems and is adopted in the proposed solution.The workflow of the method is shown in Figure 2.  The multifrontal algorithm [19] is an effective method to solve large sparse matrices operation.A frontal matrix is a small, dense matrix.The term "multi" here refers to the fact that multiple frontal matrices are used during the factorization.The multifrontal method turns the factorization of sparse matrix into a sequence of factorizations of smaller, dense matrices organized as an elimination tree or an assembly tree.The method can get satisfactory data locality and great potential for parallelization.For example, [20] investigates an automatic tuning of SpMV (Sparse Matrix Vector) multiplication kernel in a partitioned global address space language, which supports a hybrid thread-and process-based communication layer for multicore systems.The multifrontal method proposed here is associated with super-nodal implementation, so that multiple columns with the same nonzero patterns can be grouped together as a dense kernel for concurrent factorization.
The method is formulated in terms of frontal matrices and update matrices.The processing order of frontal matrices is determined by the elimination tree from the preprocessing.The order is expressed by a frontal matrix chain, in which each frontal matrix is designated as a leaf and processed by a CPU thread.For node , subtree update matrix   is the sum of all subtree update matrices; frontal matrix   is formed by assembly over   and can be partitioned into four submatrices , , , and  with (10).Then the Cholesky factor vector and the update matrix of  can be solved with following Cholesky factorization:

Mobile Information Systems
It will take a considerable amount of time for matrix operations in the multifrontal method.A threshold is defined to distinguish CPU from GPU tasks and is derived from matrix calculations.If the number of the computations is greater than the threshold, the main thread will allocate tasks to GPU.Otherwise, a small-scale matrix is computed on CPU.Nodes in frontal matrix chain are calculated one by one until all of them are computed.Δ is then computed backward through substitute, decomposed matrices  and  to update .If the absolute value of Δ is less than 10 −8 , which is regarded as convergence, the calculation is concluded; otherwise, the iterations will continue to get a convergent result.

Static State Security Analysis.
In this module, the GPUbased sensitivity method is used for static state security analysis.The elements in the matrices of ( 6) are prepared from the previous load flow analysis.It consists of four steps.
Two matrix multiplication and one matrix inversion operations are involved during the analysis.Following these steps, the changes in the node state variables can be obtained, so that the power flow on each branch can be acquired following line outage.Moreover, many cases of SSSA can be combined in parallel on a GPU to improve the efficiency.

Optimization of Matrix Operations
4.1.Small Matrix Multiplication.In 4 × 4 matrix multiplication on GPU, 16 threads are executed for concurrent element multiplication in one block, shown in Figure 3.The 16 threads are completely independent and communicate with one another in the block.However, the same instruction is executed in a warp, which is allowed to launch 32 threads at once.Whether there are 32 or 16 threads in a task, they are launched by a warp with the same time cost.Hence if only one submatrix multiplication occurs in a block with only a warp, 16 threads are launched and the other 16 are idle.
It is a best practice to allocate more than one multiplication operation to a block.A block can contain a maximum of 1024 threads, which can store 64 submatrix multiplication operations.Host threads can combine their data into a large matrix and copy it to share the memory of the block.Since the instructions are the same in a block, it is necessary to access own data of the large matrix in each thread.The distribution of logical memory is shown in Figure 4. We assume that every 16 threads process a matrix multiplication operation and 64 matrix multiplications can be concurrently handled in a block.

Small Matrix Inversion.
Between  and  0 , there is a matrix inversion operation.In general, there are two methods for matrix inversion.One is the adjoint matrix method, which executes the calculation of  −1 =  * /||.The adjoint matrix  * of  must be calculated first for this.It is too complicated to derive the expression of  * for matrix  of order greater than 3.The other matrix inversion method is the elementary transformation method, which is simpler than the adjoint matrix method for high-order matrices, but more iterations are required and inefficient in terms of GPU resource consumption.In order to enhance efficiency, ( 11) is used to partition matrix inversion: If a 4 × 4 matrix is provided, it can be divided into four 2 × 2 matrices.In this manner, the inversion task is divided and assigned to four threads, which can complete their tasks almost simultaneously.In (11), only the operations of matrix multiplication, addition, and 2 × 2 matrix inversion are given.2 × 2 matrix inversion can be expressed as follows: In this way, four threads can be used with (12) to calculate the inversion of , , , and  in formula (11).The results are substituted into (11) to work out the inversion of the 4 × 4 matrix.

Vector Multiplication by Cross-Combining Storage.
When solving (5), each change in power flow is obtained by multiplying 4 × 1 row vector by 1 × 4 column vector.A set of vector-vector multiplication can be combined into a matrix vector-vector multiplication as Figure 5.
In sensitivity analysis, 32 ( is a positive integer) row vectors are merged into one long row vector placed in global memory.Global memory allows a warp of 32 threads to access it concurrently.As Figure 6 shows, there are 32 4 × 1 row  A number of row vectors are stored in one row vector by cross-combination.It is necessary to execute a reduce operation after vector multiplication.The result of vector multiplication is stored back into .The four result elements (,  + 32,  + 64,  + 96) are grouped, summed up, and stored at .Eventually, [0-32] is the result of 32 multiplications of a 1 × 4 vector by another 4 × 1 vector.

RC-MM Storage Method.
In general practice, cublasSgemm function in cuBLAS is called for GPU matrix multiplication.The matrices in C/C++ are in row-major order, but cuBLAS assumes that the matrices are stored in columnmajor order in the devices.The order exchange is a timeconsuming operation.Hence the RC-MM (Row Column-Matrix Multiplication) method is adopted, and , , and  are stored in column-major, row-major, and column-major order, respectively, in  = .
For the multiplication in the multifrontal method, one matrix is stored in row-major order and the another matrix in column-major order.The matrices of the same frontal chain share an array, because of which the entire row or column data is copied to global memory.Each computation only uses part of the data, which avoids having to move large amounts of data on demand.

Dataset and Experimental Environment.
The experiments are conducted on a server with an Intel i7 950 (3.07 GHz) CPU, 16 GB memory, and NVIDIA GeForce GTX460.The CentOS 5.9 Linux operation system with CUDA 4.0 is used.From MATPOWER [21], a professional software in power calculation derived from a real power grid and some IEEE standard testing datasets are chosen, shown in Table 1.Global memory   1 are used for testing.The results are shown in Table 2.In Table 2, there is a 4-core processor in the CPU, with which the program can almost achieve best performance.But with increasing of the number of active processor cores in the compute node, the program on the CPU platform suffers serious performance degradation.In dataset CA3012wp, the calculation time on an 8-core CPU increased by 13% compared with a 4-core CPU.
The experimental results show that the execution time on GPU is shorter than on CPU, except on a scale of 300 nodes.It takes much time to transfer data between host and device in GPU.When the scale is 300 nodes, the transfer time cannot be ignored, and it is occupied by a significant proportion of the execution time on GPU.On the other hand, for only 300 submatrices, every two submatrices are processed in a warp on average.However for 336 cores on GTX 460, it means that over half the cores are idle throughout the processing, which is a serious waste of GPU resources to hinder overall performance.
Excluding the dataset of the 300 nodes, the accelerating effect of GPU computation is reflected, and significant speedup is revealed with increasing data scale.GPU computing can reduce calculation time by 40% compared with the execution on 4-core CPU.The increasing trend of GPU speedup is shown in Figure 7.Following matrix combination, the 32 threads in the warp can be kept busy, which indicates that the optimization method can exhibit good performance.

Evaluation of Small Matrix Inversion.
Two experiments are conducted on 4 × 4 matrix inversion in (5).cuBLAS has provided a large matrix inversion function with low efficiency for batch small matrices inversion.The small batch matrix inversion method is implemented on GPU kernel, in which one submatrix is processed by one thread.The small batch matrix inversion function is called in Exp1 once, and 100 inversion threads are scheduled on the GPU.The cuBLAS matrix inversion function is called 100 times in Exp2.The results show that the small matrix inversion method yields better performance (Exp1 925 ms versus Exp2 1055 ms) in Figure 9.

Conclusion and Future Work
In this paper, GPU-based static state security analysis is proposed for power systems.The GPU-based multifrontal method is implemented to solve a large sparse matrix, and sensitivity analysis is chosen for static state security analysis on GPU.To make full use of GPU device, several optimization methods of matrix operations are presented, such as data combination in multiple small-scale matrix multiplication operations and the partition matrix method for matrix inversion.
Experimental results indicate that the proposed algorithm on GPU can significantly improve system performance.Our results show a speedup of 1.7-1.9 with power system simulation cases from a scale of 3,000 to 6,000.
In future work, it may be desirable to further improve performance that the system and methods could be ported to more scalable distributed memory environment, such as multi-GPUs [22].We can also use the compilers to speed up system migration and dynamic task scheduling in CPU-GPU heterogeneous parallel systems.Our way of dealing with small-scale matrices can be used in scientific calculations in other fields, and the processing method of specialdimensional matrix can be extended to all-dimensional matrices.

Figure 1 :
Figure 1: The overall workflow of static state security analysis.

Figure 2 :
Figure 2: The workflow of power flow calculation.

Figure 4 :
Figure 4: The distribution of logical memory.

Figure 7 :
Figure 7: Comparison in terms of execution time between CPU and GPU speedup.

Figure 8 :
Figure 8: Comparison in terms of matrix multiplication execution times of the two experiments.

Figure 9 :
Figure 9: Comparison in terms of small matrix inversion times of the two experiments.

Table 1 :
(13) datasets for SSSA experiments.The speedup ratio  is defined in(13), where  CPU time and  GPU time are the execution times on CPU and GPU, respectively.The program executed on CPU is a multithread program on multicore CPU.
If the matrices are not merged into a block, half the threads will be idle in a warp, which can launch 32 threads.

Table 2 :
Comparison in terms of execution time between CPU and GPU.