^{1}

^{2}

^{1}

^{3}

^{2}

^{1}

^{1}

^{1}

^{2}

^{3}

An electric power system includes a network of connected electrical components which are used to supply, transfer, distribute, and use electric power. It is necessary to know the physical properties of the power system for safety. Different experiments can be conducted to study the properties of power system. Sometimes, however, it is impossible to perform such experiments on large-scale, complex power systems because the experiments are too expensive or difficult to take measurements on those power systems. Therefore, power system analysis [

Load flow analysis is one of the most significant computations in power system planning and operations. In load flow analysis, we need to model the entire network with all generators, loads, transmission lines, transformers, and reactors. Following the modeling, since power system is a large-scale and highly nonlinear dynamic system, power flow calculation involves solving higher-dimensional sparse nonlinear algorithmic equations based on nodal admittance form. Load flow equations allow us to compute bus voltage magnitudes and phase angles as well as branch current magnitudes. Solving nonlinear equations is related to iterative computations, which are data- and computation-intensive. SSSA can be replaced by a series of load flow analyses. However, the computation would be more intensive if a rigorous load flow calculation is used. Some studies are focusing on SSSA to improve the performance [

Many operations in SSSA can be parallelized on GPU to improve the performance. However a hybrid approach is necessary to combine CPU and GPU computations in GPU-based SSSA. Branch prediction or speculative execution is not supported on GPU, which is not good at iterative operations and judgements of convergence in solving nonlinear equations. For a large sparse matrix, the nonzero storage and computation mechanism should be adopted. The proposed method can combine various small matrices into one matrix in multiplication to use the GPU threads better.

The reminder of this paper is organized as follows: the background and related work are described in Section

SSSA is an important computation tool in the design and operation of large, interconnected power systems. It determines whether a system is operating in a secure state at any given time with respect to unforeseen outages

The DC power method [

In sensitivity analysis, the offline of one transmission line is regarded as a perturbation under normal circumstances [

Static state security analysis pays more attention to the calculation of the linear algebraic operation. Reference [

Reference [

Sensitivity analysis is a method to be highly parallelized, which is convenient to be used with GPU to accelerate calculations in static state security analysis. The topology of the power grid can be simply modeled as a graph with

For a large system, the load flow calculation may require significant computational resources to calculate, store, and factorize the Jacobian matrix

The overall workflow of the static state security analysis system is shown in Figure

The overall workflow of static state security analysis.

When the original data is input, it is preprocessed first for power flow calculations. The system then conducts a full power flow calculation and determines the Jacobian matrix

Data preprocessing is a major and essential stage in power flow calculation, which can speed up static state security analysis. There are three processings: sparse matrix storage, node numbering optimization, and result storage optimization, which aim at reducing storage space for matrices and speeding up the calculation of nonzero elements.

Following raw data input, data in large sparse matrix is compressed in CSR

By matrix sorting with the minimum-degree minimum-length algorithm [

Static state security analysis is performed on the results of power flow calculations. Since three read operations are necessary when accessing CSR-formatted data every time, it incurs a considerable costs for I/O operations. Hence, the matrix is decompressed prior to calculation. It may take up a lot of space to assemble several matrices, and the memory may not be large enough to store the matrices. Therefore, preprocessing involves the following three steps:

Analyze the result of power flow calculation

Generate submatrices and vectors from

Check matrix size. If it does not exceed the capacity, copy it directly into GPU device memory; otherwise, partition the matrix and copy submatrices into device.

Power flow calculation is executed with the

The workflow of power flow calculation.

The multifrontal algorithm [

The method is formulated in terms of frontal matrices and update matrices. The processing order of frontal matrices is determined by the elimination tree from the preprocessing. The order is expressed by a frontal matrix chain, in which each frontal matrix is designated as a leaf and processed by a CPU thread. For node

It will take a considerable amount of time for matrix operations in the multifrontal method. A threshold is defined to distinguish CPU from GPU tasks and is derived from matrix calculations. If the number of the computations is greater than the threshold, the main thread will allocate tasks to GPU. Otherwise, a small-scale matrix is computed on CPU. Nodes in frontal matrix chain are calculated one by one until all of them are computed.

In this module, the GPU-based sensitivity method is used for static state security analysis. The elements in the matrices of (

Calculate

Calculate

Calculate

Calculate (

Two matrix multiplication and one matrix inversion operations are involved during the analysis. Following these steps, the changes in the node state variables can be obtained, so that the power flow on each branch can be acquired following line outage. Moreover, many cases of SSSA can be combined in parallel on a GPU to improve the efficiency.

In

Submatrix multiplication with 16 threads.

It is a best practice to allocate more than one multiplication operation to a block. A block can contain a maximum of 1024 threads, which can store 64 submatrix multiplication operations. Host threads can combine their data into a large matrix and copy it to share the memory of the block. Since the instructions are the same in a block, it is necessary to access own data of the large matrix in each thread. The distribution of logical memory is shown in Figure

The distribution of logical memory.

Between

If a

In this way, four threads can be used with (

When solving (

Matrix vector multiplication.

In sensitivity analysis,

Matrix vector multiplication by cross-combining storage.

A number of row vectors are stored in one row vector by cross-combination. It is necessary to execute a reduce operation after vector multiplication. The result of vector multiplication is stored back into

In general practice, cublasSgemm function in cuBLAS is called for GPU matrix multiplication. The matrices in C/C++ are in row-major order, but cuBLAS assumes that the matrices are stored in column-major order in the devices. The order exchange is a time-consuming operation. Hence the RC-MM

For the multiplication in the multifrontal method, one matrix is stored in row-major order and the another matrix in column-major order. The matrices of the same frontal chain share an array, because of which the entire row or column data is copied to global memory. Each computation only uses part of the data, which avoids having to move large amounts of data on demand.

The experiments are conducted on a server with an Intel i7 950 (3.07 GHz) CPU, 16 GB memory, and NVIDIA GeForce GTX460. The CentOS 5.9 Linux operation system with CUDA 4.0 is used. From MATPOWER [

Test datasets for SSSA experiments.

Dataset name | CA300 | CA3012wp | CA3120sp | CA5472 | CA5492wp | CA6024 | CA6240 |
---|---|---|---|---|---|---|---|

Number of nodes | 300 | 3012 | 3120 | 5472 | 5492 | 6024 | 6240 |

Number of lines | 411 | 3572 | 3693 | 9794 | 9824 | 10706 | 11069 |

The speedup ratio

Executed on GPU and CPU platforms, the datasets in Table

Comparison in terms of execution time between CPU and GPU.

Scale | 200 | 3000 | 5000 | 6000 | |||
---|---|---|---|---|---|---|---|

Dataset name | CA300 | CA3012wp | CA3120sp | CA5472 | CA5492wp | CA6024 | CA6240 |

CPU (ms) | 10.0 | 446.6 | 476.7 | 2170.0 | 2186.7 | 2603.3 | 2793.3 |

GPU (ms) | 19.7 | 269.3 | 269.6 | 1214.3 | 1213.0 | 1408.0 | 1483.7 |

In Table

The experimental results show that the execution time on GPU is shorter than on CPU, except on a scale of 300 nodes. It takes much time to transfer data between host and device in GPU. When the scale is 300 nodes, the transfer time cannot be ignored, and it is occupied by a significant proportion of the execution time on GPU. On the other hand, for only 300 submatrices, every two submatrices are processed in a warp on average. However for 336 cores on GTX 460, it means that over half the cores are idle throughout the processing, which is a serious waste of GPU resources to hinder overall performance.

Excluding the dataset of the 300 nodes, the accelerating effect of GPU computation is reflected, and significant speedup is revealed with increasing data scale. GPU computing can reduce calculation time by 40% compared with the execution on 4-core CPU. The increasing trend of GPU speedup is shown in Figure

Comparison in terms of execution time between CPU and GPU speedup.

Equation (

Comparison in terms of matrix multiplication execution times of the two experiments.

The optimization method can achieve a speedup of approximately 1.7. A number of parallel threads are launched by a warp. If the matrices are not merged into a block, half the threads will be idle in a warp, which can launch 32 threads. Following matrix combination, the 32 threads in the warp can be kept busy, which indicates that the optimization method can exhibit good performance.

Two experiments are conducted on

Comparison in terms of small matrix inversion times of the two experiments.

In this paper, GPU-based static state security analysis is proposed for power systems. The GPU-based multifrontal method is implemented to solve a large sparse matrix, and sensitivity analysis is chosen for static state security analysis on GPU. To make full use of GPU device, several optimization methods of matrix operations are presented, such as data combination in multiple small-scale matrix multiplication operations and the partition matrix method for matrix inversion.

Experimental results indicate that the proposed algorithm on GPU can significantly improve system performance. Our results show a speedup of 1.7–1.9 with power system simulation cases from a scale of 3,000 to 6,000.

In future work, it may be desirable to further improve performance that the system and methods could be ported to more scalable distributed memory environment, such as multi-GPUs [

The authors declare that there is no conflict of interests regarding the publication of this paper.

This work was supported by the National Natural Science Foundation of China (no. 61133008), the National 973 Key Basic Research Plan of China (no. 2013CB2282036), Major Subject of State Grid Corporation of China (no. SGCC-MPLG001(001-031)-2012), the National 863 Research and Development Program of China (no. 2011AA05A118), and the National Science and Technology Pillar Program (no. 2012BAH14F02).