^{1}

^{1}

^{1}

^{2}

^{1}

^{1}

^{2}

Real-time electromagnetic transient simulators are important tools in the design stage of new control and protection systems for power systems. Real-time simulators are used to test and stress new devices under similar conditions that the device will deal with in a real network with the purpose of finding errors and bugs in the design. The computation of an electromagnetic transient is complex and computationally demanding, due to features such as the speed of the phenomenon, the size of the network, and the presence of time variant and nonlinear elements in the network. In this work, the development of a SoC based real-time and also offline electromagnetic transient simulator is presented. In the design, the required performance is met from two sides, (a) using a technique to split the power system into smaller subsystems, which allows parallelizing the algorithm, and (b) with specialized and parallel hardware designed to boost the solution flow. The results of this work have shown that for the proposed case studies, based on a balanced distribution of the node of subsystems, the proposed approach has decreased the total simulation time by up to 99 times compared with the classical approach running on a single high performance 32-bit embedded processor ARM-Cortex A9.

Electromagnetic transients (EMT) in power systems [

Developing real-time and even offline simulators is not a straightforward task, due to power system size and dynamic behavior of some elements, such as switches that cause modifications in the conductance matrix and nonlinear devices that make it a nonlinear problem. Modifications in the conductance matrix lead to a new matrix factorization, a time consuming process. Moreover, for real-time simulations, the complexity increases significantly when considering the limited time window usually available for processing data, generally on the order of microseconds.

Most proposals for the implementation of real-time simulators deal with the limited window of time available for computing the transient, avoiding the solution of the resulting linear system of equations that relate voltages and currents in the network [

Since power electronics systems such as HVDC are widely used in power systems [

This work aims to describe the implementation of an alternative offline but also real-time electromagnetic transient simulator based on a System on Chip (SoC). The simulator was designed to keep an almost invariant simulation time step independently of the number of time variant devices in the network, as the simulator performs a complete solution of the system for every simulation time step.

The architecture of the proposed simulator is based on a highly parallel implementation designed to reach multiple levels of parallelism. For the simulation transient algorithm, a topological splitting network is proposed, a technique which permits reaching a deeper degree of fragmentation than that generated by only using the well-known splitting technique using transmission lines [

Power systems are composed of several interconnected electric devices, like generators, transformers, transmission lines, and lumped devices, among others.

Digital simulators are based on synchronous devices, such as CPU, DSP, and other digital components, which all work based on clock signals that synchronize all internal electronics, and therefore their response is limited by the clock period of the system. As a result, digital simulators cannot reproduce the natural analog response of the power network element devices (voltages and currents).

Consider the relations at terminals of an inductor and a capacitor as written in (

The resulting difference equation can be taken to a circuit representation, as in Figure

Discretized model of lumped devices.

To include transmission lines in circuit simulators, the Bergeron model is adopted [

Lossless transmission line two-port equivalent.

Since history sources of distributed transmission lines are in terms of currents and voltages computed in the past and specifically a travel time of the line before, the model has the property of uncoupling the solution at both ends of the line, as can be seen in Figure

Transmission line between two areas.

Digital Electromagnetic Transient Simulators (DETS) work based on discrete models of the devices that comprise the network. The discrete models are in fact difference equations, which can be easily translated into equivalent resistive circuits, as shown in Figures

Basic flow diagram of an electromagnetic transient simulator.

Since real power systems can contain hundreds of nodes, carrying out network segmentation only by transmission lines can sometimes be insufficient for real-time simulations. Moreover, in networks which do not have transmission lines, distribution of computational burden is not possible. Therefore, an alternative technique for splitting the system is proposed in this work. This approach can be used together with transmission lines for a finer-degree of division or alone for system topologies without lines.

This technique is based on the compensation method theorem [

Segmentation of a network using a current branch.

It should be considered that the initial power system network in Figure

The method performs the solution in three simplified steps: First, the resulting subsystems are solved as if they were independent of each other; then, the branch current

Consider an initial power system divided into two subnetworks interconnected at a pair of nodes

Network equivalent.

Subnetwork “b” as a Thevenin equivalent.

Now, a second equation is needed, which relates the variables

Subnetwork “a” as a Thevenin equivalent.

From Figure

The algorithm of the solution can be summarized in the following steps:

Compute the equivalent Thevenin impedance vector

Using (

Using (

Calculate the final voltage solution vector

The method for segmenting the system is not limited to splitting a network into only two sections; it can also be applied to dividing a network into several regions using current branches to interconnect the resulting subsystems, as illustrated in Figure

Network divided into multiple subsystems joined through current branches.

The interconnection of multiple subnetworks increases the complexity of the solution, as now all the current branch contributions must be incorporated at the moment of updating the node voltages; in this case, (

One of the characteristics of the proposal for segmenting the power system is that the method enables a parallel solution scheme, as shown in Figure

Simplified parallel scheme solution flow for a network divided into 4 subsystems.

It can be seen that the diagram in Figure

The method also allows us other degrees of parallelism. In fact, each individual task of the solution flow has many options for parallel processing. For example, levels 2 and 3 require solving a linear system of equations; even though those subsystems are reduced in size, this can be computationally intensive. In this case, there are parallel techniques that can be used to solve the linear system of equation. Other stages where parallelism can be exploited are on levels 5 and 6, where voltages and history terms are updated by the procedure. The update of history terms requires performing a computation of

The real-time simulator is based on a hardware-software platform supported by a System on Chip (SoC), primarily composed of a hard ARM dual-core processor (software) and a Xilinx FPGA Virtex 7 (hardware). The simulator was designed to link hardware and software advantages. The software part is a user interface that receives the initial network interconnection data (netlist) and setup simulation parameters and handles the software-hardware synchronization. The simulator is implemented almost totally in hardware to avoid spending time on communication between the processor and the FPGA.

The simulator implementation was designed using a custom architecture focused only on solving specific simulation tasks as soon as possible. Techniques such as parallelism and pipeline are widely used in order to comply with the limited time step usually used in electromagnetic transient simulations. The complexity of the simulator is not visible to the final user and remains behind the software layer, so knowledge of the hardware description level or hardware design is not necessary for performing simulations.

To perform the algorithm described in Section

Electromagnetic transient simulator block diagram.

Small Electromagnetic Transient Simulator block diagram.

A SEMTS module is composed mainly of 5 submodules.

The Control Unit controls and synchronizes all internal processes inside the SEMTS.

The Source Generator (SG) submodule, in charge of generating the sources waveforms: this module was designed to work based on lookup tables and interpolation of discrete points of the waveform.

The History Source Update (HSU) module was designed to compute the history terms of network elements.

The System GI Update (GU) submodule forms and manages the conductance matrix; changes in time variant elements are handled by this module.

Parallel Linear System Solver (LSS) is designed to solve the linear system of equations.

The heart of the simulator is the Linear System Solver (LSS) accelerator, shown in Figure

Internal structure of the LSS unit.

The LSS unit is able to solve a linear system up to 26-by-26, so that any system lower than or equal to 26-by-26 can be solved without any hardware changes, where adjustments are handled in software. The LSS was designed to handle the matrix entity by rows. In the architecture, each row is stored in one individual BRAM memory inside a submodule Core_GJ; this approach allows simultaneous access to all rows of the linear system and also a parallel execution of the elimination procedure.

During the solution of the linear system, the LSS accelerator emulates the procedure presented below, which is also represented in the flow diagram of Figure

Elimination procedure flow diagram.

Basically, two procedures are carried out to factorize the matrix

Normalization: in this step, a normalized row is calculated; for this, the

Elimination: during the elimination process, the normalized row is used to eliminate the

The LSS unit performs the parallel elimination using four different modules: Core_GJ, Global Control, Early Start, and the interconnection module.

The Control Unit is dedicated for generating the synchronization signals “solution_phase” and “reference_row”; together, these signals provide detailed information about the progress in the solution phase of the linear system. The outputs of the “Control Unit” are used by the Core_GJ modules to decode the current stage of the solution and the instruction to be executed; a Core_GJ responds with the "done" signal to indicate the end of the instruction execution. The Early_start module also depends on the Control Unit for synchronizing the Core_GJ operation. The “solution_phase” signal indicates the current solution phase, which can be initialization, elimination, or return phases. The “reference_row” signal provides the row in turn to be initialized, normalized, or returned.

In Figure

Control Unit block diagram.

Dynamic of signal solution_phase and reference_row.

Synchronization between Core_GJ modules is not managed internally but rather by external modules called Early_start (Figure

Early Start block diagram.

In the same way, the Early_start module is used to anticipate the arrival of the eliminated row and prepare the lower part of the Core_GJ to save the updated row, as illustrated in Figure

Dynamic of signals es_lu_elimination and es_lu_writeBack_RE.

The Early_start module can predict and synchronize Core_lu modules based on the latency of the instruction being executed and that of the following instruction; for this, the internal subblock “sequence controller” keeps a constant memory table with the number of clock cycles that must wait before triggering the next subprocess; the subblock “latency cyc counter” is basically a clock cycle counter, which is compared with the constants table in order to determine the exact clock cycle to begin the next subprocess in the solution flow to proceed in synchrony.

Core_GJ modules are the most complex hardware blocks in the design and consume most of the hardware of the LSS unit. There are 26 Core_GJ blocks in every LSS unit. During the instantiation or automatic generation of the Core_GJ arrangement, each Core_GJ receives a unique ID to differentiate it from the others and indicate its relative position. This ID also identifies the assigned row to the Core_GJ, ranging from 1 to 26 for the last Core_lu in the arrangement. A block diagram with the main components of a Core_lu is shown in Figure

Block diagram of the Core_lu module.

Core_lu accelerators are designed to work in matrix rows (vectors), which reside in BRAM memories; their instruction set supports only a few basic instructions designed mainly to compute (

Some supported instructions are as follows.

Move a scalar from memory RAM to a register (num, row, fac, or den).

Multiply/divide the vector by a register.

Add/subtract two vectors.

Update the vector.

Return the vector.

For high performance, the design has a real dual-port BRAM memory, which supports two simultaneous readings or one reading and one writing. Core_GJ also contains double control logic to support the real dual-port memory. Control logic can be divided into two sections: the lower and the upper part; the upper part handles the read-only port of the ram, while the lower control section is assigned to the read-write port. During operation, both sections work simultaneously when possible. A brief description of the internal subblocks of the Core_lu accelerator is given below.

The “Position Decoder” subblock generates the relative position of the core using the input reference row drive by Global_controlas as reference; the output of this block can be equal or different when the equal signal is asserted; this indicates that the ID of the Core_lu is equal to the reference_row; contrariwise the equal signal is not asserted; it means that the ID of the Core_GJ does not correspond to the input reference_row. The relative position is useful for decoding the instruction to be performed by the Core_lu according to the stage of the solution.

The internal block called the “Control Unit” decodes the instruction to be executed based on the input solution_phase and the relative position generated by the “Position Decoder,” but the “Vector Instruction Generator” subblock is the real block which generates the synchronized signals to execute the vector instruction.

There are four registers in the diagram block: “num,” “row,” “fac,” and “den”; these registers are temporary locations used together with the “Vector Arithmetic Unit” to perform lightly more elaborate operations on the rows; for example, to execute (

“Vector Arithmetic Unit” is an arithmetic floating point unit capable of performing a couple of instructions, such as dividing or multiplying a vector by a scalar, adding or subtracting two vectors, and other simple operations.

Finally, the “Vector Instruction Generator” module is the block which generates the required sequence of control signals for the memory controller, the registers, and the Vector Arithmetic Unit in order to execute the instruction in turn.

In this chapter, simulation results and measurements of the simulator’s performance are presented, where two hypothetical case studies were used for this purpose. All simulations were performed twice, first using only the PS side of the ZYNQ SoC in order to obtain a reference for comparison purposes and a second time using both PS and PL implementing the proposed algorithm. In all cases, PS works at a frequency of 800 MHz, while the logic in PL works at a frequency of 250 MHz, the highest rate directly allowed by PS. In these simulations, the worst case was considered, meaning that a complete reconfiguration of the system was performed.

The first system to consider is shown in Figure

47-node network under test.

The first simulation was carried out using the classical approach based on the algorithm of the commercial ATP program, solving the complete network without any partition technique or hardware implementation. The algorithm was compiled using the highest optimization level. After the first run, the timer measured a maximum total time of 449

Equivalent three-section network.

Power network distribution in the simulator.

The maximum time step captured by timers in PS was 11.0

Measured timing for study case I.

Subsystem | Nodes | Time for LSS | Time for UHS | Time for GU | Time for SG | Total |
---|---|---|---|---|---|---|

1 | 17 | 3.88 |
0.184 |
0.432 |
0.072 |
4.568 |

2 | 17 | 3.88 |
0.184 |
0.360 |
0.02 |
4.444 |

3 | 17 | 3.88 |
0.184 |
0.304 |
0.02 |
4.388 |

For this specific case, the maximum time step was reduced from 449

Voltage at load 1.

Voltage at load 2.

The second power system to be simulated is shown in Figure

82-node network under test in study case 2.

The power system was segmented and distributed into four subsystems, as in Figure

Measured timing for study case II.

Subsystem | Nodes | Time for LSS | Time for UHS | Time for GU | Time for SG | Total |
---|---|---|---|---|---|---|

1 | 21 | 4.8 |
0.212 |
0.44 |
0.07 |
5.522 |

2 | 21 | 4.8 |
0.212 |
0.44 |
0.07 |
5.522 |

3 | 21 | 4.8 |
0.212 |
0.368 |
0.02 |
5.4 |

4 | 24 | 5.9 |
0.22 |
0.384 |
0.02 |
6.508 |

Equivalent four sections’ network in study case 2.

The results show that an improvement of 99 times over the classical approach was obtained. Figure

Simulator results for voltages at loads 1, 2, and 3, case of study 2.

Table

SEMTS and PL resources utilization.

Unit | LUTs | BRAMs | DSPs |
---|---|---|---|

SG | 472 (2%) | 1 (1%) | 12 (8%) |

HSU | 472 (1%) | 3 (1%) | 12 (8%) |

GU | 8759 (3%) | 4 (1%) | 6 (7%) |

LSS | 13939 (6%) | 13 (2%) | 104 (11%) |

This work deals with the design and implementation of a real-time electromagnetic transient simulator on a SoC. The approach followed in this work allows reaching a deeper segmentation degree than current simulators which work based only on network segmentation through transmission lines. The proposed methodology can be used together with transmission lines; in this case, the power system is first segmented into areas by using transmission lines; then the generated areas can be segmented again by using the proposed method in this work, in order to obtain smaller subareas or subsystems in terms of nodes. A specific methodology for dividing the network is not discussed in this document, although it is logical that the best performance should be obtained by the method when the number of nodes is evenly distributed among the subsystems. An analysis to establish a formal method for breaking down the power system could be worthwhile.

The objective of using LC sections to represent transmission lines is only to illustrate how our proposed sectioning methodology distributes the computational burden for large systems.

The design of real-time electromagnetic transient simulators is a complex task due to fast transients; this leads to the research of modeling techniques, parallel algorithms, and the use of technologies capable of exploiting them. The real-time electromagnetic transient simulator presented in this work has shown an increase of the simulation performance up to 99 times with respect to the ATP algorithm running on the PS in the SoC. The simulator can perform the simulation in real-time in sections with more than 80 nodes using the hardware and architecture currently available. For larger power systems, simulator performance could be improved by using faster FPGAs with more available resources, which can comply with the required simulation timing. Also, the interconnection of multiple FPGAs can provide more resources to implement faster hardware accelerators which also could improve the simulator performance.

The simulator framework was implemented in a SoC; the proposed architecture was designed to provide the necessary conditions for exploiting the parallelism enabled by the new segmentation algorithm by providing sufficient and replicated parallel hardware. In this sense, the simulator architecture is capable, for example, of solving multiple linear systems of equations, computing independently for history sources, or updating the conductance matrix of the generated subsystems simultaneously. The most hardware and time consuming accelerator is the LSS unit which implements a Gauss-Jordan hardware version designed to boost the solution. Hardware implementation allows solving it using multiples levels of parallelism which includes pipeline, parallel elimination procedure, and efficient architecture design.

Simulation results show a proper operation of the proposed simulator compared to the ATP method. The errors detected in the testing cases are negligible and are associated with the precision lost when extracting data from the ATP software. Some issues still remain as the designing of a specific hardware module for solving the link currents (for dealing with fault current). Solving the link currents in hardware will require a hardware which supports pivoting to avoid the numerical round error propagation. Another interesting issue to consider is the implementation of LU factorization in hardware.

The authors declare that they have no conflicts of interest.