Investigation of Arcing Time Prediction for Secondary Arc Based on Statistical Theory and Multivariable Regression Algorithms

Research on the secondary arc has long been important in the development of ultrahigh-voltage (UHV) transmission technology. Predicting the arcing time of secondary arc is important for single-phase automatic reclosing of the circuit breaker to protect the power system. In response to this problem, this study uses a multifactor variance model to select factors that significantly influence the arcing time and verifies its validity through residual analysis. The results show that the parameters of the circuit, such as wind direction, recovery voltage, secondary current, and arc length, significantly influence the arcing time. The density-based clustering algorithm (DBSCAN) is then used to remove outliers form the data, and the influence of each factor on the arcing time is analyzed through scatterplots. Finally, linear multivariate regression and nonlinear Gaussian process regression are used to fit the results. The results show that the linear regression has a good imitative effect. The proposed method is accurate and provides novel means of predicting the arcing time.


Introduction
With the increasing deployment of UHV lines, worldwide research on extinguishing time of the secondary arc is an important issue that needs to be addressed. Studies have shown that more than 90% of faults in UHV transmission lines are single-phase ground faults [1]. When a single-phase ground fault occurs in a power system, the circuit breaker (CB) and single-phase automatic reclosing (SPAR) operate together to cut o the faulty phase. However, during the operation of SPAR, the electromagnetic coupling between lines generates a secondary arc. e secondary arc continues to burn, resulting in the failure of the SPAR [2,3]. SPAR is an important equipment for relay protection, and its operation depends on accurately predicting the extinction time of secondary arc [4,5]. erefore, a clear and detail understanding of this is crucial to correcting the operation of SPAR.
A considerable amount of research has been conducted on the parameters, mechanism of generation, and kinematic law of the secondary arc in recent years. An algorithm has been proposed for detecting the edges in images of the secondary arc to solve the problem of discontinuous and false edges in conventional edge detection operators. e radius, length, and area of the secondary arc were analyzed by processing the edge coordinates thus obtained [6]. e variations in voltage harmonics generated during singlephase AC fault clearance were used to estimate the secondary arc extinction [7] e secondary arc model has been developed in ATP, which brings together the strengths of the main existing models and implements a pseudorandom variation of the secondary arc length, to provide accurate results for the determination of extinction times of secondary arc in studies of monopolar reclosures [8]. e stochastic model of secondary arc with di erent initial positions was applied into the arc chain model to calculate the arcing time with dispersion [9]. e extinction characteristics of secondary arc near an outdoor insulator under di erent secondary arc currents are studied experimentally [10]. EMTP (Electro-Magnetic Transient Program) was used to establish the secondary arc current calculation model of a 500 kV Power Station's transmission line [11]. No study to date has sought to predict the extinguishing time of the secondary arc, resulting in the theoretical guidance for SPAR is lacking. is study thus proposes a method to predict the arcing time based on statistical theory and multiple algorithms.
In this study, key parameters are selected via statistical theories and further concrete expression on arcing time is determined from multivariable regression algorithms. In comparison with the current method of determining the arcing time of secondary arc, we propose a novel method for predicting the arcing time.
is method can provide a theoretical basis for the fast and accurate operation of SPAR. e remainder of this study is organized as follows: Simulations of the secondary arc and the means of acquisition are presented in Section 2. In Section 3, analysis of variance model is used to identify the key parameters that signi cantly in uence the arcing time and verify it through residual analysis. e DBSCAN algorithm is used to remove outliers in the data in Section 4. Two regression models are used in Section 5 to t the arcing time.

Experimental Setup.
Low-voltage simulations of the secondary arc were carried out at the China Electric Power Research Institute. e circuit that was used is shown in Figure 1.
e extinguishing time of the secondary arc is determined by the recovery voltage gradient (kV/m) of the arc channel instead of the amplitude of the recovery voltage. erefore, a short gap arc can be used to simulate secondary arcs on UHV/EHV overhead lines with the same recovery voltage gradient. It is feasible to ignite the secondary arc with a small current (1000 A) instead of a large current at the same recovery voltage gradient with a short arc gap of 0.68 m. us, the low-voltage simulation experiment can equivalently simulate the secondary arc on the UHV transmission line [2,12,13].
It was composed of an AC power source (peak voltage of 11.6 kV), equivalent capacitance C q , reactor L s , an insulator string, and circuit breakers CB 1 , CB 2 and CB 3 . L s was used to generate 1 kA of short-circuit arc current and the two ends of the insulator string were connected by a nichrome wire to generate the short-circuit arc to form an arc channel. A clear image was obtained through a highspeed camera for analysis. e equivalent capacitance can be expressed as C p and C g , respectively, represent the phase-to-phase capacitance and the phase-to-ground capacitance per unit length, respectively, and l 0 represents the length of the overhead line. Secondary currents of 15 A, 30 A, and 45 A were generated by changing the value of C q . e secondary current here was the e ective value of the current calculated through the circuit and can be expressed by I E . e values of L s and C q , and the corresponding values of I E are shown in Table 1.
A voltage divider (1000 : 1 voltage ratio) and a current transformer CT were used to record the voltage and current of the secondary arc, respectively, and gradients of the recovery voltage between the electrodes and the secondary arc current in the loop were selected based on the results [12,14,15]. An air fan was placed around the insulator and arti cial wind was set by changing the position and adjusting gear of the fan. Once the arc had been extinguished, a recovery voltage was generated at both ends of the insulator string. e waveforms of secondary current and recovery voltage were recorded by using an oscilloscope.

Parameter Acquisition.
e combustion process of secondary arc was very complicated and is shown in Figure 2. Under a short-circuit large current, the nichrome wire was quickly broken at t 0.008 s. e air around the insulator was strongly ionized and a bright arc plasma channel was formed under a short-circuit large current. e arc channel was di used over time. When the arc plasma channel di uses to the greatest extent of the area at t 0.08 s, the arc channel is narrowed and the brightness is weakened. Secondary arc was generated in the plasma channel. With the extinction and reignition of arc, the morphology of arc changed greatly. In the last reignition at t 0.194 s, the arc was very bright and its shape were complex, which makes it easy to extract the morphological characteristics of secondary arc [13,16]. When the arcing time was short, the arc channel was not completed di used and thus can be captured by camera as the arc area. e arc area also re ects the intensity of discharge, which is signi cant to determining the arcing time. As a   morphological characteristic of arc, arc area can also contribute to the determination of the arcing time. Raw images of the secondary arc were captured by a high-speed camera in Figure 3(a). To acquire the arc morphology characteristic, corresponding binary images were processed by using the Color resholder in Figure 3(b). Pixel values of the nichrome wire and secondary arc can be read by binary images in Image Region Analyzer in MATLAB. e ratio can be obtained from the pixel value of the nichrome wire and the actual length. e actual arc length can be calculated by the ratio of pixel to actual length. erefore, from binary images of secondary arc, the length, area, and diameter of the arc were obtained for further analysis.
e recovery voltage and secondary current are shown in Figure 4 when the current is 30 A and capacitive. From Figure 4(a), when secondary arc extinguished, the maximum recovery voltage appeared at the rst peak of the waveform (point A in Figure 4). e maximum secondary current appeared before arc extinguished and a strong arc discharge was produced at the same time (point B in Figure 4).

Key Parameter Selection
e combustion of secondary arc is complicated. It is not determined which parameter in uences the arcing time signi cantly. To determine the correlation between the arcing time and the parameters, analysis of covariance (ANCOVA) method in statistics is used to nd out the key parameters [17,18]. e key parameters are selected out for subsequent tting.   In the secondary arc experiment platform, length of the electrode, wind speed and direction, secondary current value, and properties are arti cially set to di erent values. ese parameters are xed factors in the ANCOVA method. After the experiment, parameters such as maximum recovery voltage, transient secondary current when the maximum recovery voltage was generated, and the maximum secondary current were obtained. Some parameters cannot be arti cially designed and controlled in the experiment platform such as the arc area, arc length, and arc diameter. ese parameters can be viewed as covariates in the statistical analysis.
ANCOVA is a statistical algorithm which combines the characteristics of variance analysis and regression analysis. e main ingredient of ANCOVA consists of two parts: e independent variables corresponding to X 1 β are factors, and X 1 is a 0-1 matrix. X 1 β represents the part of the variance analysis model in ANCOVA. e independent variables corresponding to X 2 c are covariates, and X 2 c stands for the part of the regression analysis model. ε represents for random noise. X 1 β, X 2 c, and ε can be calculated by statistical software. Covariance analysis is used to obtain the modi ed average arcing time through linear regression. By converting the covariates to equal values, the in uence of confounding covariates is controlled. en, variance analysis is used to compare the di erences between the modi ed means.
In the regression analysis, collinearity in the covariates can lead to the results inconsistent with objective facts. is process can be achieved during regression analysis. Collinearity of the covariates can be determined by the variance in ation factor (VIF) as follows: ( R i is the negative correlation coe cient in regression analysis between one covariate and other covariates. e larger the VIF is, the greater is the likelihood of collinearity between the independent variables. If the VIF is between 0.1 and 10, collinearity is not serious between the parameters. e results of diagnosis are shown in Table 2. e VIFs of all parameters are bigger than 0.1 and lower than 10, which means the covariates have passed the collinearity diagnosis.
e signi cance of each parameter in ANCOVA model is shown in Table 3, where * indicates that α is less than 0.05 and * * indicates that α is less than 0.01. Table 3 shows that the P value of current, the properties of current, arc length, maximum recovery voltage, transient secondary current, and wind direction are, respectively, 0.015, 0.007, 0.0016, 0.022, 0.007, and 0.047. ey thus have a signi cant e ect on arcing time. e signi cance of the transient secondary current and properties of current is below 0.01, which means that these variables have an extremely signi cant e ect on arcing time than the others. e signi cance of the maximum secondary current, and area, and diameter of the arc is above 0.05, which means that they do not have a signi cant e ect on arcing time.
e ANCOVA needs to be veri ed by residual analysis. In (2), the expected value of ε is assumed to be zero, and it is a random variable that obeys a normal distribution. e essence of residual analysis is to test whether the assumption regarding ε is true.
ere should be no explanatory or predictable information in ε. e residual is the di erence between the observed value y i and the prediction obtained from the regression analysisŷ i . e residual is expressed as ei.
e standardized residual is the value obtained from dividing the residual by its standard deviation and it can be expressed as ze. e residual of the ith observed value is erefore, the data for the ANCOVA model satis es the normal test and the homogeneity of variance and is correct.

Outlier Removal
It is concluded in Section 2 that the key parameters are arc length, maximum recovery voltage, transient secondary current, and transient power. However, accidental data in the experiment affect the subsequent fitting of the results. Outliers thus need to be removed to improve the accuracy of the fitting model. ere are two models to remove outliers. e first is to remove them through statistical theory, such as Grubbs' criterion and Pauta criterion. However, statistical analysis cannot eliminate individual data that do not conform to the law of overall changes. e second method to remove the outliers involves using an appropriate machine learning method. DBSCAN algorithm is used to remove outliers, which is a density-based clustering method. Density-based clustering algorithm (DBSCAN) is an unsupervised machine learning clustering algorithm. DBSCAN does not need to specify the number of clusters, and it works very well in clusters of any shape and size [19,20]. erefore, DBSCAN algorithm can be used to remove the outliers in the data.
Epsilon and the minPts are firstly defined for the DBSCAN algorithm. Epsilon is the maximum radius of the cluster. If the mutual distance between data points is less than or equal to the specified value of epsilon, these points will be grouped into one cluster. A larger value of epsilon produces a cluster with more data. In general, a smaller value of epsilon should be chosen because a cluster with a dense data distribution is needed. minPts is the least number of points which a cluster contains. A lower minPts leads to more clusters and thus more noise or outliers. A higher minPts makes a more robust cluster. But if the cluster is too large, outliers can be included in. is is not conductive to removing them.
To determine a suitable value of epsilon, the average distance calculated at every point and its nearest neighbor is set to be the benchmark for it. MinPts should be greater than or equal to the number of dimensions of the dataset, but this should be adjusted according to the experimental data at hand. DBSCAN can be accessed by the results of the visualization of the data. e better the interpretability of the results is, the better is the effect of DBSCAN. e advantage of using DBSCAN algorithm for outlier removal is that it can perform exploratory classification. Moreover, DBSCAN can cluster nonlinear relationships.
Epsilon was set to be four, and minPts to three. In [13,16], arc length is positively correlated with arcing time, and recovery voltage is negatively correlated with arcing time. Clusters which do not fit the correlation need to be removed as outlier. e results of clustering the recovery voltage and arcing time are shown in Figure 6. e red × in the figure are the calculated outliers, and the blue dots represent values that were considered normal. ese outliers can be explained as accidental values that occurred for various reasons in the experiment. Figure 6(a) shows the results of the recovery voltage with arcing time after DBSCAN. Points in the upper-right corner were sparsely distributed and were considered outliers. Once they had been removed, arcing time at most points was shorter than 0.3 seconds. A total of 82.2% of the recovery voltage was in the range of 25 kV to 32 kV. Figure 6(a) also shows a negative correlation between recovery voltage and arcing time. e higher the maximum recovery voltage is, the more likely the secondary arc extinguishes. e results of the transient secondary current with arcing time after the use of DBSCAN are shown in Figure 6(b). Due to the accuracy of the oscilloscope, the transient secondary current recorded discrete current values of 0.0667, 0.1667, 0.6667, 1.3333, and 3.3333. Only three points had a value of 3.3333, one of which was considered an outlier owing to its distance to the other points. Most points had a value of 1.333 A and 1.667 A. Figure 6(c) shows the result of arc length with arcing time after DBSCAN was applied. After the outliers are removed, a significant positive correlation between arc length and arcing time is shown in the scatterplot. is means that the longer the arc length was before the secondary arc extinguished, the longer the arcing time was. During secondary arc combustion, the movement of secondary arc was irregular in the vertical and horizontal directions owing to thermal buoyancy and the electromagnetic force. When the secondary arc was about to extinguish, the arc was quickly drawn out. erefore, the longer the duration of combustion of the secondary arc was, the longer was the arc length before arc extinguishment.
In summary, the DBSCAN is used and conclusions of Section 3 to analyze the relationship between the arcing time and the scatterplots of the parameters with the arcing time. However, the conclusions drawn by using only the scatterplot are intuitive, but not quantitative. Only the correlation and the scatterplots cannot yield the precise functional relationship between the arcing time and the relevant parameters.

Establishing and Analyzing Regression
Model of Secondary Arc e arcing time is significantly influenced by recovery voltage, secondary current and arc length. e conclusion can help predict the arcing time.
ere are two ways to predict the arcing time. One is to create a fit with the given data. e advantage of such fitting is that it is simple and easy to implement, but the predicted result may differ greatly from the true response. To improve the accuracy of fitting, more complicated formulas such as derivatives are often needed that greatly increase the difficulty of fitting and the formulation of the formulas. e other is to use machine learning or deep learning algorithms to make predictions. ese algorithms are accurate but lack a simple and clear expression. e two methods are compared to obtain accurate prediction of the arcing time.

Fitting the Multivariate Model of Secondary Arc.
Before fitting, the key factors in the data should be processed first. ere are three factors, namely, current value, current property, and wind direction. Due to the insufficient sample size, the result of grouping by factors and fitting is not ideal.
One-hot encoding is a way to convert factors into numerical values. In regression, one-hot encoding can introduce factor into regression equation. One-hot encoding sets one attribute of a factor to 1 and the other attributes to 0.
Maximum recovery voltage and corresponding transient current were generated after the arc extinguished. It is not very helpful to predict the arcing time by using maximum recovery voltage and transient current. e arc length increases during the arc combustion. It is more reasonable to predict the arcing time by arc length.
Linear regression is the most used regression method, which is far faster than other regression models. e factors after one-hot encoding and arc length are set as independent variables to conduct multiple linear regression as follows: I e , I p , and W d are all column vectors of n × 1, representing the numerical matrix of corresponding current value, current property, and wind direction after one-hot encoding. B2, B3, and B4 are the row vectors of n × 1.
Results of multiple linear regression is shown in Figure 7. Twenty percent of the points lie below the red line. Most of the predicted values are greater than the actual values. When the actual arcing time is short, the predicted value is much higher than the actual. In the regression model Equation (7), there is a positive correlation between arc length and arcing time. In the early stage of secondary arc, the arc length is short and elongates with arc combustion. e arc reaches the maximum length at the last reignition. If the predicted value provided by the regression model is less than the  Mathematical Problems in Engineering actual value at the early stage of the secondary arc, it is highly likely to lead to failure of SPAR.
e linear regression can provide a safe margin for SPAR closing at the initial stage of secondary arc.

Gaussian Progress Regression.
Gaussian process regression (GPR) is a machine learning-based regression algorithm. It performs well on complex problems involving many dimensions and non-linearity [21,22]. GPR is based on the kernel function. e Gaussian process rstly establishes the prior probability, and then completes the transformation from prior probability to the posterior probability under the Bayesian framework. e Gaussian process can calculate hyperparameters during regression; hence, there is no need to nd the kernel function.
For a given set of data D {(x i , y i )}n i 1, the input matrix is x i ∈ R d and the output matrix is y i ∈ R. In the given data, f(x (1) ), f(x (2) ). . .., f(x (n) ) can form a set of random variables. is represents the column vector composed of signi cant variables and has a joint Gaussian distribution. Statistical features in GPR are the mean function m(x) and covariance function k(x, x′). It is considered that the noise in the target y is built for a general model for the Gaussian regression problem: In each experiment, the in uence of the surroundings can be considered random. erefore, such in uences can be regarded as Gaussian noise, and are marked as ε∼N (0, σ 2 ). When f(x) was of Gaussian distribution, y was also of Gaussian distribution. e covariance function C (X, X) can be expressed as follows: C(X, X) K(X, X) + σ 2 n I.
According to the Bayes principle, Gaussian prior distribution Y was established for the set of data D {(x i , y i )}n i 1 and transformed into posterior distribution for n * given data set D 1 {(x i , y i )}n + n * i n + 1. e given output Gaussian prior distribution Y and the joint Gaussian prior distribution of Y with the predicted value y are as follows: K (X, X) is matrix of covariance between the training sets and I n is an n-dimensional identity matrix. K(x * , X) K(X, x * ) is the matrix of covariance between the test set x * and the training set X, and K(x * , x * ) is the matrix of covariance between the training sets x. e posterior distribution of the predicted value y is as follows: An important problem to solve when applying the GPR model is to test whether the predicted value re ects the true correlation between the arcing time and the variables in uencing it. Usually, the coe cient of determination Rsquare can be used to test the quality of tting the model. Rsquare is determined by the SSR and SST, where the former is the sum of squares of the di erence between the predicted value and the mean of the input data, and the latter is the sum of squares of the di erence between the input data and the mean.
R-square is de ned as the ratio of SSR to SST. e value of R-square is between zero and one. It is a statistical indicator of the reliability in the regression model. e results of GPR model are shown in Figure 8. e data in each group is plotted on the horizontal axis and is in the range [1,97]. e vertical axis represents the arcing time, which is arranged from small to large. e discrete dots represent the true response, and the red line is the arcing time as predicted by the GPR model. When n is less than 50, the predicted value of GPR is larger than the true value. When n is greater than 50, the predicted value is smaller than the true value. e slope of GPR in the gure is smaller than the true response. R-square of GPR model is 0.273 and R-square of linear model is 0.377. When n is greater than 50, the di erence between the linear regression prediction curve and the real value is small, and linear regression have a good prediction in this range. Both regression models have certain signicance. e imitative e ect of linear regression is better than that of nonlinear GPR regression model. Overall, the predicted value of linear regression is greater than that of GPR. e GPR prediction curve is relatively gentle. e predicted value of linear regression is larger than the real value, and the curve uctuation of linear regression prediction is relatively large.
In this section, linear regression and nonlinear GPR methods are respectively used to t the arcing time. e linear relationship between arc length and arc time is stronger than the nonlinear. e results show that the linear model has a good imitative e ect and can provide the margin of SPAR at the initial stage of secondary arc combustion.

Conclusion
is study used the data obtained from a low-voltage simulation of the secondary arc. e data were analyzed to obtain the key factors in uencing the arcing time. After removing outliers by DBSCAN algorithm, a linear multi-variate regression and GPR model was performed, and the tting results were compared. e following conclusions can be drawn from the analysis: (1) e calculations of ANCOVA showed that the most signi cant parameters a ecting the arcing time are direction of wind, recovery voltage, secondary current, and arc length.
(2) Scatter plots were drawn after outlier removal, which shows the recovery voltage was negatively correlated with arcing time. By contrast, the arc length and arcing time showed a positive correlation. ese correlations may help in the development of suppression technologies for secondary arcs.
(3) After comparing the R-square between two regressions, the results indicate that there was a linear relationship between the arcing time and the key parameters when the arcing time was long. Linear regression can provide a safe closing margin at the initial stage of arc and can avoid the further fault caused by premature closing.
Data Availability e data used to support the ndings of this study are available from the corresponding author upon request.

Conflicts of Interest
No potential con icts of interest were reported by the authors.