A Power Transformer Fault Prediction Method through Temporal Convolutional Network on Dissolved Gas Chromatography Data

. The power transformer is an example of the key equipment of power grid, and its potential faults limit the system availability and the enterprise security. However, fault prediction for power transformers has its limitations in low data quality, binary classiﬁcation eﬀect, and small sample learning. We propose a method for fault prediction for power transformers based on dissolved gas chromatography data: after data preprocessing of defective raw data, fault classiﬁcation is performed based on the predictive regression results. Here, Mish-SN Temporal Convolutional Network (MSTCN) is introduced to improve the accuracy during the regression step. Several experiments are conducted using data set from China State Grid. The discussion of the results of experiments is provided.


Introduction
As key equipment, the power transformer is directly related to the system availability and enterprise security of power grid. Dissolved gas analysis (DGA) is one of the most reliable means for condition estimation and fault diagnosis of oilimmersed transformers and is recommended for condition evaluation by standards from the International Electrotechnical Commission (IEC) and the National Energy Administration.
rough the gas chromatography online monitoring technology, business analyses such as transformer fault detection can be done in quasi-real time, which improve the safety and stability of power grid [1].
However, it faces challenges in predicting power transformer faults due to inherent limitations in practice. First, the low quality of raw data makes direct data usage infeasible because transmission links may be interrupted and data packets may be lost [2]. Due to the equipment or communication network problems, in-complete, missing, and outlier records exist in gas chromatography data. Availability is usually an essential requirement [3], and such defective data makes fault prediction more difficult. Second, traditional methods like widely used binary classification are not accurate enough, because such threshold-based fault detection technology ignores the data below the threshold and lacks historical trends employment.
e oscillatory values around threshold may imply potential fault but cannot be found only by those methods. ird, the model is hard to be learned on small samples. e faults of power transformers appear casually, and related data must be a small proportion, and traditional models trained have to perform poorly due to the fact that too few features can be learned.
In this work, a fault prediction method is proposed for power transformers, which converts the classification problem for power transformers into a regression problem. Our contributions can be summarized as follows: (1) Missing imputation and outlier detection during the data preprocessing step guarantee completeness and continuity for gas chromatography data, which improve data quality obviously. (2) MSTCN proposed during regression step can learn features from data below fault threshold, which avoids overfitting through small sample learning. (3) On real-world data, our work shows convincing benefits and has been adopted in a practical business project. e rest of this work is organized as follows: Section 2 discusses related work. Section 3 presents research background including motivation and methodology, as well as the transformer fault diagnosis method, AKA the three-ratio rule. Section 4 elaborates transformer fault prediction method based on MSTCN model. Section 5 evaluates the effects in extensive experiments. Section 6 summarizes the conclusion.

Related Work
Power transformer fault prediction is significant nowadays, but its discovery still faces challenges in efficiency and accuracy. Many works have adopted deep learning techniques in specific domains [4,5]. We categorize related work into two technical perspectives: one is traditional algorithms through machine learning methods, and the other is deep learning methods, including recurrent neural networks (RNNs) and Temporal Convolutional Networks (TCNs).

Machine Learning Method.
Machine learning methods can learn the fault occurrence pattern and then predict the possible faults. e literature [6] compared and analyzed MLP (Multilayer Perceptron), RBF (radial basis function), fuzzy logic, and support vector machine (SVM) for fault prediction of power transformers. However, their parameters are mainly selected empirically, which limits the efficiency of modeling. e F1-score of these methods is not more than 90% as evaluated by our data set.
Machine learning methods combined with DGA for transformer fault prediction have achieved many results. Dukarm [7] shows how fuzzy logic and neural networks are used to automate standard DGA methods. Furthermore, Wang et al. [8] conducted a combined artificial neural network and expert system tool (ANNEPS) developed for transformer fault diagnosis using dissolved gas-in-oil analysis. Huang et al. [9] introduced an evolutionary programming (EP) based fuzzy logic technique to identify the incipient faults of the power transformers. Yang et al. [10] employed bootstrap and genetic programming to improve the interpretation accuracy for DGA of power transformers. Hellmann [11] applied fuzzy logic (FL) that allows intermediate values to be defined between conventional evaluations like true/false, yes/no, high/low, and so forth. Souahlia et al. [12] applied the support vector machine (SVM) based decision for power transformers fault diagnosis.
However, these works have common problems, including the too small amount of data, few types of data, and only simple classification rules. For example, the fault categories are shallow, including overheating, discharging, and overheating with discharging. Advanced transformer fault prediction is required on Big Data fully using DGA data.

Deep Learning Method.
In recent years, deep learning networks combined with DGA have further improved the accuracy of transformer fault prediction. Recurrent neural networks (RNNs) have been widely adopted in research areas concerned with sequential data, such as text, audio, and video [13]. Among RNNs methods, in particular Long Short-Term Memory (LSTM) [14] and Gated Recurrent Units (GRU) [15] are excellent in fully exploiting the timevarying features of time series data. Although the gradient problem of RNN has been solved to some extent in LSTM and GRU, it will still be tricky for longer sequences [13].
Bai et al. [16] proposed the Temporal Convolutional Networks (TCNs) model, a deep learning model for sequence modeling tasks. TCN combines convolutional neural network (CNN) and recurrent neural network ideas for processing time series type data. Almqvist [17] compared the performance of RNN and TCN for time series forecasting. Instead of using a cell state to preserve information from previous outputs as in LSTMs, TCNs use connection between previous hidden layers configured with two hyperparameters: dilation factor and filter size. Zhang et al. [18] proposed a multiscale temporal convolutional network for fault prediction. ey extracted multiscale time-frequency information with the discrete wavelet transform, and each piece of scale data is handled by different TCN, respectively. Zhang et al. [19] presented an attention mechanism enhanced Temporal Convolutional Network for fault prediction.
ey utilized an attention mechanism to make the TCN-based fault prediction model focus on more essential input variables to enhance the fault prediction performance. Zai et al. [20] put forward a predictive method for dissolved gas content in transformer oil based on Temporal Convolutional Network (TCN) and graph convolutional network (GCN).
ey designed a GCN to analyze the correlations among all gases and then established a topological graph for their correlations.
However, these models did not solve the problem caused by the rectified linear unit (ReLU) and weight normalization layers. e rectified linear unit (ReLU) based activation function applied in TCN is underutilization of negative values leading to vanishing gradient. Meanwhile, the weight normalization applied in TCN is sensitive to initial values leading to overfitting. Meanwhile, these models used binary classification for fault prediction, which might not thoroughly learn the information below the threshold and lacks historical trends employment.
Inspired by the works in [21,22], we propose a Mish-SN Temporal Convolutional Network (MSTCN) for dissolved gas regression to predict transformers' fault. We apply the Mish activation function and switchable normalization to MSTCN to solve the problems caused by ReLU and weight normalization. Meanwhile, the dissolved gas regression can explore the numerical fluctuations before the threshold value and learn historical fault feature patterns.

Motivation.
Our work is originated from a practical project of China State Grid.
is work utilizes the dissolved gas chromatography data set provided by China State Grid as the data set for experimentation. e data set comes from the gas chromatography online monitoring equipment of the power grid, which is based on an integrated, high-speed two-way communication network [23]. e data set covers roughly 600 transformers. With the explosive growth in Internet of ings (IoT) devices, applications have also substantially expanded in recent years [24]. Some of the data is a relatively long time series, containing more than 60 months of monitoring data, while others are short, only three or four months of monitoring data. In addition, each data item in the dissolved gas chromatography data is a multidimensional vector rather than a single number in some stock market and house price analysis data sets. e main fields in the dissolved gas chromatography data set of transformer oil are shown in Table 1. e data are all collected and measured automatically through the gas chromatography online monitoring technology. Definition 1. Status code. In this work, a status code is used to identify each sample's possible fault categorical value. Status code is used as the classification label for the later transformer fault classification. e possible status code is summarized in Table 2.
e appearance of dissolved gas in the transformer oil indicates transformer faults. e gas formation comes from three conditions: overheating, discharge, and moisture. e amount of gas inside the transformer oil can be measured frequently by technical means to keep track of the operating health of the transformer. If any of the gases has a tendency to exceed a notice value, the gas production rate should be observed. However, if all the gases are lower than the notice value, the transformer is considered to be working properly. Based on the recommendations of the data provider, the notice values of our data set in this work are shown in Table 3.

e ree-Ratio Rule.
We apply the three-ratio rule to converse dissolved gas regression to the status code in our proposed method. e three-ratio rule is proposed by the National Energy Administration of China [25]. By studying the trend of the dissolved gas amount in transformer oil, the status of the transformer can be determined based on the gas chromatography combined with the three-ratio rule. e conversion of three-ratio rule is shown in Tables 2 and 4. Table 4 shows ratio code of two gases. For example, if the ratio of C 2 H 2 to C 2 H 4 is 0.2, the ratio code for C 2 H 2 /C 2 H 4 is 1. Similarly, the other ratio codes for CH 4 /H 2 and C 2 H 2 /C 2 H 6 can be calculated. Table 2 shows the three-ratio codes and their corresponding faults. For example, if the ratio codes of C 2 H 2 /C 2 H 4 , CH 4 /H 2 , and C 2 H 2 /C 2 H 6 are 1, 1, and 2, that is, the three-ratio code 112, the corresponding type of fault is low energy discharge, with the status code 6.

Overview.
e fault prediction method based on dissolved gas regression proposed in this work is shown in Figure 1. Our method is divided into three steps. e first step is data preprocessing. Inspired by the work of Ding et al. [26], we convert data from different sources in the dissolved gas chromatography data set into a uniform format and resolve problems such as missing values and outliers in the data as much as possible. e second step is to predict gas amounts using a deep learning model. We apply MSTCN to dissolved regression gas regression to obtain future gas amounts. e third step is fault classification. e predicted transformer status code is calculated based on regression results of the second step and three-ratio rule mentioned above.
On the basis of transformer fault prediction studies, fault prediction methods usually use machine learning models or statistical tools to predict transformer fault. Instead of directly using deep learning models to predict transformer fault, we add a gas regression step between data preprocessing and fault classification. e usual fault prediction uses binary classification as predicting labels to do classification prediction, and the fault classification is judged based on the threshold value, ignoring the fluctuation of the value before the threshold value. e final prediction model might not learn the prethreshold value fluctuation information.

Data Preprocessing.
In the domain of gas chromatography online monitoring technology of power transformer, there are problems such as network instability and server performance bottlenecks in processing extensive data. We mainly address the problem of missing data and outliers that exist in the dissolved gas chromatography data set.
Definition 2. Missing data. e missing data types include negative number, not a number (NAN), and null. Let X ∈ R n×m be a feature matrix consisting of n data points and m features of dissolved gases. e t-th data point is denoted as x t . e j-th feature value of x t is denoted as e definition of outlier points combined with the characteristics of the data set and 3 σ-rule is shown as follows.
. , x t+k be defined as a set of the k-nearest neighbors of x t . Each of X is recorded at a specific time point t⊆U + and consists of m observations that could be denoted as x t � (x t1 , . . . , x tm ), each dimension j of m-dimensional vectors at a certain data point t could be denoted as x tj , the expected value of x tj could be denoted as x tj , the Euclidean distance of two data points can be denoted as d, and the highest distance threshold between a true data point and its expected data point could be denoted as 3σ. e outlier could be denoted as With the definitions above, missing data and outlier problems are explicitly defined to be handled. Properly imputed data and corrected outliers could lower the regression errors and further promote fault prediction effectiveness:

Security and Communication Networks
Missing Data Imputation. For the missing data mentioned earlier, considering the data characteristics of gas amount, in this work, we took a modification of the EM algorithm proposed by Junger [27]. e algorithm comprises the following steps: (i) replace the missing values by estimates; (ii) estimate parameters μ and σ; (iii) estimate the level for each of the univariate pieces of data; (iv) reestimate the missing values using updated estimates of the parameters and the level of the data. ese steps are iterated until some convergence criterion is reached. Let x t be the t data point of m features matrix X. After k + 1 iteration, the revised maximum likelihood esti- . , x t+k is a set of the k-nearest neighbors of x t . Each of X is recorded at a specific time point t ∈ Z + and consists of m real-valued observations that could be denoted as x t � (x t1 , . . . , x tm ), each dimension j of m-dimensional vectors at a certain data point t could be denoted as x tj , the expected value of x tj could be denoted as x tj , and the highest distance threshold between a true data point and its expected data point could be denoted as 3σ. In the outlier equation (1), the expected value x tj and the highest distance threshold 3σ are defined in the two following equations: e expected value x tj is also the corrected value of the outlier x t .

Dissolved Gas Regression.
In order to fully explore the numerical fluctuations before the threshold value and learn historical fault feature patterns, we proposed a regression model called Mish-SN Temporal Convolutional Networks.
On the other hand, if the predicted value of dissolved gas is obtained from the prediction model, the conversion from the predicted value of dissolved gas to the predicted value of the status code can be achieved with very few calculations. Figure 2 shows a complete MSTCN map formed by stacking h residual blocks. e input is denoted as X � (x 1 , . . . , x t ). In this work, we use a common technique in RNNs modeling called time step to improve predictive accuracy. e time step length could be denoted as L. e number of features is denoted as m. Let v t ∈ R L * m � (x t−L+1 , . . . , x t ) be defined as a new data point. For any v t , its gas regression label is denoted as y t � (x t+1 , . . . , x t+L ).  Low temperature overheating (below 150°C) 1 020 Low temperature overheating (150∼300°C) 2 021 Medium temperature overheating (300∼700°C) 3 0 * 2 High temperature overheating (above 700°C) 4 010 Partial discharge 5 10 * 11 * Low energy discharge 6 12 * Low energy discharge and overheating 7 20 * 21 * Arc discharge 8 22 * Arc discharge and overheating 9 * 0, 1, and 2 for simplicity.  Security and Communication Networks erefore, the regression result of v t is denoted as y t � (x t+1 , . . . , x t+L ). e convoluted result of the h-th residual block layer is denoted as . However, to solve the real-world problem, we are only interested in the last case v t . v (h) t � (x t+1 , . . . , x t+L ) represents L regression points of input X � (x 1 , . . . , x t ). e output Z of MSTCN regression result is shown as follows: Residual Block. In order to solve the vanishing gradient problem, in a deep convolution network, a well-known technique called residual blocks is applied in MSTCN shown in Figure 3. Residual blocks have been proven to be an effective method for training deep networks, which enables the network to transmit information in a cross-layer manner. In Figure 3, the upper branch of the residual block presents dilated causal convolution H(·) with the input e lower branch is the skip connections added to solve the vanishing gradient problem. In this work, we replace weight normalization with switchable normalization. rough the switchable normalization self-learning method, let the MSTCN decide which normalizer to use to obtain the best prediction effect. e MSTCN also introduces the Mish activation function to replace the ReLU for solving the dead ReLU problem in order to make the activation function smooth and derivable at 0 points and to improve the generalization of the model. Let δ(·) be the activation layer. e output V (h) could be expressed as Dilated Casual Convolution. Figure 4 presents the structure of the dilated causal convolution stack from a residual block with filter size k � 2 and dilation factor d � 3. In Figure 4, the other layers and skip connection are omitted. e input of dilated casual convolution is denoted as ). e output of dilated casual convolution is denoted as   Figure 4. To enlarge receptive field without deepening the structure, the MSTCN introduces dilated convolution.

Fault Classification.
e guidelines [25] stipulate that, in the oil chromatographic analysis, if the content of each gas has a tendency to increase or exceeds a notice value, the gas production rate should be observed, and the gas production rate should be observed based on the three-ratio rule; it could be preliminarily judged that there is an overheating fault or a discharging fault, according to the three-ratio rule of gas chromatography in Table 4.
Let U ∈ R L � u t , . . . , u t+L be defined as the status code. Let Z ∈ R L * n gas � x t , . . . , x t+L be defined as a set of L regression results. Let gas ∈ C 2 H 2 , C 2 H 4 , C 2 H 6 , H 2 , CH 4 ; r 1 , r 2 , r 3 ∈ Z 3 . Let n gas be denoted as the feature numbers of x t . Let x t,gas ∈ R be denoted as the regression value of dissolved gas. e fault classification algorithm is defined in Algorithm 1.

Setting.
e experiments in this work are running on a server with CentOS 7 operating system installed with Intel Core i7-6700 CPU, 16 GB RAM, and 1 TB storage. e experiments are written in Python 3.9.6, implementing JupyterLab 3.1.11, TensorFlow 2.5.0, and Matplotlib 3.3.4. e data set was collected from oil-immersed power transformers in different substations in China, with 200,000 records covering the period from 2012 to 2017. e data set contains 7 fault-related gases (H 2 , C 2 H 2 , C 2 H 4 , C 2 H 6 , CH 4 , CO, CO 2 ), the time of collection, other gases (N 2 , O 2 ), substation and transformer information, and so forth. e distribution of 7 faulty gases is shown in Figure 5. e horizontal axis is the date. e vertical axis is the amount of gas. e blue curve indicates no fault on the corresponding date. e orange curve indicates a fault status on the corresponding date because the CO 2 amount has reached its notice value.
We selected a subset composed of 100 transformers of about 170,000 records from the data set, divided into 80 training sets, 10 validation sets, and 10 test sets. e time range of the subset is from November 2012 to September 2017. e reason for the selection is that it has high data  integrity and few missing values. In every transformer sequence of this data set, each record has attributes of collection date, 7 different dissolved gas values, and the status code label according to the three-ratio rule as shown in Table 1.   Table 5. For the TCN, LSTM, and GRU methods, they have roughly the same parameters as MSTCN, considering the rigour of the experiment. is work applies the root mean square error (RMSE) as the loss function shown in equation (6) and Adam as the Input: Regression result Z � x t+1 , . . . , x t+L Output: Status code U � u t+1 , . . . , u t+L (1) for t + 1 to t + L (2) compute three ratios: r 1 � x tC 2 H 2 /x tC 2 H 4 , r 2 � x tCH 4 /x tH 2 , r 3 � x tC 2 H 4 /x tC 2 H 6 (3) look up Table 4 to convert the gas ratio to ratio code (4) combine the three-ratio code get combination r � 100 · r 1 + 10 · r 2 + r 3 and look up Table 2 to get the status code u t (5) end for (6) return U ALGORITHM 1: Fault classification algorithm.

Experiment
Security and Communication Networks optimization algorithm. m represents the total m records, y i represents the actual gas amount of record i, and y i represents the predicted gas amount.
m represents the total m records, y i represents the actual gas amount of record i, y represents the average value of actual gas amount, and y i represents the predicted gas amount. e minimum of RMSE, MAE, and MAPE is 0, and the closer the metric is to 0, the better the predictive effect is. e maximum of R 2 is 1; the closer to 1 the better. In order to measure the predictive performance of the models, RMSE, MAE, MAPE, and R 2 are used as the models' metrics. e calculation formulas of those metrics are shown in equation (6). Figure 6 shows the actual gas amount curves and the regression curves predicted by different models, including MSTCN, TCN, LSTM, and GRU. It can be seen from Figure 6 that the fit curve of MSTCN is more accurate than the curves of the other models. Although, in the predictions from (g) H 2 , LSTM performed better, overall, the MSTCN error is smaller than those in other models.
We have calculated above metrics of MSTCN, TCN, LSTM, and GRU. e results are listed in Table 6. Table 6 shows the comparison of the MSTCN model and other deep learning models (TCN, LSTM, and GRU) as regards gas regression effect. MSE, MAE, and MAPE are used to measure the error between the true value and the predicted value of the data; R 2 is also used to measure the difference between the true value and the predicted value of the data and to standardize this difference to [0, 1]. For the predictions of C 2 H 4 , C 2 H 6 , CH 4 , CO, CO 2 , H 2 , MSTCN has achieved a relatively good evaluation index. Although the prediction of C 2 H 2 TCN has more minor prediction errors (RMSE), MSTCN is overall significantly better than other models.
Experiment 2. Fault Classification. In order to further verify the superiority of the transformer fault prediction method proposed in this work, this experiment uses the regression value of the previous experiment as input. It converts the predicted gas amount to the status code according to the three-ratio rule. e control group uses TCN, LSTM, and GRU models and uses actual gas amount as input to directly classify the fault of the transformer. In order to measure the accuracy of fault classification under different models, according to the confusion matrix, this experiment denotes faulty status as positive (P) and normal status as negative (N). erefore, the correct fault classification could be denoted as true positive (TP) and true negative (TN), and the incorrect prediction could be denoted as false positive (FP) and false negative (FN) [30].
is experiment introduces three metrics to measure the model's accuracy on the test set. e precision, recall, and F1-score are expressed in equations (7)- (9). e F1-score is a harmonic mean of precision and recall, whose value is also between 0 and 1, as well as precision and recall. e more the three metrics are close to 1, the better predictive effect the model has.
Comparing the accuracy of fault classification under different models, the evaluation metrics are shown in Table 7. Table 7 shows the comparison of transformer fault classification results between the MSTCN model and other deep learning models (TCN, LSTM, and GRU). e first column indicates the transformers participating in the experiment. Each transformer is an independent and complete experiment. e second column presents the evaluation metrics. e following columns are a comparison of the four model evaluation metrics. Overall, the prediction results of each model are satisfactory. is is caused by the faulty gas three-ratio algorithm and the gas attention value. For example, although the model has a deviation between the predicted gas value and the true value, it is still in the same ratio range, or the failure attention value is not reached at all, so the final predicted failure state will not change easily, resulting in an excellent overall prediction effect. For different transformers, the difference in fault prediction effect is more significant than the difference between models. e effect of model prediction is more affected by the data set than the model difference. For different models, the difference in failure prediction effects is relatively small. Overall, MSTCN is slightly higher than other models.
With the proposed transformer fault prediction method in this work in Figure 2, it can reduce or eliminate the impact of low accuracy of classification caused by threshold-based    Security and Communication Networks binary classification. It can use the data information before the threshold and enhance the usage of historical fault data because the proposed method is based on the dissolved gas regression value. At the same time, this classification step does not introduce additional errors because it uses the same judgment criteria as the existing fault diagnosis methods.

Conclusions
In this work, we propose a power transformer fault prediction method based on dissolved gas regression, which cleverly converts the transformer fault prediction problem into a regression problem for dissolved gas amount. First, through data preprocessing, we overcome the difficulties in directly using raw data. Second, by dissolved gas regression, we achieve more efficient learning of the data below threshold than binary classification and avoid small sample learning caused by a large amount of preventive maintenance. Compared with the traditional binary-based classification fault prediction model, the fault prediction method based on gas amount prediction has better results with F1-score more than 0.9741.
is novel method provides new insights for power transformer fault prediction.
In summary, the fault prediction method based on dissolved gas regression using MSTCN has excellent potential. In future work, we will continue to research this concept and shorten the training time with more advanced deep learning techniques. In addition to the fault prediction method proposed, we plan to tune the procedure to simplify the method.
Data Availability e oil chromatography data used to support the findings of this study were supplied by China State Grid under license and so cannot be made freely available. Requests for access to these data should be made to the corresponding author for an application of joint research.

Conflicts of Interest
e authors declare that they have no conflicts of interest.