Automatic Detection of Power Quality Disturbance Using Convolutional Neural Network Structure with Gated Recurrent Unit

Department of Electrical Electronics Engineering, Uludağ University, Bursa, Turkey Department of Electrical and Electronics Engineering, Konya Technical University, Konya, Turkey Department of Electrical and Electronics Engineering, Amasya University, Amasya, Turkey School of Engineering and Applied Sciences, Bennett University, Greater Noida, India RISC Lab (LR16ES07), National Engineering School of Tunis, University of Tunis El Manar, Tunis, Tunisia Higher Institute of Information and Communication Technologies, University of Carthage, Tunis, Tunisia


Introduction
Due to the increase in technological developments and easier access to products, many technological devices have started to be used widely. ese devices consume energy, and this energy is mostly consumed as electrical energy. All these distributed devices connected to the grid can cause PQD [1]. e leading causes of PQD include nonlinear loads, flexible AC transmission devices, variable frequency drives, arc devices, and converters used in power electronics [2]. Ideally, grid voltages and currents should be in a clean sinusoidal form. If disturbing components are included in the network system, power losses and various disturbances occur. In this case, almost all electronic devices, from industrial equipment to electronic household appliances, are adversely affected. Besides, energy providers are badly affected in this case [3]. Understanding the causes of these situations enables action to be taken. For this, classifying the problems that arise is one of the most effective solutions [4].
Pattern recognition (PR) applications for PQD detections have critical importance. Due to machine learning (ML) methods and AI applications that have recently become widespread, this process has become easier [5]. Some studies in the literature divide the PQD classification process into three parts: feature extraction, feature selection, and designing of classifiers [6]. In fact, these three stages are interconnected. In the feature extraction stage, specific situations based on the problem are obtained. Although more effective solutions are used for these stages today, in the past, the feature extraction step was highly dependent on the expert's experience and statistical capacity. Uyar et al. [7] proposed a wavelet entropy-based feature extraction approach for PQD classification. Jayasree et al. [8] presented a PQD classification framework consist of two steps. In the first step, an envelope detector using the Hilbert transform is used. In the second step, an artificial neural network architecture is used to classify the information from the first step. Reaz et al. [9] used discrete wavelet transforms as a feature extractor. Fengzhan and Rengang [10] used S-transform as a feature extractor, and Janik and Lobos [11] used radial basis function (RBF) networks. Lopez-Ramirez et al. [12] used empirical mode decomposition (EMD) for the classification of PQD data. Most of these methods are generally determined according to the properties of a specific dataset. When these methods can be used with another dataset, their performances decrease. In addition, the dimensions of some features are not suitable for classifiers, or classification takes a long time. For this reason, some researchers use feature selection methods to overcome these problems. After the feature selection procedure, a low-dimensional representation of the problem occurs. e representation power of these less parameterized features remains the same. Lee and Shen [13] proposed a feature selection algorithm named as probabilistic neural networkbased feature selection (PFS) for PQD data. Panigrahi and Pandi [14] used the genetic algorithm for feature vector the selection to increase PQD classification performance. Singh and Singh [15] used the ant colony optimization technique to select optimal features. Huang et al. [16] presented a feature selection framework for PQD. eir framework includes entropy-importance (EnI)-based random forest (RF) model for selection process. Feature selection algorithms contribute positively to the success of the classification and reduce the time. But the computation of these methods is very complex, and it is model-sensitive. e purpose of feature extraction and selection in the PQD classification process is to increase the classification performance. Binary and multiclass classifiers are used together with the mentioned hand-crafted features and feature selection methods. Hidden Markov model (HMM) [17], decision tree [18], rule-based systems [19], support vector machine (SVM) [11], probabilistic neural network [20], ANN [21], independent component analysis (ICA) [22], and K-nearest neighbor classifier (kNN) [23] are the most used algorithms for PQD classification. Studies with these classifiers continued intensely until the deep learning approach became popular. Deep learning methods produce striking results in almost all AI problems. Inspired by its high success in other studies, PQD classification studies have been started with deep learning approaches. CNN, the most popular deep learning technique, is particularly effective in analyzing two-dimensional matrices and images. e robust feature representation power of the CNN architecture in 2D matrices comes from 2D kernels in the convolution layer, in addition to other factors such as other layers of the architecture and activation functions [24]. Among the current PQD classification studies in the literature, those including CNN are briefly reviewed below. Wang and Chen [25] proposed a closed-loop deep-learning method to classify PQD data. Liu et al. [26] used deep CNN and SVM together to classify PQD. Cai et al. [27] combined Wigner-Ville distribution (WVD) with CNN for the PQD dataset. Deng et al. [28] proposed a sequence-to-sequence deep learning model with bidirectional GRU for PQD classification. Shen et al. [29] proposed improved principal component analysisguided 1D-CNN for PQDs. Rodriguez et al. [30] presented a convolutional auto-encoder compression framework and a stacked long short-term memory (LSTM). Subudhi and Dash [31] proposed the grey wolf optimization-(GWO-) based extreme learning machine (ELM) algorithm to classify PQD signals with limited data. Bashawyah and Subasi [32] classified five PQDs signals with different machine learning algorithms. Biswal and Dash [33] used the fast dyadic ST algorithm with fuzzy decision tree for power quality disturbance. Khokhar et al. [34] proposed an optimal feature selection algorithm to classify PQDs. Li et al. [35] detected and classified PQDS by using Dag-SVMs with double resolution S-transform. CNNbased studies have superior performance in real samples as well as artificial samples. However, the CNN algorithms are not suitable for datasets that do not contain enough samples [36][37][38].
In this work, an efficient CNN architecture [39][40][41] is presented to classify single or composite PQD problems. For this purpose, the dataset is focused on preventing the CNN architecture from falling into the problem of overfitting. As learned from the literature on this subject, using a relatively large and sufficient dataset is essential for training the CNN algorithm. As the second stage, a highperformance CNN architecture designed for PQD sample classification is presented. e proposed architecture is a GRU-supported linear CNN architecture. e proposed CNN-GRU combination improves the performance compared to its counterparts in the literature. It includes a linear CNN architecture based on the proposed deep learning technique. Fully connected layers (FCLs) in this linear CNN architecture have been removed. Instead of these layers, GRU layers have been added. e contributions of the proposed method can be summarized as follows: (i) anks to the proposed GRU-supported CNN architecture; high classification performance is obtained for PQD datasets containing fewer samples (ii) It is faster because it contains fewer parameters than CNN architectures with FCL (iii) Suitable for end-to-end training (iv) It performs more effectively than current state-ofthe-art methods e rest of this paper is organized as follows: Section 2 includes background information of related algorithms and the proposed method. Section 3 provides experimental details and results. Finally, the conclusion is presented in Section 4.

2.1.
e Background in the Deep Learning Models. e concept of deep learning represents a relatively deep understanding of knowledge. In other words, it aims to be as knowledgeable as an expert in a problem. is is possible using deeper networks and the increase in training samples. Additionally, some mathematical interventions are done in architecture.
e most widely used deep learning architecture today is CNN [42]. CNN is ideal for analyzing one-, two-, and three-dimensional matrices or vectors. is section focuses on 2D-CNN. e best way to understand CNN architecture is to understand the CNN layers. For this reason, the operation of CNN layers is covered in this section.
e layer most associated with the CNN architecture is the convolution layer. A CNN architecture contains many convolutional layers of various kernel sizes. ese convolution layers learn about the features of the problem. Convolution layers are applied to the image by a convolution operation. One of the most significant advantages of sliding a convolution kernel over the image is parameter sharing. In this way, the total number of parameters in the network is reduced. Another layer in a basic CNN architecture is the pooling layer. is layer is usually used after the convolution layer or after the rectified linear unit (ReLU) layer. e most important task of the pooling layer is downsampling. is reduces the total number of parameters while preserving essential features in the matrix. ere are types such as maxpooling, sum-pooling, and average-pooling; max-pooling is used in this study. e other essential layer of a basic CNN architecture is ReLU. Generally, one ReLU is used after almost every convolution layer. e task of the ReLU layer is to disrupt the linear structure of the network and to disaggregate the network. e traditional ReLU layer pulls negative parameters to zero. FCL layer is a kind of artificial neural network structure. It consists of many neurons, and all nodes are interconnected. is section is generally used for the classification of extracted features. A CNN architecture consisting of these three basic layers is calculated as in the following equation: where l next represents the input of the next layer, pool is the pooling layer, n is the pooling window, σ means ReLU function, w represents convolution kernel, D in represents the input, and b is the bias. e softmax function is generally used at the end of classifier networks. It calculates the probability distribution for m class output using the following equation: In addition to these base layers, many new layers have been proposed and continued to be submitted. However, these layers are sufficient for a clear understanding of the CNN architecture proposed in this study. (GRU). GRU has emerged to effectively avoid the gradient burst or loss problem in recurrent neural networks (RNN). GRU can be thought of as a simpler version of long short-term memory (LSTM) [43][44][45]. Two gates are defined in a GRU unit: reset gate and update gate. e reset gate, which we will denote with r, puts the new input in cooperation with the previous memory. e update gate, which we will indicate with z, is responsible for protecting the previous memory value. To calculate the transition functions of a GRU, the following equations can be used:

Gated Recurrent Unit
where ⊙ denotes the element-wise product, k represents the dimensionality parameter of hidden vectors, and W, V, and b are the shared parameters.

e Proposed
Method. e proposed method is designed to investigate the effect of GRU layers on CNN architecture. A performance comparison was made between the proposed model with Vgg-16, GoogleNet, and ResNet-50 models. Pretrained CNN networks start learning with a transfer learning approach and are trained shallowly with generated power quality distribution data. ese data in the one-dimensional signal format are converted into a two-dimensional matrix format by the STFT method. e obtained data through this transformation contain more features than raw signal data. e feature extraction and classification process have occurred automatically by providing this data input to our proposed GRU-based CNN network. In the training of the network, the number of mini-batches is ten, and the learning rate has been chosen as 0.0001. Dropout is set to 0.2. A high learning rate does not bring success in problem convergence. e learning rate with very low values requires a long training period [42,46,47]. A stochastic gradient descent algorithm is used to provide parameter optimization of the proposed CNN model [48].
In the proposed CNN architecture, the GRU structure has replaced fully connected layers. Two-level GRU blocks are included in the proposed CNN architecture. e first GRU block consists of 200 hidden neurons, and the latter GRU block consists of 100 hidden neurons. A dropout layer has been added between GRU blocks to avoid overfitting problems. e proposed method is shown in Figure 1. e basic architecture of the proposed method is the structure called a block. e block structure consists of one convolution, one batch normalization, and one rectifier linear unit (ReLU) layer, respectively. ere are a total of five block structures in the proposed architecture. ere are maximum pooling layers between the first four block structures. In the proposed architecture, there are two GRU layers instead of fully connected layers. By means of dropout layers between the GRU layers, the overfitting problem is avoided.

Dataset Description.
e dataset used within the study's scope was generated in the simulation environment, and noise levels of 20-50 dB were added. e dataset consists of 7 classes in total and includes 12336 signals. 75% of these data is reserved for training and 25% for testing. e dataset comprises singular and composite power quality defects. While singular defects are in sag, swell, oscillatory transient, flicker, and harmonics, composite defects consist of sag + harmonics and swell + harmonics classes. e presence of different and high levels of noise in the signals made it more realistic and challenging. In Table 1, the number of seven PQD signals containing pure sine waves is given to evaluate the performance of the GRU-based CNN network. Parameter changes were made in accordance with the IEEE-1159 standard [33]. Examples of the PQD signal are shown in Figure 2. e amount of PQD signals classically is very important. is situation, which is called as data balance, is vital for training and testing CNN networks. e generated dataset contains 7 PQD signal classes in total. ere are 1800 pieces of sag, 2000 pieces of swell, 1736 pieces of oscillatory transient, 1600 pieces of flicker, 2000 pieces of harmonic, 1600 pieces of sag + harmonic, and 1600 pieces of swell + harmonic PQD data. Table 1 includes the class distribution of these data.

3.3.
Results. e proposed method was performed with a computer with Intel Core i7-7700K CPU (4.2 GHz), 32 GB DDR4 RAM, and NVIDIA GeForce GTX 1080 graphic card. To compare the performance of the proposed method, a comparison was made with a pretrained CNN network in three different architectures. e importance of this process is to show the performance of the proposed CNN architecture with scratch structure when supported by GRU modules. Also, training charts of each CNN network are shown in Figures 3-6. Blue lines on training curves show training accuracy, and black lines show test accuracy. Red lines on the loss curve represent education; black lines represent test loss. As can be seen from the graphs, GRU-based CNN architecture showed higher performance. However, the overfitting problem cannot occur in the proposed CNN + GRU architecture. e proposed method quickly begins to converge on the problem during the training process. e availability of a sufficient amount of training data led to high performance in all CNN networks.
According to Figures 4-6, pretrained CNN networks using transfer learning and the proposed CNN architecture showed very high performance. e main reason for this situation is that there is enough data for training and testing. A comparative analysis is presented for specified CNN models in Table 2. Table 3 contains the number of parameters of CNN architectures. e number of parameters of the proposed method is very low. Mobile Information Systems 5

Discussion.
Considering the performance of the proposed CNN architecture and the number of parameters, it stands out more than other pretrained CNN networks. According to the results in Table 2   with 245284. e number of updated parameters is a factor that directly affects the training and test processes of the model. Table 4 includes the results of the state-of-the-art. e lowest performance belongs to Khokhar et al. [34] with 86.86%. e highest performance is obtained with 97.94% accuracy by the FST + fuzzy DT method [33]. ese datasets do not contain any noise on PQDs. In our work, PQD datasets include 20-50 dB noise levels.
is makes it a challenging problem for AI algorithms. On the contrary, the number of samples is too much than other previous studies.
In conclusion, the proposed CNN + GRU algorithm has succeeded the highest performance with 98.44% within Table 4.

Conclusions
In this study, a GRU-based CNN architecture is proposed for the identification of PQD defects. A CNN model has been designed that can analyze individual and composite disorders within PQD. PQD signals are formed into a two-dimensional form by applying STFT. Despite the high level of noise in the PQD signals, a significant classification performance has been achieved. In this way, the proposed CNN model maintains its feasibility even in noisy environments and exhibits an adaptive feature. Of course, there are some shortcomings in the algorithm and the study, and it    is recommended that researchers who continue to work on this issue should analyze more PQD error classes. It can also measure error times.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.