A Hybrid Deep Neural Network for Electricity Theft Detection Using Intelligent Antenna-Based Smart Meters

,


Introduction
The rapid growth of energy consumers has increased the energy demand, which requires efficient generation and distribution of energy at the grid level. In this regard, smart grid [1], with the incorporation of advanced metering infrastructure (AMI), monitors the energy consumption patterns of consumers. AMI establishes a bidirectional communication between consumers and grid to balance supply and demand of energy [2]. Despite of demand management challenge, smart grid faces two types of losses during energy transmission. The first type is technical losses (TLs); whereas the second type is nontechnical losses (NTLs). The former occurs due to energy loss in power distribution lines and transformers. The latter is also known as commercial loss and it occurs due to unregistered connections [3], unpaid bills [4], tampering the antenna-based smart meters, etc., [5]. Moreover, the major reason for NTLs is electricity theft, caused by fraudulent consumers.
NTLs further lead to revenue loss in the economy of countries [6]. For instance, the electric utilities of Brazil and India face about 4.5 billion dollars loss annually [7]. The utilities of USA face around 6 billion dollars loss annually [8]. It is necessary for the power utilities to overcome the revenue loss by detecting NTLs. Therefore, several strategies are being used by the power utilities. The integration of AMI in power grids provides many advanced and automatic electricity theft detection (ETD) methods. However, calculating such losses and figuring out their exact locations are considered as the most crucial tasks [9]. The inefficiency and less profitability of power utilities are other major concerns. These issues cause extra burden for honest consumers by adding extra charges in their actual utility bills. The aforementioned losses lead to other issues as well, such as hampering and inflation of the industrial routine and load shedding [10].
Several strategies and methods are presented in literature to handle issues related to ETD. These methods are commonly based on hardware, game theory, and data driven [1]. The hardware-based methods, termed as the statebased methods as well [11], perform their operations with the utilization of physical devices, such as transformers, radio-frequency identification tags, sensors, and other electrical equipment. The state-based methods calculate the difference between energy generation at power utility side and energy consumption at consumers' side. These methods achieve high efficiency in theft detection; however, the maintenance and safety of physical devices are major concerns [12]. On the other hand, in the game theory-based methods [13], a game is created between the participants (utilities and energy consumers). Both of the participants compete with each other in order to increase the utility and to get more benefits [14]. Although the game theory-based methods are more efficient contrary to the state-based methods; however, they are based on assumptions and are not able to perform efficient ETD.
In literature, data-driven-based methods are presented to perform efficient ETD. These methods utilize the machine learning (ML) techniques and models. Some of them are decision tree (DT) [3], artificial neural network (ANN) [15], support vector machine (SVM) [16], etc. These techniques use both labeled and unlabeled data to give optimal ETD results, as they have efficient learning capabilities. However, handling the imbalanced class data is challenging for these methods [17]. The traditional classifiers become biased towards the majority class if the data is imbalanced. Therefore, data balancing is required before classification in order to avoid biasness of a classifier and to achieve optimal ETD results. Data balancing is performed using resampling techniques, which balance the data in different classes present in a dataset. The abbreviations used in this work are presented in Table 1. 1.1. Problem Definition and Statement. The increased electricity demand has led to several issues, not just in underdeveloped countries but also in developed countries. The increase in poverty rate is one of the issues, which forces the people to perform electricity theft. In the energy sector, people are adopting illegal means of using electricity to fulfill their demands, i.e., electricity theft. Therefore, ETD is an important thing and needs immediate attention to avoid ever-increasing electricity theft rate. Keeping this in mind, many data analysis techniques, such as SVM, logistic regression (LR), and gated recurrent unit (GRU), have been presented in the literature. However, efficient results have not been achieved yet because of the limitations in these techniques [15]. Some of these limitations are poor learning rate, limited generalization capability, etc. However, the biggest issue being faced is the class imbalanced issue. There exists a huge difference in the number of instances in both classes, i.e., honest and theft consumers' classes. Electricity theft leads to a revenue loss of billion dollars annually for electric utilities [1] and poses serious threats to the country's economy. In addition, electric utilities face electricity losses, which are further classified into TLs ad NTLs; the latter being the most difficult to tackle. NTLs are caused by meddling either with the smart antennas or the smart meters installed at the consumers' end. Therefore, to deal with the class imbalanced issue and to avoid NTLs, an efficient model is presented in this work, termed as particle swarm optimization GRU (PSO-GRU).

Contributions.
This work presented a new variant of neural networks (NNs), named as new hybrid deep neural network (HDNN), in order to address the class imbalanced and overfitting problems in ETD. This work extends the idea present in [18]. The following are the primary contributions of this work: (i) A metaheuristic model, known as particle swarm optimization (PSO), is used with conventional GRU and convolutional neural network (CNN) to fine tune the parameters and to improve the learning rate, which makes the proposed PSO-GRU model more generalized in terms of training and testing in order to solve the overfitting issue (ii) A real-world electricity consumption (EC) dataset provided by State Grid Corporation of China (SGCC) [19] is used (iii) Data normalization and preprocessing are done using local average method and min-max normalization technique, respectively, and (iv) A comparison is made between the proposed and existing models, which proves the model's efficiency in terms of ETD The rest of the paper is organized as follows. Section 2 gives an overview of the related work done for ETD. The proposed HDNN model is described in Section 3. The simulations performed in the proposed work are discussed in Section 4. In the end, Section 5 concludes the paper and presents the future work.

Related Work
In literature, many systems and approaches are presented for ETD. Most of them are based on hardware and game theory. However, maintenance and data diversity problems are still faced by these approaches. It is observed that the ML techniques are better than the abovementioned ETD methods due to no maintenance requirements and their ability to handle data diversity. However, various existing machine 2 Wireless Communications and Mobile Computing and deep learning techniques proposed in the literature face the overfitting problem [17]. The authors in [3] proposed a classification technique based on ensemble bagged tree (EBT) for detecting NTLs in power grids. The proposed technique handles the electricity loss issue in Multan Electric Power Company (MEPCO), Pakistan. In the proposed work, the technique is validated in terms of various performance metrics and is found more efficient than the existing techniques. In [20], the authors proposed a hybrid model for ETD that is based on long short-term memory (LSTM) and CNN. The model is compared with other models, and the results show that it beats the existing models and achieves high accuracy. The proposed model used in [21] is based on the relative entropy (RE) along with principal component analysis (PCA). This work is aimed at detecting the electricity losses, which occur in the vicinity of AMI, using the reconstructed data. The model is evaluated for sensitivity and specificity, and results indicate the good performance of the model for ETD.
In [22], the authors used fuzzy logic technique for the detection of suspicious electricity consumers. The selected time series data is linked with consumers, and fuzzy sets of suspicion are created. Based on these fuzzy sets, a threshold value is decided, which helps in the detection of suspicious consumers. The proposed technique's performance is examined in terms of curve membership function, and the results show that it performs better than benchmark techniques. Similarly, the authors in [23] presented fuzzy logic technique to detect electricity theft and to increase the reliability of the power grid. The proposed technique is evaluated by presenting sixteen real-world scenarios. Efficiency of the technique is evaluated in terms of classifying honest and fraudulent consumers. However, integrating the renewable sources with power grids is not handled well. Similarly, the authors in [24] extracted the EC behavior patterns of the users and detected the abnormal consumption behavior. The authors in [25] proposed a deep learning model to overcome the issues related to NTLs in smart grids. This model considered an unlabeled data and an adversarial model to mitigate the In [26], a hybrid approach is presented for ETD, which is based upon Gaussian mixture model (GMM) and LSTM. In this work, actual time series data is considered and some improvements are made in the LSTM structure. The simulations are carried out to show the performance of the proposed approach in terms of ETD.
In [27], the authors proposed a maximal overlap discrete wavelet packet transform (MODWPT) based model for feature extraction and random undersampling boosting (RUSBoost) technique to detect NTLs in the power grids. A comparison of the proposed and existing techniques is done, and results indicate the efficient performance of the proposed technique for NTL detection. The model is compared with the benchmark techniques, and the results show that the proposed technique outperforms the existing techniques in terms of NTL detection. The authors in [28] established a relationship between commercial losses and characterization of irregular consumers using black hole algorithm. Two different datasets are used in this work provided by Brazilian electric utility, and theft categorization is performed.
The authors in [29] presented a clustering-based approach for the detection of electricity thefts. The approach is based on maximum information coefficient (MIC) and clustering technique by fast search and find of density peaks (CFSFDP). Irish smart meter dataset is used in this work for carrying out the simulations. It is observed that the proposed model performs efficient theft detection. Similarly, a clustering approach is used by the authors in [30] to divide the consumers into clusters on the basis of load consumption and perform efficient short-term load forecasting. The authors in [31] adopted the LSTM method for forecasting EC of the consumers based on the recent past consumption profiles. The continuous monitoring of the profiles helps in efficient ETD. The simulation results prove the model's efficiency. The authors in [32] used big data analytics for forecasting the EC along with the corresponding price. The simulations are performed to prove the model's efficiency in terms of price forecasting. The authors in [33] proposed a fault-tolerant model to preserve the privacy of users and perform data aggregation at the smart grids. Table 2 summarizes the related work in a tabular form for better understanding.

Proposed System Model
In this section, the proposed system model is described along with the dataset used in this work. Furthermore, different techniques used in this work are discussed. The proposed system model is shown in Figure 1 3.1. Description of the Proposed Model. The proposed HDNN-based model consists of several steps, discussed as follows. Initially, the data is gathered from the intelligent antenna-based smart meters installed at the consumers' end, shown in the lower box of the proposed system model. Each smart home has a smart meter and an intelligent antenna, which helps in recording the EC data. The data is saved and is made publicly available by SGCC. Afterward, the dataset is preprocessed in order to normalize the data and to remove redundant and irrelevant data. Also, the outliers are removed to get more refined data. The most important features are obtained by performing feature engineering process. In this process, feature selection and extraction are done. Then, the classification of normal and fraudulent consumers is done. Figure 2 shows the flow of data acquired from the smart homes having intelligent antennabased smart meters. In Table 3, the identified limitations are mapped with their respective solutions and validations.
(1) Description of the Dataset. In this work, real-time EC data of the users is used, provided by SGCC [19]. The data is gathered from the consumers' side using intelligent antenna-based smart meters. The dataset consists of 1,035 features. A subset containing data of 3000 consumers is selected from the whole dataset, in which 2480 are normal consumers; while the remaining 520 consumers are fraudulent. It can be clearly observed that the dataset is imbalanced, due to which ETD is highly affected. In this work, the data is balanced using SMOTE [20], which balanced the number of fraudulent and normal consumers. In addition, the dataset is divided into 75% and 25%, respectively, for training and testing purposes. Table 4 gives a detailed description of the dataset used in the proposed work (2) Synthetic Minority Oversampling Technique. It is considered as one of the oversampling techniques, which increases the data points in the minority class (fraudsters) in order to handle imbalanced data problem. In SMOTE, the synthesized data points are generated in the minority class. In this work, the highly imbalanced data is balanced by using SMOTE. In SMOTE, if ðx 1 , x 2 Þ depicts a sample of a minority class, then ðx 1 ′, x 2 ′Þ is selected as its nearest neighbors. The synthesized or fake data points are generated by the following equation.
where random ð0, 1Þ presents a number that is chosen between 0 and 1 and Δ denotes the Euclidean distance between the minority class and its neighboring class sample. It is calculated in 3.2. Data Preprocessing. The major aim of preprocessing step is to get the most refined data from the whole dataset. In this step, the missing or Not a Number (NaN) values are recovered by local average method, which is given in the following equation [7].

Wireless Communications and Mobile
Computing when the value of NAN is continuous, then 0.10 will be the value of P k . x i represents the EC of a user at specific time interval. μ k has a binary value, i.e., either 0 or 1, which is based on threshold k and is calculated using the following equation.
where Average local is computed in the following equation The min-max normalization technique is used to transform and normalize the dataset in the range of [0, 1]. The minimum value is transformed into 0; whereas the maximum value is transformed into 1 and other values are transformed between 0 and 1. The min-max technique is calculated by the equation given below, A refers to the actual value of the features and B indicates the value after normalization. While max ðAÞ is the  [21] ROC, specificity, and sensitivity PCA only works for linear data Fuzzy logic [22] Generalized bell curve membership function Increased computational time Fuzzy logic [23] Accuracy, F1-score, AUC Issues related to renewable sources are not handled Semisupervised deep neural network (DNN) [25] Precision, true and false-positive rates, recall, and F1-score High false-positive rate LSTM and GMM [26] AUC, MCC, recall, and accuracy Data imbalance is not handled MODWPT and RUSBoost [27] F1-score, AUC, and precision Oversampling issue is not tackled Blackhole algorithm [28] Average execution time and convergence High false-positive rate MIC and CFSFDP [29] F1-score, precision, and recall Low precision and recall LSTM [31] Accuracy, sensitivity, and specificity Overfitting is not handled well LSTM and regression [32] F1-score, recall, and precision Low F1-score

Feature Engineering.
Once the data is preprocessed and normalized, the feature engineering process is performed. This process includes two steps: one is feature selection and another is feature extraction. The former selects the most relevant data features from the whole dataset to reduce both overfitting and training time and to improve accuracy; whereas the latter extracts the selected features for data dimensionality reduction and removal of data redundancy. With the feature engineering process, the performance of the model is enhanced.
In this work, feature engineering is done using a DNN, termed as CNN. The idea of CNN was primarily presented in [20]. The typical CNN architecture has various layers, which include convolution, pooling, and fully connected layers. The first convolution layer contains many convolution filters, which are termed as kernels and they perform mapping operation. The convolution layer is mathematically given in the following equation [34] y conv X ft where σ refers to the activation function and * represents the convolution operation. W ft t and b ft t represent the learnable parameters in the f th feature filter. The next layer in CNN is the pooling layer, which comes after the convolution layer. The main objectives of this layer are to extract the meaningful features and to perform the downsampling of each feature map to achieve dimensionality reduction. It also reduces the execution time of the network. The pooling layer consists of two common functions, which are as follows. In CNN, the third fully connected layer performs the final classification. In ETD, the data is classified into honest and fraudulent consumers' classes. The mathematical representation of the fully connected layer is given in Equation (8), taken from [34], where W represents weight and b represents bias. Function that is used in CNN to predict the final output is known as the Softmax function. The output is given in binary form, either 0 or 1 [20]. Equation (9) provides a complete mathematical form of CNN, as given in [34]. Figure 3 gives an overview of the architecture of CNN, The parameters involved in Equation (9) are initialized with some random number using the normal distribution. m denotes the total number of training example. Initially, the weight W i , j is assigned randomly, then later it is updated using the gradient descent method. Equations (10) and (11) where α represents the learning rate, ∂ denotes partial derivative, W ðlÞ ij is the connection weight between ith neuron in the lth layer and jth neuron in the ðl + 1Þ th layer. b is the bias of the ith neuron in the lth layer. Equations (10) and (11) are repeated until the optimal value of objective function AðW, bÞ is achieved. The above mathematical representations are motivated from [34]. The hyperparameters of CNN used in this work and their values are given in Table 5.

Gated Recurrent
Unit. It is considered as a variation of LSTM and recurrent neural network (RNN) and is a subclass of DNN. It resolves the vanishing gradient problem of RNN   Wireless Communications and Mobile Computing by using two gates: update gate and reset gate. These gates determine that how much information is required to pass to the future. In the updated gate, the past or previous information needed to be passed to the future is determined. Equation (12) gives the formula to calculate the output of the updated gate z t for time series data, taken from [35].
where x t shows an input that is given to the network unit and is multiplied by its weight W z . The h t−1 maintains the previous information and is multiplied by its weight U z as well. Then, these weights are summed up and the result is squashed between 0 and 1 by applying the sigmoid function.
The reset gate decides that how much previous information is required to be neglected. Equation (13) gives the mathematical form of the reset gate, taken from [35].
where x t is multiplied by its weight W r , and the h t−1 is multiplied by its weight U r . Figure 4 shows the architectural view of GRU. In Table 6, the hyperparameters of GRU used in this work along with their values are presented.

Particle Swarm Optimization.
It is a population-based stochastic technique that handles the local optima issue by covering the search space with global optimum solutions. The traditional ML techniques, such as GRU, LR, and SVM, are mostly stuck into local optima. That is why it is not suitable to utilize such techniques for ETD due to their poor ETD performance. In this work, a hybrid technique is made by integrating PSO with GRU to perform efficient and accurate ETD. The proposed technique overcomes the local optima issue very efficiently. PSO performs the searching operation via swarm particles, which are updated in every next iteration. The best optimal solution is achieved by moving each particle in the direction of previous best pbestði, tÞ and global best gbestðtÞ solutions in the swarm [18]. Equations (14) and (15) give the mathematical form of calculating pbest and gbest, respectively.
where i indicates the particle index, t gives the current iteration number, N p gives the total number of the particles, f represents the fitness function, and P tells the position. The velocity V of a particle is updated using the following equation.
where ω is a weight of inertia that is used to balance both global and local exploitation. Whereas r 1 and r 2 indicate the uniformly distributed random variables that are in the range of [0,1]. While c 1 and c 2 represent the positive constant parameters, which are also known as acceleration coefficients. The hyperparameters of PSO used in this work and their values are given in Table 7. The above given mathematical formulations of PSO are taken from [18]. Algorithm 1 gives the pseudocode of PSO. The efficiency of the presented hybrid model is optimized by passing three parameters of GRU to PSO. Based on these parameters, the training and testing processes of the model are optimized for given dataset. As a result, the model becomes accurate and more robust. These parameters are discussed below.
(i) Hidden Layer. It is considered the most important layer of GRU. It is positioned between input layer and output layer. This layer primarily performs the computational operations. Moreover, the weights are given to input values by this layer. After successfully accessing optimal input sets, the results are passed to the output layer for final predictions (ii) Batches. They determine the number of training samples required to compute training and testing loss. Generally, the loss is calculated by the predefined loss function The validation is given in Figures 7 and 8   These problems further lead to increase in false-negative rate and false positive rate (which are given in Equations (19) and (20)). Therefore, a hybrid PSO-GRU model is presented to overcome the aforementioned issues. The pseudocode of the model is given in Algorithm 2. The parameters of GRU are tuned using PSO. Afterward, the well-tuned model is used for classification. The main purpose of PSO is to improve the learning of the GRU network and to solve overfitting problem. In this whole process, the data is initially preprocessed. In this phase, local average method is used for recovering missing values; whereas min-max normalization is applied to scale the data. Afterward, data balancing is performed using SMOTE oversampling technique, which balances the provided data in different classes by generating synthetic samples of the minority class. If the classifier is trained on imbalanced data, then it is biased towards the majority class; therefore, the data is balanced using SMOTE technique. Afterward, useful features are extracted from the dataset by CNN. The extracted features are then passed to GRU for training. The parameters of GRU are tuned using PSO. Finally, the fine-tuned model is used to perform classification.

Simulation Results and Discussion
Several simulations are conducted to assess the performance of the proposed hybrid model. The simulation results, performance metrics, and benchmark models are discussed in this section.

Performance Metrics.
The performance of the proposed model is examined by considering several performance measures, which include AUC, accuracy, F1-score, recall, and precision. The training and testing loss and accuracy are also calculated to assess the performance of the proposed and the     (1) Area Under Curve. This performance metric is used for the validation of model by considering an AUC between two integrals. Moreover, it provides the accumulative performance of the binary classes. Generally, the value of AUC is either 0 or 1. Where 0 value means that the performance of the model is poor; whereas 1 indicates the best performance. Equation (17) is used to compute the value of AUC [20], as given below where the rank values of samples are indicated by Rank i . M represents the total positive samples, and N represents the negative ones.
(2) F1 -Score. It is referred as F-measure as well. It calculates the testing accuracy of the model and assesses the testing score using recall and precision. The F1 -score is calculated using the following equation [20] F1 − Score = 2 * Precision * Recall Precision + Recall : ð18Þ (3) Recall and Precision. The recall and precision are expressed using the following equations where recall calculates the number of true positives in all of the classifier's results. True positive means correctly classified energy thieves. False-positive means honest electricity consumers misclassified as thieves. False-negative means energy thieves misclassified as honest consumers and true negative means correctly classified honest consumers. Whereas precision is the measure of relevance of the classifier's results. If precision is high, it means that classifier's relevant results are more than the irrelevant results.

Case Study A.
In this case study, the SGCC dataset is considered, which is publically available on internet. The brief description about dataset is given in Section III-A1 [Page 3].

Description of Existing Models with their Performance.
This section presents the existing models along with their performance based on the abovementioned performance metrics.
(1) Support Vector Machine Model. SVM is widely used in ETD for binary classification [36]. The performance metrics used for SVM, and their results are given in Table 8. It performs better than LR model; however, its performance is worse than LSTM, GRU, and PSO. It means that SVM is less accurate in handling the imbalanced data.
(2) Logistic Regression Model. LR is a popular classifier that is widely used for both classification and regres-sion. In literature, it is also used for ETD [37]. In this work, the performance of LR is examined in terms of aforementioned performance metrics. The results of LR are given in Table 9, which show that it performs worse than all of the benchmark techniques. The reasons behind this are overfitting issue and inability of LR to handle imbalanced ETD data (3) Long Short Term Memory Model. LSTM is a DNN model and is widely used for feature extraction and classification in ETD [38,39]. The performance of LSTM is checked in terms of abovementioned metrics. The results are presented in Table 10, which show that LSTM performs better than SVM and LR and worse than the proposed model. The proposed model performs better than LSTM because it does not face overfitting problem, which degrades the performance of LSTM (4) Gated Recurrent Unit Model. GRU is also a DNN model [40]. It is an advanced version of the LSTM. Its results are shown in Table 11, which are better as compared to the benchmarks. It means that GRU is capable of handling imbalanced data while avoiding overfitting (5) Genetic Algorithm Model. Genetic algorithm (GA) is a metaheuristic technique. Its performance is assessed based on various performance metrics, as given in Table 12. The results show that GA performs better than the benchmark models: SVM, LR, LSTM, and GRU. GA is also found to be more accurate and robust than the benchmarks because of its better learning capability 4.3. Results. The simulations are performed to evaluate the proposed and benchmark models by considering the aforementioned performance measures along with loss and accuracy. Moreover, the models are integrated for the performance evaluation. The SVM and LR obtain 0.68% and 0.63% of ACU score, respectively. SVM has higher AUC score as compared to LR because of using kernel trick to cope with the nonlinear data. In contrast, LR has lowest AUC score of 0.63% because it has only one hidden layer, which did not handle high dimensional data effectively and stuck in local minima. The performance results of SVM and LR are given in Tables 8 and 9, respectively. Figures 5 and 6 show the accuracy and loss values of combined CNN-LSTM model. The CNN is utilized to extract abstract and latent features from EC data with the help of convolutional and pooling layers. Whereas LSTM extracts temporal patterns and classifies consumers' records into normal and abnormal data patterns. From Figure 5, the training accuracy is observed as 81%; whereas the testing accuracy of the model is 75.5%. Both accuracies increase for an increasing number of epochs, which shows that model gives good results on a large number of epochs. However, the model is trained only for four epochs due to limited resources. There is a 6% difference between training and testing accuracies curves, which indicates that the model is stuck into an overfitting problem. On the other hand, Figure 6 shows that the training and testing losses of the model decrease as number of epochs increase. After the 4th epoch, the training and testing loses are 19% and 24.5%, respectively.
This implies that the epoch is the main controlling parameter that decides the optimal point where model achieves higher performance. However, the CNN-LSTM obtains 75.5% test accuracy, which is neither satisfactory nor acceptable in ETD. This is happened due to the inappropriate selection of hyperparameters. For deep learning models, the suitable selection of hyperparameters has great influence on the performance results. Figures 7 and 8 show accuracy and loss values of combined CNNGRU model on training and testing datasets. The CNN is used to extract optimal features while classification task is performed through GRU model. The GRU model has reset and update gates that extract more relevant information from extracted high variance features through CNN and remove the noisy and redundant features. This process makes the performance of CNN-GRU better than CNN-LSTM. There is a 4% difference between accuracy curves and a 6.97% difference between loss curves on training and testing datasets, which indicate that the CNN-GRU model is stuck into an overfitting problem. The inappropriate tuning of hyperparameters leads to an overfitting problem where the model gives good results on seen data as compared to unseen data.
In literature, there are different techniques to tune the hyperparameters of ML and deep learning models like random search, grid search, gradient-based optimization, and evolutionary algorithms. Each one of these methods has its pros and cons. In this study, we utilize evolutionary algorithms PSO and GA to find optimal hyperparameters of the CNN-GRU model. These algorithms make a search space of hyperparameters and try to find the optimal combination where the model gives high values of performance indicators.
The PSO is merged with CNN-GRU for hyperparameters tuning to enhance the performance of the proposed model. The results are shown in Figures 9 and 10. The       Figure 11. The figure shows that both training and testing accuracy are approximately 87% and 86.3%, respectively. Figure 12 presents the training and testing loss of CNNGRU-GA, which keep decreasing with the increasing number of epochs. As shown in the figure, the training loss and the testing loss are 13% and 13.7%, respectively, which are approximately the same. Next, training and testing accuracy and loss of hybrid CNNGRU-PSO model are evaluated in Figures 9 and 10, respectively. Here, PSO and GA are used for tuning the hyperparameters of the CNNGRU model. The results exhibit that the PSO obtains optimal set of parameters as compared to GA because the PSO require less number of parameters and less execution time. In PSO, each solution has its own local best, which leads it towards global best after each iteration. Whereas GA has crossover and mutation steps that create diversity in newly generated offsprings and prevent the model from falling into local optima problem. However, in this case, PSO performs better as compared to GA and gives optimal combination hyperparameters where CNN-GRU gives good results. The former shows that both training and testing accuracy of CNN-GRU-PSO increase with the increasing number of epochs. The training accuracy is 89%; while testing accuracy is 87.3%. Whereas in latter, the training and testing loss are given, which are 11% and 12.7%, respectively. Moreover, CNN is combined with GRU and the performance of CNN-GRU is evaluated in terms of training and testing accuracy and loss. In Figures 5-10, the performance of the combined models is depicted.

Wireless Communications and Mobile Computing
It is observed that the CNN-GRU-PSO has the maximum accuracy as compared to the other models. Similarly, the CNN-GRU-PSO model has a minimum loss, which shows the model's generalization. Figures 13 and 14 show the combined performance of the used and existing models for AUC. In former, AUC is calculated using both the false positive rate and true positive rate. The results indicate that the proposed CNN-GRU-PSO achieves high AUC score. Whereas the other models have a low AUC score. Moreover, it is shown that the proposed model beats the existing ones regarding AUC in the presence of imbalanced dataset. The GRU module in proposed model has strong ability to learn temporal correlation from long-term electricity load profile of consumers. It also maintains the context of previous EC information, which helps out to handle any nonmalicious (weather condition, family structure, etc.) change in EC profile. Moreover, the integration of PSO for parameters tuning further enhance the performance of the proposed model towards efficient ETD. Furthermore, the proposed model is compared with the benchmark models in terms of mentioned performance metrics, and the result is shown in Figure 15. The result indicates that the hybrid CNN-GRU-PSO model is more robust, accurate, efficient, and more generalized than the benchmarks, as given in Table 13. Table 14 describes the running time of the proposed and baseline models. The SVM model takes 220 s during the training phase, which is higher than all other schemes. The selected SGCC dataset is high dimensional and not linearly separable. SVM draws n − 1 hyperplanes and then picks an optimal hyperplane of high margin for distinguishing two classes (n represents the number of dimensions).
So, that is why SVM takes higher time as compared to other models.
The LR takes lowest execution time because of its simplex layering structure. It has only one hidden layer of neural network (NN). So, it needs less weights to learn and consumes less time as compared to other deep learning models. The CNN-GRU takes 56 seconds running time in training phase, which is 10 seconds less than CNNLSTM because of less gated configuration as compared to LSTM. The proposed CNN-GRU-PSO has higher execution time as compared to other models because of using PSO for tuning the hyperparameters of both CNN and GRU models concurrently.

Case Study B.
In this case study, the PRECON dataset is used, which is publically available on internet. The dataset is collected by Pakistan Residential EC company. This dataset contains the EC history of 42 residential houses for 365 days. In dataset, the EC of each user is recorded after one minute time period. However, in this work, the data granularity is reduced into half hour for ease. EC of 30 minutes is aggregated into single value for all dataset. All the consumers       Figure 17 shows the proposed model accuracy. The accuracy tells about how accurately data samples are classified. The higher accuracy means higher correct predictions. The accuracy is increasing gradually on test and train data after each epoch. The optimal epoch value found by PSO is 25. Figure 18 illustrates F1-score, which is the harmonic mean of precision and recall. It helps the model to accurately identify the energy thieves. The higher F1-score is beneficial for power utilities to recover maximum revenue. The AUC score of the proposed and baseline models is presented in Figure 19. The AUC measures the separability between the positive and negative classes. The proposed model obtains 0.95 of AUC score, which is higher than all benchmark models. This implies that the proposed model efficiently distinguishes two classes and reduces the miss classification rate to a minimal level. Furthermore,   test data. In this regard, three suitable statistical tests are opted that are closely related to the classification task. The detailed description of these test is given as follows.
(1) 5x2cv Paired t Test. A well-known statistical test for evaluating the performance of classification and regression models. It is introduced by Dietterich in 1998 [41]. In this study, this test is conducted to judge the performance of different classifiers. It consist of twofold cross-validation with five repeats. For each fold, the classifier is trained and the results are recorded. Afterward, the 5x2 paired t test is applied on the final result to accept or reject the null hypothesis. In this case, the null hypothesis defines as the difference between the mean performance of two algorithms is probably real or not. In this test, the p value is calculated against each t value. The p denotes the probability value, which decides that the result of your sample data is occurred by chance or not. The p value ranges from 0 to 1.
where Ms denotes McNemar's test statistic. Similar to abovementioned test, a null hypothesis is formulated. The null hypothesis (H0) is defined as the classifiers have similar proportion of errors in test set or vice versa. The p value is   Table 17 describes the results of different statistical tests. It is seen that the results of 5x2cv paired f test outperform the other tests because it is suitable for large population size. The value of probability (p) is almost lesser that 0.005 (5%). The smallest value of p indicates that the models' results are that occurred by chance. The t test does not perform well because the available dataset has large in both population and sample size. The McNemar's test also yields better value of p and proves that the models' results are not occurred by chance. All the results are real and do not depend on any noise factor.

Conclusion and Future Work
This work presents a HDNN based model in order to detect electricity theft in the smart grid. For this, dataset is taken from SGCC, which provides the real EC data gathered using intelligent antenna-based smart meters installed at the consumers' end. The proposed model works in several steps. In the preprocessing step, the raw data is normalized, and the outliers and missing values are handled. The preprocessing is done by the local average method and min-max normalization technique. Then, the feature engineering step is performed using CNN. Once the most relevant and normalized data is obtained, the classification process is done using PSO-GRU integrated CNN. In this step, the normal and fraudulent consumers are classified. The proposed model is validated in terms of several performance metrics like accuracy, recall, precision, AUC, and F1-score. Moreover, comparison of the proposed and existing hybrid models is done. The models include CNN-GRU, CNN-LSTM, and CNN-GRU-GA.
The comparison results show the efficiency, accuracy, robustness, and generalization of the proposed hybrid model for handling imbalanced class issue in terms of ETD. Despite that, our proposed method is an ideal solution towards efficient ETD. However, it has incurred little bit higher computational cost because the proposed model's modules are integrated in a sequential manner (CNN-GRU-PSO). First, CNN takes time while capturing potential features from high-dimensional EC data. Second, GRU processes the CNN's extracted features map for final classification. Meanwhile, PSO tunes the hyperparameters of both CNN and GRU models. This working flow of the proposed model consumed a little bit higher execution time as compared to the existing methods. Moreover, the proposed method has a lack in some complex real-world scenarios by accurately identifying the electricity thieves due to the addition of simulated theft data (in minority class using SMOTE). For the future, more robust techniques will be utilized to efficiently handle the overfitting issue.

Data Availability
The datasets used in this study are openly available in [henryRD-lab/ElectricityTheftDetection] at [23] [Page 3, Section III-A1].