Energy Theft Detection in an Edge Data Center Using Deep Learning

With the development of smart grid information physical systems, some of the data processing functions gradually approach the edge layer of end-users. To better realize the energy theft detection function at the edge, we proposed an energy theft detection method based on the power consumption information acquisition system of power enterprises.,emethod involves the following steps. In the centralized data center,K-means is used to decompose a large amount of data into small data and then input and train neural network parameters to realize feature extraction. We design a neural network named DWMCNN, which can extract features from the day, week, and month and can extract more accurate features. In the edge data center, the random forest (RF) algorithm is used to classify the extracted features.,e experimental results show that the clustering method accords with the idea of edge computing-distributed processing and improves the operation speed and that the feature extractor has good convergence performance. In addition, compared with the methods based on various classifiers, this method has higher accuracy and lower computational complexity, which is suitable for the deployment of edge data centers.


Introduction
In recent years, the industrial Internet of ings, especially smart grids, has developed rapidly. In the smart grid, the advanced metering infrastructure (AMI) [1], combined with the Internet of ings technology and artificial intelligence technology, can obtain historical and real-time power consumption data from meters deployed in users' homes to realize emergency analysis and transient stability simulation [2,3].
However, smart meters are vulnerable to network physical attacks in smart grids due to their insecure distributed distribution and physical environment, resulting in energy theft. e loss caused by energy theft belongs to the nontechnical loss of electric power loss. By attacking smart meters and installing electricity stealing modules for meters, this behavior can wiretap, damage, and tamper with meter readings, resulting in a significant income loss of energy enterprises and even endangering public safety (such as fire or electric shock). e electricity stealing rate in developing countries is quite high, reaching 30%. In India, energy theft costs as much as $4.5 billion a year [4].
Traditional energy theft detection mainly relies on the electric power enterprise to send technical personnel to read the electricity meter on a regular basis and then record, count, and analyze the data for manual discrimination.
ere are also methods of using camera monitoring to prevent energy theft. However, this method consumes the human and material resources of power enterprises and cannot detect the energy theft realized by advanced attack means. At present, the most commonly used method is combined with smart grid detection. Smart meters upload the collected data to a centralized data processing center, and then the centralized data processing center detects the theft through an intelligent algorithm. However, the widely deployed smart meters and a large amount of power consumption data pose challenges to the centralized data center processing mode. To save on the energy consumption of nodes and reduce unnecessary data transmission, deploying edge data processing center with a data processing function at the user edge has become a new detection mode of energy theft. In this mode, the user's electricity data do not need to be uploaded to the centralized data center, which reduces the upload bandwidth. erefore, this paper aims to design a novel detection method to solve the above problems. We proposed a neural network model that is suitable for the deployment of edge devices and conforms to the daily, weekly, and monthly power consumption features of users to learn the power consumption data and identify the energy thieves. e rest of the paper is organized as follows. Section 2 summarizes the related literature. Section 3 introduces the various parts of the model and the general process. We describe the characteristics of power consumption data in Section 4. We introduce the DWMCNN feature extraction model in Section 5. en, we give the experimental results in Section 6. Finally, we summarize the thesis in Section 7.

Related Works
rough much research, we realized several detection technologies for energy theft. Researchers divide energy theft detection systems into three basic methods: state-based detection, game theory-based detection, and classificationbased detection. Using upgraded devices and sensors in condition-based detection can improve the accuracy of energy theft detection. In [5], the authors designed a system that can conveniently detect and shield electricity stealing. e whole smart meter sensor is equipped with programmable logic controller (PLC) control and supervisory control and data acquisition (SCADA) monitoring. Energy theft detection occurs through a sensor that works during any illegal use of electricity. e main limitations of the detection system are vulnerability, high cost of hardware equipment, and high maintenance cost. e detection method based on game theory is suitable for analyzing a large amount of data. Reference [6] proposed a detection method based on game theory to find an optimal solution that is based on the formulation of various potential strategies. In this process, the greatest challenge is to calculate the utility function among distributors, regulators, and thieves. e method based on classification mainly uses a machine learning algorithm to establish a classification model and analyze the daily power consumption mode of users. e classification model includes decision trees (DTs), random forests (RFs), support vector machines (SVMs), and neural networks (NNs), and so on.
A classifier based on machine learning is used because the power consumption data are usually in the form of one dimension and time series.
ere have been many new studies [7][8][9][10][11][12][13][14][15][16][17], and the support vector machine (SVM) classifier is the most common method. In addition, there are studies [18][19][20] that used artificial neural networks to detect energy theft. e accuracy of these studies in the detection of energy theft is very low. Additionally, the features extracted by traditional machine learning feature extraction methods cannot successfully achieve effective energy theft detection.
In [21], the author trained an SVM algorithm model and rule engine algorithm using energy consumption data from customers with different time interval values. e different models proposed in the study achieved high success rates of 85.5% and 92%. e work in [22] proposed a convolutional neural network-long short-term memory (CNN-LSTM) model to detect energy theft in the smart grid. Due to the imbalance of data distribution, data generation technology was used in energy theft detection. is increases the amount of theft data to the same level as normal users. Although the experimental accuracy was 89%, in practice, energy thieves are often far less accurate than normal users. e work in [23] developed a new method to detect and identify energy theft in distribution systems using a multilayer perceptron artificial neural network (MP-ANN) algorithm. ey successfully classified malicious users and normal users with an average accuracy of 93.4%. e work in [24] tried to identify customers who steal electricity by using smart meters through two different algorithms based on the linear regression method. Moreover, the study in [25] proposed a combination of convolutional neural network (CNN) and long short-term memory (LSTM) structures in which a model is used for short-term load forecasting and detection. Compared with other methods, the proposed model performed quite well.
ere are also some methods that use clustering to detect anomalies. In [26], the density-based application spatial clustering and noise (DBSCAN) algorithm was used to detect and diagnose abnormal building operation patterns. In [27], a fuzzy clustering detection algorithm based on c-means was proposed.
e Euclidean distance between customer consumption and regular profile was calculated and used to measure anomaly degree.
However, whether using the machine learning method or deep learning method, most of the related research cannot be well combined with edge computing platforms, cannot achieve high efficiency and low computational complexity, and is unsuitable for edge platform deployment. erefore, this paper is committed to proposing a power theft detection model suitable for deployment in edge nodes, which can save on bandwidth and detect energy theft.

Proposed Method
In this section, we discuss the basic structure of the energy theft detection model based on edge computing and the basic process of energy theft detection.
e energy theft detection model based on edge computing is composed of users, field terminals, edge data centers, and centralized data centers. Its specific structure is shown in Figure 1. is is described in detail in the next section.

System Model
Users: each user has a smart meter that can connect smart devices at home to aggregate their energy consumption. Users can be roughly divided into residential users, low-voltage general industrial and commercial users, small and medium-sized special transformer users, and large-scale special transformer users. Each type of user can be divided into several categories according to their electricity consumption habits. Field terminal (FT): this includes centralized meter reading terminals and special transformer acquisition terminals that can collect the data collected by the electric energy meter installed in the user's home, conduct a small amount of simple data processing, and monitor the operation status of the electric energy meter. Edge data center (EDC): an edge data center is composed of sensors and processors with certain computing power. It can be deployed near the components of the distribution network and can carry out simple calculation tasks. Centralized data center (CDC): a centralized data center has powerful computing power and a large number of computer resources. One panel is configured in medium and low-voltage distribution stations, distribution stations, and other places. After receiving the data collected by the field terminal, the edge gateway can carry out the simple calculations and then send them to the centralized data processing center.

System Flow.
e computing power of the edge data center is not as strong as that of the centralized data center, so it cannot complete the task of energy theft detection only at the edge. However, it will take many resources to upload to the data center. When we design our energy theft detection scheme, we will fully consider the advantages and disadvantages of the edge end and central end of the equipment and design a set of power-stealing detection schemes that can make the edge data center and the centralized data center cooperate with each other and give full play to their respective advantages. e specific process of the scheme is shown in Figure 2.
(1) e user's historical electricity consumption data (at least one year's data) collected by smart meters are uploaded to the CDC through the EDC. In the CDC, the K-means algorithm is used to cluster the data sets for different electricity consumption habits. e Knearest neighbor (KNN) model can also be trained to classify the data after clustering. (2) e clustering data are sent to different EDCs, and the model parameters of the CNN feature extractor are trained in the EDC. After training, the network parameters have good universality and persistence and do not need to be updated again in a short time.
It should be noted that the given user history data x ∈ R M×N , where M and N denote the number of samples and the length of observation, respectively. It needs to be divided into a training set and test set x train , x test : e CNN feature extractor is trained on x, and the weight learned is used as the feature of the input data of the classifier, which is recorded as C (x). C (x) is also segmented according to formula (1).
(3) EDC uses C (x) as the input data of the RF classifier to train the RF classifier model. e obtained model is stored in the EDC locally. When the parameters of the model need to be changed, the model is directly modified in EDC to reduce the bandwidth pressure of data upload. (4) To evaluate the model, the test set is used to check the performance of the whole scheme. When new user data are uploaded from the EDC, the power consumption type of the user is determined according to a KNN algorithm in the nearest EDC and transmitted to the corresponding EDC. According to the trained CNN-RF model, whether the user is an energy thief is determined.
is process includes three steps: (1) a large amount of data cleaning and preprocessing, (2) CNN feature extraction model training, and (3)   Mathematical Problems in Engineering first part of the content, a large amount of data with high computational complexity, needs to be processed at a centralized data center. In steps 2 and 3, the amount of data is small, and anomaly detection can be realized at the edge data center in less time.

Analysis of Electricity Consumption Data of Users
Many countries or distribution companies record real electricity consumption daily and regularly to investigate consumers' electricity consumption behavior. To determine the difference between normal users and energy thieves, we select a dataset released by the State Grid Corporation of China (SGCC) for analysis. is dataset contains the power consumption data of 42,372 power users over 1035 days. We randomly selected the electricity consumption data of a typical normal user and a typical power-stealing user to draw, to determine the difference between them. Figure 3 shows a comparison of the daily electricity consumption data of two kinds of users (partial dates). It is obvious that the daily electricity consumption of the power-stealing users fluctuates significantly, while the daily electricity consumption of normal users fluctuates slightly. If we plot in two dimensions by week and month, we can obtain the difference between normal users and power-stealing users.
Using the method of two-dimensional drawing by week and a Pearson correlation coefficient, Zheng [28] found that the power consumption data of normal users have a strong correlation, but this only proved the regularity of power consumption of users with the week as a unit. We found that not only the user's electricity consumption data but also monthly correlation characteristics have weekly correlation characteristics. Figure 4 shows the monthly electricity consumption curve of normal users, and Figure 5 shows the monthly electricity consumption curve of energy thieves. We can find the difference. e annual electricity consumption of powerstealing users is irregular, while the electricity consumption curve of normal users is periodic. e average power consumption of the three years reaches a peak in July every year, and the power consumption of other months is low. (For the sake of fairness, the normal users and power-stealing users we choose are small and medium-sized households with an annual power consumption of less than 20,000 kWh.). If the method of analyzing one-dimensional time series data is used to analyze the user's electricity consumption data, then it is often difficult to obtain the characteristics of the regularity of the user's electricity consumption data. Many traditional data analysis methods, such as SVM and simple artificial neural networks (ANNs), cannot be directly applied to power consumption data due to their computational complexity and limited generalization ability.
Some scholars [28] considered the periodicity of power consumption data in the detection, but the design structure did not fully consider the three aspects of the day, week, and month, so it was not completely accurate in the prediction. In order to improve this situation, we add the day, week, and month convolution neural network (DWMCNN) to feature extraction in the framework of edge computing. DWMCNN is described in detail in Section 5.

Convolution Neural Network (CNN).
e CNN algorithm, as the feature extractor of the proposed model, trains the model parameters in the centralized data processing center and assigns them to the designated edge data center network, which is a kind of feed-forward neural network.
is is an artificial neural network designed by simulating the structure of a cat's visual nerve as inspired by the structure of that nerve. e models developed from it, such as AlexNet, visual geometry group (VGG) network, and ResNet, are widely used in image processing [29]. e architecture of a CNN is composed of many distinct layers that transform input features into output features by differentiable functions. e basic convolution process of CNN is as follows. e convolution layer consists of a group of learnable filters or cores with small receptive fields but extends to the whole depth of the input volume. During the forward passage, each filter convolutes the input volume on the width and height of the filter, calculates the dot product  between the filter inlet and the input, and generates the twodimensional activation map of the filter. e pooling layer is a form of nonlinear downsampling that is used to gradually reduce the space size of the representation and reduce the number of parameters and the amount of calculation in the network so as to control overfitting. After several convolutions and maximum pool layers, the high-level reasoning in the neural network is completed through the complete connection layer. e neurons in the fully connected layer are connected to all of the activation in the previous layer. e fully connected layer is used to generate the final output.

DWMCNN.
To more comprehensively extract the features needed by users' electricity consumption data, the traditional CNN network structure needs to be improved. e traditional LetNet5 network structure is simple and is composed of two convolution layers, two pooling layers, and two fully connected layers. e convolution kernel is 5 × 5, stride � 1, and the pooling layer uses max pooling. e potential power consumption relationship cannot be extracted effectively. In this paper, uncontrollable factors such as user lifestyle, seasonal change, and user type are considered when deciding the network structure, and the characteristics of user power consumption change are diverse. A CNN feature extraction framework for the day, week, and month is designed.
As shown in Figure 6, the DWMCNN framework is composed of 1D shape daily load feature convolution, 2D shape weekly load feature convolution, and monthly load feature convolution. We explain this in detail as follows: (1) Daily Load Feature Extraction. Daily load feature extraction is realized by a fully connected neural network layer. It learns global knowledge from 1D power consumption data. Customer electricity consumption is essentially 1D time series data. Each neuron in the full connectivity layer determines the output of the node according to the rectified linear unit (ReLU) activation function. e equation for ReLU performs as follows: where y is determined by the following equation: where y j is the output of the complete connection layer of the j th neuron, n is the length of one-dimensional input data, w i $ is the neuron weight between the first input value and the j th neuron, b is   Mathematical Problems in Engineering the neuron weight between the first input value, and the j th neuron is the deviation. After calculation, the value to the connection unit is sent to the higher layer through the activation function to determine its contribution to the next prediction. e input shape of one-dimensional daily load data is as follows: where d is the total number of days of historical power consumption data of users.
(2) Weekly and Monthly Load Feature Extraction. Because the daily electricity consumption fluctuates in a relatively independent way, it is difficult to identify the periodicity or nonperiodicity of electricity consumption from one-dimensional electricity consumption data. If we analyze the power consumption data of several weeks together, we can easily identify the abnormal power consumption. Inspired by this observation, a deep CNN component is designed that transforms the two-dimensional data into twodimensional data, convolutes the features, and combines them. For the input layer, we have two input threads: one thread is arranged weekly and the other thread is arranged monthly. After the double parallel convolution layer, the two groups of data are merged after they have the same shape. e input shapes of weekly and monthly load data are shown as follows: where w is the total number of power consumption weeks of the user's historical power consumption data. m is the total number of power consumption months of the user's historical power consumption data. (3) Combination Extraction and Classification. Combining feature extraction and one-dimensional convolution to extract daily load features and twodimensional convolution to extract weekly and monthly load features, the weighted sum of their output is used as a hidden feature for combination and then through the full connection layer. Traditionally, a softmax classifier is used in the last output layer of CNN. For the classification problem, the softmax function is a common function added to the output layer to obtain the category. e k-dimensional vector of any real value is compressed into the k-dimensional vector of the real value, where each entry is in the range of (0, 1), and all entries add up to 1.  e classifier based on CNN-RF cancels the softmax classifier, outputs 32 dimensional features directly from the fully connected layer, and then predicts the categories. e RF classifier can be defined as follows: where sigm is a sigmoid function that maps the outliers to 0 and the normal values to 1. e parameter set of the RF layer includes the number of decision trees and the maximum depth of the tree, which are obtained by a grid search algorithm.
To train the neural network, we define the loss function and optimizer to adjust the weight. In the neural network framework, we use classification cross entropy as the loss function and random gradient descent as the optimizer. e cross entropy of distributions u and v on a given discrete set is defined as follows: Stochastic gradient descent (SGD) is an iterative method to optimize a differentiable objective function and a stochastic approximation of gradient descent optimization. e basic idea is to obtain a "gradient" through randomly selected data (x i , y i ) to update the weight W via W t+1 ⟶ W t + ηθ

) Technique of Selecting Parameters and Avoiding
Overfitting. Table 1 summarizes the detailed parameters of the proposed DWMCNN structure, including the number of filters in each layer, filter size, and step size. Some units are randomly deleted from the neural network in the training process, which can prevent these units from adapting to each other too much and make a neuron independent of the existence of other specific neurons. e application of appropriate training methods can also help to reduce overtraining. Each iteration increases the weight, which is essentially a penalty. We also use binary cross entropy as the loss function. Finally, a grid search algorithm is used to optimize RF classifier parameters such as the maximum number of decision trees and features.

Implementation
To evaluate the performance of the proposed energy theft detection scheme based on the fact that other energy theft detection schemes are more realistic, an algorithm is implemented in Python 3.7, a CNN is implemented in the TensorFlow and Keras frameworks, and the interface between the RF and CNN is implemented by the Scikit-learn module. e energy usage data comes from SGCC.

Data Preprocess.
In SGCC data, due to various external factors, such as smart meter failure, unreliable measurement data, and unplanned system maintenance, error or null data inevitably appear in the dataset. Because the amount of data that cannot be read by the system for any reason will directly affect the effectiveness of the model, the preprocessing stage of the dataset is very important for the system.

Data Selection.
In this study, data similar to the actual situation should be selected. Most of the electricity consumption data from 2014 and 2015 contain NaN and zero values, while the data in 2016 are more complete and contain fewer NaN data. Table 2 shows that in 2016, there were 169 users with 100-200 NaN data and 132 users with more than 200 NaN data. e number of users without any NaN and 0 data points was 30.341.
To keep as much data as possible, it is necessary to approximate the NaN value. First, according to the principle of 3 Sigma, the statistical error values due to meter failure and other reasons are removed. en, according to equation (11), the daily power consumption of customers with NaN and zero data is eliminated.
where i is user i, x i,t is the daily power consumption data of the user on day t, avg(x i,t ) is the historical average daily power consumption of the user, σ(x i,t ) is the standard deviation of the user's historical daily electricity consumption, and θ is an artificially set deviation threshold.
where i is user i, x i,t is the daily power consumption data of the user on day t, and x i,t−1 , x i,t−1 represents the daily power consumption data of the user on the day before and after day t respectively. If x i is empty, then it is represented as NaN, which means that the missing value is uploaded by a smart meter.
To speed up the gradient descent to find the optimal solution and improve the accuracy, it is necessary to normalize the power consumption data. We choose the maxmin scaling method to normalize the data according to the following equation: Mathematical Problems in Engineering

Evaluation Method.
Because it often expensive to check the identification of abnormal users, it is very important to predict abnormal users accurately. e confusion matrix is a basic tool to evaluate the performance of classifiers, as shown in Figure 7. TP indicates that the predicted normal user is actually a normal user, and TN indicates that the predicted abnormal user is actually an abnormal user. e higher the TP and TN are, the higher the detection effect. FP is the predicted normal user, but the actual abnormal user, and FN means the predicted abnormal user but the actual normal user.
According to the confusion matrix, several evaluation indexes can be derived: accuracy (PR), recall (RE), F 1 score, etc.
Re � TP (TP + FN) , TPR � TP (TP + FN) , TPR is the proportion of the number of normal users predicted by the detection model to all actual normal users, and FPR is the proportion of the number of abnormal users predicted by the detection model to all actual abnormal users.
e area under the receiver operating characteristic curve (AUC): By changing the threshold, a receiver operating characteristic (ROC) curve was drawn with TPR to FPR. e higher the AUC, the better the model can distinguish between abnormal and normal [30]. When AUC is 0.5, the model has no class separation ability.

Method Comparison.
To evaluate the accuracy of the day, week, and month convolution neural network-random forest (DWMCNN-RF), nondeep learning methods including SVM, RF, gradient enhanced decision tree (GDBT), and logistic regression (LR) were used to carry out comparative experiments. In addition, we also compared the classification results of various supervised classifiers: CNN feature extraction and SVM classifier (CNN-SVM) and CNN feature extraction and GDBT classifier (CNN-GDBT).
ere were compared with the results of previous classification work. e following six methods were introduced and the results were analyzed: LR: the basic model in binary classification, which is equivalent to a neural network with a sigmoid activation function. Any value greater than 0.5 is classified as normal mode, and any value less than 0.5 is classified as abnormal mode.   Table 3.
Class 0 is the exception user class, and Class 1 is the normal user class. e ROC curve of the DWMCNN-RF model was drawn as shown in Figure 8. e AUC value was 0.988, which was much better than that of the baseline model (AUC � 0.5). is shows that the algorithm can classify these two classes accurately. e parameters of the comparison method are summarized in Table 4 and experiments were carried out accordingly.
e results of different methods are shown in Figure 8 Figure 10 shows the results of all comparative experiments in terms of accuracy, recall, and F 1 score. Among the eight different detection algorithms, deep learning (including improved CNN network structure and ordinary CNN network structure) outperforms machine learning (such as LR, GBDT, and SVM). For deep learning methods, the algorithm using a CNN network structure in this paper outperforms the algorithm not using a CNN network structure in this paper (comparison of DWMCNN-RF with CNN-RF and CNN). Among the algorithms using the network structure in this paper, RF is the best classifier (comparison of DWMCNN-RF with DWMCNN-SVM and DWMCNN-GDBT). e reason for the above results is that compared with the classical machine learning method, deep learning does not need feature engineering. Classical machine learning algorithms usually require complex feature engineering. First, deep exploratory data analysis is performed on the dataset, and then a simple dimensionality reduction process is performed. Finally, the best function must be carefully selected to pass on to the machine learning algorithm. When   Figure 8: ROC curve for the paper model. using deep networks, we do not need to do this because we can usually achieve good performance by simply passing data directly to the network. In a deep learning network, the DWMCNN is better than a CNN because the DWMCNN has periodic characteristics for daily, weekly, and monthly data of power data and can extract features more effectively.
In addition, to further demonstrate the classification performance of the proposed method, a confusion matrix heat map of the proposed method and the ordinary CNN structural feature extraction method are shown in Figure 11.  e heat map of the confusion matrix shows that the CNN method easily decomposes normal data into abnormal data without the deep learning method of an improved structure, and it is not robust to normal load changes. In the selection of classifiers, the RF classifier and the model proposed in this paper have the best combination effect, which is suitable for large-scale training samples and high-dimensional feature data.

Conclusion
Based on a power consumption information acquisition system, this paper proposed an energy theft detection method for an edge data center.
is method includes clustering and CNN training at a centralized data center, feature extraction based on a CNN feature extractor at an edge data center, and RF algorithm training based on the extracted features. e advantage of this method was proven in the following experiments: (1) Using K-means clustering technology can greatly shorten the computing time and realize distributed data processing. Compared with the traditional method of processing power consumption data at a centralized data center, this method has the advantages of fast calculation speed, less bandwidth occupation, and good privacy protection. (2) Compared with the principal components analysis (PCA) based feature extraction method, the improved CNN network feature extraction method proposed in this paper can effectively find the periodicity of the data, which is consistent with the daily, weekly, and monthly variation characteristics of power consumption data. Compared with other traditional classifiers, the DWMCNN-RF combination model has higher accuracy and better robustness and can effectively realize energy thief detection.
In future research, we will continue to improve the monitoring function under this framework. User load forecasting and anomaly recognition based on edge computing is an important direction of framework development.

Data Availability
is dataset released by State Grid Corporation of China (SGCC) contains the electricity consumption data of 42,372 electricity customers within 1,035 days (https://www.sgcc. com.cn/).

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper. Mathematical Problems in Engineering 11