A Comprehensive Analysis of Supervised Learning Techniques for Electricity Theft Detection

. There are many methods or algorithms applicable for detecting electricity theft. However, comparative studies on supervised learning methods for electricity theft detection are still insuﬃcient. In this paper, comparisons based on predictive accuracy, recall, precision, AUC, and F1-score of several supervised learning methods such as decision tree (DT), artiﬁcial neural network (ANN), deep artiﬁcial neural network (DANN), and AdaBoost are presented and their performances are analyzed. A public dataset from the State Grid Corporation of China (SGCC) was used for this study. The dataset consisted of power consumption in kWh unit. Based on the analysis results, the DANN outperforms compared to other supervised learning classiﬁers such as ANN, AdaBoost, and DTin recall, F1-Score, and AUC. A future research direction is the experiments can be performed on other supervised learning algorithms with diﬀerent types of datasets and suitable preprocessing methods can be applied to produce better performance.


Introduction
Electricity loss can be defined as the difference between the energy source that has been injected and the energy that has been delivered to consumers. In a power system, electricity losses generally occur in the processes of generating, transmitting, and distributing electrical energy [1]. Electricity loss can be classified into two categories, namely, technical loss (TL) and nontechnical loss (NTL). TLs involve the component of the electrical system [1,2] whereas NTL is more related to electricity theft that is caused by tampering of the meter reading device, hacking the electricity meter, stealing (illegal connections), and more [3].
In 2015, Northeast Group reported that the total cost of NTL worldwide was US$89.3 billion per year. Countries such as India, Brazil, and Russia lost US$16.2 billion, US$10.5 billion, and US$5.1 billion, respectively [4]. NTL can give a negative impact on power companies, which reduces future investments [5]. From this reason, numerous problems. ML can be mainly classified into three: supervised learning, unsupervised learning, and reinforcement learning [31].
Supervised learning [31] is the most often used ML algorithm for electricity theft detection such as support vector machine (SVM) [27,28], ANN [32], deep bidirectional recurrent neural network [33], self-attention [34], wide and deep convolutional neural network [29], and more. Supervised learning is able to take input data and labels in a trained model so as to generate predictions. Supervised learning methods have been successfully applied to assist in reducing site inspections cost [1].
is article compares and analyzes the predictive accuracy, precision, recall, F1-score, and AUC for four classifiers based on the dataset obtained from the State Grid Corporation of China (SGCC). e classifiers that are compared are decision tree (DT), ANN, deep artificial neural network (DANN), and AdaBoost. e rest of the paper is organized as follows: Section 2 reviews several papers that investigated electricity theft detection issues by means of supervised learning algorithms, and Section 3 briefly explains the supervised learning algorithms used in the research. In Section 4, the dataset is described, while in Section 5, the preprocessing methods of the dataset are explained. Next, Section 6 presents the evaluation metrics. Sections 7 and 8 provide the experimental results and comparative analysis, respectively, for all the comparisons. Finally, the study is concluded in Section 9.

Related Works
Decision tree (DT) is a supervised learning method that is commonly applied in electricity theft detection [44]. e authors in [44] proposed DT in conjunction with SVM classifiers and compared it against fuzzy classification, rough sets, ANN, SVM, and fuzzy logic coupled with SVM. e results showed that the method successfully produced 92.5% of accuracy and 5.12% of false positive rate. is detector proved its effectiveness in real scenarios. In a different work, AdaBoost associated with SVM (AdaBoost-SVM) was proposed and could efficiently detect electricity theft [45]. AdaBoost-SVM was compared with four conventional ML techniques and two ensemble learning techniques. It was found that the suggested approach performed significantly better in the imbalanced dataset [45].
Besides, ANN can be utilized in detecting electricity theft [46,47]. Nevertheless, a majority of previous research is found to be less accurate in detecting electricity theft. It is found that extracting artificial features is necessary based on the domain knowledge [29]. Recently, researchers in [29] proposed a wide and deep convolutional neural network (CNN) model. e study aimed to examine the data of electricity consumption and determine the electricity theft offenders.
e Wide and Deep CNN model included a wide component of a fully-connected layer of neural networks and a deep CNN component with multiple convolutional layers, a fully-connected layer, and a pooling layer. is model integrated the benefits of the wide and deep CNN components, which give rise to its useful implementation and good performance in electricity theft detection.
SVM is developed as a supervised machine learning algorithm that has an advantage in performing classification (for nonseparable class) and regression tasks. e concept of SVM is that classes have the capability of separating the hyperplane through support vector [48]. e nonseparable class is categorized via conversion from lower-dimensional space into higher dimensional space by using kernel trick. SVM with linear kernel for classification is known as a linear SVC (support vector classifier), which accelerates performance for large datasets [49]. NTL detection based on SVM is found in several studies in the literature [28,50]. Due to the parameter tuning problem in SVM that increases implementation time, SVM has been associated with the genetic algorithm (GA), DT, social spider optimization, and fuzzy logic for improving the classification performance of the model [44,[51][52][53].
Generally, electricity theft datasets contain an imbalanced class, whereby anomalous (thieves) are smaller than normal consumers [35]. e problem with an imbalanced dataset is known in the literature. e effects of the imbalanced dataset are that the ML model will study the ways to categorize the most regular classes and will not learn the less common classes [36]. erefore, small classes will not be detected in the confusion matrix caused by the learning of the machine learning model as it performs well in most of the common classes [36]. As a result, the ML model will later become worthless in solving the problem.
Recently, in [54], the study compared several machine learning methods with and without the imbalanced data handling technique on the SGCC dataset. e results showed that SVM and ANN yielded high accuracies even though the imbalanced data handling technique was not applied. In addition, there are many classifiers used without applying the imbalanced data handling technique for solving electricity theft problem, including ANN, Naive Bayes, logistic regression (LR), linear discriminant analysis, quadratic discriminant analysis, random forest (RF), SVM, DT, K-nearest neighbor (KNN), stochastic gradient descent, AdaBoost, CatBoost, LightGBM, and XGBoost [38].
It can be clearly seen that some machine learning models can be applied without an imbalanced data handling technique for solving the classification task. Moreover, no research focused on comparing machine learning classifiers with DT, ANN, DANN, and AdaBoost. With these motivations, this paper aims to contribute to the literature by comparing a few supervised learning algorithms to analyze the best performance of the comparative methods without applying any imbalanced data handling techniques.

Supervised Learning Algorithms
is section describes and presents the concept and equation of each supervised learning algorithm for electricity theft detection. Figure 1 shows the standard steps in supervised learning algorithms, in which the ML algorithm uses training data, features vectors, and label data as input to produce a predictive model before utilizing new data to provide the expected label as output.

Decision Tree (DT).
Decision Tree (DT) is defined as a supervised learning method that is able to solve problems pertaining to regression and classification [55]. DT [31] categorizes instances according to attribute values. In the DT algorithm, a tree consists of node and branch, whereby every node symbolizes a feature of an instance to be categorized. e tree also presumes that the node of every branch represents a value. A simple representation of DT is as shown in Figure 2.

Artificial Neural Network (ANN).
e architecture of ANN consists of the input layer (one layer), hidden layers (one or more layers), and output layer (one layer) [48] as shown in Figure 3. ANN is also known as multilayer perceptron or multilayer feed-forward neural network. is algorithm is inspired by interconnections constructed in the human brain. In ANN, the inputs are represented as dendrites (located in the human brain) that receive electrochemical signals generated by neurons and then send them to the cell body.
Each input has a weight and carries signals to a specific hidden layer. Usually, a neuron is driven via an activation function called the sigmoid function. Among the various activation functions such as step function, Gaussian function ramp function, and linear function, it is noted that hyperbolic tangent function can be applied as well [57]. e last layer in ANN refers to the axon that extends to the synapse and connects two distinct neurons. Generally, the simple construction of ANN contains a hidden layer, two inputs, and a single output. An epoch of neural network (NN) is the movement of neurons between the input and output that occurs back and forth. e best epoch depends on the tolerable error in the training of NN. e equation of the ANN output is as indicated in the following equation:

Deep Artificial Neural Network (DANN)
. ANN that has two or more hidden layers is considered as a deep neural network (DNN) [58,59]. Deep learning (DL), or known as DNN, uses an NN algorithm that involves vast computing power and humongous data to capture a high degree of information depending on the raw input data of other layers. Another name of DL and DNN is deep artificial neural network (DANN). Each layer of DANN is able to classify attributes with various forms that exist. All these layers are achieved by understanding the various forms in which information from the preceding layer is put together in order to create distinguishing features. e architecture of DANN is shown in Figure 4.

AdaBoost.
AdaBoost is defined as an ensemble learning approach that has been introduced by Freund and Schapire [60]. It has obtained tremendous success in classification.
AdaBoost is noted to be less inclined to overfitting of learning methods in a majority of prediction concerns. e approach develops weak learners through a group of weights that are kept in the training dataset. en, it will adaptively modify the learners after every weak learning cycle. In the training dataset, weights that are incorrectly classified by the current weak learner will tend to increase, while weights that are correctly classified will tend to decrease [61]. An example of AdaBoost is shown in Figure 5.

Dataset Description
e dataset was made available by the State Grid Corporation of China (SGCC) [29]. It contained the electricity consumption data of 42,372 consumers, whereby 91.47% (38,757) were normal consumers and 8.53% (3615) had abnormal consumption patterns that could be suspected of electricity theft. e dataset was collected over a time interval from 1 st January 2014 until 31 st October 2016 (1,035 days). To achieve an even value of 148 weeks, one more day of data was added. Table 1 displays the description of the data.

Dataset Preprocessing
e dataset contained some missing data that were represented as not a number (NaN). e missing values were handled by imputing the NaN cell with the mean of row, which signified the mean of consumption for each consumer. e dataset was then normalized by using MinMax Scaler since there were different types of consumers that increased the diversity of the data. e scaling is defined as follows: Scaling the data is important since some algorithms are sensitive to diverse data, especially the neural network algorithm.
e dataset was then divided into training and testing sets with several different ratios of splitting percentage to observe which ratio gave the best result in predicting anomaly users. e ratios used were 90/10, 80/20, 70/ 30, and 60/40.

Evaluation Metrics
e dataset used in this study was an imbalanced dataset, in which the number of true consumers varied significantly as compared to the false consumers. e classifier in the imbalanced dataset was found to be biased when it regarded the real electricity thieves as true customers. For further calculation, a simple accuracy metric was deemed as unreliable. For that reason, numerous evaluation measures were taken into consideration in the current research. e values of each performance indicator were verified based on the confusion matrix. An example of the confusion matrix is as shown in Table 2.
Normal consumer is represented in the negative class, while anomaly consumer is represented in the positive class. Information driven from the confusion matrix is as follows: (i) TP: anomaly consumer accurately predicted as anomaly (ii) TN: normal consumer accurately predicted as normal (iii) FP: normal consumer predicted as anomaly (iv) FN: anomaly consumer predicted as normal

Input layer
Hidden layer Output layer Figure 3: Architecture of artificial neural network (ANN) (adopted from [56]).

Input layer
Hidden layer Output layer  e results were evaluated using accuracy, precision, recall, F1-score, and AUC (area under ROC curve). Accuracy refers to the number of instances that are correctly categorized by classifiers and divided by all the instances. e calculation is presented as follows: where TP, FP, TN, and FN indicate True Positives, False Positives, True Negatives, and False Negatives, respectively [48]. e value of TN refers to normal consumers that composed a very high number of this dataset. erefore, the value for accuracy would also be high. e classifiers learned from the dataset and did not ignore that the dataset was correctly classified because it did not impact the accuracy metric. Due to the fact that SGCC was an imbalanced dataset, depending on accuracy alone was not suitable for this kind of issue.
Owing to that, the evaluation metrics suitable for imbalanced datasets were F1-score and AUC. It was noted that AUC was performed by the exchange between TP rates and FP rates. is kind of measurement determined how well classifiers are correctly classifying the classes. AUC presented the exchange between TP rates and FP rates with a range of 0 and 1. A good classifier was considered if the classifier had a value of ROC-AUC close to 1. For instance, if AUC was equal to 1, it meant that the classes were correctly classified by the classifier, whereas if AUC was equal to 0.5, it indicated that the classifier performed a random prediction [36].
AUC denotes the ability of the model to separate between normal and anomaly classes. e value range was from 0 to 1. A value near 1 indicated a high measure of separability, whereas a value lower than 0.5 signified that the classifier was unable to distinguish between the two classes; thus, it was considered as performing random guessing.
Precision measures the proportion of actual positive out of the total predicted positives. In this case, it denotes the proportion of correctly identified anomaly consumers out of all predicted anomaly consumers. High precision indicates a low FP rate. Recall measures the proportion of actual positives from the total actual positives. For this study, it signifies the percentage of correctly identified anomaly consumers out of all actual anomaly consumers. High recall indicates a low false negative rate. F1-score is favored when the data are unfairly distributed, whereby the balance is determined by considering both the precision and recall values. As the classifier in high skewness data tends to be biased toward the majority class, evaluating the F1-score is more reliable than individually using recall or precision.
Precision in equation (4) refers to the ratio of correctly categorized positive class (TP) over the total of positive classes (TP + FP). A high precision value indicated a low FP rate. In equation (5), recall, also known as sensitivity or True Positive Rate (TPR), refers to the rate of correctly classified positive class (TP) as compared to all observations in the actual class (TP + FN). e metric analysis helps to identify the number of instances that are correctly categorized: F1-score (F-measure) is more suitable for imbalanced class distribution, which includes the weighted average of recall and precision [36] as shown in equation (6). e value is computed from 0 (the worst) to 1 (the best) [62]. If the classes are found to be very imbalanced, it is suggested to observe both measures of recall and precision. On the other hand, F1-score merges the two measures for a more appropriate evaluation metric for a dataset of this type [36]:

Experimental Results
is section presents the results for all comparison methods. Two sets were derived from the dataset, which were the training set and testing set as mentioned in Section 5. Different ratios for each algorithm gave different results of recall, accuracy, AUC, precision, and F1-score. AUC and F1-  Journal of Electrical and Computer Engineering 5 score were used as the dataset contained imbalanced classes. Table 3 demonstrates the outcomes of evaluation methods for all comparison methods. e best results were highlighted in bold in the table. Based on the results, ANN showed the highest average of accuracy with 92.54%, followed by DANN (92.31%), Ada-Boost (91.75%), and DT (91.39%). Generally, DT and DANN achieved the best accuracy in the 70/30 splitting percentage with 91.77% and 93.04%, respectively. Besides, ANN and AdaBoost performed the best accuracy for ratios of 60/40 and 90/10, respectively.
All classifiers at all ratios achieved more than 0.5 in the AUC evaluation. It can be concluded that these classifiers were applicable in performing classification tasks. In the 70/30 splitting percentage, DANN outperformed DT and AdaBoost in terms of AUC when it achieved 0.7310 as compared to 0.5149 and 0.5418, respectively. Another evaluation was an F1-score, which was related to AUC. e best results of AUC could also provide the best results of the F1-score. Based on Table 3 at the 70/30 splitting percentage, it can be clearly seen that while DT, ANN, DANN, and AdaBoost had the highest score in AUC with the values of 0.5149, 0.7029, 0.7130, and 0.5418. F1-score also yielded the highest result with 6.10%, 49.50%, 52.44%, and 15.45%, respectively.
To evaluate how well the class of anomaly was distributed during classification, precision evaluation was used in this study. Based on the experimental result, three classifiers, namely, ANN, DANN, and AdaBoost, performed the best precision at the ratio of 90/10 splitting percentage with 79.03%, 65.71%, and 63.64%, respectively. DT showed the highest precision at the 70/30 splitting percentage with 53.97%. e highest average of precision was achieved by ANN with 64.05%.
As for the average of recall, DANN outperformed other comparison methods with 40.94%, followed by ANN (35.49%), AdaBoost (7.57), and DT (2.87%). ANN and DANN both produced the highest recall at a 60/40 ratio with 50.71% and 61.03%, respectively. Another classifier, which was DT, achieved the highest recall (4.37%) while the training and testing samples were set into 80 and 20, respectively. Within contrast, AdaBoost achieved the best recall when the ratio was set to 70/30 splitting percentage.
Even though ANN produced the highest average of accuracy and average of precision with 92.54% and 64.05%, respectively, DANN obtained the highest result for three other evaluation metrics, that is, recall, F1-score, and AUC, with values of 40.94%, 45.83%, and 0.69%, respectively. In conclusion, the ratio for training and testing, or well known as splitting percentage, played a significant part in providing the best result.

Comparative Analysis and Discussion
Figures 6(a)-6(e) show the graph of the performance of DT, ANN, DANN, and AdaBoost for different types of evaluation methods such as precision, accuracy, recall, AUC, and F1-score, respectively. eoretically, the training dataset was used to fit the model, whereas the testing dataset was utilized to measure the ML method's fitness. Generally speaking, the splitting percentage on the dataset aimed to evaluate the ML model's implementation and execution on new data. Based on Figure 6(a), the results showed very high accuracy in the training and prediction models of DANN at 70/30 splitting percentage, which was more than 93% accuracy as compared to AdaBoost, ANN, and DT. It is also noteworthy to see that AdaBoost and DT's trained model performance interestingly improved when the splitting percentage was 80/20 as compared to the 70/30 splitting percentage. Most classifiers except for AdaBoost significantly reduced the accuracy when the splitting percentage was 90/10. Based on Figure 6(b), ANN dramatically increased in the precision value at the 90/10 splitting percentage and moved downward at the 70/30 splitting percentage. It can be noticed that AdaBoost achieved the highest precision with 63.64% at the 90/10 splitting percentage. Precision for three other ratios of splitting percentage for the performance of AdaBoost was almost consistent at about 55%. At the initial step of the precision test, DT seemed to be slightly increased at the 70/30 splitting percentage. However, its precision percentage slowly went downward at the 60/40 splitting percentage. e precision of DANN nearly achieved 66% when the splitting percentage was against 90/10.

Journal of Electrical and Computer Engineering
Based on Figure 6(c), the recall value of DANN was higher at 60/40 than 70/30 splitting percentage. It can be seen that DANN had a small percentage of recall at the 90/10 splitting percentage. e behavior of ANN was quite similar to DANN when it gradually increased the percentage of recall for each splitting percentage. AdaBoost slightly improved the percentage of recall from 90/10 to 60/40 splitting percentage. It can be clearly seen that the recall value of DT was significantly lower than other comparison methods.
Based on Figure 6(d), DANN achieved a higher F1-score at 60/40 than 70/30, 80/90, and 90/10 splitting percentage. ANN also yielded the highest F1-score at 60/40, which was similar to DANN. Even though AdaBoost poorly achieved a higher F1-Score than DANN and ANN, the score was better than DT.
Based on Figure 6(e), DANN steadily increased over the splitting percentage. It is obvious that the value of AUC for DANN increased when the training set decreased. ANN also had similar behavior to DANN. DT and AdaBoost slightly increased the AUC value at 80/20 and 70/30 splitting percentages, respectively. Both of them provided AUC in the range between 0.50 and 0.53. Different classifiers will provide different performances of precision, accuracy, recall, AUC, and F1-score in different splitting percentages. For most of them, when the splitting percentage was 90/10, the highest accuracy would be provided.

Conclusion
is paper analyzed the performance results of supervised learning algorithms with four classifiers for electricity theft detection. Performance evaluated using accuracy, precision, recall, F1-score, and AUC for all classifiers. Compared to other supervised learning classifiers, DANN surpassed the recall, F1-Score, and AUC of other classifiers like ANN, AdaBoost, and DT. For future research, experiments can be performed on other supervised learning algorithms with different types of dataset and suitable preprocessing methods can be applied to produce better performance.

Data Availability
Previously reported State Grid Corporation of China (SGCC) datasets were used to support this study and are available at http://www.sgcc.com.cn/. ese prior studies (and datasets) are cited at relevant places within the text as reference [29].

Conflicts of Interest
e authors declare that there are no conflicts of interest.