An Intelligent Fault Detection Model for Fault Detection in Photovoltaic Systems

Effective fault diagnosis in a PV system requires understanding the behavior of the current/voltage (I/V) parameters in different environmental conditions. Especially during the winter season, I/V characters of certain faulty states in a PV system closely resemble that of a normal state. Therefore, a normal fault detection model can falsely predict a well-operating PV system as a faulty state and vice versa. In this paper, an intelligent fault diagnosis model is proposed for the fault detection and classification in PV systems. For the experimental verification, various fault state and normal state datasets are collected during the winter season under wide environmental conditions. The collected datasets are normalized and preprocessed using several data-mining techniques and then fed into a probabilistic neural network (PNN). The PNN model will be trained with the historical data to predict and classify faults when new data is fetched in it. The trained model showed better performance in prediction accuracy when compared with other classification methods in machine learning.


Introduction
Fault detection and timely troubleshooting are essential for the optimum performance in any power generation system, including photovoltaic (PV) systems. In particular, the goal for any commercial power-producing house is maximizing power production, minimizing energy loss and maintenance cost, and the safe operation of the facility. Since PV systems are subject to various faults and failures, early detection of such faults and failures is very crucial for achieving the goal [1][2][3]. The US National Electric Code requires the installation of OCPD (Overcurrent Protection Device) and GFDI (Ground Fault Detection Interrupters) in PV installations for protection against certain faults. However, the Bakersfield Fire case, 2009, and Mount Holly, 2011, show the inability of these devices to detect the fault in those particular scenarios [4]. Faults in a PV system can arise from either physical, environmental, or electrical conditions [5,6].
A wide range of technologies exist for PV array fault detection, and also extensive studies have been done in the area to offer possible solutions [7]. The two most important parameters in determining the performance in a PV system are current and voltage. A simple current-voltage analysis method was proposed where the electrical signature of each faulty modules and array was fixed by considering the deformations induced on the I/V curves [8]. Another study shows the use of infrared thermography where electrical and thermal models of a PV system were combined for extracting quantitative information of a mismatch fault [9]. Similar studies show the application of aerial infrared thermography for detection of the damage on the PV blocks [10] and an onfield infrared thermography-sensing technique for PV system efficiency assessment [11]. Likewise, reflectometry methods have also been used for fault detection in PV systems. A time domain reflectometry (TDR) method was used to detect short circuit and insulation defects [12,13], and recently, a spread spectrum TDR (SSTDR) method was investigated to detect ground faults and aging-related impedance variations in a PV system [14]. In addition to that, the application of wavelet decomposition techniques for detecting arc faults [15][16][17] and multiresolution signal decomposition for detecting line-line faults [18,19] are also found in the literature. A recent article has provided a comprehensive study on several advanced fault detection approaches in PV systems. The study has divided fault detection approaches into model-based difference measurement (MBDM), real-time difference measurement (RDM), output signal analysis (OSM), and machine learning techniques (MLT). It has also done critical comparisons of these advanced techniques with conventional methods providing their pros and cons [20].
Nowadays, most of the PV systems are built with a monitoring system and have a database constantly backed with huge historical data [21]. Artificial Intelligence (AI) methods are data-based, and with the availability of big data in PV systems, studies in this area seem to be in the momentum. In particular, machine learning-(ML-) based algorithms and techniques are proposed [22][23][24][25], where the model is trained with historical data to predict and classify faults. A recent study reports the application of thermography and ML techniques for fault classification in PV modules [26]. The study has adopted a texture feature analysis to study the features of various fault panel thermal images, and the developed algorithm was trained with 93.4% accuracy. Another study reports the application of ML techniques for fault detection, classification, and localization in PV systems. The study claims the development of the algorithm with the prediction accuracy of 100% [27]. Likewise, another study utilizes a waveletbased approach and radial basis function networks (RBFN) to detect short circuit and open circuit faults in the inverter [28]. Their work presents 100% training efficiency and 97% testing efficiency when tested in a 1 kW single-phase stand-alone PV system.
The performance of a trained model for a PV system using ML techniques can greatly vary if new data is fetched from a different environmental condition, especially data from the winter season. The irradiation level in the winter is substantially lower than that in the summer, and studies have shown faults occurring in such lower irradiation levels    [29,30]. Such undetected faults can cause a significant amount of power losses and degradation of the quality of the panel or even lead to deterioration of panels. We propose an intelligent fault diagnosis model for detecting faulty modules and further classifying the fault type that is applicable in all environmental conditions. The model uses the multilayer perceptron (MLP) and follows the supervised learning approach. It is robustly trained with historical data of different faulty and normal states in different environmental conditions especially focusing on winter. The data was collected from a 1.8 kW grid-connected PV system located at Jeollabuk-do province of South Korea. The rest of the paper is organized as follows. Section 2 introduces the overview of PV system faults. Section 3 describes the whole system architecture of the fault diagnosis model. Section 4 presents experimental results, compares the model with existing classification methods, and discusses other relevant issues. Finally, Section 5 summarizes and concludes the article.

Overview of PV System Faults
The classification of faults occurring in a PV system can be categorized from different aspects. We classify such faults into three types: physical, environmental, and electrical [2,3]. However, the classification of faults can also be made on other bases, e.g., location and structure [1].
Physical faults can be internal or external and generally include damage, cracks, and degradation in PV modules. Also, PV system failures are caused by the aging effect which is also a physical phenomenon.
Environmental faults include soiling and dust accumulation, bird drops, and temporary shading. Permanent   Line-line faults are created by unintentional low impedance current path in a PV array. Ground faults are similar to line-line faults; however, the low impedance path is from current-carrying conductors to ground/earth. Figure 1 shows the classification of PV array faults, whereas Figure 2 shows the main types of electrical faults in PV systems.
A PV module can be modeled electrically with a one diode or two diode model [18]. However, modeling a real PV system is very complex because electrical parameters vary largely between PV systems due to variation in the construction of PV modules (dimension, material, and ground connection), site, and physical layout [27]. Especially in largescale power generation systems, modeling a system comes with the special technical challenge. In this study, we have limited our work to detect only electrical faults.

Proposed System Architecture
This chapter provides detailed explanation of the several steps that constitute the proposed fault diagnosis system architecture. Figure 3 shows the block diagram and the flow of each step in the proposed architecture.
3.1. Data Acquisition. This is the first layer of the system architecture. For building the model, we acquired the current, voltage, irradiation level, and temperature data from respective sensors attached to the PV array. The sensors operate at the 5 V level, while the PV module used in this study has an open circuit voltage (V oc ) of 39 V and short circuit current (I sc ) specification of 9 A. Active analog filters were used to remove noise levels that could get injected into current and voltage sensors from the PV panel. The irradiation level data was collected using a commercial lux meter (LX1330B) with a 0.01 to 200 klux range and error rate of ±2%. Temperature data was collected from the sensor attached to the modules. The difference between ambient temperature and panel temperature was in the range of 1 to 7 degree Celsius (°C). The input data fetched for training the model is the average temperature measured from each module. The dataset consists of data collected in summer and winter in all possible environmental conditions. The collected data was backed in the local server as well as the cloud server.

Data Preprocessing.
Data preprocessing is the second layer in the proposed system architecture. It consists of all the actions taken before the data inputs are fetched to the model for extracting features. Figure 4 shows the functional block of the multilayer perceptron model used in this study. In order to create the fault detection model, seven PV data features are selected as the input attributes for the input layer.
x 1 is the current (A) in branch 1 of the PV system, x 2 is the current (A) in branch 2 of the PV system, x 3 is the voltage (V) in branch 1 of the PV system, x 4 is the voltage  Journal of Sensors (V) in branch 2 of the PV system, x 5 is the irradiation level (klux), x 6 is the average temperature (°C) from each module, and x 7 is the weather condition (sunny, snowy, cloudy, and rainy). Among the input data, x 7 is of a categorical nature; thus, it is encoded to a suitable numerical data. The weather condition "sunny" was encoded to 1 and the rest ("snowy," "cloudy," and "rainy") were encoded to 0 as data collected in those environmental conditions showed quite a similar feature. After that, all the input data were normalized as follows: where z is the standard score of sample x, u is the mean of the training samples, and s is the standard deviation of the training samples. The whole dataset was split into the training set and the test set with the ratio of 8 : 2.

Multilayer Perceptron and Feature Extraction.
A multilayer perceptron (MLP) or probabilistic neural network (PNN) is a nonlinear learning algorithm in ML and is widely applied in both supervised and unsupervised learning. However, most of its application is found in the classification problem of supervised learning.
Here, Φ ij ðyÞ is the probability density function of input vector y, d is the total category number of training samples, y ij is the j th training center of the i th type of samples, and ω is the smoothing factor [23]. Figure 5 shows a feedforward multilayer perceptron.
Assuming that we used an input layer with n 0 neurons, input layer X can be given as For the feature extraction, the hidden layer is designed with two layers: h 1 being the first hidden layer and h 2 being the second hidden layer. Each of the input dimensions (x 1 to x 7 ) is fed to h 1 , and then, the output from h 1 goes to h 2 . The outputs h j i of neurons in the hidden layers are computed as where W i−1 k,j is the weight between the neuron k in the hidden layer I and neuron j in the hidden layer +1, n i is the number of the neurons in the i th hidden layer. Both of the hidden layers use uniform distribution as the kernel initializer for initializing the weights in the network. Also, we chose ReLU (Rectified Linear Unit) as the activation function because of its several advantages in nonlinear datasets with multiple dimensions [31]. ReLU is given as The output layer consists of three layers: y 1 , y 2 , and y 3 . The network outputs are computed as  where w N k,j is the weight between the neuron k in the N th hidden layer and the neuron j in the output layer and n N is the number of the neurons in the N th hidden layer. The output layer also uses uniform distribution as the kernel initializer, but unlike hidden layers, it uses Softmax as the activation function to represent the logits into probabilities [32]. The Softmax function is given as Because of the nature of classification, we have used categorical crossentropy as the loss function given in equation (3) whereŷ is the predicted output.
Categorical crossentropy will compare the distribution of the predictions (the activations in the output layer, one for each class) with the true distribution, where the probability of the true class is set to 1 and 0 for the other classes. Among many other optimizers, we used Adam (Adaptive Moment Estimation) for optimizing the proposed model [33]. Adam uses adaptive learning for each of the parameters, and the weight of a learning rate is divided by a running average of recent gradients. Finally, the model is fitted to train with a batch size of 5 with 200 epochs. Table 1 shows the different parameters used to construct the MLP as the fault classifier.
To check the bias-variance tradeoff, a k-fold crossvalidation test is performed with the 5 validations split into the training data. Also, for improving the model and reducing overfitting, we implemented the dropout regularization technique. The dropout rates of 0.1 and 0.2 were selected for the first and second hidden layers, respectively. The result of the evaluation, improvement, and tuning of the model is provided in Section 4.1.  For the purpose of collecting experimental data, the data without any hardware or circuit modification in the PV system were categorized as "normal." Fault data were collected by making several intentional faults in the circuit of the PV array. Table 2 shows the minimum range, maximum range, and variance of data collected in the different environmental conditions to train the proposed model. For the experimental verification, we set up a PV system used in the real power pro-duction industry with the specifications given in Table 3 and Figure 6. As shown in Table 2, variance in the winter dataset is very high than that in the summer season which requires special attention while training the model for accurate predictions. Figures 7 and 8 show normal and line-line fault dataset features of the input variable x 1 and x 2 during summer and winter seasons during sunny days and cloudy days, respectively.
Among the input dataset, irradiation level seems to have the highest variance level in the absolute terms. However, visualizing the current sensor (S I1 /S I2 ) data would make  7 Journal of Sensors much sense since its relative variance (σ 2 x / x) was the highest among other input variables. As seen in Figure 7, it is hard to distinguish "normal cloudy" and "line-line" fault data in the winter season as they overlap most of the time. Figure 9 shows the principal component analysis (PCA) dimensional reduction technique to visualize the sevendimension data (x 1 to x 7 ) into scaled two-dimension data. As seen in the center-left and center-right parts of the figure, small regions exist where "normal" state data and "line-line" fault data overlap. Figure 10 shows training-test validation of the dataset with the train-test split ratio of 8 : 2. The proposed PNN model was extensively trained with 3000 datasets, 1000 each from different states of the PV system. Figure 11 shows the confusion matrix giving the outcome of 100% accuracy when the test data was fetched to the trained model. The numerical values shown in the confusion matrix are expressed in absolute terms, i.e., the total number of predicted labels for the line-line fault was 184, given the true label for the line-line fault was 184, which is 100%. Figure 12 shows the setup of the experimental 1.8 kW PV system at the Jeonbuk National University campus. Table 4 shows the comparison of the proposed method with existing studies found in the literature. Figure 13 shows the screenshot of the developed desktop application implementing the proposed model for fault detection.

Discussions. ML-based fault detection and diagnosis
techniques have been employed recently, and it is expected to continue in the coming years. The quality of the MLbased model heavily depends upon the training data. Studies show the models trained with PV data tend to have very high accuracy rates in prediction, even up to 100% (Table 4). We tested our dataset with other machine learning models and got quite high accuracies (shown by the F 1 score) in each case as shown in Table 5. The correlations between each predicted classifier are shown in Figure 14.
It is very important to identify the main features from the input dataset while training a ML model. The most important parameters in a PV system are current and voltage. A fault detection model only trained with these two input features can equally be robust as the other models trained with more input datasets.
No single fault detection technique is capable of detecting, diagnosing, and locating all types of faults in the PV  Journal of Sensors system. As discussed in the introduction part, our study is limited to detecting major electrical faults. As a part of the continuation of this study, we aim to continue the research to develop hybrid techniques including ML to develop a comprehensive and complete fault detection method.
In most of the cases, a fault detection model developed for a PV system cannot be implemented to another PV system as electrical parameters vary largely in different PV systems. There is a need for the development of flexible models that can be developed and can be implemented in any PV system with minor modifications. We have given special consideration to make our model as flexible as possible while developing the desktop application. The source code and data used for the experiment is available in the author's GitHub page as an open-source project. The open-source community can use the model for their application, provide feedback, or contribute to the improvement of the model as a whole.

Conclusion
PV systems are subject to various faults and failures, and early fault detection of those faults and failures is very impor-tant for the efficiency and safety of the PV systems. ML-based fault detection models are trained with data and provide prediction results with very high accuracy. However, databased fault detection models for PV systems can sometimes give false predictions, especially when the environmental parameters are not taken into consideration. This paper developed an intelligent fault detection model for PV arrays based on PNN for accurately classifying the fault types. The model was trained with a large dataset containing different data values under different environmental conditions in the summer and the winter season. For the experimental verification, various fault state and normal state datasets are collected from 1.8 kW (six 300 W panels, 2 parallelly connected lines, each with 3 serially connected panels) into the grid-connected PV system. The experimental results demonstrate that the proposed method is superior in accurately predicting the result in cases where fault state and normal state are very hard to distinguish.