Classification of Imbalanced Data Using Deep Learning with Adding Noise

This paper proposes a method to treat the classification of imbalanced data by adding noise to the feature space of convolutional neural network (CNN) without changing a data set (ratio of majority and minority data). Besides, a hybrid loss function of crossentropy and KL divergence is proposed. The proposed approach can improve the accuracy of minority class in the testing data. In addition, a simple design method for selecting structure of CNN is first introduced and then, we add noise in feature space of CNN to obtain proper features by a training process and to improve the classification results. From comparison results, we can find that the proposed method can extract the suitable features to improve the accuracy of minority class. Finally, illustrated examples of multiclass classification problems and the corresponding discussion in balance ratio are presented. Our approach performs well with smaller network structure compared with other deep models. In addition, the performance is improved over 40% in defective accuracy by adding noise approach. Finally, the accuracy is higher than 96%; even the imbalanced ratio (IR) is one hundred.


Introduction
In industrial applications, the defect detection is very important since defects have an adverse effect on the quality and performance of products [1]. In this area, surface defect detection is one of the most applications and it is necessary for steel, wood, or solar wafers [2][3][4]. The more commonly used methods in recent years are machine vision due to high speed, cost savings, and high accuracy [5][6][7][8][9][10]. Traditional machine vision approach can be divided into four categories, namely, statistical approaches, structural approaches, filterbased methods, and model-based approaches [7]. However, the corresponding performance depends on application fields and the data set distribution affects the results [8][9][10]. On other applications such as cancer detection or environmental disaster prediction, a disproportionate amount of data available comes from negative cases [11,12].
Recently, convolutional neural networks (CNNs) have been used for many fields [11][12][13][14]. It automatically learns to extract features without professional knowledge for feature extraction. In manufacturing, the defect detection is a typical imbalanced data problem and defect samples are usually less than nondefective ones [15]. The imbalance ratio (IR) is usually used to describe the ratio of minority to majority samples. The general used value of IR is greater than 1.5 that causes the learning result bias towards the majority class [16,17]. To treat the imbalanced data problems, lots of researches have been presented, e.g., sampling methods, cost-sensitive methods, and kernel-based method [17][18][19][20][21][22][23][24][25][26]. The most used sampling method is random oversampling and undersampling [26]. Another variation of the oversampling method is synthetic minority oversampling technique (SMOTE), which generates new samples by synthesizing minority samples [19,20]. SMOTE uses the modification of the data in the feature plane to solve the problem. Therefore, we have to use feature extraction to get the features first. On the other hand, data augmentation method modifies image data to add samples [17,[21][22][23][24][25]. Generative Adversarial Network (GAN) is used to generate realistic data from minority sets [17,22,23]; however, the corresponding results are not good enough on diverse data sets. Costsensitive method is changing the threshold or weight of the network to bias the network to minority classes [22,23]. The kernel-based method works on the classification boundary of the feature space [24,25]. Literature [27] presents the MetaBalance, an algorithm that uses meta-earning for deep neural network on class-imbalanced data.
Since the performance of CNN is closely related to the data, the number or quality of data affects the classification results. Therefore, some methods have been introduced to deal with this problem [28][29][30]. However, CNN misclassifies defective samples. Therefore, a CNN with adding feature space noise method is proposed to improve the accuracy.
In this paper, we propose a classification method using CNN with adding noise in feature space for imbalanced data classification. Our target is to improve the accuracy of minority samples in the classification that also preserves the total accuracy. Herein, a hybrid loss function of crossentropy and KL divergence is adopted for training. In addition, a simple design method for selecting structure of CNN is introduced and then, we add noise in feature space of CNN to obtain proper features by a training process and to improve the classification results. As our experience, the proposed approach has the ability to find the information that does not exist in the original minority samples, thereby improving the accuracy of minority class in the testing data. To demonstrate the generalization, this method is applied in three imbalanced data sets with different sizes. Experiments are introduced to show that the method effectively improves the accuracy of a minority samples. In addition, we also apply this method to multiclass classification or different imbalance ratios.
The rest of this paper is as follows. Section 2 introduces the imbalanced data problem and data sets. The major contributions are introduced in Section 3, which include CNN with adding feature space noise and network structure selection. Section 4 presents the experimental and validation results. Finally, conclusion is given.

Problem Formulation and Data Sets
This section introduces the imbalanced data classification problem and the experimental data sets. Three open data sets (DAGM 2007 [31], NEU surface defect [32], and MNIST [33]) are utilized to demonstrate the performance and effectiveness of our method.
2.1. Imbalanced Data Classification. Any data set with an unequal data distribution is an imbalanced data, and the minority samples usually present significant concepts for classification [18,[34][35][36]. Figure 1 shows the problem of classification caused by imbalanced data in the training process; the red and blue points are the minority and majority samples, respectively; and the dashed line denotes the decision boundary. The model is trained and judges by the decision boundary. Since the minority samples in the training data do not have enough concepts to present the minority class, the model may classify some minority samples as wrong class in testing data. If these minority samples are defective ones, it will have a great impact on the quality of products. Therefore, our goal is to solve the misclassification of minority samples in imbalanced data.
For the binary classification problems, the corresponding confusion matrix is used to show the classification results. Ifpandndenote the true positive and negative,YandNmean the predicted positive and negative and TP, FP, FN, and TN represent true positives, false positives (or type 2 errors), false negatives (or type 1 errors), and true negatives. In general, the positive samples are defective or minority samples, and the negative samples are nondefective or majority samples. Thus, accuracy is defined as Training data Testing data Figure 1: Problem of classification caused by imbalanced data in a learning model.  1  79  496  71  504  2  66  509  84  491  3  66  509  84  491  4  82  493  68  507  5  70  505  80  495  6  83  492  67  508  7  150  1000  150  1000  8  150  1000  150  1000  9  150  1000  150  1000  10  150  1000  150  1000   2 Journal of Sensors We use recall to evaluate our model as Herein, we mainly compare the problems of binary classification with recall. To make the experimental results clearer, we present recall by accuracy of defective samples or minority samples instead. In addition, we also use precision to evaluate  [31]. Table 1 presents the number of samples for each subdata set with imbalance ratio -20/3. The sizes are unified to 512 × 512.
Each subdata set has different kinds of textures and defects; samples are shown in Figure 2.

NEU Surface Defect Data Set.
The data set is provided by Northeastern University, and the dimensions of the version are 64 × 64 pixels [33]. Nine classes of typical surface defects of the hot-rolled steel strip are collected. The NEU surface defect data set includes two difficult challenges, which are large differences in the same class and similarities between different defects. Table 2 shows the number of images for each class, and Figure 3 shows the samples of the data set.

MNIST Data Set.
MNIST is an abbreviation of Modified National Institute of Standards and Technology from American Census Bureau employees [33]. Each picture is normalized to 28 × 28 pixels, shown in Figure 4, and Table 3 shows the number of images for each digit. MNIST is not an imbalanced data set; therefore, the sample number of some classes is modified in later experiments.

Convolutional Neural Network with Adding Noise in Feature Space
CNN is one of the representatives of deep learning and artificial intelligence [14]. Figure 5 shows the basic architecture of CNN with input image size 6 × 6 [37], in which the convolution computing with eight 3 × 3 filters results in eight feature maps with 4 × 4 resolution and processed by 2 × 2 maximum pooling to reduce dimensions. After passing the flatten layer, the feature maps are rearranged in one dimension; subsequently, the full connection with four hidden neurons and six outputs is connected. In general, the corresponding structure affects the performance of CNN; thus, we here introduce an architecture design method in this section.
3.1. CNN with Adding Noise in Feature Space. In order to solve the imbalanced data classification using CNN, the CNN is modified by adding noise to features extracted. The purpose of adding noise in feature space is to change the distribution of features by the training process [38][39][40]. When the noise is added to the feature space, we have the chance to extract suitable features for classification, especially for minority samples. Therefore, to get better results on the training data, it is necessary to get better features that can identify the random noise samples. Thus, it will be able to identify testing samples. An illustration of adding noise to the feature space is introduced in Figure 6; the red and blue points denote the majority and minority samples,  3 Journal of Sensors respectively. The dashed line denotes the decision boundary by training. Note that the trained neural network cannot distinguish the minority samples due to the fact that they are closer to the majority samples in the testing process. When we disturb the point, illustrated as the dashed circle in Figure 6, it has the chance to be similar to the point in the testing data that is misjudged. For being able to classify this point in the training data correctly, the network will be able to classify these points by the extracted feature and classification fully connected layer. Therefore, the minority samples that are close to the majority samples in the testing data may also be correctly classified.
Herein, we implement the concept to propose a CNN with adding noise in the feature space to obtain proper features by the training process and to improve the classification results. Figure 7 shows the proposed CNN with adding noise architecture. The noise is added in the last extracted feature layer, called CNN noise . In addition, the structure selection will be introduced in the next subsection. The adding noise with standard normal distribution is multiplied with e σ to ensure the value is positive. Note that nodes m are features extracted from CNN. For the testing process, we only take m and remove the noise. After the adding noise part, we get the new feature c as    where ND means the standard normal distribution. Finally, c passes softmax and outputsŷ to make predictions on y and the network is adjusted by error back propagation. However, the noise added after training will approach zero without any constraints in loss function. Since a small value of noise results in a small loss function value, therefore, we adopt the KL divergence to constrain the CNN to ensure the existence of noise. KL divergence is used to calculate the difference in probability distribution. By using the KL divergence as one of our loss functions, m and σ will be close to the mean and standard deviation of standard normal dis-tribution. Since the standard deviation of the standard normal distribution is one, we can ensure that noise is not zero. Therefore, our loss function is where

Journal of Sensors
L KL and L ce denote the loss functions of KL divergence and crossentropy; parameter α is used to suppress the KL divergence. In addition, N denotes number of samples, k denotes dimension of feature, and y andŷ denote the label of data and prediction, respectively. An illustrated result of DAGM 2007 for selecting value of α, in equation (5), will be introduced in Section 4. By our experience, α is set to be 0.00025 and it can be fine-tuned by experimental results.
3.2. Architecture Design of CNN. The structure of CNN affects the classification results and computation complexity. Herein, a simple method to select the CNN structure is introduced. Recently, some literature uses transfer learning method for defect detection [41][42][43][44]. At first, we adopt the transfer learning for the DAGM 2007 subdata set 1 as test by using VGG16, InceptionV3, and ResNet50 [45][46][47]. The classic models are pretrained by ImageNet data set, and results of transfer learning are presented in Table 4. The algorithm used in training is Adam. The initial learning rates used during training are all 0.0001, and decay is 0.00001. The number of iterations used for learning is 100 epochs. The average results here are obtained by 10 independent runs. From Table 4, the accuracy of defective is not good enough although the accuracy of nondefective is acceptable. In addition, the total adjustable parameters of them are very huge, more than 18 million; the corresponding computation effort for training and defect is large. More details and comparisons will be introduced in Section 4. Thus, we introduce a designed method of CNN architecture to improve the classification results. The method of designing for specific data set was also used in other articles [48,49]. The advantage of this method compared to transfer learning is that the calculation time is short.
Here are the steps to design the CNN architecture.
Step 1. We first decide the number of pooling layers.
Step 2. Use a smaller amount of convolution layers, and gradually increase it to increase the correct rate.
Step 3. Stop when the correct rate no longer increases.
Step 4. The number of hidden layers of the fully connected layer is selected to be one.

Experimental Results
The following experiments are run on the platform of Taiwan Computer Cloud (TWCC), a cloud computing platform provided by the National Center for High-performing Computing (NCHC) of the National Applied Research Labs (NARLabs). The computing speed of Taiwania 2 and Taiwania is ranked 23rd and 314th in the TOP500 international rankings in 2019, respectively. Its GPU hardware specifications are NVIDIA Tesla V100 32 GB, and we use eight of them at the same time.
To demonstrate the proposed approach, several comparison and illustration experiments are introduced. To illustrate the versatility, we also present the results of the three data sets and the multiclassification problem to show the performance of higher imbalance ratio case. In the following experiments, the data are randomly selected as 80% training, 10% testing, and 10% validation (each experiment is finetuned). In addition, the random selection for each epoch is done by "shuffle." 4.1. Architecture Design of CNN. Herein, an illustrated example for CNN architecture design by the DAGM 2007 data set for performance evaluation is introduced. At first, the number of pooling layers is determined by the range whose features are extracted according to the results of [50]. The number of convolutional layers determines the features; i.e., more complex features require more convolutional layers. We here gradually increase it and stop when the accuracy no longer increases. Since the DAGM 2007 data set is a binary classification problem, the number of hidden layers of the fully connected layer is selected to be one. In addition, the numbers of kernels of the convolutional layers and the fully connected layer are just set to appropriate values. Herein, our objective is to introduce usable results with fewer parameters and shorter computation time to obtain well accuracy. If we need to optimize the accuracy by architecture, uniform design method and optimization can be adopted [51,52]. Figure 8 shows the CNN architecture, and Table 5 shows the corresponding results on the DAGM 2007 data set. Obviously, the accuracy of minority data is 94.37%; it is better than the results of VGG16, InceptionV3, and ResNet50 by transfer learning, shown in Figure 4. We use it as our CNN architecture (called CNN designed ) in the following discussions. Besides, Table 6 shows the comparison results of previous models with our method in parameter number and computation efforts. We can see that the parameters of the previous models are much more than our method. Note that the trainable parameters affect train-ing time and the fixed parameters affect training and testing time; the total parameters affect the computing requirement. From Table 6, the parameter number of our method is smaller than that of others and the corresponding training and testing time of our method is relatively small. These may cause difficulties in the implementation of transfer learning approach. Considering the low accuracy of transfer learning in defective samples and implementation difficulties, we adopt the CNN designed in the following experiments.

4.2.
Performance of Adding Noise in Feature Space. As shown in Figure 1, decision boundary in feature space is easily obtained when the distribution of features is separable. Thus, the purpose of adding noise is to get more separable features. Herein, we present the distribution of the trained features to demonstrate the proposed method. Figure 9 shows the distribution of eight features for training and testing data, respectively; blue and red are majority and minority samples. Figures 9(a) and 9(b) show the distribution of feature extracted by training samples and Figures 9(c) and 9(d) are the results by testing samples. From Figure 9(a), we can find that the extracted features by CNN without adding noise are very close and some samples are overlapping. Figure 9(b) shows the feature distribution extracted by the CNN noise . Obviously, the features extracted by CNN noise can increase the distance between majority samples and minority samples. In order to be able to adapt to these noise-added features, the network must be able to extract more separated features between classes. Figures 9(c) and 9(d) show the comparison results in testing data. We can see that there are many indistinguishable samples in the extracted feature by CNN without noise. In contrast, in Figure 9(d), we find that CNN noise can also effectively separate features in the testing data. It can be found that the features confused in Figure 9(d) are very few, which proves that the features obtained by CNN noise can effectively classify the samples. In addition, an evaluation of confusion feature is introduced as follows.
To evaluate the confusion of extracted features, we here define confusion number as

Confusion number = min overlap
where overlap A denote the sample number of class A overlap class B, and overlap B means the sample number of class B overlap class A. Since we usually consider the fewer overlapping samples as being mixed into another class, we choose the small one to be the confusion number. For example, if A = f0, 1, 2, 5g and B = f3, 4, 7, 8g, the sample of class

Journal of Sensors
A overlap class B is {5} and class B overlap class A is {3, 4}. Therefore, we have confusion number = min ð1, 2Þ = 1. Table 7 presents the confusion number in different features of CNN without (CNN designed ) and with noise (CNN noise ). We can see that CNN noise results in no confusing samples whether in training or testing data; this means that the features are separable. However, the CNN without adding noise results a number of minority samples in the testing data, even though feature extracted by CNN can be classified in the training data. Our method can obtain new features during training by adding noise to features of minority samples. These new features have the opportunity to approach those features that are not included in minority samples. In this way, our network also performs well in testing data. In training data, although CNN designed has several features that are not perfect, other features can still make the network get a high accuracy in training data. This also causes the network to no longer continue to converge, and the significant drop in the quality of the features in testing data highlights the lack of the imbalanced sample.

Comparison Results in Accuracy.
This section presents the comparison results of classification accuracy between CNN designed and CNN noise for the DAGM 2007 data set. Herein, the same learning parameters are utilized; the initial learning rate is 0.0001, and decay is 0.00001; the learning algorithm is Adam, and epoch is 200. Table 8 shows the comparison results by the accuracy of all test samples and defective samples in each subdata set. From Table 8, the accuracy rate can be improved by CNN noise ; it obtains much better results in these subdata sets, especially the accuracy of minority samples (accuracy improvement to 90.33%). This verifies the effect of our proposed method to detect imbalanced defect data; we can greatly improve the accuracy on minority samples. We could also conclude that the accuracy can be increased both in the high-accuracy subdata set and the low-accuracy subdata set. Especially, the subdata sets with lower accuracy of the defect samples, such as subdata set 4, subdata set 5, subdata set 7, and subdata set 9, can also be improved well. These results verify that if we use this method to detect imbalanced defect data, we can greatly improve the accuracy of CNN designed on minority samples.

Comparison Results of Other Data Sets.
In order to verify the proposed method, other two commonly used data sets (NEU and MNIST) are adopted for demonstration. The NEU surface data set is also a problem with the defective data, and it also has different numbers of samples in different classes, and the MNIST is a commonly used data set for image recognition. The corresponding CNN structure is introduced in Table 9. Table 10 introduces the sample number of selected classes. They are divided into majority and minority classes for binary classification. Table 11 shows the accuracy results of NEU surface data set by CNN designed and CNN noise . We can find that CNN noise still gets better results for the NEU surface data set. Herein, we use an initial learning rate of 0.0005 and a decay of 0.00001, and Adam is also used as the training algorithm.
Herein, we also apply CNN noise on multiclass due to the fact that multiclass classification is common in real applications. Note that the NEU data set categorizes to multipleclass data; therefore, it is adopted for demonstrating      Table 12 presents the confusion matrix results of CNN designed and CNN noise . We can observe that class 2 and class 6 are minority classes here and they also get improvement by CNN noise . It can be observed that the classification effect in multiple classes decreases since the addition of other classes, but CNN noise can still obtain a higher accuracy rate than CNN without adding noise. The second validation is the MNIST data set; note that MNIST is first manually modified to make it an imbalanced data set. MNIST is a data set with ten classes of handwritten digits from 0 to 9; we here choose the digits 7, 8, and 9 as minority classes. There are 6000 samples per class in the original training data. The imbalance ratio 100 by randomly selecting the minority classes is created; the number of      Table 13. Table 14 shows the confusion matrix results on the modified imbalanced MNIST. Herein, we use an initial learning rate of 0.0001 and a decay of 0.00001, and the training algorithm is also Adam. From Table 14, it can be found that the minority classes of 7, 8, and 9 can only get about 77%, 64%, and 67% of the accuracy through CNN without adding noise. In contrast, CNN noise can improve the accuracy of the minority classes 7, 8, and 9 with about 13%, 20%, and 13% without reducing the accuracy of the majority classes.
4.5. Discussion of Imbalanced Ratio. In practical problems, we may need to deal with a higher imbalanced ratio [53]. Therefore, we increase the imbalanced ratio artificially to verify CNN noise by binary classification. It is also confirmed whether the results are affected by the imbalanced ratio. Herein, we use the class 1 (majority) and class 7 (minority) of the NEU surface defect for verification. Different imbalance ratios 2, 4, 10, 20, 50, and 100 are chosen, and random selection is utilized to create the data set. Table 15 shows the comparison results between CNN noise and CNN in different imbalance ratios. We can see that when the imbalance ratio increases, the minority sample accuracy of CNN decreases rapidly. In contrast, CNN noise decreases slowly and can maintain the accuracy of minority samples above 96%. This shows that CNN noise has a higher tolerance for higher imbalance ratios. 4.6. Comparison Results with Others. Herein, comparison results with CNN noise are introduced, shown in Tables 16  and 17. The corresponding function analysis between CNN noise and other methods is introduced in Table 16, where ○ is able; Δ is partial able; and × is unable. The second column means whether the method can apply to image data directly. SMOTE must extract the features of the picture before we use it. The third column means whether the method requires complicated processing. Random oversampling needs to select the sampled images before classifying them. SMOTE and cost-sensitive method also require to generate samples first. The fourth column means whether the method is general for different data. SMOTE and data augmentation need different generation methods to deal with different nature of data. The cost-sensitive method may encounter the problem that the accuracy of minority class must be sacrificed. The last column means whether the method has the possibility of obtaining information beyond the existing information. Both SMOTE and data augmentation are generation methods. However, since SMOTE synthesizes in the original data, it is more limited than other methods. CNN noise is able to add noise to the feature space and obtain information that does not exist in data. Since CNN noise only needs to train a classification model and can be applied to different data sets directly, the versatility and simplicity of CNN noise are also well. In these methods, random oversampling is mainly used to solve the problem of imbalanced sample distribution in the learning algorithm. On the other hand, cost-sensitive methods can work well if we do not care about the accuracy of the majority class. In addition, SMOTE can be used to find and synthesize information in data. CNN noise and data augmentation can find information outside the data. Table 17 shows the comparison results of performance for the NEU surface data set with an imbalance ratio of 100. These comparison results are obtained by averaging the same ten experiments. We can see that all these methods can improve the accuracy of the minority class using only CNN. Although random oversampling can improve the accuracy, the improvement is lower than other methods. Cost-sensitive methods can increase the accuracy of minority class to a high level, but it sacrifices the accuracy of the majority. SMOTE and data augmentation require some preprocessing after using them. In contrast, CNN noise can get well accuracy without complicated processing.

Discussion for Selection Parameter α of Hybrid Loss
Function. Herein, a comparison result for selecting parameter α of hybrid loss function (5) is introduced by experimental testing. Figure 10 shows the training history with different α in the DAGM 2007 subdata set 1. Left column figures are L KL and the right ones are L ce . Herein, we attempt to balance L KL and L ce by gradually reducing α. From   , we can find that L ce is difficult to converge when α is too small. Besides, L KL in Figure 10(e) is larger than that in Figure 10(d). As our experience, we hope to get smaller L KL ; α is set to be 0.00025 for all experiments.

Conclusions
In this paper, we have proposed a method to improve the detection problem of CNN in the imbalanced defect data set by adding noise in the feature space. A simple design method for selecting structure of CNN was first introduced, and then, we add noise in feature space of CNN to obtain proper features by the training process and to improve the classification results. In addition, a hybrid loss function of crossentropy and KL divergence was adopted for training. For general uses in training data, CNN can distinguish the defective samples from nondefective samples. However, the results on the testing data are not as good as those on the training data; therefore, we added noise to the feature space of CNN to use it to prevent the network from being limited by the minority samples in training. We prevented noise from being removed by the KL-divergence limit during training. Finally, several comparison results of three data sets are introduced to demonstrate the performance and effectiveness of CNN noise . Through different data sets, we can also verify that our method is a general method in imbalanced data. Especially, DAMG 2007 and NEU are synthetic data sets for defect detection on textured surfaces; our approach performs well with smaller network structure compared with other deep models. In addition, the performance is improved over 40% in defective accuracy by the adding noise approach. Finally, the accuracy is higher than 96%; even the imbalanced ratio (IR) is one hundred. As shown above, the proposed method can be applied for defect detection problems and other imbalanced data sets after fine-tuning.

Data Availability
Three open data sets (DAGM 2007, NEU surface defect, and MNIST) are utilized to demonstrate the performance and effectiveness of our method.

Conflicts of Interest
The authors declare that they have no conflicts of interest.

16
Journal of Sensors