Improving the Performance of Deep Learning Model-Based Classification by the Analysis of Local Probability

. Generally, the performance of deep learning-based classiﬁcation models is highly related to the captured features of training samples. When a sample is not clear or contains a similar number of features of many objects, we cannot easily classify what it is. Actually, human beings classify objects by not only the features but also some information such as the probability of these objects in an environment. For example, when we know further information such as one object has a higher probability in the environment than the others, we can easily give the answer about what is in the sample. We call this kind of probability as local probability as this is related to the local environment. In this paper, we carried out a new framework that is named L-PDL to improve the performance of deep learning based on the analysis of this kind of local probability. Firstly, our method trains the deep learning model on the training set. Then, we can get the probability of objects on each sample by this trained model. Secondly, we get the posterior local probability of objects on the validation set. Finally, this probability conditionally cooperates with the probability of objects on testing samples. We select three popular deep learning models on three real datasets for the evaluation. The experimental results show that our method can obviously improve the performance on the real datasets, which is better than the state-of-the-art methods.


Introduction
In these days, deep learning models have been proved efficient in many applications [1][2][3][4][5][6][7][8]. Generally, the performance of a deep learning-based classification model depends on the captured features [9][10][11]. When using a deep learning mode for the classification, the probability of each object is outputted. en the object that has max value is selected as the final result.
In some cases, the probability of wrong object may be higher than that of the correct one. is is caused by similar features among these or the low efficiency of training models. To capture more features for higher accuracy, the structure of models becomes bigger while this is limited by many factors like the computational resource or the vanishing gradient problem [12][13][14].
us, there should be another way to improve the performance of deep learning model in real applications. Different from deep learning models, human beings classify an object based on not only the features but also other factors. Figure 1 illustrates this kind of examples. e probabilities of person and animal may be both high in these samples, which may easily cause wrong classification results. In Figure 1(a), if we know there are no big animals in this area, the object is more likely to be a person. In Figure 1(b), if we know there is no human activity in this area, the object is more likely to be an animal. We call this as local probability, which presents the probability of objects in an environment. We believe that this is the reason why human beings can classify an object although they have not clearly seen it.
In this paper, we built a novel framework (L-PDL, Local Probability-based Deep Learning) to improve the performance of classification on the samples based on the analysis of local probability. Firstly, our method trains the deep learning model on the training set. en, we can get the probability of objects on each sample by this trained model. Secondly, we get the local probability of objects on the validation set. Finally, this probability conditionally cooperates with the probability of objects on testing samples.
Our contribution can be summarized as follows. (1) We built a novel framework that uses the local probability to increase the classification accuracy. Our framework does not need bigger models or more training samples while it can achieve higher accuracy than the existing methods. (2) Our framework increases the robustness of deep learning models for classification task. e local probability may be various in different environments. In this kind case, our framework only needs to update the local probability whose cost is lower than the retraining or transferring of models.
We performed our framework and the existing methods on the samples of CIFAR-10 [15][16][17], CIFAR-100 [18][19][20], and Mini-ImageNet [21][22][23]. All of these evaluations proved the effectiveness of our framework. We organize the paper as follows. Section 1 introduces the background and our contributions. Section 2 introduces the existing methods and their problems. In Section 3, we present our framework and related analyses. e experiment is organized in Section 4. Section 5 gives the conclusion and future work.

Related Work
(i) VoVNet-57 is designed for object detection task, which consists of a block including 3 convolution layers and 4 stages of OSA modules that output stride 32 [24]. An OSA module is comprised of 5 convolution layers with the same input/output channel for minimizing MAC. Whenever the stage goes up, the feature map is downsampled by 3 × 3 max pooling with stride 2. VoVNet-57 has more OSA modules at the 4th and 5th stage where downsampling is done in the last module. (ii) VGG16 is a variant of VGG models for image recognition [25]. Figure 2 shows the structure of this model. e image is passed through a stack of convolutional layers, where the filters were used with a very small receptive field: 3 × 3. e convolution stride is fixed to 1 pixel. e padding is 1 pixel for 3 × 3 convolutional layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the convolutional layers. Max pooling is performed over a 2 × 2-pixel window, with stride 2. ree fully connected layers follow a stack of convolutional layers: the first two have 4096 channels each, and the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). e final layer is the soft-max layer. All hidden layers are equipped with the rectification (ReLU) nonlinearity. (iii) ResNeSt50 is a state-of-the-art deep learning framework for image classification that uses a modular Split-Attention block and enables attention across feature-map groups [26]. By stacking these Split-Attention blocks ResNet-style, it obtains a new ResNet variant which is called ResNeSt. ere are four versions of ResNeSt. From ResNeSt50 to ResNeSt269, the structure becomes bigger and more complicated and can get higher accuracy when there are more and bigger size training samples. Based on the size of testing samples and computational resource, we use ResNeSt50 in this paper.
ese models have been widely used, which are useful in many applications. To increase the accuracy of these models, we have to increase the number of training samples, which is hard work in many applications. Furthermore, the structure of these models has to be deeper and the training process needs some special techniques. Actually, there are many things that can be used to improve the accuracy in the real applications.
e local probability is one of this kind of things, which will be introduced in the next section.
Some fusion operators have been proposed to improve the performance of classification by using multiple models [27]. In that paper, the authors solved mobile apps traffic by proposing a multiclassification approach, intelligently combining outputs from state-of-the-art classifiers proposed for mobile and encrypted traffic classification. In this paper, we also try to apply our framework on one of these fusion operators for higher accuracy.

Our Framework
Before giving the details of our framework, we give the following definitions. ese definitions are to explain the implementation of the methods.

Preliminaries.
We set S n as a sample and L k as the label of an object. We set G n as the ground truth on S n where G n ∈ L k [28,29]. e label is to benefit the computation, which is generally a number [30,31]. For example, when there are 10 objects to be classified, the label is from 0 to 9. Figure 3 introduces our framework, which is named L-PDL (Local Probability-based Deep Learning). Firstly, our framework trains a deep learning model on the training set. en, we can get the probability of labels (each label presents a kind of objects) on each sample of validation set by this trained model. Secondly, we get the posterior local probability of objects on the validation set.

Our Framework.
irdly, we confirm the parameters of conditional cooperation between this probability and the probability of labels. Finally, we use this conditional cooperation between posterior local probability and the output probability of models on testing samples to get final results.

3.3.
e Probability by the Trained Model. We define P(M(S n ) � L k ) as the probability of label L k on the sample S n by the trained model M. en the most possible result is selected by the following equation: which is used by deep learning models to predict the final result as Figure 4 illustrates. en, we define as the set that includes the label L k satisfying  of the L k that have high probability to cooperate with the local probability. Figure 5(a) shows the probability of the labels that belong to the samples in training set. Figure 5(b) shows the local probability of the labels that belong to the samples in the validation and testing sets. As we can see in this figure, some labels may have less samples than the other ones. We define P(L k ) as the local probability and P(L k ) as the posterior local probability of label L k on the validation set. en, we make P(L k ) assist the model probability to get correct results on the testing set.

Conditional Cooperation.
In this subsection, we carried out two conditions that should be followed for the cooperation between local probability and model probability as follows: Con 1 means that we only reconsider the result L x (having the max probability among all of the labels L k ), whose probability is smaller than δ. en, we consider the labels, whose probabilities are bigger than ε as the potential set of the final result. en, we can carry out two methods based on our framework.

3.5.1.
Joint and we call this L-PDL-joint from now on.

Weighted
Cooperation and we call this L-PDL-weight from now on. When using these methods, we should compute P(L k ), δ, ε, and ω (only for L-PDL-weight) on the validation set. We can get the posterior local probability P(L k ) on the validation samples. δ is the threshold that decides whether we reconsider a result or not. For example, if max P(M(S n ) � L k ) < 0.6, we think the trained model is not highly sure about the correctness of the result. e parameter ε means that we only select some of the labels as the potential set of final result. is is to avoid the labels that have P(M(S n ) � L k ) ≈ 0 being selected to be the final result because of the local probability. In other words, local probability should not be the only reason to select the final result.

Why Are Our Methods Better?
In this section, we try to explain why our methods can perform better than the existing methods.
First reason: in the deep learning model case, the captured features play an important role in the classification. e number of captured features depends on the structure of layers [32,33]. e training process of deep learning is to select the features that can present the samples. en, the probability is used to present the distribution on these features. us, the object is more likely to be the label L k than L q when there is the following relation: where here E(.) is the expected value and L k ≠ L q . us, the selection of labels that have high probability is reasonable when reconsidering the result. Second reason: there may be the following relation: which means the trained model predicted a wrong result. In this kind of case, we believe that  Complexity 5 Especially when P(M(S n ) � L q ) < δ, there may be P(M(S n ) � L k � G n ) ≉ 0, which shows the correct result may be the other. For example, P(M(S n ) � L 4 ) � 0.50 and P(M(S n ) � L 7 ) � 0.48 in Figure 4. In this kind of case, if we have the local probabilities we can easily select the correct result L 7 that is "horse" in Figure 4. Table 1 categorizes the reviewed works and our framework along with their main distinctive characteristics. ResNeSt50, VGG16, and VoVNet-57 are the deep learning models. Fusion operators [27] and our framework are fusion methods, which are based on these deep learning models. ese models are needed to be trained on the training set. Our framework is needed to be trained on the validation set. Some of the fusion operators need to be trained on the validation set while the other ones do not need to be [27]. Our framework can be applied to a single model or multiple ones.

Experiment
We evaluate our methods with the existing ones on some real datasets in different local probability cases. When we randomized the parameters, we evaluate 1000 times. We trained the deep learning models on some real datasets by the reported default settings. We set the number of epochs [34,35] as 10 for all these models on any training set. We do not focus on the designing of structure or tuning the hyperparameters. Instead, we focus on how to use the local probability to increase the accuracy.

e Evaluation on CIFAR-10.
e evaluation on CIFAR-10 [15][16][17] has 50000 training samples and 10000 testing samples that belong to 10 labels. Each sample is an RGB image that has three channels: red, green, and blue. We use 50000 training samples to train the models. en, we have 10000 samples left. We assign different local probabilities to these samples as Table 2 shows.
We use three kinds of local probability to evaluate the methods. In this table, Zero20 means 20% of the labels have zero samples. We define Zero40 (40% of the labels have zero samples) and Zero80 (80% of the labels have zero samples) by the same way. e labels to be zero samples are randomly selected. Figure 6 shows examples of these local probabilities.
en, the number of samples for the validation and testing sets is less than 10000 in these local probability cases. For example, there about 8000 samples left for these sets in the Zero20 case.
Our framework trained VoVNet-57 [24], VGG16 [25], and ResNeSt50 [26] on the training samples to generate trained models. en, we use 1000 samples as the validation set and the remaining as the testing set. As we can see in Table 2, our methods can increase the accuracy by about 2.56% (in the Zero20 case), 5.83% (in the Zero40 case), and 13.06% (in the Zero80 case) compared to the best of the existing methods.

e Evaluation on CIFAR-100.
is dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each [18][19][20]. ere are 500 training images and 100 testing images per label. We use 50000 training samples to train the models. en, we have 10000 samples left. We assign different local probabilities to these samples.
We define Zero20, Zero40, and Zero80 by the same way as Section 4.1 has introduced. en, we use 1000 samples as the validation set. As we can see in Table 3, our methods can increase the accuracy by about 2.57% (in the Zero20 case), 6.36% (in the Zero40 case), and 18.51% (in the Zero80 case) compared to the best of the existing methods.

e Evaluation on Mini-ImageNet.
e Mini-ImageNet [21][22][23] dataset is for few-shot learning evaluation. Its complexity is high due to the use of ImageNet images but requires fewer resources and infrastructure than running on the full ImageNet dataset. In total, there are 100 labels with 600 samples of 84 × 84 colour images per label. We use 48000 training samples to train the models. en, we have 12000 samples left. We assign different local probabilities to these samples.
We define Zero20, Zero40, and Zero80 by the same way as Section 4.1 has introduced. en, we use 1000 samples as the validation set. As we can see in Table 4, our methods can increase the accuracy by about 2.26% (in the Zero20 case), 4.83% (in the Zero40 case), and 13.94% (in the Zero80 case) compared to the best of the existing methods.

Random Case on Two Datasets.
In this subsection, we randomly assign the local probability to the CIFAR-100 and Mini-ImageNet. In more details, we randomly select the labels and assign random local probability to evaluate the methods.
Rand (.) is the function that outputs random value of probability. If the randomized value is smaller than 0, we use 0 instead of this value. en, we can generate local probability by this function. For example, if the number of original samples for an object label is 1000 and Rand (0,1) � 0.9, we have 900 samples for this label in the local probability case. Figure 7 shows the examples of Rand (0, 1), Rand (−1, 1), and Rand (−2, 1).
As we can see in Table 5, our methods can increase the average accuracy by about 1.13% (in the Rand (0, 1) case), 8.76% (in the Rand (−1, 1) case), and 12.20% (in the Rand (−2, 1) case) compared to the best of the existing methods.

Multiple Models on Two Datasets.
In this subsection, we apply our framework to the fusion operators, which uses the probabilities of multiple models [27]. We select the soft combiners, which require some parameters to be estimated, usually by means of a validation set. We selected the method class-conscious trainable combiner-based KL weights (named CC-KL trainable in Table 6) as a representative, which achieved better performance than the other methods in that paper. en, we applied our framework to the result      of this method, which is named CC-KL trainable with our framework in Table 6.
As we can see in Table 6, CC-KL trainable can increase the accuracy by using the probability of models. On the other hand, the performance is limited by the accuracy of these models. As this table shows, our framework can further increase the accuracies with the cooperation of CC-KL trainable (introduced in [27]), which are about 0.96% (in the Rand (0, 1) case), 8.56% (in the Rand (−1, 1) case), and 10.95% (in the Rand (−2, 1) case) higher than the existing methods on average.

Analysis
We have evaluated our methods with the existing ones on real datasets with different local probabilities. e results show the effectiveness of our framework in these cases. When using deep learning models in real applications, the   8 Complexity accuracy can be improved by analysing local probability. e local probability can be obtained by the computation on the validation set. In some cases, the local probability can be obtained by the other way, for example, the experience of other users about the probability of the objects in an environment. Based on this kind information, we can draw a conclusion that the object in Figure 4 may not be a "deer" but a "horse" as Figure 8 shows.
Another advantage of using L-PDL is that we do not need to retrain or transfer the models to each local environment for the robustness. is is like a person that follows "when you are in Rome, do as the Romans do." is kind of ability can make a person well live anywhere as soon as possible, which can be called the robustness. In this paper, we also implemented this kind of robustness by using our framework.  L-PDL-joint Joint cooperation based on our framework, introduced in equation (4)  14 L-PDL-weight Weighted cooperation based on our framework, introduced in equation (5)  15 Rand (.) Is the function that outputs random value of probability 16 CC-KL trainable e existing class-conscious trainable combiner-based KL weights method that is introduced in the work [27] 17 CC-KL trainable with our framework Our framework on class-conscious trainable combiner-based KL weights method that is introduced in the work [27] Complexity 9 As our framework is based on the existing deep learning models, the computational complexity is increased compared with these models. e models should output the probability and there should be a validation set with the ground truth for tuning the parameters, which increases the complexity of managing samples. Furthermore, our framework causes additional cost that is caused by the computation of the cooperation between the local probability and the output of models.

5.
1. e Introduction of the Employed Acronyms. We use Table 7 to give the introduction of the employed acronyms in this paper for the reader's convenience.

Conclusions
In this paper, we have introduced a novel framework that combines the local probability with the probability of objects. Our framework uses the output of the model to present the probability of objects.
en, this probability conditionally cooperates with the local probability to achieve higher accuracy. Our framework can improve the robustness of the deep learning classification models in an environment. Furthermore, we also applied our framework to the existing fusion operators, which can further increase the accuracy. e evaluation results proved the effectiveness of our framework to the deep learning models and fusion operators on these models. us, our framework can be a choice to increase the accuracy in the real applications.
In the future work, we will do research about the deep cooperation between the model probability and the local probability, for example, how we can use the output before the probability of labels. is may include more information about the features of objects, which can correctly present "what the model has seen in the samples." Furthermore, the deep cooperation between our framework and the fusion operators may further increase the accuracy, which is another direction of our future work.

Conflicts of Interest
e authors declare that they have no conflicts of interest.