Classification of Long-Tailed Data Based on Bilateral-Branch Generative Network with Time-Supervised Strategy

In the face of the long-tailed data distribution that widely exists in real-world datasets, this paper proposes a bilateral-branch generative network model. +e data of the second branch is constructed by resampling the generative network training method to improve the data quality. A bilateral-branch network model is used to curb the risk of gradient explosion and to avoid over-fitting and under-fitting with the combined effect of different data branches. Meanwhile, Time-supervised strategy is introduced to improve the model’s operational efficiency and ability to cope with extreme conditions by supervising and collaboratively controlling of the bilateral-branch generative network with time-invariant parameters. Time supervised strategy could ensure the accuracy of the model while reducing the number of iterations. Experimental results on two publicly available datasets, CIFAR10 and CIFAR100, show that the proposed method effectively improves the performance of long-tail data classification.


Introduction
With the rapid development of convolution neural network algorithms in recent years, there has been a very impressive improvement in the performance of image classification. Undoubtedly, the success is inextricably linked to the available high-quality large-scale datasets, such as ImageNet ILSVRC 2012 [1], MS COCO [2] and Places Database [3]. However, compared to these high-quality datasets, realworld datasets are always biased and it is difficult to ensure a uniform distribution of data, and more often than not, certain classes of data are very abundant while certain remaining classes are very scarce, which leads to a long-tail distribution of data [4,5] and affects the performance of image classification. From the reviewed related materials, class re-balancing strategies are currently often used when faced with uneven data distribution. Class re-balancing strategies are further divided into two categories, namely resampling strategies [6][7][8][9][10][11][12][13] and re-weighting strategies [14][15][16][17]. e re-sampling strategy means that the source data is resampled according to the desired frequency distribution to obtain a new datasets by copying the minority data [6,8,12,18] and reduce the majority data [7,12,13]. is strategy is able to reduce error rate caused by unbalanced data during training. Although resampling can show better results, this strategy still has a negative effect on the model. For example, the SMOTE algorithm [11] merely repeats and abandons the original data in the process of resampling. Although it changes the data distribution, it cannot bring more classified information to the deep learning model. erefore classical resampling tends to make the tail data more prone to over-fitting conditions, while the head data is also tend to under-fitting conditions. And this can be effectively avoided by a re-weighting method that adds a regularization term to the loss. Where the loss of the regularization parameter can often be expressed in Loss � loss 1 + λ * loss 2 (0 ≤ λ ≤ 1).
(1) e over-fitting and under-fitting of the model is suppressed by introducing a new loss function as a constraint. However, the regularization method also has its drawbacks. Due to the introduction of the regularization parameter as a restriction in the loss, it sometimes makes the model parameters fail to converge, and in extreme cases, it even results in gradient explosion. erefore some scholars proposed many ways to prevent the over-fiting and under-fiting with datadependent regularizer [19][20][21][22]. For example, the algorithms proposed by Zhou et al. [23] based on bilateral-branch network model has good accuracy in the classification problem of longtailed data. e algorithm obtained the first place in iNatur-alist2019 with an error rate of 30.38% on the iNaturalist2018 public datasets. e model of Boyan Zhou re-weights the loss function by a data-driven approach. is data-driven approach not only effectively curbs the risk of gradient explosion, the model will converge more easily under the combined effect of different data branches while avoiding the over-fitting phenomenon on tail data [4]. However, this method requires very strict data quality. Since the regularization term itself is datadriven approach, if the data quality of the second branch is poor, it will lead to loss 2 adverse effect on the direction of the overall model learning. erefore how to construct high quality second branch with different distributions becomes a new problem.
Many works focus on minority samples, the re-balance the data by augmentation of minority samples [11,24,25]. while some other works provide a Major-to-minor translation to re-balance the data distribution [26][27][28]. the M2m algorithm proposed by Kim et al. [29], which is different from the above methods and gives a solution from the perspective of resampling. M2m algorithm is a method of resampling data through data generation by generation against network [26,30,31]. is method solves the problem of unbalanced data distribution by constructing fewer classes of data through multiple classes of data, while bringing more classification information to the data and improving the data quality. is method can effectively increase the number of minority class samples while greatly reducing the over-fitting of the minority class data. However, this method often requires the use of an already pre-trained network, and generating data during training still requires a large number of iterations, which will reduce the efficiency of model learning.
e main contributions of this paper are as follows.
(1) Incorporating the respective advantages of the Bilateral-branch model and the generative network model, this paper proposes a bilateral-branch generative network model. e data of the second branch is constructed by resampling and generating by the generative network training method to improve the data quality. e bilateral-branch network model is used to curb the risk of gradient explosion, and the model is made to avoid over-fitting and under-fitting phenomena under the combined effect of different data branches to improve the effect of long-tail data classification.
(2) Since the generative network model adds a large number of iterations in the process of generating data, which affects the efficiency of the model, this paper introduces a time-supervised strategy, which supervises and limits the number of iterations of the generative network through time-variant parameters, improves the operational efficiency of the model and its ability to cope with extreme conditions, and ensures the accuracy of the model while reducing the number of iterations. (3) e accuracy of the algorithm was tested under two publicly available datasets, CIFAR10 and CIFAR100 [32], for different distributions, and the method used in this paper, was higher than both Boyan Zhou' algorithm and M2m algorithm.

Construction of a Bilateral-Branch Generative Network Model
e overall framework of the bilateral-branch generative network model proposed in this paper is shown in Figure 1, and the model is abbreviated as BBGN (Bilateral-Branch Generative Network).
As shown in the figure the BBGN model first constructs the inverse distribution data branches by data resampling. We refer to the source data branch as the first branch data, and the inverse distribution data constructed by the resampling method as the second branch. After the data of the second branch is generated by the re-sampling method, the first branch data is used as the data source to generate new high-quality data for replacing the data of the second branch by the generating network (GN) module to perform data augmentation on the data of the second branch. In the GN module and later in the network model we introduce respectively the time-supervised parameters (1 − α) with α for coordinate and control these two functional modules.
After the process of data augmentation is completed, the original two data branches are feature extracted using a pyramid-shaped multiple layer feature perceptron to form separate sets of features.
rough the time-supervised parameterα of control, the two feature sets are fused to form a one-class feature set. e fused features are averaged pooled (GAP layer in the figure) and pushed into the fully connected classifier for classification. e model loss is calculated and the model parameter weights are updated backwards based on the classification results and the loss function.
rough the coordination and control of the time-supervision strategy, in the early stage of BBGN model learning, the BBN network is influenced by the time-supervision parameter, which makes the data of the second branch play almost no role in the learning of the model, while the GN network is also subject to the synergy of this parameter and does almost no data enhancement to the data of the second branch. At this point the data of the second branch plays a weak role in the learning of the model only by resampling the data, and the learning process of the model is dominated by the source data of the first branch. Over time, under the influence of the time-supervised strategy, the influence of the BBN model's bilateral-branch data on model learning gradually changes. e role of the influence of the data of the second branch gradually increases. At the same time the GN network is gradually activated by the coordination of the temporally supervised parameters, and the resampled data of the second branch starts to play its role in the model learning process after being enhanced by the GN 2 Complexity module. e introduction of the time-supervised strategy enables our algorithm to maintain accuracy while reducing the number of iterations and greatly improving the computational efficiency.

Bilateral-Branch Network (BBN)
Model. e BBN model is data-driven and uses a loss function weighting method to avoid over-and under-fitting of the data. e BBN model starts from a regularized weighting perspective, but unlike the usual method of constructing regularized expressions, this model achieves regularized weighting by using branches of data from different distributions to learn and fuse the losses generated by multiple branches. e pseudocode of the BBN model algorithm is as follows.
At the row 14 of Algorithm 1, we get the Target pred of model by mixing the feature 1 and the feature 2 . is combiner function canbe expressed by the following equation (6). And then, at the row 18 of Algorithm 1 the BBN model Combiner2 function can be expressed by the following equation (2).
where loss 1 is the error generated in the learning process for the first branch data in the BBN model, and loss 2 is the error generated by the data of the second branch in the learning process. At the row 16 and 17 of Algorithm 1, the L function can be expressed by the following equation (8), and Loss is the total error generated by the BBN model in the learning process. Loss affects the learning process of the model through the synthesis of the error components of the two data.

Generating Network Models.
e pseudocode of our proposed algorithm for generating models in BBGN is shown below.
At the first row of the Algorithm 2, the Bernoulli distribution function used to select the class to be generated is shown in .
where Target G denotes the class to be generated, and N Target G denotes the total number of samples of the class to be generated, and N 0 denotes the total sample size of the class corresponding to the class with the highest frequency, and P G denotes the probability that such class is selected as the class to be generated. At the row 3 fo Algorithm 2.
At the row 5 of Algorithm 2, the generation source class Target O are selected using the distribution function as shown in .
where β ∈ (0, 1) is a fixed parameter, and N Target O denotes the total number of samples of the generated source class, and N Target G denotes the total number of samples of the class to be generated, and P O denotes the probability that such class is selected as the generating source class. And then we need to select the generation source based on the class to be generated. Since this data augmentation algorithm constructs minority class samples by extracting class-independent features from the majority class and classrelated features from the minority class.
erefore the method has certain requirements on the data quality of the generation source data, and the class-irrelevant feature classes contained in the data should be as rich as possible, so as to improve the generalization ability of the model for minority class learning.
In this Algorithm 2, the class to be generated Target G and the generated source class Target O are selected by resampling the class label with the probability distribution. is approach can meet our requirements on the quality of the generated source data. At the same time, the possibility of generating fewer classes of data with fewer classes of data is preserved, and the choice of generating source class data is expanded.
e framework diagram of the generative network used for Algorithm 2 is shown in Figure 2.
For the selected image I, input the to-be--trained model F to obtain the I's classification result Label F , and compute In the updating process, the goal of loss g is to cause the selected image I to be classified as the class Target G with the pre-trained network G. Since the pre-trained network G Inputs: training datasets I 1 , λ, c, η, L > 0, β ∈ [0, 1), α Output: BBN network model F (input; weight) (1) x← A random sample of class k in I 2 (6) Target O ←A random sample of class k in I 1 with P O (6) I← A random sample of class Target O in I 1 (7) Add some noise σ to the I (8) for t � 1 to L do (9) Loss G � L(G; I; k) (10) Loss (20)  In the unbalanced datasets, the majority class has a very rich set of class-irrelevant features in addition to such classrelevant features. While the minority class has only classrelated features with a small number of class-irrelevant features. erefore the class irrelevant features carried in the majority classes are combined with the class relevant features carried in the minority classes in the pre-trained network G through the generative network to generate a large number of images of the minority classes. New classification information is introduced while balancing the data distribution.
Also, in order to prevent the situation that the gradient disappears during the training process and the target image cannot be generated, the regularization term can be set to eliminate this situation. e above loss O represents the prediction result of image I in the network F to be trained Label F with the image to be generated in the source classTarget O of the difference. Before the classification is performed, the Label F is in the form of a 0-1 vector, and the respective values on the vector indicate the probability of the image being of this class. By setting this difference as a regularization term, the gradient disappearance can be effectively prevented, allowing the input image I to be generated more easily as an image of other classes.

Time Supervision Strategy
e variation of the time-supervised parameters with training time is shown in Figure 3. e temporal supervision strategy proposed in this paper acts on both the BBN model and the data generation module. e time-supervised strategy coordinates and controls the overall model learning process by setting the time-supervised parameters α.

Time-Supervision Strategies for BBN Models.
In our BBGN model, the degree of influence of different branches on the learning process of the model can be adjusted by introducing time-supervised parameters in the computation process of the fusion and loss of the two branches. e first branch data is the original real data, while the data of the second branch enhanced by the generative network is the non-real data generated by data features. erefore, in the early stage of model training, the real data should be the main focus, and the real features in the data should be learned as much as possible. And when the minority class gradually starts to over-fit, then gradually start to increase the degree of influence of the non-real data generated by the features on the model learning to prevent the model from over-fitting on the minority class, and at the same time enhance the generalization ability of the model on the minority class to improve the accuracy of the model. e fused feature is obtained during the fusion of model features by .
It is important to note that when we use the results obtained from fused feature with classifier when calculating the loss function, it needs to be compared the result label to the two objectives Target O with the Target G together to calculate the loss. Since the feature fusion is performed with the time-supervised parameter α, the loss should also be calculated with the time-supervised param-eter α to determine the final loss. e loss calculation in this paper is shown in equation (7).
where the loss function L (L, T) is calculated as shown in .

Time-Supervision Strategies in Generative Network
Models. e generative network model improves the performance of the model by improving the quality of the data of the second branch. However in the early stages, the data of the second branch has little impact on the learning process of model learning, and the large amount of generated data does not contribute to the model learning process. is would generate a waste of resources. So we use equation (9) and introduce a time-supervised strategy to limit the size of generated data in the early stage.
where L is the number of samples to be generated, and N Target G is the total number of classes to be generated, and α is the time supervision parameter. By the introduction of the time-supervised strategy, the amount of generation of the generative network is limited in the early stage of model training. And with the increase of the number of iterations, the size of generated data L of the generative network is gradually activated by the time-supervised strategy, when the main component of the data of the second branch is changed from resampled data to generated data, and the data quality is gradually improved. In this way, the number of iterations can be greatly reduced and the learning efficiency can be improved while ensuring the accuracy of the model.

Experimental Data Set.
In this paper, we use online public datasets CIFAR10 and CIFAR100 [32], both of which contain 60,000 RGB color images of 32 * 32 size. ere are 50,000 images in these images for training and 10,000 images for testing. CIFAR10 and CIFAR100 have 10 classes and 100 classes of data respectively. e source datasets is a uniform datasets. And in this paper, we resample the long-tail datasets with class imbalance of CIFAR10 and CIFAR100 by setting the imbalance ratio parameter Ratio during the experiment. Among Ratio � N max /N min , in this paper, Ratio contains three values of 10,50,100 respectively.

Introduction of Experimentation.
Our experimentation are running in the ubuntu20.04 operation system. Traning the model need about 16G RAM. e machine language of the experimentation is python. In the experimentation, at first we will train our model with pretrained model and unbalanced trainning datasets. en we test the result of our model in the testing dataset. Finally we visualize our result and analyze it.

Analysis of Experimental Results.
In this paper, the accuracy of the optimal solution on the test set is used to compare the classification ability of different methods by comparing the experimental results. Also, we track the accuracy of the classification results of different models on the test set during the training process and plot it as a line graph, which can visualize the difference of different models during the training process. Table 1 and Figure 4, we can find that the accuracy of BBN algorithm and our algorithm decreases when the number of training is between 50 and 75, while M2m algorithm can quickly improve to a higher level and stabilize at that level in the early stage. However, as the number of training gradually increases, the improvement of M2m accuracy gradually becomes slower, while the accuracy of BBN algorithm and our algorithm improves rapidly. Considering that both the BBN algorithm and our algorithm contain a bilateral-branch structure, the decrease in the accuracy of the algorithm is related to the gradual increase in the influence of the second branch. At the beginning, the accuracy drops significantly due to the small influence of the data in the second branch of the model, but as the number of training sessions increases, the accuracy of the BBN algorithm and our algorithm always steadily exceeds that of the M2m algorithm with the effect of the data in the second branch. is indicates that the pure data generation model can achieve a high accuracy rate within a smaller number of training sessions, but the final accuracy ceiling of the model is not high. Compared to the BBN algorithm, the lead is weaker on the CIFAR10 datasets Complexity when the imbalance ratio is 10 versus 50, although our method outperforms the BBN algorithm. But when faced with an extreme imbalance situation, i.e., when the imbalance ratio is 100, the advantage of our algorithm is significantly improved. is shows that when the realistic environment is not complex enough, the data augmentation by resampling alone can achieve better results, but in the extreme environment, the data quality of the second branch will be very poor, and thus better results can be obtained by introducing data generation to improve the data quality.

Results of the CIFAR100 Experiment.
e results of the CIFAR100 experiments are shown in Table 2 and Figure 5. It can be found that the curves are roughly the same as cifar10, the M2m accuracy stabilizes to a high value very quickly, while the BBN algorithm with our algorithm takes the lead in decreasing and then improving as the number of training increases. e CIFAR100 datasets is 10 times more classified and has less datasets in each class than CIFAR10, so obtaining a higher accuracy is very difficult. In this extreme condition, our algorithm can take advantage of the data enhancement and lead the BBN and M2m algorithms. However, surprisingly, in the most extreme case of Ratio � 100 for CIFAR100, the accuracy rates of all three are comparable and lower, and the accuracy curves are almost completely indistinguishable. is is perhaps due to the fact that in the too extreme case, the ResNet32 network leads to this result due to the limitations of its network structure.
To test this conjecture, we did a set of comparison experiments on the CIFAR100 datasets with Ratio � 100, using the ResNet50 network.

Results of the ResNet50 Experiment.
e experimental results are shown in Figure 6. From the accuracy curves, we can find that when the model has more parameters and the network structure is more complex, the accuracy results of our method and the BBN method produce very obvious differences. Benefiting from the role of the data generation network, the BBGN network model constructed by the generation method with the time-supervision strategy can also significantly improve the accuracy of the model in the case of very extreme data distribution.

Summary and Outlook
In this paper, two classical solutions are introduced to address the characteristics of unbalanced data distribution of long-tailed data sets: resampling methods and weighting methods. Two types of algorithms for processing long-tailed data, BBN and M2m. we studied and analyzed these two methods and proposed bilateral-branch generative network model based on them. is model improves the accuracy of classifying long-tailed data by data-driven re-weighting    8 Complexity methods successfully and data enhancement methods based on generative networks. Also in this paper, a time-supervised strategy is introduced in this model to coordinate and control GN and BBN modules to reduce the number of algorithm iterations while maintaining a high accuracy rate. Compared with BBN and M2m algorithm, this algorithm can obtain higher accuracy rate stably. e following shortcomings still exist in this paper: the time-supervised strategy proposed in this paper, although it reduces a large number of iterations and improves the operation efficiency, making the model run significantly more efficiently than the M2m algorithm, there are still differences compared to the BBN algorithm.

Data Availability
We used CIFAR10 and CIFAR100 public datasets to support the findings of the research. All data used in the study can be download at http://www.cs.toronto.edu/∼kriz/cifar.html.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of the study.