Multisource Deep Transfer Learning Based on Balanced Distribution Adaptation

The current traditional unsupervised transfer learning assumes that the sample is collected from a single domain. From the aspect of practical application, the sample from a single-source domain is often not enough. In most cases, we usually collect labeled data from multiple domains. In recent years, multisource unsupervised transfer learning with deep learning has focused on aligning in the common feature space and then seeking to minimize the distribution difference between the source and target domains, such as marginal distribution, conditional distribution, or both. Moreover, conditional distribution and marginal distribution are often treated equally, which will lead to poor performance in practical applications. The existing algorithms that consider balanced distribution are often based on a single-source domain. To solve the above-mentioned problems, we propose a multisource transfer learning algorithm based on distribution adaptation. This algorithm considers adjusting the weights of two distributions to solve the problem of distribution adaptation in multisource transfer learning. A large number of experiments have shown that our method MTLBDA has achieved significant results in popular image classification datasets such as Office-31.


Introduction
Machine learning can achieve good results in computer vision, and it is often based on the following assumptions: there are enough data samples in the training dataset and a high-precision classifier; the training data and testing data come from the same feature space and the same distribution. For a new domain, it is often difficult to obtain enough data labels. In this case, transfer learning [1] is a promising method that transfers knowledge from the source domain to the target domain. At the same time, the development of deep learning has accelerated the technical level of transfer learning models. Transfer learning usually assumes that training and testing data come from similar but different distributions [2]. For example, the object that takes a photo under different angles, backgrounds, and lighting may get different marginal condition distributions. e existing transfer learning methods mainly focus on distributed adaptation by observing and reducing the difference between each domain through joint distribution adaptation. For example, several unsupervised transfer learning methods [3][4][5] use maximum mean discrepancy in the neural network to reduce the domain difference; other models introduce different learning modes to align the source and target domains, including aligning second-order correlation [6,7].
In recent years, most unsupervised transfer learning algorithms have focused on single-source unsupervised transfer learning problems, which are training samples that come from a single-source domain. In previous research, the work focused on estimating the sample's weight, which is the ratio of the source domain and the target domain [8][9][10][11]. In addition, the manifold learning method is used to sample a high-dimensional space and map it to a low-dimensional manifold space to make sure that the subspace of the source domain and the target domain comes closer. Some singlesource transfer learning algorithms map the data of two domains to a common feature space and describe the invariant features of the source and target domains by minimizing the difference in domain distribution [6,[12][13][14]. Long [15], Hou [16], and Hashemi [17] had also proposed many joint distribution adaptive methods to solve the difference of distribution between the source domain and the target domain. In recent years, many deep transfer learning algorithms were proposed to solve the problem of data distribution adaptation. Tzeng et al. proposed DDC (deep domain confusion) [18], and Long [12] et al. proposed the DAN (deep adaptation network) to solve the problem of marginal distribution adaptation. Zhu [19] et al. proposed the DSAN (deep dynamic adaptation network), and Wang [20] proposed the DDAN (deep dynamic adaptation network) to solve the problem of jointly distributed adaptation.
However, in practical work, we often face multiple source domains, so it is more feasible to study the migration of multiple source domains, and it is also more meaningful in practice. For multisource transfer learning, a common simple idea is to combine all source domains into a new source domain and then use the single-source transfer learning algorithm to classify the target domain data. Due to the expansion of the dataset, these methods may yield better results. However, in practical applications, because of the large differences in the distribution of each domain, this type of method does not yield good results. erefore, we need to find a better way to utilize data from multiple source domains.
With the rapid development of deep learning, there are many studies on transfer learning based on deep learning. Zhao [21] et al. proposed a multidomain adversarial network, which aligns the distribution of features in each source domain and target domain through multiple domain discriminators; Xu [22] et al. proposed a deep cocktail network. A separate domain discriminator and a classifier are designed for each source domain and target domain. e current deep multisource transfer learning algorithms often have the following two problems: 1.
ey first map the source domain samples and target domain samples to the same common feature space, but even for a single-source domain. It is also difficult to learn the same characteristics as those of the target domain. Moreover, in multiple source domains, their data samples are likely to cross, which leads to a reduced effect of feature alignment. 2. At present, the studies often consider only the marginal probabilities or conditional probabilities for the distribution of the source domain and the target domain. Current algorithms often adjust the marginal probability first and then adjust the conditional probability. e relationship between them is not fully utilized.
In this article, we combine the advantages of balancing distribution, convolutional neural networks, and multisource transfer learning, and then a new multisource transfer learning algorithm based on balanced distribution adaptation-MTLBDA-is proposed, which first maps multiple source domains and target domains to the same subspace and then aligns the features of multiple source domains and target domains. en, according to the balanced distribution adaptation, the effect of the category in each source domain and target domain is decreased, and the difference between the marginal probability distribution and the conditional probability distribution in each source domain and target domain is reduced. en the convolutional neural network is used as the classifier for each source domain and target domain to complete the task of classification. Finally, we generate a regularization term for the classifier of each source domain, which is weighted to prevent overfitting of the model.
Compared with the previous work, the contributions of this work include the following: (1) A new multisource transfer learning algorithm named MTLBDA is proposed, which balances the difference between conditional probability distribution and marginal probability distribution to improve the classification effect. is method first maps all domains to the same feature space and then reduces the difference between the marginal probability distribution and conditional probability distribution with maximum mean discrepancy and adds a separate regularization term to the convolutional neural networks on this basis. e rest of the paper is arranged as follows: Section 2 reviews the work related to multisource transfer learning and joint distribution adaptation. Section 3 proposes multisource deep transfer learning based on balanced distribution adaptation. Section 4 verifies the effectiveness of the algorithm on the SVHN dataset, USPS dataset, MINIST dataset, Office-31 dataset, and DomainNet dataset. Section 5 summarizes the main work of this paper.

Joint Distribution Adaptation.
A domain often has two probability distributions: one is marginal probability, and the other is conditional probability. Long [21] gave the hypothesis of joint distribution adaptation, whose purpose is to reduce the distance of joint probability distribution between the source domain and the target domain. Current research on joint distribution includes domain invariant clustering [16], increasing structural consistency [17], target optimization [23], and so on. Wang [20] proposed a dynamic balance adaptive algorithm, which pointed out that marginal distribution adaptation and conditional 2 Computational Intelligence and Neuroscience distribution adaptation are not equally important. However, these joint distribution adaptations are often used in the field of single-source transfer learning, and they have not played a role in the field of multisource transfer learning.

Multisource Transfer
Learning. Multisource transfer learning (as shown in Figure 1) as a research direction of transfer learning has essential practical value. In the process of real life and practical application, there are often multiple source domains. Although each source domain has a different similarity to the target, these source domains can still be used for knowledge transfer. Moreover, multisource transfer learning contains more knowledge, which can make the effect of the model better. At the same time, transfer learning also has a theoretical basis. Crammer [24] first proposed the expected loss boundary condition of multisource transfer learning. Later, Mansour [25] proved that the distribution weighted combination rule can reduce the instantaneous function between the source domain and the target domain. Ben-David [26] gave two learning boundaries for minimizing empirical risk by introducing the distance between the target domain and the source domain.
In recent years, a lot of work was centered around multisource transfer learning and deep learning. Xu [22] proposed the deep cocktail network (DCTN), which uses a single domain discriminator and a classifier for each source domain and target domain. e domain discriminator is used to align the feature distribution, and the classifier outputs the predicted probability distribution. Based on the output of the domain discriminator, the DCTN designed a method of voting by multiple classifiers. Peng [27] proposed a moment matching multisource domain adaptation (M 3 SDA) method, which not only considers the alignment between the source domain and the target domain but also aligns the feature distribution of different source domains. Zhu [28] et al. proposed a framework named aligning domain-specific distribution and classifier for cross-domain classification from multiple sources (MFSAN). However, the current deep multisource transfer learning algorithms often only consider the marginal probability distributions or consider the marginal probability distribution and the conditional probability distribution separately. In this paper, multisource transfer learning based on balanced distribution adaptation, which considers the joint probability distribution to improve the accuracy of the algorithm, is proposed. Problem.
In multisource transfer learning, there are N source domains, and their labeled sample data can be represented as , and the probability distribution can be expressed as P t (x, y).
In recent years, some papers had defined the objective function of multisource deep transfer learning. ey first map all domains to the same target space and then use the common domain invariant representation in the common feature space for learning all domains. Zhu [28] et al. gave a definition of the loss function: (1) e first term represents the loss of the classification function, the general classification loss is the cross-entropy loss, and the second term represents the statistical measurement of the source domain and the target domain. Nowadays, the commonly used metrics are MMD [15], reference loss [29], CORAL loss [12], and confusion loss [13,14]. Zhu et al. defines CORAL loss as a specific difference loss. e common problem of these methods is that they only use MMD to calculate the marginal distribution difference between the source domain and the target domain without considering the influence of conditional probabilities on the model. Zhao [30]'s paper published on ICML2019 proved theoretically that reducing the marginal distribution difference between the source domain and the target domain is not enough. At the same time, Wang's paper also pointed out that equal consideration of marginal distribution and conditional probability distribution is not enough. erefore, we propose a multisource deep transfer learning algorithm based on balanced distribution adaptation to solve these problems.
Similar to other multisource transfer learning algorithms, we first map multiple source domains and target domains to the same subspace, and then we align the marginal probability distribution and conditional probability distribution of each source domain and target domain. Of course, the best way is to tune the convolutional neural network for each pair of source and target domains. However, from a practical point of view, the amount of calculation in this method is very large, so we use shared weights to solve this problem. Finally, we add a specific Computational Intelligence and Neuroscience regularization term to realize the problem of individual network tuning.

Multisource Deep Transfer Learning Based on Balanced Distribution Adaptation
To solve the impact of category imbalance on the existing multisource transfer learning algorithm, in this chapter, we introduce a multisource transfer learning algorithm based on distribution balance. We use the general regularization item proposed in [28] to replace the classification selector to output the final classification result. Algorithm structure. Our algorithm structure contains three parts-a common feature selector, a distribution balancer, and a regularizer-as shown in the figure.
Common feature extractor: We propose a common subnet f(·) to extract the common representation of all domains, which maps images from the original feature space to a common feature space.
Domain-specific distribution balancer: We design a distribution balancer for each source domain and target domain data, given a set of images x s j from the source domain X s � (x s j , y s j ) and a set of images x t from the target domain (x t j ) . e features of these specific fields are mapped to the same feature space through a common feature extractor, specifically expressed as the source domain mapping feature f(x s j ) and target domain mapping characteristics f(x t ). Hence, we can get N independently distributed balancers b(·) corresponding to specific source domains (x s j , y s j ) . e class balancer we proposed is a domain-specific feature extractor. Generally, people use MMD, CORAL, adversarial, and other methods as feature extractors, but they often only consider one distribution. To balance the categories, we use the BDA algorithm proposed by Wang Jindong [20] while considering conditional distribution, marginal distribution, and multiclass balance as the distribution balancers. We use a convolutional neural network as our classifier, and we define C i as the classifier of N source domains. Based on experience, our loss is classified as crossentropy loss, and the loss function is denoted as J(·, ·).
Domain-specific regularization term: Based on the behavior regularization proposed in the literature, for the source domain i, we give the regularization term R(w, w * , x j , y j ), where w is the d-dimensional parameter vector containing all d parameters of the target domain under the convolutional neural network. w * is the parameter vector of the source domain. It is harder to calculate all parameters for each domain, so we share the parameters of the first n − 3 layers.
Objective function: According to Figure 2, we define the final objective function of the algorithm as L � L cls + Lb bda + Lr reg .
(2) e classification loss Lb cls is the loss caused by a specific domain classifier, and in Figure 2, we can see that the variable x j in the source domain i undergoes a three-step transformation: first, F(x s i j ) is obtained through the public feature extractor; then B i (F(x s i j )) is obtained through the class balancer; finally, C i (B i (F(x s i j ))) is obtained through the CNN classification. e final classification loss is e balance loss Lb bda is a specific domain balancer loss, and we follow the concept of single-source domain distribution balancers according to Wang et al. e algorithm considers the conditional probability and marginal probability distribution of the source domains and target domains at the same time. In particular, due to the inability to obtain the label of the target domain, we have no way of estimating the conditional probability distribution. erefore, we use the proof given in [31]; when there are enough label samples, we can use the conditional distribution P(x t |y t ) of the class to approximately match the conditional distribution P(y t |x t ). In calculating the conditional distribution P(x t |y t ) of the class, we first use the specific domain classifier to label the target domain data samples to form the prelabels of the target domain samples.
μ ∈ [0, 1] is the balance factor. When μ ⟶ 0, the marginal distribution is more important, and when μ ⟶ 1, the conditional probability is more important. To calculate the marginal probability and conditional probability, according to MMD and TCA [32], we can define the specific domain balancer estimation empirically.
For the i-th source domain, the squared distance between the empirical kernel average embeddings is obtained from the empirical estimation of MMD.

Computational Intelligence and Neuroscience
We define formula (6) as the estimation of the difference between the source domain and the target domain. erefore, the balance loss is defined as follows: We define the regularization term for a specific domain i according to the behavior regularization term proposed in [33] as follows: R i w, w * , x j , y j � n j�1 Ω w, w * , x j , y j , Ω w, w * , x j , y j � α · Ω ′ w, w * , x j , y j + β · L 2 w, w * , W k (w * , x j , y j ) refers to the weight assigned to the j-th image in the k-th layer of the network, FM k (w, x j ) · FM k (w * , x j ) refers to the difference in the characteristics of the two images, ‖ · ‖ 2 indicates their Euclidean distance, and L 2 (w, w * ) represents the L 2 regularization term of w and w * . In order to reduce calculation, k � n, n − 1, n − 2 { }. Collecting the regularization terms of multiple source domains, we define the regularization loss as c ∈ [0, 1] is the value of the regularization term ranging from 0 to 1, and its selected value is defined according to the subsequent selector. Final objective function: In summary, the specific process steps of the MTLBDA algorithm are shown in Table 1.  Computational Intelligence and Neuroscience

Experimental Results
To test the effectiveness and generalization of the MTLBDA algorithm, we test it on two types of image datasets. e first type is a digital classification dataset including the SVHN [34] dataset, USPS [35] dataset, and MNIST [36] dataset. e second category is of image classification datasets including the Office-31 [37] dataset, Caltech [38] dataset, and DomainNet [24] dataset. e experiment will compare single-source transfer learning algorithms DAN, DANN, BDA, and DDAN, and multisource transfer learning algorithms DCTN, MFSAN, and M 3 SDA.
For fairness of the experiment, a 5-fold cross-validation strategy is selected for all experiments, and the experiments of this strategy are repeated twice to obtain the final comparison result. In the experiment, we use the average classification accuracy [39,40] and recall rate of each algorithm after running it for 10 times as the evaluation criteria. e recall rate reflects how many positive examples in the sample are predicted correctly. e forms of expression of classification accuracy and recall are defined as follows: Classification accuracy: Accuracy � |x: Recall rate: R � FP/TP + FN × 100%.
Among them, TP represents the number of positive samples that are correctly classified as positive, FP represents the number of negative samples that are incorrectly classified as positive, TN represents the number of negative samples that are correctly classified as negative, and FN represents the number of positive samples that are incorrectly classified as negative.
X represents the target domain number test dataset, f(x) is the sample x-class label predicted by the classifier, and y(x) is the reality-class label of sample x.

Dataset Introduction.
Both the USPS dataset and the MNIST dataset contain handwritten digits "0"-"9"; the former is composed of 9298 16 × 16 images, and the latter is composed of 70,000 28 × 28 images. e street view house number (SVHN) is obtained from Google. Each picture contains a group of Arabic numerals '0-9′, which contains 73257 digits, and the image pixel is 32 × 32. Figure 3 shows examples of USPS, MNIST, and SVHN. We can see that the distributions of USPS and MNIST are different but they contribute the same feature space. SVHN datasets are different in their distribution and feature space. We extract 9000 images from MNIST and SVHN as two domains. Since USPS has only 9298 pictures, we regard the whole dataset as a domain.

Experimental Data.
In this part, we compare some single-source transfer learning algorithms and multisource transfer learning algorithms such as DCTN and MFSAN with our algorithm MTLBDA It can be seen from Table 2 that among the three crossdomain tasks, the highest accuracy rates of the MTLBDA algorithm are 83.56%, 98.43%, and 96.14%, which are higher by 3.31%, 0.35%, and 0.02% than the algorithms of DCTN, MFSAN, and M 3 SDA, respectively. Compared with the classification tasks with S as the target domain or source domain, the multisource transfer learning algorithm is clearly better than the single-source transfer learning algorithm. Due to the single-source transfer learning algorithm, in the tasks with S as the target domain, our algorithm MTLBDA is 13.04% more accurate than the best transfer learning algorithm DDAN. From Figure 4, we can see that the different types of S, U, and M pictures lead to their distribution differences, which also proves the accuracy of our algorithm.
. Output: Loss function f(x) 1: Give the number of training iterations T 2: From 1 to T 3: Randomly take m samples from a certain source domain 4: Take m samples from the target domain 5: Send the source and target samples to a common feature extractor, and get a common expression as f(·) 6: Input the common latent representation of the source sample into the domain-specific distribution balancer to obtain the domainspecific representation of the source sample b(·) 7: e specific domain representation of the original sample is output to the specific domain classifier, and the calculation formula of the classifier is (3) 8: e general latent representation of the target sample is input to all domain-specific extractors to obtain the domain-specific representation of the target sample. 9: Use the formula to calculate balance loss (7) 10: Make all passes to minimize the total loss in formula (10), update public feature extractor F(·)、multiple domain distribution balancer B 1 B 2 · · · B N and multiple classifiers C 1 C 2 · · · C N , multiple regularization terms R 1 R 2 · · · R N 。 11: Finish     It can be seen from Table 3 that among the four crossdomain tasks, the highest accuracy of the MTLBDA algorithm is 93.03%, 99.28%, 99.52%, and 94.53%, which are higher than those of the comparative algorithms DCTN, MFSAN, and M 3 SDA. At the same time, compared with the four cross-domain tasks, the single-source transfer learning algorithm shows higher accuracy than the optimal classification task by 4.25%, 0.82%, 3.16%, and 2.77%. In the A,W,D -> H task, MFSAN improved by 1.6% compared with the best transfer learning algorithm and by 3.25% compared with the best single-source learning algorithm. In contrast, the average accuracy is greatly improved, which proves the effectiveness of the proposed algorithm.

Influence of Category and μ.
To demonstrate the advantages of our algorithm category, we selected the network dataset domain proposed in Ref. [27] (as shown in Figure 5); the fields of the dataset are clipart, infographic, painting, quickdraw, real, and sketch, including 345 classes and 599859 data. e data distribution is shown in Table 4. Each domain contains 345 classes. We gradually increase the number of classes from 20 to 345 and show the impact of the number of iterations on the accuracy of the algorithm. Finally, we calculate the sensitivity of the algorithm to μ.

Experimental Data
(1) Category Influence. We plot how the performances of different models will change when the number of categories increases. e figure shows all multidomain combinations. Under DomainNet, it can be seen from Figure 6(a)) that the multisource transfer learning algorithm is very sensitive to the number of classes. At the same time, when there are many classes, our algorithm is clearly better than the algorithms DCTN and MFSAN. (b) When the number of classes is greater than 150, our algorithm's accuracy is generally higher than that of other algorithms. (c) Our  8 Computational Intelligence and Neuroscience algorithm has a better effect on datasets with a very large difference between edge distribution and conditional probability distribution, such as DomainNet, which also proves that marginal probability and conditional probability have a great impact on classification in practical images.
(2) Influence of Iterations. Figure 7 shows the effect of the number of iterations on the accuracy. (a) When the number of iterations exceeds 1000, the accuracy of the algorithm tends to be stable. (b) At the same time, MTLBDA shows better results.
(3) Influence of μ. In this section, we will evaluate the effectiveness of the balance factor µ. We used running μ ∈ 0, 0.1, . . . , 0.9, 1.0 { } in MTLBDA with μ � 0.5 as the baseline on some tasks. Figure 8 shows the results. Clearly, the optimal µ is different in different tasks, indicating the importance of balancing the marginal distribution and conditional distribution between domains. In tasks C, I, P, Q, R-> S and C, I, P, S, R -> Q with optimal μ � 0.9, the marginal distribution is almost the same, so the performance of transfer learning mainly depends on the conditional distribution. In task U, M -> S with optimal μ � 0.4, the contribution of marginal distribution and conditional distribution is almost the same, but the marginal probability is more important.
e observations were similar in other tasks. is shows that µ is essential for balancing marginal distribution and conditional distribution in cross-domain learning problems.
erefore, MTLBDA is more capable of obtaining good performance.

Conclusion
In this article, to solve the problem of unbalanced categories in multiple source fields in transfer learning, a small sample data classification technique based on category adaptation balance and multisource transfer learning is proposed. Under the condition of unbalanced distribution, this method first maps multiple source domains and target domains to the same target space. en, according to the balanced distribution adaption algorithm, the distribution in each source domain and target domain is balanced while adjusting its marginal distribution and conditional distribution. en the convolutional neural network is used as the classifier for each source domain and target domain. Finally, the regularization term of each source domain is added to prevent overfitting of the model. e experimental results on the SVHN dataset, USPS dataset, MNIST dataset, Office-31 dataset, Caltech-256 dataset, and DomainNet dataset show that MTLBDA is superior to the benchmark algorithm in classification accuracy and training efficiency. Although the experimental results show that the MTLBDA algorithm is better than the benchmark algorithm, in the future, further research is still needed in the following area: the expansion of MTLBDA to the multiclassification problem; the accurate estimation of μ is also a challenge.

Conflicts of Interest
e authors declare no conflicts of interest.