Instance Transfer Learning with Multisource Dynamic TrAdaBoost

Since the transfer learning can employ knowledge in relative domains to help the learning tasks in current target domain, compared with the traditional learning it shows the advantages of reducing the learning cost and improving the learning efficiency. Focused on the situation that sample data from the transfer source domain and the target domain have similar distribution, an instance transfer learning method based on multisource dynamic TrAdaBoost is proposed in this paper. In this method, knowledge from multiple source domains is used well to avoid negative transfer; furthermore, the information that is conducive to target task learning is obtained to train candidate classifiers. The theoretical analysis suggests that the proposed algorithm improves the capability that weight entropy drifts from source to target instances by means of adding the dynamic factor, and the classification effectiveness is better than single source transfer. Finally, experimental results show that the proposed algorithm has higher classification accuracy.


Introduction
In data mining, a general assumption for the traditional machine learning is that training data and test data have the same distribution. However, in the practical application, this assumption cannot be often met [1]. By transferring and sharing different field knowledge for target task learning, transfer learning makes the traditional learning from scratch an addable one. This must improve the learning efficiency and reduce the learning cost [2,3]. In 2005, Information Processing Techniques Office (IPTO) gave a new mission of transfer learning: the ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks. In this definition, transfer learning aims to extract the knowledge from one or more source tasks and apply the knowledge to a target task [2]. Since the transfer learning needs to use information from similar domains and tasks, its effectiveness is related to the correlation between the source and target domains.
However, transfer learning is more complex than traditional machine learning because of the introduction of transfer. There are many kinds of knowledge representation in related domains, such as sample instances, feature mapping, model parameters, and association rules. Due to the simpleness of implement, the paper selects sample instances as knowledge representation to design the effective transfer algorithm. In detail, instance transfer learning is used to improve the classification accuracy by finding training samples in other source domains which have strong correlation with the target domain and reusing them in the learning of target task [4]. Obviously, how to decide weight of this training data should influence the effectiveness of candidate classifiers [5].
Up to now, researchers have proposed several approaches to solve transfer learning problems. Ben and Schuller provided a theoretical justification for multitask learning [6]. Daumé and Marcu studied the domain-transfer problem in statistical natural language processing by using a specific Gaussian model [7]. Wu and Dietterich proposed an image classification algorithm by using both inadequate training data and plenty of low quality auxiliary data [8]. This algorithm demonstrates some improvement by using the auxiliary data, but it does not give a quantitative study using different auxiliary examples. Liao et al. proposed a new active 2 The Scientific World Journal learning method to select the unlabeled data in a target domain to be labeled with the help of the source domain data [9]. Rosenstein et al. proposed a hierarchical Naive Bayes approach for transfer learning by using auxiliary data and discussed the applying time problem of transfer learning [10].
Transfer AdaBoost algorithm, also called TrAdaBoost, is a classic transfer learning algorithm which is proposed by Dai et al. [11]. TrAdaBoost assumes that the source and target domain data use exactly the same set of features and labels, but the distributions of the data in the two domains are different. In addition, TrAdaBoost assumes that, due to the difference in distributions between the source and the target domains, some of the source domain data may be useful in learning for the target domain but some of them may not and could even be harmful. Since TrAdaBoost relies only on one source, its learning effects will become poor when there is a weak correlation between the source and target domains. Moreover, as the literatures [12][13][14] said, TrAdaBoost has the weaknesses of weight mismatch, introducing imbalance and rapid convergence of source weights. The purpose of this paper is to remove the weight drift phenomenon efficiently, improve learning efficiency, and inhibit the negative transfer.

Multisource Dynamic TrAdaBoost Algorithm
Considering the correlation between multiple source domains and the target domain, recently Yao and Doretto proposed multisource TrAdaBoost (MSTrA) transfer learning algorithms [15]. As an instance-based transfer learning method, MSTrA selects its training samples from different source domains. At each iteration, MSTrA always selects the most related source domain to train the weak classifier. Although this can ensure that the knowledge transferred is relevant to the target task, MSTrA ignores effects of other source domains. Samir and Chandan proposed an algorithm (DTrAdaBoost) with an integrated dynamic cost to resolve a major issue in the boosting-based transfer algorithm, TrAdaBoost [16]. This issue causes source instances to converge before they can be used for transfer learning. But DTrAdaBoost has low efficiency of learning. In order to overcome the above disadvantage, a multisource dynamic TrAdaBoost algorithm (MSDTrA) is proposed. By using this algorithm, the rate of convergence of source sample weight will be reduced based on weak correlation to target domain [17]. Supposing there are source domains, 1 , . . . , ; source tasks, 1 , . . . , ; and source training data, 1 , . . . , , the purpose of transfer learning is to make good use of them to improve the learning effectiveness of the target classifier function̂: → . In detail, the algorithm steps of MSDTrA are described as follows.
Step 1. Step 2. Set the value of as follows: where = ∑ is the number of all source domains training samples and is the sample number of training sets with th source domain.
Step 4. Select a base learner to obtain the candidate weak classifiers ( ) based on training set ∪ ; calculate the error of ( ) on according to the following equation: update the weight of ( ) by using the vectors update strategy: Repeat the above method until all source domains are traversed, where ( ) is the error rate of candidate weak classifiers with th source domains in target domain. ̸ = ( ) ⋅ stands for error classified with the candidate weak classifiers. According to the vectors update strategy above, the error of each weak classifier in the target training set is computed and a weight is assigned to each weak classifier according to the error. The larger the error is, the smaller the weight becomes. In other words, source domains which correspond to those classifiers with high classification accuracy contain much valuable information for the learning of target task.
Step 5. Integrate all weighted weak classifiers to obtain a candidate classifier at the th iteration: where the classification error of on at iteration is where must be less than 0.5. Then, calculate the errors of the candidate classifier on the source and target training sets, based on which update the weights of training samples on the source and target domains. For the correct classified source training samples, their corresponding weights keep unchanged.
The Scientific World Journal 3 Step 6. Set where = 2(1 − ) is the expression of dynamic factor . And Theorem 1 will provide the deduce process.
Step 7. Update the weight vector of source samples according to the following rule: Update the weight of target samples according to the rule: where the weight update of the source instances uses the weighted majority algorithm (WMA) mechanism. This updated mechanism is computed by and . The target instance weights are updated by using , which is calculated on Step 6.
Step 8. Retrain all weak classifiers using the training samples with updated weights. If the maximum number of iterations is reached, < , return to Step 3; otherwise, turn to Step 9.
Step 9. Decide the final strong classifier In the MSDTrA algorithm, TrAdaBoost's ensemble learning is selected to train classifiers based on the combination set of source and target instances in every step. WMA is used to adjust weights of the source set by decreasing the weight of misclassified source instances and preserving current weights of correctly classified source instances.
It can be seen from the above algorithm that the MSDTrA allows all source training samples to participate in learning process at each iteration, and different source training samples are assigned different weights. If a source training sample can improve the learning of target task, it will be assigned a large weight. Overall, the MSDTrA takes full advantage of all useful knowledge from all source domains, and this can obviously enhance the learning effectiveness of target task.

Theoretical Analysis
The previous section introduced in detail the proposed new algorithm, that is, the instance transfer learning algorithm. In this section, related theory analyses will be given according to single source TrAdaBoost algorithm [13]. First, Theorems 1 and 2 will proof the influence of source and target sample weight vectors with dynamic factor in source weight, respectively.

Theorem 1.
A dynamic factor of = 2(1 − ) that is applied to the source weights can prevent their weight drift and get the weight vector to update mechanism of source sample.
Proof. Set is sum of correctly classified target weights at boosting iteration + 1 and is sum of misclassified target weights at boosting iteration + 1. Consider Substituting for and to simplify the source update of TrAdaBoost, we have Introducing the correction factor into the WMA, because of +1 = , we have Theorem 2. The dynamic factor of = 2(1− ) that is applied to the source weights makes the target weights converge as outlined by TrAdaBoost.
Proof. In TrAdaBoost, without any source instances ( = 0), target weights for correctly classified instances will be updated as 4 The Scientific World Journal Applying the dynamic factor to update the source instance weight, we can get the update mechanism of the target instance weight based on MSDTrA. Consider Next, we analysis the performance of MSDTrA on the target training set.

Theorem 3. The final error on the target training set is
Proof. Supposing that the final sample set which contains all misclassified samples on the target domain is , the final error is = | |/ . At each iteration, the error on the target training set is where 0 ≤ ( ) ≤ 1/2. If the error on the target training set is 0, = 0, training sample weights are not updated, ( +1) = . If ̸ = 0 and = /(1− ) ̸ = 0, the updating rule for the weights of target training samples is as follows: Then, In addition, we have the following criterion: Combining (18) and (19), we have Substituting = /(1 − ) into (20), we can obtain According to Theorem 3, because the condition of < 0.5 is satisfied in the algorithm, the error in final target training data will decrease with the increase of iterations. And the upper bound of the associated generalization error can be calculated by + (√ VC / ), where VC is the VCdimension of the weak classifier model.

Experimental Results and Analysis
The performance of the proposed method is investigated based on object category recognition in this section. Without loss of generality, we consider the following case: a small number of training samples of a target object category and a large number of training samples of other source object categories. For any test sample, we verify whether it belongs to the target object category or not.

Experimental Setting.
For object category recognition, the Caltech 256 datasets that contain 256 object categories are considered. Practically, among 256 object categories, the 80 categories that contain more than 50 samples are used in our experiment. We designate the target category and randomly draw the samples that form the target data. The number of samples for training is limited between 1 and 50, while the number of samples for testing is 50. Furthermore, in order to illustrate the proposed method does not depend on the data set, we have also used the background dataset, collected via the Google image search engine, along with the remaining categories as our augmented background data set, to verify the effectiveness and robustness of this method.
The remaining categories are treated as the repository from which to draw positive samples for the source data. The numbers of source categories or domains are varied from 1 to 10 in order to investigate the performance of the classifiers with respect to the variability of domains. The number of samples for one source of data is 100. For each target object category, the performance of the classifier is evaluated over 20 random combinations of source object categories. Given the target and source categories, the performance of the classifier is obtained by averaging over 20 trials of experiments. The overall performance of the classifier is averaged over 20 target categories. SVM is selected as base classifiers and the iteration is 50.

Error Analysis.
Since transfer learning is not needed to get good classification results when the target data set is large, standard cross-validation method is not used here. Small portion data of the target set are used for training, and most of the remaining samples are used for testing. Figure 1  Practically, fixing the number of source domains = 4, Figure 1(a) shows the ROC curves of the four algorithms with the increase of the number of training instances. Since AdaBoost does not transfer any knowledge from the source, its performance depends mainly on the number of . For a very small value of , it performs slightly improvement as the ROC curves show. However, due to the transfer learning mechanism, TrAdaBoost has good improvement by combining the three sources. By incorporating the ability to transfer knowledge from multiple individual domains, MSTrA and MSDTrA demonstrate a significant improvement in recognition accuracy, even for a very small . In addition, the performance of AdaBoost and TrAdaBoost strongly depends on the selection of source domains and target positive samples, as the standard deviation of ROC shows.
Fixing the number of training instances = 10, Figure 1  performance in both accuracy and consistency. Since TrAd-aBoost is incapable of exploring the decision boundaries separating multiple source domains, its performance keeps unchanged regardless of the number of source domains. Figure 2 compares the classification performance of different methods in the target domain. We can see that AdaBoost algorithm does not transfer source domain knowledge and gets lower classification accuracy. DTrAdaBoost has relatively poor test results, because it only uses one source In order to have objective and scientific comparison results, hypothesis testing is used on the experimental results. Let the variables 1 , 2 , 3 , 4 , 5 denote the classification error rate of MSDTrA, MSTrA, CDASVM, DTrAdaBoost, and AdaBoost algorithms, respectively. Since the value of 1 , 2 , 3 , 4 , 5 is subject to many random factors, we assume that they submit to normal distribution, ∼ ( , 2 ), = 1, 2, 3, 4, 5. Now, we compare the random variable means of these algorithms, ( = 1, 2, 3, 4, 5). The smaller the is, the lower the expected classification error rate is and the higher the efficiency is. Because the sample variance is the unbiased estimation of the overall variance, the sample variance value is used as an estimate of the generality variance. In this experiment the significance level is set as 0.01. Table 1 shows the comparison process on and other parameters. We can see from Table 1 that the expectations of classification error rate in MSDTrA is far below than other algorithms.

Time Complexity.
Since several domains are used into the learning of target task together, time complexity of multisource domains is more than single domain. Supposing that the time complexities of training a classifier and updating weight are ℎ and , respectively, the time complexity of AdaBoos, DTrAdaBoost, MSTrA, and MSDTrA can be approximated to ℎ ( ) + ( ), ℎ ( ) + ( ), ℎ ( ) + ( ) and ℎ ( ) + ( ). Furthermore, Figure 3 shows the average training time of the four algorithms with fixed , .

Dynamic
Factor. This experiment will prove the effect of dynamic factor on source weights and target weights. Here a sources domain is considered, = 1. In Figure 4(a), the number of instances is set as constant ( = 1000, = 200) and the source error rate is set to zero. According to the WMA, the weights should not change because of = 0; that is, +1 = . When target error rates = {10%, 20%, 30%, 40%}, the ratio of the weights of MSDTrA and MSTrA is plotted at different boosting iterations.
We can see from Figure 4(a) the following. (1) In MSTrA, source weights converge always even the classification results are correct. (2) MSDTrA matches the behavior of the WMA.
(3) If dynamic factor is not applied, the smaller the value of is and the faster the convergence rate of source weights is. In addition, for a weak learner with = 10%, MSTrA is still not able to get good performance by using over 1000 source instances, even though they were never misclassified.   of target instances {10, 20, 50}. It can be observed that after a single boosting iteration, the ratio of a correctly classified source instances increases with the increases of .

Conclusions
Considering the situation that sample data from the transfer source domain and the target domain have similar distribution, an instance transfer learning method based on multisource dynamic TrAdaBoost is provided. By integrating with the knowledge in multiple source domains, this method makes good use of the information of all source domains to guide the target task learning. Whenever candidate classifiers are trained, all the samples in all source domains are involved in learning, and the information that is beneficial to target task learning can be obtained, so that negative transfer can be avoided. The theoretical analysis and experimental results suggest that the proposed algorithm has higher classification accuracy compared with several existing algorithms.