Training Classifiers under Covariate Shift by Constructing the Maximum Consistent Distribution Subset

1School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China 2The College of Textiles and Fashion, Qingdao University, Qingdao 266071, China 3Sino-German Faculty, Qingdao University of Science and Technology, Qingdao 266061, China 4College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China 5College of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China


Introduction
Traditional classification methods, such as Support Vector Machines (SVMs) [1,2], decision tree [3,4], and neural networks [5,6], are always based on the assumption that the training and testing samples are drawn from the same distribution.In many classification scenarios, such as the class imbalance problem [7,8], concept drift [9,10], and covariate shift [11], however, this assumption is violated.An informal description on covariate shift is as follows.Covariate shift refers to the learning settings in which source data sets and target data sets have the same feature attributes, label attribute, and the conditional probabilities of  |  but have different feature distributions.In this paper, we mainly study classifying under covariate shift.Below, we give two types of classification scenarios belonging to covariate shift.They are what we focus on in this paper.Scenario 1. Classification problems contain training samples, testing samples, and auxiliary samples, where training samples and testing samples are drawn from the same distribution, while the auxiliary samples are drawn from another distribution.In addition, the training set size is very small.
In real world, lots of classification problems belong to Scenario 1.For example, suppose we want to construct a Web page classification model.The Web data used in training a Web page classification model can be easily outdated when applied to the Web sometime later, because the topics on the web change frequently.Often, new data are expensive to label and thus their quantities are limited due to cost issues.How to accurately classify the new test data by making the maximum use of the old data becomes a critical problem.Scenario 2. Classification problems contain training samples and testing samples, where training samples and testing samples are drawn from different distributions.There are no auxiliary samples.
For example, suppose we are using a learning method to induce a model that predicts the side effects of a treatment for 2 Mathematical Problems in Engineering a given patient.Because the treatment is not given randomly to individuals in the general population, the available training samples are not a random sample from the population.Therefore, the training samples and testing samples are drawn from different distributions.How to accurately classify the testing data by employing the training data becomes a critical problem.
In this paper, we address the two types of covariate shift problems by training on a newly constructed set following approximately the target distribution.The rest of this paper is organized as follows.In Section 2, we formally define covariate shift in machine learning terms and describe the related works on this problem.In Section 3, we propose the MIDS construction method by matching sample numbers between target set and auxiliary set in each feature subspace.In Section 4, we present the corresponding data correction methods with respect to the two scenarios, propose the corresponding classification algorithms, and analyze the time complexity of the algorithms.Our experimental results and discussion are shown in Section 5. Section 6 summaries the main contribution of this paper and gives some future works.

Related Concepts and Works
2.1.Related Concepts.In this section, we introduce some notations and definitions that are used in this paper.First of all, we give the definitions of a "domain" and a "task," respectively.Definition 1 (domain).In this paper, a domain D consists of two components: a feature space X and a marginal probability distribution (), where  = { 1 , . . .,   } ∈ X.
Definition 2 (task).Given a specific domain-D = {X, ()}-a task consists of two components: a label space Y and an objective predictive function (⋅) (denoted by T = {Y, (⋅)}), which is not observed but can be learned from the training data, which consist of pairs {  ,   }, where   ∈ X and   ∈ Y.The function (⋅) can be used to predict the corresponding label, (), of a new instance .From a probabilistic viewpoint, () can be written as ( | ).
In this section, we denote the source domain data by It is worthwhile to note that there can be multiple auxiliary data sets in classification problems under covariate shift and their feature distributions can be different.In addition, from the definition of covariate shift, we can see that the two scenarios described in Section 1 do belong to covariate shift, because, for Scenario 2, testing samples can be considered as target samples and training samples can be considered as auxiliary samples.

Related Works.
As described before, covariate shift includes the two scenarios described above.With respect to Scenario 1, auxiliary samples are utilized to improve the performance of classifiers.In previous works, Wu and Dietterich [12] proposed an image classification algorithm using both inadequate training data and plenty of low quality auxiliary data.They demonstrated some improvement by using the auxiliary data.However, they did not give a quantitative study using different auxiliary examples.Liao et al. [13] improved learning with auxiliary data using active learning.Rosenstein et al. [14] proposed a hierarchical naive Bayes approach for transfer learning using auxiliary data and discussed when transfer learning would improve or decrease the performance.Dai et al. [15] proposed a covariate shiftrelated algorithm, TrAdaBoost, which is an extension of the AdaBoost algorithm, to address the inductive transfer learning problems.TrAdaBoost assumes that the source and target domain data use exactly the same set of features and labels, but the distributions of the data in the two domains are different.In addition, TrAdaBoost assumes that, due to the difference in distributions between the source and the target domains, some of the source domain data may be useful in learning for the target domain but some of them may not and could even be harmful.It attempts to iteratively reweight the source domain data to reduce the effect of the "bad" source data while encouraging the "good" source data to contribute more to the target domain.For each round of iteration, TrAdaBoost trains the base classifier on the weighted source and target data.The error is only calculated on the target data.Furthermore, TrAdaBoost uses the same strategy as AdaBoost [16] to update the incorrectly classified examples in the target domain while using a different strategy from AdaBoost to update the incorrectly classified source examples in the source domain.However, TrAdaBoost can not deal with the case where there are multiple auxiliary data sets coming from different distributions.
With respect to Scenario 2, unlabeled testing samples are utilized to improve the performance of classifiers.Unlike semisupervising learning problem [17], for Scenario 2, the unlabeled testing samples are under a different distribution from the training samples and are used to correct the sample selection bias.In previous works, most approaches intend to estimate the importance (   )/(   ).If we can estimate the importance for each instance, we can solve the learning problems under covariate shift.There exist various ways to estimate (   )/(   ).
Zadrozny [18] proposed to estimate the terms (   ) and (   ) independently by constructing simple classification problems and then estimate the importance by taking the ratio of the estimated densities.However, estimating densities is known to be a hard problem particularly in highdimensional cases.Therefore, this approach may not be effective.
Huang et al. [19] proposed a kernel-mean matching (KMM) algorithm to learn (   )/(   ) directly by matching the means between the source domain data and the target domain data in a reproducing kernel Hilbert space (RKHS).KMM is shown to work well if tuning parameters such as the kernel width are chosen appropriately.Thus, the importance estimation problem is now relocated to the model selection problem.Standard model selection methods such as crossvalidation, however, are heavily biased under covariate shift.Therefore, KMM can not be directly applied in the crossvalidation [20] framework.
Unlike KMM, Sugiyama et al. [21] proposed an algorithm known as Kullback-Leibler importance estimation procedure (KLIEP), which is equipped with a natural model selection procedure.KLIEP can be integrated with cross-validation to perform model selection automatically in two steps: (1) estimating the weights of the source domain data; (2) training models on the reweighted data.
In this paper, we propose a novel method by constructing a MIDS to deal with classification problems under covariate shift.The formal definition of MIDS and its construction method will be given in the next section.Unlike previous transfer learning methods, our method can consistently deal with both scenarios and the cases where there are multiple auxiliary data sets coming from different distributions.Furthermore, unlike the above sample reweighting techniques, we do not estimate distributions but match sample numbers between target set and auxiliary set in each feature subspace; we do not reweight samples but construct a new training set following approximately the target distribution.

The MIDS Construction Method
In this section, we use two data sets, target set and auxiliary set.Our objective is to design a method that can construct MIDS from auxiliary samples according to target distribution.First of all, we give the formal definitions of identical distribution subset (IDS) and MIDS.
Definition 4 (identical distribution subset, IDS).Let  = { 1 ,  2 , . . .,   } be a target sample set and  = { 1 ,  2 , . . .,   } a source sample set.Assuming that they follow different distributions and have the same feature and label spacethat is,   (, ) ̸ =   (, )-X  = X  and Y  = Y  .Identical distribution subset is a subset of  and follows the same distribution with .

Definition 5 (maximum identical distribution subset, MIDS).
A proper identical distribution subset  is called a maximal identical distribution subset if there exists no other proper identical distribution subset  with a bigger size than that of .

Basic Idea of MIDS Construction Method.
Our basic idea of MIDS construction is first to partition the feature space into several subspaces and then construct MIDS by matching sample numbers between target set and auxiliary set in each feature subspace; that is, select a maximum amount of auxiliary samples from each subspace to compose the MIDS according to the proportion of target samples in each subspace.The detailed process of the MIDS construction method will be presented in Section 3.2.
(1) Partitioning the Feature Space into Several Subspaces.Firstly, compute the mean of the target set by the following formula: Then, partition the -dimensional space into 2  subspaces.In detail, let  = ( 1 , . . .,   , . . .,   ) be any vector in the feature space, where   denotes the th-dimensional value of the vector .Compare the th-dimensional value of the vector  with the th-dimensional value of  0 , and we can obtain  inequalities.We use an -dimensional binary vector to represent these inequalities; that is to say, if   ≤   0 , we label the th-dimensional value of the binary vector with 0, otherwise 1.Thus we can divide the feature space into 2  subspaces, corresponding to 2  binary vectors from (0, 0, . . ., 0) to (1, 1, . . ., 1), respectively.We number the subspace with the decimal numbers corresponding to the binary vectors.
(2) Computing the Proportion of the Target Samples in Each Subspace.Compute the number of the target samples in each subspace, and so we can obtain the proportion of samples in each subspace.
(3) Extracting Samples from the Auxiliary Set.We first compute the numbers of auxiliary samples in each subspace.Then, according to the proportion and the numbers, we select a maximum amount of samples from each subspace to compose the MIDS, noting that the proportion of the auxiliary samples selected from each subspace should be consistent with the proportion of the target samples in each subspace.
Thus we obtain the MIDS, and this subset can be considered to follow approximately the target distribution.

The Description of the MIDS Construction Algorithm and Its Time Complexity Analysis.
The pseudocode of the MIDS construction algorithm is described in Algorithm 1. Firstly, we define two 2-dimensional arrays,  and .The first dimension of Array  records the subspace number of samples in the target sample set.It is worth noting that Array  only has one record for samples with the same subspace number.The second dimension of Array  records the number of samples in the corresponding subspace.Array  is the same as Array , but for the source sample set.Then we obtain Array  from   and  0 .We compute the subspace number of samples by real number comparison operation.To obtain the number of samples in the corresponding subspace, we need to scan the whole data set   .It is the same with array .
Finally, we select a maximum amount of samples from each subspace to compose the MIDS  according to Array  and Array .
We split the whole space into 2  subspaces, and for large  the number of subspaces is enormous, which would cause the curse of dimensionality if we compute the number of samples in each subspace.Luckily it is not necessary to do that, as the samples are always sparse in a high-dimensional space.Thus we only need to compute the number of samples in subspaces which consist of samples.
Therefore the time complexity is mainly composed of two parts, corresponding to calculating the proportion of the target samples in certain subspaces and calculating the numbers of the source samples in certain subspaces, respectively.It is worth noting that we only need to compute the number of samples in subspaces which consist of samples.
Thus if we define one-time real number comparison operation as one-time basic operation, we need to do (  +   ) times operations, where   denotes the size of the source sample set and   denotes the size of the target sample set.

Classification Methods under Covariate Shift by Constructing the MIDS
In Section 3, we present a general MIDS construction method by matching sample numbers between target set and auxiliary set in each feature subspace.In this section, we will propose the special MIDS construction methods corresponding to Scenarios 1 and 2, respectively.Furthermore, we will propose the classification methods for the two scenarios.Assume that  tr and  te follow the same distribution   (, ) and  au follow another distribution   (, ).In this section, we will present two kinds of MIDS construction methods, where one is direct and the other is indirect.Moreover, we will prove that the effect of the indirect method is equivalent to that of the direct method.

The Direct MIDS Construction Method of Scenario 1.
With respect to the direct MIDS construction, we consider feature vector  and label  as a joint vector.We consider   =  tr ∪  te and  au as the target sample set and the source sample set, respectively.Thus we can use the above algorithm directly to obtain the MIDS.The MIDS construction method by considering feature vector  and label  as a joint vector is called the direct MIDS construction method.
Since feature vector  and label  are considered to be one joint vector, the dimension of samples will be increased.As described in Section 3.3, with the increase of the dimension, the running time of the MIDS algorithm will increase correspondingly.In the next section, we will present the indirect MIDS construction method, for which it is not necessary to consider  and  collectively and the MIDS is constructed according to feature vector  alone.Thus the indirect method can reduce effectively the running time and moreover it can be applied to the case where the target testing set contains only feature vectors.

The Indirect MIDS Construction Method of Scenario 1.
With respect to the case where there are no class labels in the target testing set, we can construct the MIDS according to feature vector  alone.The MIDS construction method by considering only feature vector  is called the indirect MIDS construction method.Now let  te = { te  |  = 1, 2, . . ., } be target testing set.First of all, we present the detailed process of the indirect MIDS construction method.
Process 1. Remove all the labels of samples in   , and label the set composed by the remaining feature vectors as    .Similarly, remove all the labels of samples in  au , and label the set composed by the remaining feature vectors as   au .
Process 2. Use the MIDS construction algorithm to obtain a subset   au of   au .
Process 3. Add the class labels removed in Process 1 to each sample of   au correspondingly, and thus we obtain a subset  au of  au .
Below we will prove that the effect of the indirect method is equivalent to that of the direct method.Theorem 6.The subset   obtained from the auxiliary set   by the indirect MIDS method follows the same distribution with the target set   ; that is,    (, ) =   (, ), where    (, ) and   (, ) denote the distributions of   and   , respectively.
Proof.Let    () and   () denote the distributions of   au and    , respectively.From the definition of conditional distribution, we have As described in Process 3,  au is obtained by adding the original class labels to   au correspondingly.Thus the conditional probability is unchanged; that is,    ( | ) =   ( | ).From Definition 3, we know that   ( | ) =   ( | ).Thus we can obtain that    ( | ) =   ( | ).Moreover since   au is a MIDS of    , we can obtain that    () =   ().Therefore we have    (, ) =   (, ).

The MIDS Construction of Scenario 2. Let 𝑇 tr = {(𝑥 tr
,  tr  ) |  = 1, 2, . . ., } be target training set and  te = {( te  ,  te  ) |  = 1, 2, . . ., } target testing set.Assume that  tr and  te follow distributions  tr (, ) and  te (, ), respectively.We consider  te and  tr as the target sample set and the source sample set, respectively.Thus we also can use the above algorithm directly to obtain the MIDS.
With respect to the case where there are no class labels in the target testing set, we can also use indirect method to construct the MIDS.Now let  te = { te  |  = 1, 2, . . ., } be target testing set.The detailed process is as follows.
Process 1. Remove all the labels of samples in  tr , and label the set composed by the remaining feature vectors as   tr .
Process 2. Use the MIDS construction algorithm to obtain a subset   tr of   tr .
Process 3. Add the class labels removed in Process 1 to each sample of   tr correspondingly, and thus we obtain a subset  tr of  tr .

Experiments
In this section, we perform experiments to test the performance of the proposed classification algorithms.As proven in Section 4, the effect of the indirect method is equivalent to that of the direct method.Thus, we just test the performance of the two indirect classification algorithms.The experiment data in this section come from the UCI Machine Learning Repository [22].All experiments are run on 2.00 GHz, Intel (R) Core (TM) i5-4200U CPU with 4 GB main memory under window 8.We select auxiliary samples using a deliberately biased procedure (as in [19]).To describe our biased selection scheme, we need to define an additional random variable   for each point in the pool of possible training samples, where   = 1 means the th sample is included and   = 0 indicates an excluded sample.In this paper, we discuss the classification problems under covariate shift, so we only consider the situation (  |   ,   ) = (  |   ).Below, we present the detailed method of experimental data construction.First of all, we select some samples randomly from the original data set to compose the target set, 1/4 of the data used for training and 3/4 for testing.Then, in the remaining samples, we consider a biased sampling scheme based on the input features to construct the auxiliary set.For convenience, in this paper, we only consider a biased sampling scheme based on one input feature.For example, with respect to breast cancer data set, We select -SVC [23] and Radial Basis Function (RBF) [1],

The Experiment on
as the basic classification algorithm and kernel function, respectively, for the above three methods, where  is a penalty factor,  is a width parameter, and  and  are -dimensional vectors in the original feature space.With respect to the multiclass data sets of the 20 selected data sets, we select oneagainst-all (1-v-r) approach [24], which is to transform a class problem into -two-class problems, where one class is separated from the remaining ones.In this experiment, the best  and  are obtained by 10-fold cross-validation.
(3) Result Analysis.The three methods are compared on the selected 20 data sets.Five runs of 10-fold cross-validation are performed for each algorithm, and the average result is reported in Table 1, where the numbers following "±" are the standard deviations.The running time and parameter values of different algorithms are shown in Table 2, where  denote the number of iterations,  is a penalty factor,  is a width parameter, and  (ms) denotes the running time. is set to 100 according to the parameter setting in [15], and the best  and  are obtained by 10-fold cross-validation.
As is shown in Table 1, the precision given by SVM is strictly lower than IDC1 and TrAdaBoost.Intuitively, this is true because, unlike SVM, IDC1 and TrAdaBoost are learning techniques designed for classification of Scenario 1. Furthermore Table 1 shows that IDC1 outperforms TrAdaBoost.In detail, pairwise two-tailed -tests indicate that there are 16 data sets (Australian, balance, breast, Cleveland, credit, diabetes, ionosphere, iris, page, sonar, thyroid, voting, wave-form40, wine, wdbc, and wpbc) where IDC1 is significantly more accurate than TrAdaBoost, while there is no significant difference on the remaining 4 data sets.We believe that the auxiliary set contain not only good samples, but also noisy data that caused the distribution of the auxiliary set different from that of the target set.The reason why IDC1 outperforms TrAdaBoost might be that ICD1 always employs the most important samples, which is included in the MIDS, to help the learners, while TrAdaBoost sometimes can not avoid using bad samples to help the learners.Moreover, as shown in Table 2, the running time of IDC1 is the least, while the running time of TrAdaBoost is the most.Thus IDC1 is the most effective model.The reason is below.In this experiment, traditional SVM algorithm uses the target sample set and the whole auxiliary sample set for training, while IDC1 employs the target sample set and only a subset of the auxiliary samples set for training.With respect to TrAdaBoost, the reason why its running time is the most is that it needs to do repeated iteration for a better performance.(2) Experimental Method.In the following, we compare our method (the indirect classification of Scenario 2, which is denoted by IDC2) against two other methods: the traditional classification algorithm and the KLIEP algorithm proposed in [21].We also select -SVC and RBF as the basic classification algorithm and kernel function, respectively, and select 1-v-r approach for the multiclass data sets.
(3) Result Analysis.Five runs of 10-fold cross-validation are performed for each algorithm, and the average result is reported in Table 3, where the numbers following "±" are the standard deviations.Also the running time and parameter values of different algorithms are shown in Table 4.
As is shown in Table 3, the precision given by SVM is strictly lower than IDC2 and KLIEP.Like the analysis of Scenario 1, this is true because, unlike SVM, IDC2 and KLIEP are learning techniques designed for classification of Scenario 2. Furthermore Table 3 shows that IDC2 is comparable to KLIEP that is a state-of-the-art algorithm.In detail, pairwise two-tailed -tests indicate that there are 4 data sets (Australian, credit, page, and waveform21) where IDC2 outperforms KLIEP, and there are 3 data sets (heart, vehicle, and wpbc) where IDC2 performs a little worse than KLIEP, while there is no significant difference on the remaining 13

Conclusion
In this paper, we first propose a MIDS construction method by matching sample numbers between the target set and auxiliary set in each feature subspace, and then we propose a novel approach for classification under covariate shift by training on a new data set.Our basic idea is to train a model on a newly constructed data set following approximately the target distribution.Our approach consists of two methods, including a direct method and an indirect one.The theoretical analysis shows that the indirect method is equivalent to the direct method for the MIDS construction, but with less running time.In our experiments, the two indirect algorithms, ICD1 and ICD2, demonstrate better classification abilities than traditional learning techniques.In addition, our method can consistently deal with both scenarios of covariate shift and the cases where there are multiple auxiliary data sets coming from different distributions.
We note that our method assumes that the source domain and the target domain have the same concept and can not deal with the case where they have different concepts, that is, the problem of concept drifts.In the future, we will try to extend the proposed method to address this issue.

4. 3 .
Classification Algorithms.With the help of the MIDS construction method, we can make effective classification by traditional classification method.With respect to Scenario 1, we first construct the MIDS from the auxiliary set and then train a model on the set composed by the target training set and the MIDS.The pseudocodes of the two classification algorithms corresponding to the direct and indirect MIDS construction methods are shown in Algorithms 2 and 3, respectively.With respect to Scenario 2, we first construct the MIDS from the target training set and then train a model on this MIDS.The pseudocodes of the two classification algorithms corresponding to the direct and indirect MIDS construction methods are shown in Algorithms 4 and 5, respectively.

Scenario 1 ( 1 )Principle 1 .
Experimental Data Construction.This experiment is performed on 20 data sets, from which the target training set, the target testing set, and the auxiliary training set are constructed by the following principles.The target training set and the target testing set should follow the same distribution.Principle 2. The auxiliary training set and the target set should follow different distributions.Principle 3. The size of the target training set is far less than that of the auxiliary training set.
1 ), . .., (    ,     )}, where    ∈ X  is the data instance and    ∈ Y  is the corresponding class label.Similarly, we denote the target domain data by   = {(  1 ,   1 ), . .., (    ,     ) }, where the input    is in X  and    ∈ Y  is the corresponding output.In most cases, 0 ≤   ≪   .We now give a formal definition of covariate shift.
Definition 3 (covariate shift).Covariate shift refers to the learning settings that have the following features: (1) source domain and target domain have the same feature and label spaces; that is, X  = X  and Y  = Y  .(2) Source domain and target domain have different feature distribution; that is,   () ̸ =   ().(3) Source domain and target domain have the same concept; that is, Require: the source sample set   , the target sample set   , dimension , size of the source sample set   , size of the target sample set   The first dimension of Array  records the subspace number of samples in the target sample set.Note that Array  only has one record for samples with the same subspace number.The second dimension of Array  records the number of samples in the corresponding subspace.Array  is the same as , but for the source sample set * /  = Array generation(  ,  0 ) / * obtain Array  from   and  0 * /  = Array generation(  ,  0 ) / * obtain Array  from   and  0 * /  = construct(, ) / * select a maximum amount of samples from each subspace to compose the MIDS  according to Array  and Array  * / Algorithm 1: The MIDS construction algorithm. *

Table 2 :
The running time and parameter values in Experiment 1.

Table 4 :
The running time and parameter values in Experiment 2.

Table 4 ,
IDC2 costs less time than SVM and KLIEP.Like the case in Experiment 1, the reason is that in the training process it uses only a subset of the training samples, while SVM and KLIEP have to employ the whole training samples for training.