Unsupervised Domain Adaptation Using Exemplar-SVMs with Adaptation Regularization

1School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China 2Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China 3Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China 4School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China 5School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China


Introduction
Over the past decades, machine learning technologies have achieved significant success in various areas, such as computer vision [1], natural language processing [2], and video detection [3].However, traditional machine learning methods assume that training and testing data come from the same domain, which implies that training or testing data are drawn from the same distribution and represented in the same feature spaces.This assumption is too violated to be held in the real world as collecting suitable and enough labeled data is time consuming and an expensive manual effort.Lacking labeled data, most of traditional machine learning methods always lose their generalization performance in reality.Therefore, it is desired to utilize the data of the relational domain to help training a robust learner for target domains.Driven by this requirement, transfer learning has rapidly developed in recent years [4].Transfer learning slacks the assumption of the traditional machine learning in which data or labels are drawn from the same distribution and represented in the same feature space.In the transfer learning settings, it is always assumed that domains are similar or related, with even no relationships, which is instead of i.i.d.assumption.Thus, transfer learning has a strong motivation when developing the classical machine learning functions or applying the functions to real-world applications.Besides, transfer learning can be regarded as a supplement of classical machine learning methods.One is the problem of covariate shift or sample selection bias.Another motivation is that we want to train a universal or general model as a predictor for all the tasks, viewed as the parameter or learner shared.It is also considered as a goal of Artificial General Intelligence.Transfer learning aims to utilize source or related domains to help target domain tasks.It has achieved significant success Complexity in various practical applications, such as face recognition [5], natural language processing [6], cross-language text classification [7], WiFi localization [8], or medicine image [9].Domain adaptation is a subproblem of transfer learning which assumes that source and target domain data are generated from the same feature and label space but different margin probability distributions.It aims to solve the problems that there is none or less labeled data in the target domain and usually use labeled data in the source domain to assist the training of target domain tasks.Massive works focus on the domain adaptation problems, and they also extend to some applications, such as WiFi location, text sentiment analysis, and image classification for multidomains.Since distribution mismatch generally exists in the real-world applications, there is also some other research area concern about domain adaptation.For example, extreme learning machine (ELM) is an efficient model for training single-hidden layer networks [10].There are also some ELM works in a domain adaptation setting [11,12].They utilize most previous domain adaptation classifiers that have added constraint term which is based on using instance reweighting to minimize Maximum Mean Discrepancy (MMD) [13].However, these methods need to assume that the difference between the source and target domain is not too large.Namely, this idea requires that different domains are similar.
Most pattern recognition problems can be transformed into several basic classification tasks.Generally speaking, classification tasks assume that a category can be represented by a hyperplane [14,15], and most of the machine learning algorithms aim to learn hyperplanes to predict for unseen instances.Meanwhile, to improve the ability of representation by a hyperplane, there are some works which cluster the samples first and then solve the classification tasks on the clusters.In contrast to the category classification tasks, a cluster classifier can include more information about the positive category, but the more risks of overfitting.Motivated by the object detection, [16] proposed an extreme classification model training the classifiers for every positive instance and all the negative instances named exemplar support vector machines (E-SVMs).In fact, exemplar-SVMs can be viewed as an extreme situation of cluster-level SVM, in which every positive sample is regarded as a cluster.There are two viewpoints about the reason why the exemplar-SVM achieves a surprising generalization performance.One of the viewpoints is taking the exemplar-SVMs as a representation with complete details of positive instances.In other words, every classifier captures details of the positive instance like background, corner, color, or orientations and most of the classifiers can describe the category more intrinsically.From transfer learning viewpoint, training data cannot satisfy the underlying assumption of i.i.d., as every instance in the training set may be different from each other, namely, sample selection bias [17].Each exemplar-SVMs classifier is trained on a high weight positive sample and other negative samples; it can represent the positive sample well in the same distribution.Recently, [18] extends exemplar-SVMs into a transfer learning form which uses loss function reweighting and adds a low-rank regularization item for classifiers.
In this work, we propose a novel model to address unsupervised domain adaptation problems that there is no label on target domain data.Furthermore, it permits distribution mismatch among instances.In our model, we train kernel exemplar classifiers for every positive instance and then integrate the classifier to make a prediction for target domain data.To align the distribution mismatch, we embed the regularization item based on TCA in our classifiers.In our opinion, the model constructs the bridge to transfer the knowledge, and we use the information in the kernel matrix which includes the instances representation in the highdimension space to assist classifier training across domains.For the problem of sample selection bias, we integrate the classifiers to make a prediction.Basically, the step of integration is to expand the representation of hyperplanes that entirely take advantage of details learned before.
Our contributions are as follows.
(1) We propose a novel unsupervised domain adaptation model based on exemplar-SVMs named Domain Adaptation Exemplar Support Vector Machines (DAESVMs), and it improves standard domain adaptation prediction accuracy by transferring knowledge across domains.(2) Every DAESVM classifier constructs a bridge that transmits knowledge from the source domain to target domain.Compared with the traditional two-step method, this strategy thoroughly searches the optimization point of the model which makes the classification hyperplane more precious about domains.(3) To solve the problem of sample selection bias, we use the ensemble methods to integrate the classifiers.The process of the ensemble is similar to slacking the classification hyperplane, which drops off some unreliable classification results and use the reliable parts to make a prediction.(4) We bring in the method of the pseudo label in DAESVMs inspired by [19] to supplement the information of target domain, and the experiments verify the effectiveness of the pseudo label.(5) We push a step further to extend to implementing DAESVMs on the multidomain adaptation.The rest of this paper is organized as follows.In Section, we introduce the notation of the problem.Meanwhile, we review the related works of domain adaptation, exemplar-SVM, and Transfer Component Analysis (TCA).In Section, we introduce the deduction process of DAESVM and formulate the model.In Section, we propose the optimization algorithm for our model.In Section, we integrate all the DAESVMs classifiers to make a prediction.In Section, we analyze the experiments on some transfer learning dataset to verify the effectiveness of DAESVMs.In Section, we conclude our work and give an expectation.

Notation and Related Works
This section will introduce the notation and related works about this paper.

Notation.
In this paper, we use the notation of [4] It is agreed that the approaches of domain adaptation can be divided into three parts, reweighting approach, feature transfer approach, and parameter shared approach.
(1) Reweighting Approaches.In the transfer learning tasks, the basic idea of utilizing the source data to help training target predictor is to reduce the discrepancy between the source and target data as far as possible.Under the assumption that source and target domains have a lot of overlapping features, a conventional method is reweighting or selecting the source domain instances to correct the marginal probability distribution mismatch.Based on the metric distance method between distributions named Maximum Mean Discrepancy (MMD), [20] proposed a technique called Kernel Mean Minimum (KMM) revising the weight of every instance to minimize MMD between the source and target domain.Being similar to KMM, [21] used the same idea but a different metric method to adjust the discrepancy of domains.Reference [22] used the strategy of AdaBoost to update the weights of source domain data, which improved the weight of instances in favor of classification task.It also introduced the generalization error bounds of model based on the PAC learning theory.In recent years, [23] used a two-step approach; first is sampling the instances which are similar with other domains as landmarks, and then use these landmarks to map the data into a high-dimension space, after which it is more overlapping.Reference [24] solved the same problem but slacked the similarity assumption; it assumes that there are no relationships between the source and target domain.The model named Selective Transfer Machine (STM) reweights the instance of personal faces to train a generic classifier.Most of instance-based transfer learning techniques use KMM to measure the difference of the distributions, and these methods are applied in many areas, such as facial action unit detection [25] and prostate cancer mapping [26].
(2) Feature Transfer Approaches.Compared with instancebased approaches, feature-based approaches slack the similarity assumption.It assumes that source and target domain share some features named shared features, and domains have their own features named spec-features [27].For example, when we train a task that uses movie critical to help sofa critical sentiment analysis classification task.The word "comfortable" is always nonzero in the sofa domain features but always zero in the movie domain features.This word is the spec-feature of sofa domain feature.Feature transfer approaches aim to find a shared latent subspace where the distance between the source and target domain is minimized.Reference [28] proposed an unsupervised domain adaptation approach named Geodesic Flow Kernel (GFK) based on kernel method.GFK maps data into Grassmann manifolds and constructs geodesic flows to reduce the mismatch among domains.It effectively exploits intrinsic low-dimensional structures of data in domains.To solve problems of crossdomain natural language processing (NLP), [29] proposed a general method structural correspondence learning (SCL) to learn a discriminative predictor by identifying correspondences from features in domains.Primarily, SCL finds the pivot features and then links the shared features with each other.Reference [7] learned a predictor by mapping the target kernel matrix to a submatrix of the source kernel matrix.The deep neural network is used not for learning essential features but also for domain adaptation.Reference [30] proposed a neural network architecture for domain adaptation named Deep Adaptation Network (DAN) and extended it to joint adaptation networks (JAN) [31].Reference [32] discussed the transferable domain features on the deep neural network.
(3) Parameter-Based Approaches.The core idea of parameterbased approaches aims to transfer parameters from source to target domain tasks.It assumes that different domains share some parameters and these parameters could be utilized for domains.Reference [33] proposed Adaptive Support Vector Machine (A-SVM) as a general method to adopt new domains.A-SVM trains an auxiliary classifier firstly and then learns the target predictor based on the original parameters.Reference [34] reweighted prediction of the source classifier on target domain by signing distance between domains.

Exemplar Support Vector Machines.
Reference [16] is proposed for object detection and getting high performance.It trains classifiers on every positive instance from all negative instances.Every positive instance is an exemplar and the classifier corresponding to it can be viewed as a representation of the positive instance.In the process of the prediction, every classifier predicts a value for the test instance and uses a function to make a calibration for the value and then gets the high score classifiers result as a predicted class.The exemplar-SVMs solve the problem that a hyperplane is hard to represent a category instance and utilize an extreme strategy to train predictor.In [35], they gather the training procession into one model and enter the nuclear norm regularization to the scene of domain generalization which assumes target domain is unseen.They also extend the model to the problem of domain generalization and multiview [36,37].In [38], they reduced two hyperparameters into one and spread exemplar-SVMs to a kernel form.Complexity 2.4.Transfer Component Analysis.Reference [39] proposed a dimension reduction method called maximum mean discrepancy embedding (MMDE).By minimizing the distance of source and target domain data distribution in a shared latent space, the source domain data is utilized to assist training classifier on the target domain.MMDE is not only to minimize the distance between the domains in the latent space but also preserve the properties of data by maximum of the variance of data.Based on the MMDE, [40] extended it to have the ability of deal with the unseen instance and reduce the computation complexity of MMDE.Substantially, TCA simplifies the process of learning kernel matrix instead by transforming init kernel matrix.The optimization of this problem is equal to a solution in  leading eigenvectors of object matrix.

Domain Adaptation Exemplar Support Vector Machine
In this section, we present the formulation of Domain Adaptation Exemplar Support Vector Machine (DAESVM).
In the remainder of this paper, we use a lowercase letter in boldface to represent a column vector and an uppercase in boldface to represent a matrix.The notation mentioned in Section is extended.We use x + 푖 ,  ∈ {1, . . .,  + 푆 }, where  + 푆 is the number of positive instances, to represent a positive instance, and x − 푗 ,  ∈ {1, . . .,  − 푆 }, where  − 푆 is the number of negative instances, to represent a negative instance.The set of negative samples are written as  − .This section introduces the formulation procession of an exemplar classifier.In fact, we need to train exemplar classifiers in the number of source domain instances and the method which integrates these classifiers is proposed in Section.

Exemplar-SVM.
The exemplar-SVM is constructed by an extreme idea of training a classifier by a positive instance from all the negative instances and then calibrating the outputs of classifiers into a probability distribution to separate the samples.The model trains the number of positive instance classifiers.Learning a classifier which aims to separate a positive instance from all the negative instance can be modeled as where ‖ ⋅ ‖ is 2-norm of a vector and  1 and  2 are the tradeoff parameters corresponding to  in SVM for balancing the positive and negative error cost.ℎ() = max (0, 1 − ) is a hinge loss function.
The formulation (1) is the primal problem of exemplar-SVM, and we can find the dual problem for utilizing kernel method.The dual formulation can be written as follows [38]: (2) are Lagrangian multipliers.e is an identity vector.We take this model as an exemplar learner.The matrix (3)

Pseudo Label for Kernel Matrix.
To make the best use of samples in source or target, we construct the kernel matrix on both domain data.However, in the dual problem of SVM, kernel matrix K needs to be supplied labeled data.Our model is based on the unsupervised domain adaptation problem in which only source domain data are labeled.Motivated by [19], we use the pseudo label to help model training.Pseudo labels are predicted by classical classifiers, SVM in our model, which train on the source labeled data.Due to the distribution mismatch between source and target domain, there may be many labels incorrect.Followed by [19], we assume that the pseudo class centroids calculated by them may reside not far apart from the true class centroids.Thus, we use both domain data to supplement the kernel matrix K with label information.In our experiments, we testify this method is effective.

Exemplar Learner in Domain Adaptation
Form.In fact, each exemplar learner is an SVM in kernel form which is trained by a positive instance and all the negative instances.
In the opinion of [16], a discriminative exemplar classifier can be taken as a representation of a positive instance.However, in the task of object detection or image classification, this parametric form representation is feasible because of some characteristics in samples, such as angle, color, orientations, and background, which are hard to represent.The instancebased parametric discriminative classifier can include more information about positive samples.Similarly, with the motivation of transfer learning, we can view a positive instance as a domain, and there is some mismatch among domains.Our model aims to correct this mismatch and reduce the distance from the target domain.We construct an exemplar learner distance metric of domains from MMD and it can be written as dist However, it is just a metric of distance which is satisfied with our requirement of minimizing this distance by some transformation.Motivated by Transfer Component Analysis (TCA), we want to map the instance into a latent space that the instances from source and target domain are more similar and assume this mapping is ().Namely, we aim to minimize MMD distance between domains by mapping instances into another space.We extend the distance function as follows: Corresponding to a general approach, it always reformulates (4) to construct a kernel matrix form.We define the Gram matrices on the source positive domain, source negative domain, and target domain.The kernel matrix K is composed of nine submatrices, and it constructs the coefficient matrix L, Thus, the primal distance function is represented by KL.Motivated by TCA [40], the mapping for primal data is equal to the transformation of kernel matrix generated by the source and target domain data.Utilizing the low-dimension transform matrix M ∈ R (1+푛 −  +푛  )×푚 reduces the dimension of the primal kernel matrix.It maps the empirical kernel map K = (KK −1/2 )(K −1/2 K) into an -dimensional shared space.Mostly, we replaced the distance function KL by (KMM 푇 KL).In our case, we follow [40] and minimize the trace of the distance, For controlling the complexity of M and preserving the data characteristic, we add the regularization and constraint item.
The domain adaptation item is formulated followed from TCA and written as where  > 0 is a tradeoff parameter and I 푚 ∈ R (푚×푚) is an identity matrix.
Furthermore, the objective function of dual SVM needs to be added to the training label information which is similar to our model.Thus, we construct the training label matrix y + 푆 is the label of a positive instance, y − 푆 is the label vector of negative source instances, and y 푇 is the pseudo labels of target instances which are predicted by SVM before.It can be rewritten in another form: Label matrix U provides the information of source domain data labels and target domain pseudo labels.The matrix K in a dual problem of exemplar-SVM ( 2) is primal data kernel matrix.We want to replace it by mapping the kernel matrix into a latent subspace.Namely, replace K by K and the final objective function of each DAESVM model is formulated as follows: min

Optimization Algorithm
To minimize problem (12), we adopt the alternated optimization method which alternates between solving two subproblems over parameter  and mapping matrix M.Under these methods, the alternated optimization approach is guaranteed to decrease the objective function.Algorithm 1 summarizes the optimization procedure of problem ( 12) which we formulated.min   푇 K − e 푇 , K = UKMM 푇 KU which represents the kernel matrix has been transformed by transformation matrix M. It is obvious that this problem is a QP problem and it could be solved efficiently using interior point methods or other successive optimization procedures such as Alternating Direction Method of Multipliers (ADMM).

Ensemble Domain Adaptation Exemplar Classifiers
In this section, we introduce the method of integration exemplar classifiers.As mentioned before, we get the number of source domain instances classifiers and this section aims to predict labels for target domain instances.In our opinions, the classification hyperplane of an exemplar classifier is representation for a source domain positive instance.However, most of the hyperplanes contain information which comes from various samples, such as images of different background or source.In fact, we aim to search the exemplar classifiers which are from instances similar to the testing sample.Thus, we utilize integrating method to filter out classifiers which include details different with the testing sample.Another view for the integration method is that it slacks the part of hyperplanes.Namely, it removes some exemplar classifiers which are trained by large instances distribution mismatch.
In our method, we first construct the classifiers from Lagrange multipliers .The classifier construction equation is where w is the weight of classifier.
where  is the bias of classifier.The classifier is given by And then we compute the scores by every classifier and the testing instance.Second, we find the top P numbers of scores for each class classifier and compute the sum of those scores.At last, we get a score for each class, and the highest score is the category that we predict.The prediction method is described in Algorithm 2.

Experiments
In this section, we conduct experiments onto the four domains, Amazon, DSLR, Caltech, and Webcam, to evaluate the performance of proposed Domain Adaptation Exemplar Support Vector Machines.We first compare our method to baselines and other domain adaptation methods.Next, we analyze the effectiveness of our approach.At last, we introduce the problem of parameter sensitivity.

Data Preparation.
We run the experiments on Office and Office Caltech datasets.Office dataset contains three domains Amazon, Webcam, and DSLR.Each of them includes images from amazon.com or Office environment images taken with varying lighting and pose changes using a Webcam or a DSLR camera.Office Caltech dataset contains the ten overlapping categories between the Office dataset and Caltech-256 dataset.By the standard transfer learning experiment method, we merge two datasets; it entirely includes four domains Amazon, DSLR, Caltech, and Webcam which are studied in [41].The dataset of Amazon is the images downloaded from Amazon merchants.The images in the Webcam also come from the online web page, but they are of low quality as they are taken by web camera.The domain of DSLR is photographed by the digital SLR camera by which the images are of high quality.Caltech is always added to domain adaptation experiments, and it is collected by object detection tasks.Each domain has its characteristic.Compared to the other domains, the quality of images in the DSLR is higher than others and the influence factors such as object detection and background are less than images downloaded from the web.Amazon and Webcam come from the web, and images in the domains are of low quality and more complexity.However, there are some different details on each of them.Instances in the Webcam are object alone, but the composition of samples in Amazon is more complex including background and other goods.Figure 1 shows the example of the backpack from four domain samples.In the view of transfer learning, the datasets come from different domains and the different margin probabilities for the images.In our model, we aim to solve this problem and get a robust classifier for the cross-domain.
We chose ten common categories among all four datasets: backpack, bike, bike helmet, bookcase, bottle, calculator, desk chair, desk lamp, desktop computer, and file cabinet.There are 8 to 151 samples per category in a domain: 958 images in Amazon, 295 images in Webcam, 157 images in DSLR, 1123 images in Caltech, and 2533 images total in the dataset.Figure 1 shows examples for datasets.
We follow both SURF and DeCAF features extraction in the experiments.First, we use SURF features encoding the images into 800-bin histograms.Next, we use DeCAF feature which is extracted by 7 layers of Alex-net [42] into 4096-bin histograms.At last, we normalized the histograms and then -scored to have zero mean and unit standard deviation in each dimension.
We run our experiments on a standard way for visual domain adaptation.It always uses one of four datasets as source domain and another one as target domain.Each dataset provides same ten categories and uses the same representation of images which is considered as the problem of homogeneous domain adaptation.For example, we choose images taken by the set of DSLR (denoted by ) as source domain data and use images in Amazon (denoted by ) as target domain data.This problem is denoted as D → A. Using this method, we can compose 12 domain adaptation subproblems from four domains.

Experiment Setup
(1) Baseline Method.We compare our DAESVM method with three kinds of classical approaches: one is classified without regularization of transfer learning, the second is conventional transfer learning methods, and the last one is the foundation model, which is low-rank exemplar support vector machine.The methods are listed as follows: (1) Transfer Component Analysis (TCA) [40] (2) Support Vector Machine (SVM) [43] (3) Geodesic Flow Kernel (GFK) [28] (4) Landmarks Selection-based Subspace Alignment (LSSA) [23] (5) Kernel Mean Maximum (KMM) [20] (6) Subspace Alignment (SA) [44] (7) Joint Matching Transfer (TJM) [45] (8) Low-Rank Exemplar-SVMs (LRESVMs) [18] TCA, GFK, and KMM are the classical transfer learning methods.We compare our model with these methods.Besides, we prove our method is more robust than models without domain adaptation items in the transfer learning scenery.TCA is the foundation of our model, and it is similar to GFK and SFA which are based on the idea of feature transfer.KMM transfer knowledge by instance reweighting.
TJM is a popular model utilizing the problem of unsupervised domain adaptation.SA and LSSA are the models using landmarks to transfer knowledge.
(2) Implementation Details.For baseline method, SVM is trained on the source data and tested on the target data [46].TCA, SA, LSSA, TJM, and GFK are first viewed as dimension reduction process and then train a classifier on the source data and make a prediction for the target domain [19].Being similar to dimension reduction, KMM is first to compute the weight of each instance and then train predictor on the reweighting source data.Under the assumption of unsupervised domain adaptation, it is impossible to tune the optimal parameters for the target domain task by cross validation, since there exists distribution mismatch between domains.Therefore, in the experiments, we adopt the strategy of Grid Search to obtain the best parameters and report the best results.Our method involves five tunable parameters: tradeoff in ESVM  1 and  2 , tradeoff in regularization items  and , and parameter of dimension reduction .The parameters of tradeoff in ESVM  1 and  2 are selected over {10 −3 , 10 −2 , 10 −1 , 10 −0 , 10 1 , 10 2 , 10 3 }.We fix  = 1,  = 1,  = 40 empirically and select radial basic function (RBF) as the kernel function.In fact, our model is relatively stable under a wide range of parameter values.We train a classifier for every positive instance in the source domain data and then we put them into a probability distribution.We deal with the multiclass classifier in a one versus the others way.To measure the performance of our method, we use the average accuracy and the standard deviation over ten repetitions.The average testing accuracies and standard errors for all 12 tasks of our methods are reported in Table 1.For the rest of baseline experiments, most of them are cited by the papers which are published before.

Experiments Results.
In this section, we compare our DAESVM with baseline methods regarding classification accuracy.
Table 1 summarizes the classification accuracy obtained by all the 10 categories and generates 12 tasks in 4 domains.The highest accuracy is in a bold font which indicates that the performance of this task is better than others.First, we implement the traditional classifiers without domain adaptation items that we train the predictors on the source domain data and make a prediction for target domain dataset.Second, we compared our DAESVM with unsupervised domain adaptation methods, such as TCA or GFK, implemented to use the same dimension reduction with the parameter  in our model.At last, we also compared DAESVM with newly transfer learning models, like low-rank ESVMs [18].Overall, in a usual transfer learning way, we run datasets across different pairs of source and target domain.The accuracy of DAESVM for the adaptation from DSLR to Webcam can achieve 92.1% which make the improvement over LRESVM by 1.2%.Compared with TCA, DAESVMs make a consideration about the distribution mismatch among instances or different domains.For the adaptation from Webcam to DSLR, this task can get the accuracy of 91.8%.For the domain datasets Amazon and Caltech which are more significant than DSLR and Webcam, DAESVM gets the accuracy of 77.5% which improves about 36.2% compared to the method of TJM.For the ability which transfers knowledge from large dataset to small domain dataset, from Amazon to DSLR, we get the accuracy of 76.8%.Contrarily, from DSLR to Amazon, the prediction accuracy is 83.4%.Totally speaking, our DAESVM trained on one domain has good performance and will also have robust performance on multidomain.We also complement tasks of multidomains adaptation, which utilized one or more domains as source domain data and made an adaptation to other domains.The results are shown in Table 2.The accuracy of DAEVM for the adaptation from Amazon, DSLR, and Webcam to Caltech achieves 90.1% which get the improvement over LERSVM.For the task of adaptation from Amazon and Caltech to Webcam, DSLR can get the accuracy of 92.4%.The experiments prove that our models are effective not only for single domain adaptation but also for multidomain adaptation.
Two key factors may contribute to the superiority of our method: The feature transfer regularization item is utilized to slack the similarity assumption.It just assumes that there are some shared features in different domains instead of the assumption that different domains are similar to each other.This factor makes the model more robust than models with reweighting item.The second factor is the exemplar-SVMs which are proposed from a motivation of transfer learning which makes a consideration that instances are distribution mismatch from each other.Our model combines these two factors to resist the problem of distribution mismatch among domains and sample selection bias among instances.6.4.Pseudo Label Effectiveness.Following [19], we use pseudo labels to supplement training model.In our experiments, we test the prediction results which are influenced by the accuracy rate of pseudo labels.As a result, described by Figure 2, the prediction accuracy is improved following the increasing accuracy of pseudo labels.It is proved that the method of the pseudo label is effective and we can do the iteration by using the labels predicted by the DAESVM as the pseudo labels.The iteration step can efficiently enhance the performance of the classifiers.

Parameter Sensitivity.
There are five parameters in our model, and we conduct the parameter sensitivity analysis which can achieve optimal performance under a wide range of parameter values and discuss the results.
(1) Tradeoff . is a tradeoff to control the weight of MMD item which aims to minimize the distribution mismatch between source and target domain.Theoretically, we want this term to be equal to zero.However, if we set this parameter to infinite,  → ∞, it may lose the data properties when we transform source and target domain data into high-dimension space.Contrarily, if we set  to zero, the model would lose the function of correcting the distribution mismatch.
(2) Tradeoff . is a tradeoff to control the weight of data variance item which aims to preserve data properties.Theoretically, we want this item to be equal to zero.However, if we set this parameter to infinite,  → ∞, it may augment the data distribution mismatch among different domains; namely, transformation matrix M cannot utilize source data to assist the target task.Contrarily, if we set  to zero, the model cannot preserve the properties of original data.
(3) Dimension Reduction . is the dimension of the transformation matrix, namely, the dimension of the subspace which we want to map samples into.Similarly, minimizing  too less may lead to losing the properties of data which may lead to the classifier failure.If  is too large, the effectiveness of correct distribution mismatch may be lost.We conduct the classification results influenced by the dimension of , and the results are displayed in Figure 3.
(4) Tradeoff in ESVM  1 and  2 .Parameters  1 and  2 are the upper bound of the Lagrangian variables.In the standard SVM, positive and negative instances share the same standard of these two parameters.In our models, we expect the weights of the positive samples to be higher than negative samples.In our experiments, the value of  1 is one hundred times  2 which could gain a high-performance predictor.The visual analysis of these two parameters is in Figure 4.

Conclusion
In this paper, we have proposed an effective method for domain adaptation problems with regularization item which reduces the data distribution mismatch between domains and preserves properties of the original data.Furthermore, utilizing the method of integrating classifiers can predict target domain data with high accuracy.The proposed method mainly aims to solve the problem, in which domains or instances distributions mismatch occurs.Meanwhile, we extend DAESVMs to the multiple source or target domains.Experiments conducted on the transfer learning datasets transfer knowledge from image to image.
Our future works are as follows.First, we will integrate the training procession of all the classifiers in an ensemble way.It is better to accelerate training process by rewriting all the weight into a matrix form.This strategy can omit the process of matrix inversion optimization.Second, we want to make a constraint for  that can hold the sparsity.At last, we will extend DAESVMs on the problem transfer knowledge among domains which have few relationships, such as transfer knowledge from image to video or text.

Notations and Descriptions
D 푆 , D 푇 : Source/target domain T 푆 , T 푇 : Source/target task : Dimension of feature X 푆 , X 푇 : Source/target sample matrix y 푆 , y 푇 : Source/target sample label matrix K: Kernel matrix without label information : Lagrange multipliers vector  푆 ,  푇 : The number of source/target domain instances e: Identity vector I: Identity matrix.

Input: y 푆Algorithm 2 :
, X 푡푒 ; parameter P Output: prediction labels y (1) Compute the weights w of the classifiers.(2) Construct weight matrix W and bias b of predictors based on .(3) repeat (4) Compute scores of each classifier in this category.(5) Find top P scores.(6) Compute the sum of these top scores.(7) until The number of categories (8) Choose the max score owned category as the prediction label y.Ensemble Domain Adaptation Exemplar Classifiers.

Figure 1 :
Figure 1: Example images from the backpack category in Amazon, DLSR ((a) from left to right), Webcam, and Caltech-256 ((b) from left to right).The different domain images are various.The images have different style, background, or sources.

Figure 2 :
Figure 2: The accuracy of DAESVMs is improved with the improvement of the pseudo label accuracy.The results verify the effectiveness of the pseudo label method.

Figure 3 :Figure 4 :
Figure 3: When the dimension is 20 or 40, the prediction accuracy is higher than others.
definition in transfer learning, and the definition just considers the condition of one source domain and one target domain.First, it needs to define the Domain and Task.Domain D is composed of a feature space X and a margin probability distribution (), namely, D = {X, ()},  ∈ X. Task T is composed of a label space Y and a prediction model (), namely, T = {Y, ()},  ∈ Y. From view of probability, () = ( | ).Notations in this paper which are frequently used are summarized in the Notations and Descriptions section.The definition of transfer learning is as follows: Give a source domain data D 푆 = {( 푆 1 ,  푆 1 ), . . ., ( 푆

Table 1 :
Classification accuracies of different methods for different tasks of domain adaptation.We conduct the experiments on conventional transfer learning methods.Comparing with traditional methods, DAESVMs gain a big improvement in the prediction accuracy.And they also improve confronted with the approach of LRESVM which is proposed recently [average ± standard error of accuracy (%)].

Table 2 :
We also conduct our experiments for the tasks of multidomain and gain an improvement comparing with methods proposed before.The experiments adopt the same strategy as the single domain adaptation.We treat multidomain as one source or target to find the shared features in a latent space.However, the complexity of the multidomain shared features limits the accuracy of tasks [average ± standard error of accuracy (%)].