Domain Adaption Based on ELM Autoencoder

We propose a new ELM Autoencoder (ELM-AE) based domain adaption algorithm which describes the subspaces of source and target domain by ELM-AE and then carries out subspace alignment to project different domains into a common new space. By leveraging nonlinear approximation ability and efficient one-pass learning ability of ELM-AE, the proposed domain adaption algorithm can efficiently seek a better cross-domain feature representation than linear feature representation approaches such as PCA to improve domain adaption performance. The widely experimental results on Office/Caltech-256 datasets show that the proposed algorithm can achieve better classification accuracy than PCA subspace alignment algorithm and other state-of-the-art domain adaption algorithms in most cases.


Introduction
With the rapid development of Internet and social networks, a huge amount of data (e.g., web data and social data [1]) is being generated rapidly at every moment [2,3].With the explosive growth of data, its processing becomes more and more essential.Among data processing, the feature extraction is one of the most important technologies to deal with data.Feature extraction is used to represent the sample data and extract the most useful characteristics.The performance of a machine learning algorithm depends on whether the extracted feature can well represent the data.When the data dimension is too large, there will be a lot of problems.For example, the computational efficiency will be decreased and may cause overfitting problems.Principle Component Analysis (PCA) [4][5][6] is one of state-of-the-art methods of feature extraction.PCA is used to reduce the dimensions of data under such circumstance and tries to keep the useful information as more as possible.In particular, it uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.However, PCA bears its inherent limitations when it is employed to represent the data: (1) it requires that the data is linear and (2) the features of PCA decomposition are orthogonal.In real applications, many datasets do not conform to this property.Based on this, we introduce a new method, ELM Autoencoder (ELM-AE), to extract the feature space of the data and learn the projecting function of subspace.The feature space that one extracts is not based on orthogonal transformation and the space of the data can be linear and nonlinear cases.
Domain adaptation (DA) is mainly to train a robust classifier, which can recognize the fact that the data is from the different distributions.It is widely used in computer vision and pattern recognition.The usual DA methods are divided into two classes [7]: (1) semisupervised DA algorithm [8], which will have a small amount of label data from the target domain, and (2) unsupervised DA algorithm, in which there is no label data from the target domain.Here we mainly study the unsupervised DA algorithm and proposed a DA algorithm based on ELM-AE.
Recently, DA has been applied to many fields, embracing speech and language processing [9][10][11], computer vision [12][13][14], statistics, and machine learning [12][13][14].A robust classifier can be trained in DA to deal with the multidomain mixed classification tasks.This algorithm is particularly suitable to handle the unsupervised data with no class labels.Typical implementation of DA method is learning a new space in which the differences of feature representation between source and target domains can be minimized.
DA method based PCA has been extensively researched [4][5][6]; we can find a common subspace through PCA in which the diversities between the two different distributions data of the source and target domains are minimized.In [10], Blitzer et al. proposed a method to learn a new feature space through feature relationship between different domains.Source data representation can be obtained by linear transformation of target data according to Jhuo et al. mentioned in [15].In [16], Gong et al. proposed a geodesic flow kernel (GFK), which mainly counts the changes of both source data and the target data in the geometry and statistics.Fernando et al. proposed a DA method based PCA in [7].They obtained the feature space of the source and target data by applying PCA, respectively.Then, the representative feature of source data was projected into the feature space of target data and the representative feature of the target data was projected into the feature space of source data.Fernando et al. also proposed three methods in [7]: DA-SA1, where subspace W 푆 built from source domain uses PCA, DA-SA2, where they use PCA to project the subspace of the target domain which is denoted by W 푇 , and NA, where they only use the original input space without learning a subspace.
The rest of the paper is organized as follows.We present the related work in Section 2. Section 3 is devoted to the presentation of the ELM-AE method and the consistency theorem on the similarity measure deduced from the learned projecting function.In Section 4, the subspace alignment algorithm based on ELM-AE is introduced.We carry out our experiments on various datasets in Section 5.In Section 6, we get the conclusions.

Related Work
In this section, we will show a kind of feature extraction method (PCA) which has been used in subspace alignment domain adaption algorithm.In some applications involving many related features, a great number of features will not only increase the complexity of the problem but also make it difficult to give a reasonable analysis of the problem.In general, although each feature provides some information, their importance is different.In many cases, there is certain correlation among the feature; thus, the information provided by these feature, to some extent, will coincide.Therefore, it is expected to represent these features by a small amount of the new and unrelated feature to reflect the vast majority of information provided by the original feature and then achieve the better solution of the problems through new feature.
PCA was invented in 1901 by Karl Pearson in [17] as an analogue of the principal axis theorem in mechanics.It can be mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate, the second greatest variance on the second coordinate, and so on.In particular, it mainly includes the following steps [17].
(2) Solving covariance matrix, where cov(, ) represents the correlation coefficient between the th column and th column.
The covariance matrix can be obtained by the following equation: (3) Calculate the eigenvalues and eigenvectors of the covariance matrix and decompose the covariance matrix by its feature in [18]: where Λ is a diagonal matrix composed by eigenvalue of the covariance matrix Z, U is the orthogonal matrix composed by eigenvectors of Z in column, and it is the principal coordinate of the new variables.Characteristic value represents the size of the new variable variance, and the eigenvalues we obtained will gradually decrease.
(4) When the eigenvalues are small enough, we think it has little to deal with our source data.Thus, we will choose the first  larger eigenvalues and eigenvectors to constitute our projecting space being U 퐷 * 푚 .
(5) By using ( 5), get the new data representation.Every line in matrix F is equivalent to the projection of all the lines of the original matrix in the principal component axis.These new vectors of projection can be used to express our source data.
PCA method can find the most important variable combination of the original data.By showing the greatest variance, it can effectively and intuitively reflect the relationship between samples.Moreover, it can approximately express the original data by the largest principal component projecting.However, PCA method has its limitations: (1) it requires that the principal component must be a linear combination of the original data and (2) it requires that each principal component must be uncorrelated.These will lead to PCA being not able to solve some practical problems well when it encounters them.Figure 1: ELM-AE has the same network structure as ELM except that its target output is the same as input X.

ELM-AE
In this section, we will introduce one new feature representation algorithm, ELM-AE, which is based on one very fast and effective neural network named Extreme Learning Machine (ELM) [19,20].Just like the traditional ELM [21][22][23], ELM-AE contains three layers: input layer, hidden layer, and output layer.The difference is that the target output is the same as the input in ELM-AE.Figure 1 shows ELM-AE's network structure for compressed, sparse, and equal dimension representation.
Suppose that there are  hidden neuron in hidden layer and  neurons (data dimension) in input layer and output layer; according to its source data dimension  and the neurons number  in hidden layer, ELM-AE can be divided into three kinds of structures [24,25]: (1) Compressed ( > ): representative feature is projected into a low-dimensional feature space from a high-dimensional input data space (2) Sparse ( < ): representative feature is projected into a high-dimensional feature space from a lowdimensional input data space (3) Equal ( = ): representative feature is generated from projecting in which the dimension of the input data space is equal to the feature space According to ELM theory [21,26,27], the hidden neurons for ELM can be randomly generated.In general, we usually choose orthogonal randomly generated hidden parameters to improve ELM-AE's generalization performance.In ELM-AE, the orthogonal random weights and biases of the hidden nodes project the input data into a different or equal dimension space, as shown by the Johnson-Lindenstrauss lemma [28], and are calculated as follows: where a = [ 1 , . . .,  퐿 ] are the orthogonal random weights and b = [ 1 , . . .,  퐿 ] are the orthogonal random biases between the input and hidden nodes; () is activation function of ELM-AE.
ELM-AE's output weight  is our transformation matrix; we can project the input data into the subspace by .For sparse and compressed ELM-AE representations, we calculate output weights  as follows: where H = [h 1 , . . ., h 푁 ] are ELM-AE's hidden layer outputs and X = [x 1 , . . ., x 푁 ] are not only input data but also output data.For equal dimension ELM-AE representations, we calculate output weights  as follows:

Subspace Alignment Based on ELM-AE
In this section, we mainly introduce the subspace alignment algorithm based on ELM-AE and illustrate the principle of the algorithm in detail.

Subspace Alignment.
In classification or regression, we generally deal with the labeled data and those unlabeled data that come from the same distribution.However, in practical applications, this assumption will be challenged, since data we need to process are from different domains.Due to the change of the data type, it will not achieve a good result when it uses the model trained on the former data to classify the later test data.In order to obtain robust machine learning model, it is necessary to take into account the shift between these two domains (we refer to these different but related marginal distributions as domains).The purpose of DA is to make full use of the information generated during the transformation between the source domain and the target domain and then automatically adapt it.
In this paper, we enable each domain to be projected into its subspace.Then we enable the subspace of the source domains to be closed to the target domain for the purpose of the reduction of the difference between the two structures.It needs a transformation function to enable the subspace of the source domain to be projected into the subspace of the target domain.Based on this idea, subspace alignment algorithm is described as follows in detail.

Subspace Generation.
To learn the shift between these two domains, it is needed to process the raw data using ELM-AE to represent the original data and make full use of the information of these two domains.Firstly, we transform every source and target sample in the form of a -dimensional vector.Then, use ELM-AE to calculate the output weights of the  hidden layer nodes, which are the subspaces between the source domains and the target domains, respectively, denoted by W 푆 ∈  퐷×퐿 and W 푇 ∈  퐷×퐿 .

The Subspace Alignment Based on the ELM-AE.
In this section, we will introduce how to learn the transformation function between two subspaces and use it to achieve the subspace alignment.We suggest to project each source (x 푆 ) and target (x 푇 ) sample (where x 푆 ∈  1×푑 , x 푇 ∈  1×푑 ) into its respective subspace x푆 and x푇 by the operations x 푆 W 푆 and x 푇 W 푇 , respectively.Then, we learn a linear transformation function that aligns the source subspace coordinate system to the target one.This step allows us to directly compare source and target samples in their respective subspaces without unnecessary data projections.To achieve this task, we align basis vectors by using a transformation matrix M from W 푆 to W 푇 .M is learned by minimizing the following Bregman matrix divergence [7]: where ‖ ‖ 2 퐹 is the Frobenius norm.Because the orthogonal operation is invariant for Frobenius norm, (9) can be written as follows: In this way, we can calculate the optimal transformation matrix: By changing matrix M, we can get the new coordinate system: where W푆 is the target aligned source coordinate system.Subspace alignment algorithm based on ELM-AE (SA-ELM-DA in short) can be summarized in Algorithm 1.
As for the ELM-AE activation function, it can be "sigmoid," "RBF," and "Sin" functions and so on. represents the number of neurons in the hidden layer. is a regularized parameter to improve the generalization performance of the ELM-AE.W 푆 and W 푇 are not only the output weight of the ELM-AE but also the projecting matrix of the subspace.The advantage of the proposed algorithm is twofold: (1) by generating the subspace, we can train a robust classifier, which has a high degree of accuracy when classifying the different distributions of the data; (2) compared to PCA, our algorithm can also have good performance in dealing with the nonlinear data.

Experimental Study
In this experiment, we will carry out the experiments on a variety of data to evaluate our method and compare with some other state-of-the-art DA algorithms.

DA Datasets.
We compare all DA algorithms using Office [29] dataset and Caltech-256 [30] dataset which contain four domains.The Office dataset consists of images from webcam (denoted by ), DSLR images (denoted by ), and Amazon images (denoted by ).The Caltech-256 images are denoted by . contains 958 images,  contains 1123 images,  contains 157 images, and  is a collection of 295 images.Following the same setup as in [30], we use each source of images as a domain; consequently we get four domains (, , , and ) and 12 DA problems.DA problem can be described as the notation  → , in which  is the source domain and  is the target domain.We use random selection method to obtain the two dimensions of the original datasets, and one example is visualized in Figures 2-5.From Figures 2-5, we can find that the points from different categories mix together.But the visualizations of 2dimensional representation cannot conclude that the original datasets are nonlinear.To better show the nonlinear ability of the proposed algorithm, we use linear activation function and nonlinear activation function (sigmoid), respectively, in the experiments.From Tables 1 and 2, we can find that when we use the linear activation function, the average classification accuracy of SA-ELM-DA is significantly reduced not only using SURF feature but also using CNN feature; it further illustrates that our datasets are nonlinear and also illustrates the nonlinear ability of the proposed algorithm.

Experimental Setup.
We use SURF feature and CNN feature to cooperate with our algorithm SA-ELM-DA, respectively, and denote them as SA-ELM-DA(SURF) and SA-ELM-DA(CNN).SURF is a feature extraction algorithm based on scale invariant: first, using Hessian of each pixel to build scale space and then using nonmaximum suppression to select initial feature points.The 3-dimensional linear interpolation method is used to get the feature points of the subpixel level, and the points whose values are less than a certain threshold are also removed.The main direction of the feature points is obtained by counting harr wavelet feature in the field of feature points.Finally, construct surf feature point description operator and get an 800-dimensional representation.Convolution neural network (CNN) [31] is an algorithm commonly used in depth learning; we use imagenet-vgg-f model of CNN in MatConvNet (a convolution network toolkit in MATLAB developed by Andrea Vedaldi) to extract the output of the fully connected layer on layer 20 and get a 4096-dimensional representation of images.
We compare our ELM-AE subspace DA approach with PCA subspace DA algorithm and other three baselines as follows.
DA-SA1 [7].Subspace W 푆 built from source domain uses PCA; W 푆 is used to project source data and target data to complete the domain adaptation and calculate average accuracy of domain adaptation by cross-validation.
DA-SA2 [7].PCA is used to project the subspace of the target domain which is denoted by W 푇 , followed by using W 푇 to project source data and target data and calculate the average accuracy of domain adaption by cross-validation.
NA [7].No projection is generated, the classification of multidomain data in original space is completed without learning a subspace in domain adaption.Calculate the average classification accuracy by cross-validation in domain adaption.
GFK [16].By integrating an infinite number of subspaces that characterize changes in geometric and statistical properties from the source to the target domain, it can automatically infer important algorithmic parameters without requiring extensive cross-validation.
SA-ELM-DA.Learn the feature representation by ELM-AE from source and target data and generate respective subspace, complete subspace alignment by transformation matrix to achieve domain adaption, and calculate the average classification accuracy.
In the experiments, we compare the performance of subspace generation for each DA algorithm using a 1-Nearest-Neighbor (NN) classifier and a SVM classifier.The number of hidden nodes in experiments is determined by cross-validation.

Experimental Results.
For domain adaptation based on ELM-AE performance with Office/Caltech-256 datasets, we use Office [29]/Caltech-256 [30] datasets which consist of four domains (, , , and ) to compare with different DA algorithms in our experiments.The results of unsupervised DA problems using a NN classifier are shown in Table 1.In 7 out of the 12 DA problems, our algorithm has an advantage over the others; it improves the average accuracy significantly in domain adaption.The results acquired from a SVM [32] classifier in the unsupervised DA case are shown in Table 2. Our method has better performance compared to the others in terms of 10 DA problems.Other results are closed to the optimal performance on average accuracy.SA-ELM-DA(SURF, linear) and SA-ELM-DA(CNN, linear) are only used to prove that the datasets are nonlinear in Tables 1 and 2; we will no longer analyze the results of SA-ELM-DA(SURF, linear) and SA-ELM-DA(CNN, linear) in here.In Table 2, we use SA-ELM-DA(CNN) to complete the domain adaptation; the average accuracy has been improved significantly.It is obviously shown that our algorithm works not only for NN-like local classifiers but also with more global SVM classifiers superior to other DA algorithms.In Table 1, SA-ELM-DA(SURF) has an increase of 15.20% compared to NA [7] on average accuracy; compared to DA-SA1 [7] and DA-SA2 [7], our algorithm improves the average accuracy by 2.30% and 4.00%, respectively.GFK [16] algorithm has been shown to have good performance, compared to NA [7]; it increased by 11.6% on average accuracy, but our algorithm has an increase of 4.20% compared to GFK [16] on average accuracy.In Table 2, our algorithm has an average accuracy of 4.70% over DA-SA1 [7] and improves the average accuracy by 4.40% over DA-SA2 [7].Compared to GFK [16], our algorithm improves by 3.40% on average accuracy.PCA [7] has shown a best result, but our algorithm has an increase of 1.10% compared to PCA [7] on average accuracy.Our algorithm has a better performance in the task of multidomain data.DA algorithm based on PCA [7] has been proven to have good performance in dealing with unsupervised data.Compared with the aforementioned DA algorithm, SA-ELM-DA performs better in handling the unsupervised data model.Our algorithm has been greatly improved which was compared to the NA [7] with no field conversion algorithm on average accuracy.Therefore, SA-ELM-DA is more suitable to deal with data from different fields.Compared to GFK [16], the performance of algorithm has much room for improvement.As we can conclude from Tables 1 and 2, SA-ELM-DA(CNN) has the best performance on average accuracy; when we use CNN feature to complete the domain adaptation, the average accuracy has been improved significantly except for two domains about  and .There are a few samples in  and ; they may lead to extracting inaccurate feature when using CNN, which results in a significant decline in performance of domain adaption between  and .SA-ELM-DA has a significant improvement on average accuracy not only using SURF feature but also using CNN feature.Because the datasets are nonlinear, the average accuracy has a significant improvement when using nonlinear activation function (sigmoid) to project data.The algorithm we proposed can handle nonlinear data compared to DA based on PCA [7].It is firmly believed that our method will be widely applied in various types of data processing such as image processing, machine learning, and computer vision in the future.For example, in medical image processing, there are differences in the performance of the same diseases because of individual differences.Through domain adaption, we can learn a common model to narrow the differences of the same type of diseases and improve the diagnosis rate of diseases.

Conclusion
We propose a new domain adaption algorithm based on the ELM-AE; meanwhile, we use different features to carry out our algorithm.In order to further enhance the accuracy of domain adaption, we further use the convolution neural network (CNN) [31] to extract the high-dimensional feature of images and complete the domain adaptation to improve the performance of our algorithm.The experimental results show that the SA-ELM-DA performs better than PCA and other state-of-the-art domain adaption algorithms on average accuracy.Besides, in feature extracting, ELM-AE can also extract nonlinear feature well for those data with nonlinear relationship.Due to the good performance of our algorithm in dealing with the different distributions of data and its simple principle, we believe that our algorithm can have a wide range of applications in the future.We are going to extend our algorithm to cross-modal retrieval for large-scale data.

Table 1
Recognition accuracy with DA using a NN classifier (Office dataset + Caltech-256)

Table 2 (
a) Recognition accuracy with DA using a SVM Classifier (Office dataset + Caltech-256)