Joint Transfer Extreme Learning Machine with Cross-Domain Mean Approximation and Output Weight Alignment

. With fast learning speed and high accuracy, extreme learning machine (ELM) has achieved great success in pattern recognition and machine learning. Unfortunately, it will fail in the circumstance where plenty of labeled samples for training model are insufcient. Te labeled samples are difcult to obtain due to their high cost. In this paper, we solve this problem with transfer learning and propose joint transfer extreme learning machine (JTELM). First, it applies cross-domain mean approximation (CDMA) to minimize the discrepancy between domains, thus obtaining one ELM model. Second, subspace alignment (sa) and weight approximation are together introduced into the output layer to enhance the capability of knowledge transfer and learn another ELM model. Tird, the prediction of test samples is dominated by the two learned ELM models. Finally, a series of experiments are carried out to investigate the performance of JTELM, and the results show that it achieves efciently the task of transfer learning and performs better than the traditional ELM and other transfer or nontransfer learning methods.


Introduction
Te fast development of mobile Internet, the Internet of things, and high-performance computers causes a large amount of data to emerge.How to mine the information of these data to help people make decisions has become a challenge.Machine learning uses numerous labeled data to train a statistical model for automatic prediction and this has become a hot topic in artifcial intelligence (AI).As a high-performance model in machine learning, ELM has achieved success in pattern recognition, computation science, and machine vision.It includes the following two merits [1-4]: fast learning speed and outstanding generalization performance.Tere is no need for ELM to tune input weight and bias, and what it only needs is to optimize the output weight by solving a least square problem.Terefore, it has been widely recognized for classifcation and regression in various felds including industry fault diagnosis [5,6], medical diagnosis [7], hyperspectral imagery classifcation [8,9], facial expression recognition [10], and brain-computer interface [11,12].However, like the traditional machine learning model, ELM performs less satisfyingly when the training samples are insufcient.
Transfer learning (TL) can handle this problem, in which the account of labeled samples (data) from other domains (source domain) related to the current domain (target domain) are adopted to train an efcient model for helping target tasks [13][14][15].TL can not only reduce the cost of collecting training samples for data reuse but also enhance the generalization performance of the model.It is an expression of advanced intelligence.We commonly divide TL into three parts [14,16], namely, instance-based transfer [17][18][19], feature-based transfer [20][21][22][23], and classifer (or parameter)-based transfer [24][25][26].Moreover, with the success of deep learning and adversarial network in computer version and machine learning, some deep transfer learning [27] and transfer adversarial learning approaches [28] appear to further enrich the transfer learning in theory and application.
TL could help ELM to solve the shortage of available training samples, and many variant ELMs with the ability of knowledge transfer have appeared.Depending on how to adapt between domains, we divide the transfer ELM (TELM) into the following three types.(1) Te target supervised method: It usually requires a few of the labeled samples from the target domain to adjust the model training on the source domain.Domain adaptation extreme learning machine (DAELM) [29] was put forward to enhance ELM to handle the domain adaptation problems in the E-nose system.Online domain adaptation extreme learning machine (ODAELM) [30] and online weighted domain transfer extreme learning machine (OWDTELM) [31] extend DAELM to the online task.To further improve DAELM, Xia et al. [32] proposed the boosting for DAELM (BDAELM) which introduces boosting technology to ensemble DAELMs.(2) Parameter transformation or approximation: Tis method realizes the knowledge transferring across domains by a transform matrix or output weight approximation, such as transfer extreme learning machine with output weight alignment (TELM-OWA) [33], parameter transfer ELM (PTELM) [34], and extreme learning machine (ELM)-based domain adaptation (EDA) [35].Li et al. [36] designed transfer learning based on the ELM algorithm (TL-ELM) by adding a constraint which forces the output weights of the two domains to be close to each other.(3) Statistical adaptation: It usually introduces a statistical distribution metric, such as MMD [37], into ELM to reduce the domain shift.Many methods including cross-domain extreme learning machines (CdELMs) [38], extreme learning machine based on maximum weighted mean discrepancy (ELM-MWMD) [39], and domain space transfer ELM (DST-ELM) [40] are applied MMD to reduce the distribution discrepancy of the output data in hidden layer from source and target domains.
In this paper, we propose a novel ELM called joint transfer extreme learning machine (JTELM) for transfer learning.It frst obtains one ELM model by introducing cross-domain mean approximation (CDMA) [41] into ELM, in which CDMA could efectively minimize the marginal and conditional distribution diferences between the two domains.Second, we apply subspace alignment technology [42] to align output weights of two domains and to simultaneously add the approximation term to force the output weights to be close to each other, which could boost knowledge transfer.Ten, we can obtain the other ELM model.Finally, the target samples are tested by two learned ELMs.JTELM is illustrated in Figure 1.We carry out some experiments on public datasets for transfer learning tasks to estimate the performance of JTELM, and the result demonstrates the superiority of our method.
We summarize our contributions as follows: (1) CDMA measure is added to the objective function of ELM to reduce the distribution discrepancy of the output of hidden layers in the source and target domains, which could obtain one transfer ELM model.
(2) We apply output weight alignment and the approximation of the output weights from the two domains to improve the efciency of knowledge transfer and simultaneously to get the other transfer ELM.(3) We use the two obtained transfer ELMs to jointly predict test samples, which enhances the robustness of JTELM.To estimate the performance of our approach, we conduct classifcation experiments on object recognition and text datasets, and the result demonstrates that JTELM has a remarkable knowledge transfer ability.
We organize the rest of the sections of this paper as follows.ELM, CDMA, and SA are briefy described in Section 2. JTELM is described in detail in Section 3. Ten,, the experiment is analyzed in Section 4 and the conclusion of this paper is presented in Section 5.

Related Work
In this section, we briefy introduce ELM, CDMA, and SA.

Extreme Learning Machine (ELM)
. ELM, as a singlehidden-layer forward-feedback network, randomly initializes the input weight and bias and then solves the optimal output weight, which leads to its fast learning speed and high accuracy.If a labeled dataset (x i , y i )   N i�1 with N samples of x i and a correspondent label y i is given, then we can construct a classic ELM model with L nodes in a hidden layer in the following manner: where o i is the output of ELM according to the input samples x i , w j , and b j and these are the input weights and bias which are often randomly initialized.β is a vector representation of the output weight.If we want an optimal β, then the following loss function is solved: where ‖β‖ 2 is a parameter sparse constraint avoiding model overftting. N i�1 ‖o i − y i ‖ 2 is the classifcation error and λ is its tradeof parameter.We then convert equation (2) into the following matrix form: where

Complexity
According to [2], we get the optimal β as Finally, we predict the testing sample x Te as where h Te � g(x Te ).

Cross-Domain Mean Approximation (CDMA).
Te distribution discrepancy measure is very critical in transfer learning.Zang et al. [41] presented CDMA that is nonparametric, easy to understand, efcient, and benefcial for mining local information.In transfer learning, there are two datasets: i�1 from the source domain and D T � x Tj , y Tj   n T j�1 from the target domain, where n S (n T ) is the number of x Si (x Tj ) in D S (D T ) and y Si (y Tj ) belonging to C classes which is the label of x Si (x Tj ).Ten, we can get the CDMA measure as and μ T(S) is the mean vector of the target (source) domain sample.If we further consider the label information of the samples, CDMA is also represented as where is the mean vector of the target (source) domain D (c) T(S) with c category.

Subspace Alignment (SA).
In transfer learning, especially feature transfer, SA usually aligns two feature subspaces from the source and target domains obtained by other feature extraction methods, and realizes the distribution consistency of the two domains.If we have learned the two subspace transform matrixes A S and A T , then, a transformation matrix M is obtained to solve the following function: where ‖•‖ 2 F is the Frobenius norm.We add the orthogonalization operation into equation ( 8) and get From equation ( 9), we can see that M * � A T S A T .We set A � A S M � A S A T S A T and from this it is clear that the sample distribution in A subspace is more similar to the one in A T than in A S , which facilitates knowledge transfer.

Joint Transfer Extreme Learning Machine (JTELM)
In response to the shortcoming of ELM with no ability of knowledge transfer, we propose a novel transfer ELM abbreviated as JTELM for handling unsupervised transfer learning tasks in which no labeled target samples appear.In unsupervised transfer learning, the source domain D S and D T are given but y Tj disappears in D T , so we expect that JTELM learned from D S to precisely predict the samples in D T In equation (10), the frst two items are the loss of ELM, and the third item is the loss of CDMA in the output layer.α 1 is a tradeof parameter between two losses.
T av samples with c category in H S and H T .H S av , H T av is the mean vector of H S and H T , respectively.H (c) T(S) av is the mean vector of the target (source) domain H (c)  T(S) with c category.We can obtain one ELM with knowledge transfer ability by using zL ELM−CDMA /zβ S � 0 according to [2] as follows:

Extreme Learning Machine with Output
Weight Alignment and Approximation.Suppose that there is weight β T in the target domain, then we can construct a loss function as follows: where ‖H S β S − Y S ‖ 2 denotes the classifcation error in the source domain, ‖β S − β T ‖ 2 denotes output weight approximation to force β S close to β T for the facilitation of knowledge transfer, and α 2 and c are the balance parameters.
As a next step, we apply SA to align the output layer of the source ELM to target ones.First, we obtain a transform matrix M * � β T S β T , and set β temp � β S M � β S β T S β T and then replace β S with β temp and substitute it into equation (12) to get At this moment, ‖H S β temp − Y S ‖ change into the source classifcation error under the output weight alignment.We substitute β temp � β S β T S β T to equation ( 13) and get 4 Complexity Because We set 1 , then equation ( 15) can be simplifed as Let zJ(β T )/zβ T � 0, then we obtain

Discussion.
Inspired by TELM-OWA [33], we put forward JTELM to address the problem of unsupervised transfer learning.It has the following characteristics: (1) Similar to TELM-OWA, output weight alignment (equation ( 8)) and weight approximation ‖β S − β T ‖ 2 are used to learn a transfer ELM parameter β * 2 , but JTELM is an unsupervised TL method in which no labeled samples in the target domain exist.Terefore, JTELM has a higher difculty and challenge.
(2) Te authors in [41] have proved that CDMA is a more efcient distribution discrepancy metric than MMD.We apply it to ELM to add the transferring ability of knowledge from the source domain to the target domain.Tus, β * 1 in equation ( 11) is a parameter of the shared model between domains.
(3) JTELM utilizes β * 1 and β * 2 to jointly make decisions for test samples, which not only unify statistical adaptation and parameter transformation into a learning framework to improve knowledge transfer, but also enhances the robustness of our approach similar to ensemble learning.

Experiment and Analysis
In this section, we present the validity of our JTELM and perform some experiments on image and text datasets commonly used in transfer learning for the classifcation task.All experiments are run on a PC with 8 GB memory and Windows 10 operating system and MATLAB 2017b.Every experiment runs 20 times and the average value is recorded.We evaluate all algorithms in the experiment with an accuracy similar to [21].

Datasets Description. Ofce31 + Caltech256 (shown in
Figure 2): Tese datasets were frst published in the year referred in [43].It contains two domains, namely, Ofce31 and Caltech256.Ofce31 consists of 4,652 images in 31 categories.Tey have been collected from 3 subdomains, that is, Amazon (A), DSLR (D), and Webcam (W).Caltech (C) is also an object image dataset consisting of 30,607 images from 256 categories.
During the experiment, we select 1,410 images with 10 categories from ofce31and 1123 images with 10 categories from Caltech.Every picture is extracted using SURF features with 800 dimensions.Two subdomains in A, W, D, and C are randomly chosen as the source and target domain datasets, and 12 cross-domain tasks are built as C⟶A, C⟶W, C⟶D, . .., and D⟶W (shown in Table 1).
USPS + MNIST (shown in Figure 3): USPS and MNIST are the two image datasets describing numbers from 0 to 9, Complexity so they share 10 categories but have diferent distributions.USPS consists of 9,298 images with 16 × 16 pixels and MNIST has 70,000 images with 28 × 28 pixels.During the experiment, 1,800 pictures from USPS and 2,000 pictures from MNISTare selected as the source domain and the target domain (shown in Table 1).Every image is converted into 16 × 16 pixels and two cross-domain tasks, i.e., USPS vs. MNIST and MNIST vs. USPS are constructed for transfer learning tasks.MSRC + VOC2007 (shown in Figure 4): MSRC is an object image dataset consisting of 4,323 images from 18 categories.VOC2007 is an image dataset with photos in Flickr, consisting of 5,011 images from 18 categories.Tey have similar but diferent distributions as can be seen in Figure 4.In this experiment, we collect samples from shared 6 categories of the two datasets, including aircraft, birds, cows, family cars, sheep, and bicycles.Ten, we construct two transfer learning tasks: MSRC vs. VOC and VOC vs.   1).
Reuters-21 78: Reuters-21578 is a text dataset commonly used for text data mining and analysis.It has 21,577 news documents from 5 classes, such as "exchanges," "orgs," "people," "places," and "topics."In this experiment, we select the largest three categories "orgs," "people," and "place," and construct 6 transfer learning tasks, i.e., orgs vs. people, people vs. orgs, orgs vs. place, place vs. orgs, people vs. place, and place vs. people, as shown in Table 1.

Results and Analysis.
To investigate the performance of JTELM, we carry out classifcation tasks on image and text datasets including Ofce + Caltech, USPS + MNIST, MSRC + VOC2007, and Reuters-21578 datasets, and the results are reported in Table 2 and 3. From the results, the following are the observations: (1) JTELM has the highest accuracy in the total average of all algorithms in Tables 2 and 3. It, respectively, gains improvement of 10.54% and 8.73% compared to the baseline ELM in Tables 2 and 3, indicating that our method has a better ability of knowledge transfer with help of CDMA, output weight alignment, and weight approximation.It enriches ELM in theory and application.(2) TELM-OWA and DAELM, as supervised transfer learning mechanisms which requires parts of the labeled target samples, are not ideal under unsupervised learning.TCA, JDA, ARRLS, and CdELM-C apply MMD to reduce the distribution discrepancy of the two domains and to gain good results.SSELM utilizes graph regularization to explore the information of unlabeled target samples and it performs well.(3) TCA1 (2) and JDA1 (2) implement the classifcation task by combining the transfer feature extraction methods(TCA and JDA) with baseline classifer, therefore they outperform 1NN and SVM.ELM performs slightly better than 1NN and SVM due to its good generation ability.
In Table 4, we test the running time of some compared algorithms and JTELM and record it.It can be seen that (1) ELM has the least running time among all methods without tuning the input weight and bias.(2) DAELM_S and DAELM_T are slightly higher than ELM because of the participation of part of the target samples in the training Complexity   10 Complexity better.However, for some datasets, a larger L may maximize the distribution discrepancy of the output data for the hidden layer from two domains, leading to model's poor performance.(3) We also observed the accuracy varying with the iteration number in Figure 5(f ).It shows that the accuracy of JTELM gradually becomes stable and fnally converges after 10 iterations.

Conclusion
In

Figure 1 :
Figure 1: An illustration of JTELM.(1) CDMA is combined with ELM to minimize the distribution discrepancy of output data of hidden layer from two domains.(2) Output weight alignment and approximation could enhance the ability of knowledge transfer of JTELM.(3) Joint decisions of two ELMs improve the robustness of JTELM.

Table 1 :
Description of image and text datasets.,269 images are selected in MSRC and 1530 images are selected in VOC2007.In addition, we rescale all images to 256 gray pixels in length and extract 240 dimensions as a new feature representation (shown in Table

Table 2 :
Accuracy of diferent algorithms on USPS + MNIST and Ofce + Caltech datasets.Bold values indicate that the value is the best result of the row in which it is located.
weight alignment, and the result in Table6shows that CDMA, OWA, and WA could help ELM perform better in transfer learning, but they need more time-cost.4.5.Parameter Analysis.We investigate the sensitivity of JTELM to parameters α 1 , α 2 , c, and λ and to the number of approximation, when adjusted to the appropriate range, can improve accuracy and the knowledge transfer ability of ELM in transfer learning mechanism.(2)Asshown in Figure5(b), the accuracy frst increases and then slightly decreases on 4 datasets, with the number of L growing.When L increases, the nonlinear approximation of our network will perform
this paper, we propose JTELM to address the problem that ELM degrades in transfer learning.It frst applies CDMA to ELM and one transfer ELM model is learned.Ten, similar to TELM-OWA, it uses output weight alignment and out weight approximation to learn the other transfer ELM on the source domain.Finally, it adopts two learned transfer ELMs to predict the samples from the target ones.Extensive experiments have been performed on the open image and text datasets, and the results show that JTELM has a higher accuracy and strong knowledge transfer ability than several state-of-the-art classifers.