TSTELM: Two-Stage Transfer Extreme Learning Machine for Unsupervised Domain Adaptation

As a single-layer feedforward network (SLFN), extreme learning machine (ELM) has been successfully applied for classification and regression in machine learning due to its faster training speed and better generalization. However, it will perform poorly for domain adaptation in which the distributions between training data and testing data are inconsistent. In this article, we propose a novel ELM called two-stage transfer extreme learning machine (TSTELM) to solve this problem. At the statistical matching stage, we adopt maximum mean discrepancy (MMD) to narrow the distribution difference of the output layer between domains. In addition, at the subspace alignment stage, we align the source and target model parameters, design target cross-domain mean approximation, and add the output weight approximation to further promote the knowledge transferring across domains. Moreover, the prediction of test sample is jointly determined by the ELM parameters generated at the two stages. Finally, we investigate the proposed approach in classification task and conduct experiments on four public domain adaptation datasets. The result indicates that TSTELM could effectively enhance the knowledge transfer ability of ELM with higher accuracy than other existing transfer and non-transfer classifiers.


Introduction
In the current era of big data, the classification model constructed by machine learning can help human quickly identify and annotate a large number of images, texts, audios, and signal data rapidly generated by the Internet, sensors, and computers. Mining information from these data helps understand the relationship between things. Support vector machine (SVM) [1], k-nearest neighbor (kNN) [2], naive Bayes [3,4], decision tree [5], logistic regression [6], and many other classifiers with high accuracy appear and have attracted much attention. Huang et al. [7] proposed extreme learning machine (ELM) which is a better classifier with powerful nonlinear fitting and approximation capabilities [8,9] and has been widely studied and applied in brain-computer interfaces [10,11], medical diagnosis [12,13], fault diagnosis [14], hyperspectral [15], and other fields.
ELM initializes randomly the input weight and bias and obtains optimal output weight by solving a least-squares problem [7][8][9]16], which has the advantages of faster learning speed and better generalization, therefore becomes a hot research topic. ere are many variants of ELM put forward both in theories and applications to enhance its performance for handling problems in different situations. In order to solve the problem that ELM is sensitive to the input weights and biases, Li et al. [17] proposed a WOA-ELM algorithm which applied the whale optimization algorithm (WOA) to optimize the input weights and biases of ELM for its performance improvement. In response to the class imbalance problem, weighted extreme learning machine (WELM) [18][19][20] was proposed, in which different weights are assigned for each training sample based on two different strategies. SMOTE based on class-specific extreme learning machine (SMOTE-CSELM) [21] was also presented by exploiting the benefit of both the minority oversampling and the class-specific regularization. To improve the generalization power and prevent the overtraining of ELM, some methods combined ensemble learning with it to improve its robustness, including voting-based extreme learning machine (V-ELM) [22,23], AdaBoost extreme learning machine [24][25][26], and extreme ensemble of ELMs (EEoELMs) [27]. Moreover, affected by deep learning, ELMs with deep structure occur for high accuracy. ML-ELM [28,29] was presented to resolve the time-consuming issue of deep learning and achieved faster speed and higher generalization than stacked autoencoders, deep belief network, and deep Boltzmann machine. Hierarchical ELM (H-ELM) [30,31] was proposed to enhance the universal approximation capability of ELM. e kernel-based multilayer ELM (ML-KELM) [32] integrated the kernel learning technique into the ML-ELM and achieved a faster learning speed and a better recognition performance. Although the above ELM models have achieved great success in classification and regression tasks, they will degrade when training samples and test samples are taken from different domains with different distributions (i.e., cross-domain tasks).
To handle this problem, domain adaptation (DA) [33][34][35], as an important branch of transfer learning, has attracted wide attention, in which efficient classifier is obtained with the help of the knowledge from source domain, which is different but related to target domain. L. Zhang and D. Zhang [36] put forward the domain adaptation extreme learning machine (DAELM) framework by extending ELM to handle domain adaptation problems for gas identification and drift compensation of E-nose system. Adaptive ELM (AELM) [37] was proposed by introducing the manifold regularization term into ELM for image classification. Zang et al. [38] proposed a supervised extreme learning machine called transfer extreme learning machine with output weight alignment (TELM-OWA), which aligned the output weight matrix of the ELM between domains and added the approximation between the inter-domain ELM parameters for knowledge transferring. However, these approaches are developed to solve semi-supervised domain adaptation problems because they require few labeled samples from the target domains. Due to its high cost in collecting labels and labeling samples, cross-domain ELM (CDELM) [39], domain space transfer ELM (DST-ELM) [40], cross-domain extreme learning machine (CdELM) [41], and extreme learning machine based on maximum weighted mean discrepancy (ELM-MWMD) [42] are proposed respectively for unsupervised domains by minimizing the classification loss and applying the maximum mean discrepancy (MMD) strategy on the prediction results. In the above methods, the supervised ELM model usually outperforms the unsupervised ones with the help of a few labeled samples from the target domain.
In this article, inspired by pioneering works [38,42], we propose a novel method denoted as two-stage transfer extreme learning machine (TSTELM), in which there are two stages of domain adaptation: statistical matching and subspace alignment. At the statistical matching stage, we first learn a domain adaptation ELM classifier via utilizing the MMD to simultaneously minimize the marginal and conditional distribution between domains. In addition, at the subspace alignment stage, we use a transformation matrix to align the output weights of inter-domain ELM models, simultaneously put forward target cross-domain mean approximation, and add an output weight approximation term into the objective function. en we can obtain the other domain adaptation ELM. Finally, we fuse the DAELM parameters from two stages to realize the label prediction of test samples. TSTELM is illustrated in Figure 1. Extensive experiments have been conducted on real-world image and text datasets, and the results verify that our approach outperforms several existing domain adaptation and non-domain adaptation methods.
In this article, TSTELM realizes knowledge transferring at two stages, and its contributions are summarized as follows: (1) Similar to [39], our method is to use MMD proved as a general statistical distribution discrepancy measure, to minimize the marginal and conditional distribution discrepancy of the outputs of hidden lays of ELMs from two domains, which effectively extends ELM for unsupervised domain adaptation. erefore, we can obtain one DAELM. (2) Based on the first DAELM and inspired by [42], we introduce the output weight alignment, design target cross-domain mean approximation, and add the output weight approximation constraint into traditional ELM for enhancing knowledge transfer across domains. Hence, we can learn the other DAELM. It is worth emphasizing that we present target crossdomain mean approximation referred to [35] to adapt the distribution of the target domain for consistency with the source domains. (3) At prediction stage, the above two DAELMs jointly determined the category of test samples. Our proposed method performs image classification experiments on object recognition and text datasets. e results verify its effectiveness and advantages. (4) Compared with other the state-of-the-art DAELMs, our research has some distinct properties: (1) Many technologies including MMD, output weight alignment, output weight approximation, and target cross-domain mean approximation are utilized to jointly realize the efficient knowledge transfer across domains at two stages. (2) Output weight alignment organically bridges the DAELMs from the statistical matching stage and subspace alignment stage. (3) Joint decision from two DAELMs facilitates our approach to obtain robustness and high accuracy. e rest of this article is organized as follows: In Section 2, we briefly review domain adaptation and ELM. We then present the proposed TSTELM in Section 3. In Section 4, the experiment and analysis are presented. Finally, Section 5 is the conclusion of this article. In the past decades, many researches have conducted to address domain adaptation problems in classification task, which are mainly divided into three parts [35,43]: (1) Samplebased adaptation. It directly assigns weights to each sample of two domains, which could adapt and minimize the distribution gap between domains. Many such approaches appeared such as domain adaptation (PRDA) [44], TrAdaBoost [45], and Kernel Mean Match (KMM) [46]. (2) Feature-based adaption. It seeks the shared subspace between domains, in which distribution discrepancy is alleviated and knowledge is easily transferred across domains. Transfer component analysis (TCA) [47] and joint distribution adaptation (JDA) [48] take MMD metric as an objective function to find an optimal projected matrix for shared low-dimensional subspace. Liang et al. [49] designed a relaxed domain-irrelevant class clustering (DICE) term and then combined it with MMD to obtain a domain-irrelevant projection for reducing distribution discrepancy between domains. Moreover, DICE was extended to ensemble learning with multiple projection obtained from sampling subsets of source and target domains, which help it achieve better performance. Progressive learning with Confidence-wEighted Targets (PACET) [50] improved DICE by adding a confidence-weight strategy with the posterior probability of target instance. (3) Classifier (or parameterbased) adaptation. Its purpose is to find optimal classifier or its parameter with a well-generalized ability between the source and target domains. Yang et al. [51] presented adaptive support vector machine (Adapt-SVM). It designed a regularizer to minimize the discrepancy between parameters of two classifiers trained on source and target labeled samples and then added into SVM's objective function. Multi-model knowledge transfer (Multi-KT) [52], following the idea of Adapt-SVM, constructs a regularizer to force the parameter of target SVM close to multiple weighted source SVMs. In multiple kernel learning, Wang et al. [53] introduced multiple kernel MMD into the objective function to adapt distribution discrepancy between training samples and test samples, which prevents performance degradation because of inconsistent distribution of datasets and simultaneously obtain a multiple kernel classifier with strong generalization ability. Recently, deep network adaptation and adversarial learning adaptation have become successful in computer version and machine learning. Based on the assumption that samples with the same category are close each to other and the local geometry property of the data can be maintained in neural embedding subspace, Wang et al. [54] proposed the neural embedding match (NEM), which reduces cross-domain distribution divergence by projecting the source and target domains into a common subspace using deep neural network embedding model. In [55], a deep neural network with weighted MMD and the manifold embedding was proposed to handle domain adaptation for hyperspectral image classification. To address the problem in unsupervised partial domain adaptation (PDA), Liang et al. [56] put forward a domain adversarial neural networks called BA 3 US. It presented balanced adversarial alignment (BAA) and adaptive uncertainty suppression (AUS) to overcome negative transfer and propagation of uncertainty which usually appear in PDA.  (2) At the subspace alignment domain adaptation stage, we align the output weights of source and target ELM model, simultaneously design target cross-domain mean approximation term, and add the output weight approximation term into the objective function. We can obtain the other DAELM. (3) In fusion decision, we fuse the DAELM parameters from two stages to realize the label prediction of test samples.

A Brief Review of the Domain Adaptation and
In the abovementioned approaches, sample-based adaptation methods are the most efficient ones for knowledge transfer because of direct utilization of source sample, while feature-based adaptation methods are widely applied. Classifier (or parameter)-based adaptation is the most potential one due to past related domain knowledge or experience is integrated into shared parameters of classifier. Deep network adaptation and adversarial learning adaptation strictly belong to feature-based adaptation method but they can extract (deep) domain-invariant feature with strong discrimination. However, these methods also have their own shortcomings. In sample-based adaptation methods, the effective evaluation mechanism about sample importance is a challenge. Generic shared features obtained from different domains is also a difficult task in feature-based adaptation methods. Since the useful information and knowledge from the auxiliary domain to the target domain is not directly applied, classifier (or parameter-based) adaptation is not efficient compared with two formers. Deep network adaptation usually needs massive labeled samples and sufficient computing resources for training deep model, which could hinder its application. Class misalignment and the simultaneous efficiency of feature extractor and discriminator are a challenge for adversarial learning adaptation. In this article, our approach belongs to classifier (or parameter)-based adaptation; it attempts to seek two output weights of shared ELM models across domains for knowledge transferring.
In this article, we propose TSTELM to address problems in the unsupervised domain adaptation, in which the training data come from the source domain with labeled samples and the test data come from the target domain with unlabeled samples. Suppose the source domain dataset is denoted as (x Si , y i ) n S i�1 ∈ D S and the target domain dataset is denoted as (x Tj ) n T j�1 ∈ D T , where n S and n T represent the number of source and target samples, respectively. e source data and the target data belong to the same feature space X S � X T and label space Y S � Y T . e data distributions of the source and the target domains should be different but similar, that is the marginal distribution P(X S ) ≈ P(X T ) and conditional distribution P(Y S |X S ) ≈ P(Y T |X T ). In TSTELM, we hope to construct an ELM model to obtain high accuracy on Table 1 summarizes other related notations in domain adaptation problems.

Extreme Learning Machine.
Unlike the conventional feedforward neural networks, ELM is an approach in which two characteristics are contained: (1) Hidden layer parameters (i.e., input weights and the biases) can be randomly initialized. (2) e output layer weight can be solved as the least-squares problem. As a result, it yields faster learning speed and better generalization performance compared with other classifiers.
Given a training set (x i , y i ) N i�1 with N samples, where y i is the label corresponding to x i , and C is the number of categories. e structure of the ELM with L hidden nodes and activation function h(x) is shown in Figure 2: In Figure 2, x i is the input sample, w is the input layer weight, b is the hidden layer bias, and the hidden layer is the nonlinear activation function, L is the number of nodes in the hidden layer, and β i is the hidden layer output weight. e outputs of the network are given by: e above formula can be written in matrix form: where By adopting parameter regularization, the ELM can avoid the overfitting problem. Its corresponding objective function can be formulated as where θ is a penalty constant on the training errors, and ‖•‖ 2 denotes the L2-norm of a matrix or a vector. e minimization of equation (3) is a regularized leastsquares problem. By setting the gradient of equation (3) with respect to β as zero, we have e output weight vector β is obtained according to the Moore-Penrose principle. If N > L, the optimal solution of equation (3) is: where I L is a L-dimensional unit matrix.
If N ≤ L, the optimal solution of equation (3) is: where I N is an N-dimensional unit matrix.
Computational Intelligence and Neuroscience

Proposed Methods
In this section, the overall architecture of the proposed TSTELM is introduced in detail. As shown in Figure 1, TSTELM consists of three parts: statistical matching stage, subspace alignment stage, and prediction based on weight fusion.

Statistical Matching Stage.
In statistical matching stage, we hope to obtain an ELM for domain adaptation using labeled samples of the source domain and unlabeled samples of the target domain. First, the source data and target data are mapped into the hidden layer space of ELM, and then we could obtain H S � (h Si , y i ) , n � n S + n T , w ∈ R d×L and b ∈ R 1×L are randomly generated weights and bias, d is the original spatial dimension of the data.
For labeled source data, we can learn an ELM, that is where β S is the output weight of the ELM learned on H S , Y S . Since equation (7) just obtain an ELM classifier using the labeled source samples, it cannot perform well for the target domains due to distribution difference between the source and target domains. erefore, we adopt MMD between H S and H T to reduce marginal and conditional distribution difference between domains [42]. MMD minimization is formulated as: where H � H T S ∪ H T T ; M 0 and M c are the MMD matrixes which are as follows: where S and D (c) T , respectively. Here, replacing β S and β with β 1 and incorporating equations (7) and (8), we can obtain the DAELM at statistical matching stage, and its objective function is By setting the gradient of equation (11) with respect to β 1 as zero, we have h (x 1 ) Computational Intelligence and Neuroscience where E is a diagonal label indicator matrix with each element E ii � 1 if x i ∈ D S , and E ii � 0 otherwise.

Subspace Alignment Stage.
At subspace alignment stage, we train a DAELM on labeled samples of the source domains and unlabeled samples of the target domains.
For the target sample, we can learn an ELM from the following formula: Here, inspired by [35], we introduce cross-domain mean approximation to replace the prediction loss. When there are no labeled samples in the target domains (when c � 0), we force target data H T close to source data mean point H S av , which promotes domain adaptation seen from [35]. If the target sample obtains pseudo labels, it is drawn to source data mean point with the same category H (c) S av . In order to further improve cross-domain knowledge transferring, similar to [38], we introduce a transformation matrix M to align the output weights of ELM between the source domain and the target domain. e function is established as follows: where ‖•‖ 2 F is Frobenius norm. It is invariant to the orthogonalization operation, so equation (14) can be rewritten as: en, we can get the optimal M * � β T 1 β T . Let β a � β 1 M � β 1 β T 1 β T , we can know that β a is closer to β T than β 1 and facilitates cross-domain knowledge transfer.
To align output layer of source ELM to target one, we combine the training error ‖H S β a − Y S ‖ 2 , equation (13), and a regular term and replace β 1 with β a to get: where ‖β T − β a ‖ 2 is a parameter approximation term for facilitating knowledge transfer and preventing negative transfer, and λ and c are the balance parametesr. We substitute β a � β 1 β T 1 β T to equation (16) and get: Because , and equation (18) can be simplified as en: 6 Computational Intelligence and Neuroscience

Prediction Based on Output Weight Fusion.
In the classification task, a sample x Te is tested. After β 1 and β 2 are obtained, the output weight of final ELM model is dominated by β * � β 2 + pβ 1 , and the classification result of x Te can be obtained: where h Te � g(x Te ) and p is the scale factor to balance β 1 and β 2 . TSTELM can be summarized in Algorithm 1.

Discussion.
In order to solve the problem that the traditional ELM does not perform well in cross-domain tasks, we propose TSTELM and its objective function is equations (8) and (17). It can be seen: (1) Compared with the classical ELM, TSTELM reduces the distribution difference between domains and transfers knowledge across domains via adopting MMD, output weight alignment, parameter approximation, and C c�0 ‖H (c) (2) ough TELM-OWA proposed by Zang et al. [38] also applies output weight alignment and parameter approximation for domain adaptation, it is a supervised domain adaptation algorithm requiring few target labeled samples unlike TSTELM. In addition, TSTELM replaces S av β 1 ‖ 2 , which is different from TELM-OWA. O n 3 S + 2Ln 2 S + mLn S when, n S < L, O L 3 + L 2 n S + mLn S when, n S > L. (22) According to Algorithm 1, the main computation cost of our method is in steps 3 and 4.
In step 3, we need to compute M 0 , where T is the number of iterations. In step 4, the output weight is determined according to equation (20) and Q has the same size of H. erefore, the time complexity of step 4 is: O TL 3 + TL 2 N + TCLN when, N > L. (24) Given that TELM-OWA also has two stages to compute the output weight, it has time complexity in the first stage as follows: Input: Source domain D S , source labels Y S , target domain D T , maximum iterations T.
(1) Randomly initialize the input weights w and biases b of the ELM network with L hidden nodes; set the trade-off parameters p, α, θ, and λ. (2) Calculate the matrix H S and H T , obtain H 0 using equation (9).
(3) Compute the optimal weights β 1 by equation (12). (4) Calculate the matrix H S av and compute the optimal weights β 2 using equation (20). S av . (6) Compute the optimal weights β * � β 2 + pβ 1 and obtain the prediction Y T of H T using equation (21). (7) Repeat step 2-6 until the number of iterations reaches to T or Y T no change.
Output: e output weight β * and the predicted output Y T .
Computational Intelligence and Neuroscience 7 O n 3 S + 2Ln 2 S +mLn S when, n S < L, O L 3 + L 2 n S +mLn S when, n S > L. (25) e time complexity of TELM-OWA in the second stage is (26) e above analysis indicates that the computational complexity of TSTELM is significantly higher than ELM and TELM-OWA.

Experiment and Analysis
In this section, experiments are conducted on four crossdomain datasets including Office + Caltech object recognition, USPS and MNIST digital handwriting, MSRC and VOC2007 object recognition, and Reuters-21578 text dataset for classification, where image datasets are descripted in Table 2. We compare our approach with several related unsupervised classification methods and semi-supervised and unsupervised domain adaptation methods. To be more objective, experiments are implemented on PC with 8 GB memory and Windows 10 operating system and MATLAB 2017b. Every experiment runs 20 times and the average value is recorded. We adopt the accuracy rate to evaluate the performance of every algorithm, and it is Accuracy � correctly_classified_samples total_samples × 100%. Figure 3): Office is widely used dataset for visual crossdomain learning, which contains 4,652 images in 31 categories. ese images come from 3 realistic aggregated item datasets: Amazon (images download from online chants https://www.amazon.com); DSLR (high-resolution images by a digital SLR camera in realistic environments); and Webcam (low-resolutions images by a simple webcam). Caltech256 is also a standard object recognition dataset which contains 30,607 images from 256 categories. In this article, we employ the Office + Caltech dataset released by Gong et al. [57]. SURF features are extracted and quantized into an 800-bin histogram with codebooks computed with K-means on a subset of images from Amazon. en, the histograms are standardized by z-score. We select four domains C (Caltech256), A (Amazon), W (Webcam), and D (DSLR) for experiment, and two different domains are randomly selected as the source and the target domain datasets, and 12 cross-domain tasks for evaluation are constructed, namely C ⟶ A, C ⟶ W, C ⟶ D, . . ., and D ⟶ W. USPS + MNIST (as shown in Figure 4): USPS and MNISTare the two different but related handwritten datasets with 10 categories of 0-9. e USPS dataset contains 7,291 training samples and 2,007 test samples with 16 × 16 pixels.

Dataset Description. aOffice + Caltech256 (shown in
ere are 60,000 training images and 10,000 test images with 28 × 28 pixels in the MNIST database. During this experiment, we randomly select 1,800 pictures from USPS and 2,000 pictures from MNIST and convert them into 16 × 16 pixels. We construct two crossdomain tasks, that is USPS as the source domain and MNIST as the target domain (USPS vs MNIST) and vice versa (MNIST vs USPS).
MSRC + VOC2007 (shown in Figure 5): MSRC dataset is provided by Microsoft Cambridge, which consists of 4323 images of 18 object classes. VOC2007 dataset contains 5,011 images annotated with 20 concepts. We can see from Figure 5 that MSRC and VOC2007 have distinct but different distributions. MSRC are standard images as benchmark data for evaluation. VOC2007 is randomly constructed by using the images in the network album.
In our experiments, we construct the domain adaptation dataset MSRC vs VOC in which 6 shared categories are selected including aircraft, birds, cows, family cars, sheep and bicycles. Among them, 1269 images are selected from the MSRC dataset as the source domain dataset and 1530 images are selected from the VOC2007 dataset as the target domain dataset. en, the source domain and the target domain are exchanged to construct a new set of domain adaptation dataset VOC vs MSRC. We convert all images into 256 gray pixels; 240 dimensions are extracted as the spatial dimension of the sample.
Reuters-21578: Reuters-21578 text dataset is a common dataset for text classification. It contains 21,577 news documents from Reuters in 1987. ese documents have been manually labeled by Reuters as five classes, such as "exchanges," "orgs," "people," "places," and "topics," including multiple categories and subclasses. Among them, the largest three categories are "orgs," "people," and "place," which can construct six cross-domain text classification tasks orgs vs people, people vs orgs, orgs vs place, place vs orgs, people vs place, and place vs people.
is article makes a more complete evaluation of the algorithm on 6 classification tasks.

Experimental Settings.
To validate the efficiency of TSTELM, we compare it with some other classifiers.

Experimental Results and Analysis.
We test TSTELM on Office + Caltech256, USPS + MNIST, MSRC + VOC2007, and Reuters-21578 datasets, and the comparison results are displayed in Tables 3 and 4  Computational Intelligence and Neuroscience Table  3: Accuracy of different algorithms on Office + Caltech256 and USPS + MNIST datasets. e bold values in Table 3 is best result in its column. 10 Computational Intelligence and Neuroscience   ELM  SSELM  TCA1  TCA2  JDA1  JDA2  DAELM_S  DAELM_T  ARRLS  TELM-OWA  TSTELM   VOC vs MSRC   1NN  SVM  ELM  SSELM  TCA1  TCA2  JDA1  JDA2  DAELM_S  DAELM_T  ARRLS  TELM- We check the execution times of some methods on MNIST vs USPS, and the results are reported in Table 5. It can be seen: (1) e speed of the methods based on ELM are significantly faster than other methods, and ELM is the fastest. (2) TSTELM consumes more time than TELM-OWA, ELM, SSELM, DAELM_S, and DAELM_T, because of label refinement iterative process. TELM-OWA is more time-consuming than ELM, SSELM, DAELM_S, and DAELM_T as a result of solving β * S and β * T . (3) Since constructing the Laplacian matrix is the most time-consuming, SSELM is relatively inefficient. (4) TCA(1,2) and JDA(1,2) cost more time than 1NN and SVM because of additional feature extraction process. (5) JDA(1,2) has the highest time cost because it applies an iterative manner to refine the target pseudo label and extract cross-domain shared feature.

Parameter Analysis.
To evaluate the effects of scale factor (p); number of hidden layer nodes(L); parameter α, λ, and θ on TSTELM, we conduct some experiments on org vs people, MSRC vs VOC, MNIST vs USPS, and A vs D. e results are shown in Figures 9(a)-9(f ). It can be seen that: (1) With the increase of p, the trend in TSTELM accuracy goes up first and then goes down on all test datasets and achieves optimal results when p ∈ [0.1, 1], as shown in Figure 9(a). It can be known that the results of joint decision of β 1 and β 2 are better than their separate decisions. (2) As shown in Figure 9(b), TSTELM accuracy increases first and then decreases with the number of L on all test datasets. Although a large network forces the ELM network to behave better on output function approximation, time cost of the algorithm     Figure 9: Classification accuracy of TSTELM with respect to scale factor (p); number of hidden layer nodes (L); parameter α, λ, and θ; and iteration. and required memory become large, and too many hidden nodes will hurt the ELM performance of domain adaptation because of better output function approximation. (3) In Figures 9(c)-9(e), with the gradual increase of parameter α, λ, and θ, the accuracy increases first and then decreases and takes different optimal values on different test datasets to achieve the optimal accuracy, which indicates that the control term of these parameters are beneficial to TSTELM when the parameter values are reasonable. (4) We also provide the classification accuracy varying with the iteration number, and result is shown in Figure 9(f ). It shows that the accuracy is increasing iterative with the number of iterations and finally converges after several iterations, which verifies that TSTELM has strong robustness.

Conclusion
To handle the problem that traditional ELM does not perform well in unsupervised domain adaptation, we in this article propose TSTELM including two domain adaptation stages. At the statistical matching stage, MMD is introduced into ELM learning frame to simultaneously minimize the marginal and conditional distribution between domains. At the subspace alignment stage, subspace alignment strategy, cross-domain mean approximation, and output weight approximation are adopted to further adjust the distribution consistency between domains. Finally, parameters of learned ELM models at two stages are fused and used to predict test samples. Extensive experiments have been conducted on real-world image and text datasets, and the results show that TSTELM has higher accuracy and better generalization performance. In the future, we will make further research that TSTELM is improved by stacking it into deep structure model for extracting deep feature.

Data Availability
e data used to support the findings of this study can be found at https://github.com/jindongwang/transferlearning/ blob/master/data/dataset.md.

Conflicts of Interest
e authors declare that they have no conflicts of interest.