Transfer Extreme Learning Machine with Output Weight Alignment

Extreme Learning Machine (ELM) as a fast and efficient neural network model in pattern recognition and machine learning will decline when the labeled training sample is insufficient. Transfer learning helps the target task to learn a reliable model by using plentiful labeled samples from the different but relevant domain. In this paper, we propose a supervised Extreme Learning Machine with knowledge transferability, called Transfer Extreme Learning Machine with Output Weight Alignment (TELM-OWA). Firstly, it reduces the distribution difference between domains by aligning the output weight matrix of the ELM trained by the labeled samples from the source and target domains. Secondly, the approximation between the interdomain ELM output weight matrix is added to the objective function to further realize the cross-domain transfer of knowledge. Thirdly, we consider the objective function as the least square problem and transform it into a standard ELM model to be efficiently solved. Finally, the effectiveness of the proposed algorithm is verified by classification experiments on 16 sets of image datasets and 6 sets of text datasets, and the result demonstrates the competitive performance of our method with respect to other ELM models and transfer learning approach.


Introduction
Neural networks for solving classification problems have been widely researched in recent years [1,2], which has powerful nonlinear fitting and approximation capabilities. Extreme Learning Machine (ELM), as a Single-Layer Feedforward Network (SLFN), has been proven to be an effective and efficient algorithm for pattern classification and regression [3,4]. It randomly generates the input weight and bias of the hidden layer without tuning and only updates the weight between the hidden layer and the output layer. With the regular least squares (or ridge regression) as prediction error, the output weight will be efficiently obtained in a closed form by Moore-Penrose generalized inverse [3]. As a result, it has the advantages of strong generalization ability and fast training speed, therefore, and it has been widely used in various applications, such as face recognition [5], brain-computer interfaces [6][7][8][9], hyperspectral image classification [10], and malware hunting [11].
Although the learning speed and generalization ability of ELM are of great significance, there do exist many disadvantages. To improve ELM, many algorithms have been put forward in both theories and applications. In response to the fact that the shortcoming of ELM can be highly affected by the random selection of the input weights and biases of SLFN, Eshtay et al. [12] proposed a new model that uses Competitive Swarm Optimizer (CSO) to optimize the values of the input weights and hidden neurons of ELM. For imbalance data classification, Raghuwanshi and Shukla [13] presented a novel SMOTE based Class-Specific Extreme Learning Machine (SMOTE-CSELM), a variant of Class-Specific Extreme Learning Machine (CS-ELM), which exploits the benefit of both the minority oversampling and the class-specific regularization and has more comparable computational complexity than the Weighted Extreme Learning Machine (WELM) [14]. In order to reduce storage space and test time, the Sparse Extreme Learning Machine (Sparse ELM) [15] and multilayer sparse Extreme Learning Machine [16] were proposed for classification. To overcome the bias problem of a single Extreme Learning Machine, Voting based Extreme Learning Machine (V-ELM) [17,18] and AdaBoost Extreme Learning Machine [19][20][21] are proposed to reduce the risk of selecting the wrong model by aggregating all candidate models. Moreover, some semisupervised ELM [22][23][24][25] and unsupervised ELM [26][27][28] algorithms were designed to utilize the large number of existing unlabeled samples for improving the performance of ELM and clustering. However, the above models are obtained under a typical assumption that the training and testing data are sampled from the identical distribution [29] and it may not always hold in many real worlds, yet the performance of ELM will degrade as a result of lacking sufficient samples with the same distribution for training model, and labeling samples are very expensive and costly [30].
Domain adaptation [31][32][33], as an important branch of transfer learning, solves the above problems with the help of the knowledge from the source domain which is different from but related to the target domain and resolves the inconsistency of sample distribution between the source and target domains. Zhang and Zhang [34] extended ELM to handle domain adaptation problems with very few labeled guide samples in target domain and overcome the generalization disadvantages of ELM in multidomain application. Li et al. [35] proposed the TL-ELM (transfer learning-based ELM) which uses a small amount of labeled target sample and a large number of labeled source samples to construct a high-quality classifier. Motivated by the biological learning mechanism, an Adaptive ELM (AELM) algorithm [36] was put forward for transfer learning which introduced the manifold regularization term into ELM for image classification under deep convolutional feature and representation. AELM is semisupervised transfer learning because it requires labels in the target domain. Due to the difficulty of collecting labels, unsupervised methods are more desirable. Chen et al. [37] presented a transfer ELM framework to bridge the source domain parameters and the target domain parameters by a projection matrix, in which informative source domain features are selected for knowledge transfer and the L2, 1-norm was applied to the source parameters. Li [38] and Chen [39], respectively, proposed two unsupervised domain adaptation Extreme Learning Machines by minimizing the classification loss and applying the Maximum Mean Discrepancy (MMD) strategy on the prediction results. Among the above approaches, due to efficiently utilizing target label, supervised ELM for transfer learning is superior to unsupervised ones.
In this paper, we focus on supervised transfer learning and propose a supervised ELM model with the ability of knowledge transfer, called Transfer Extreme Learning Machine with Output Weight Alignment (TELM-OWA), in which there are a small number of labeled target samples and a large number of labeled source samples to build a highquality classification model. Firstly, it builds two ELM models utilizing labeled source and target samples. Secondly, we use a mapping function that transforms the output weight of source ELM into one of target ELM to align the distribution between the domains. irdly, a regularization constraint for the approximation between the interdomain ELM output weight matrices is added into the objective function to improve the cross-domain transfer of knowledge. Finally, we transform the objective function into a standard ELM form to solve and classify. Our approach is illustrated in Figure 1. Extensive experiments have been conducted on 16 sets of image datasets and 6 sets of text datasets and demonstrated significant advantages of our method over traditional ELM and state-of-the-art transfer learning methods. e main contributions of this paper are as follows: (1) An idea of subspace alignment is adopted to reduce the distribution discrepancy between domains. (2) We apply the approximation constraint between the interdomain ELM output weight matrices to realize the efficient transferring of knowledge across domains. (3) e objective function is solved in standard ELM form, which is efficient and easy to understand. (4) Our proposed method performs image classification experiments on object recognition and text datasets. e results verify its effectiveness and advantage. e remainder of this paper is as follows: in Section 2, we briefly introduce domain adaptation and ELM. In Section 3, we present TELM-OWA. In Section 4, the experiment and analysis to verify the validity of TELM-OWA are illustrated. Finally, Section 5 is the conclusion of the paper.

Domain Adaptation.
Transfer learning aims to learn a classifier for the target domain by leveraging knowledge from one or multiple well-labeled source domains. But if the source and target domains contain large different distribution data, its performance will be affected. In transfer learning, domain adaptation accelerates the cross-domain transfer of knowledge by minimizing the discrepancy between domains. According to "how to correct interdomain distribution mismatch," domain adaptation can be roughly divided into three categories: sample weighting, subspace and manifold alignment, and statistical distribution alignment [33].
Sample weighting methods weigh each sample from the source domain to better match the target domain distribution and minimize the distribution divergence between two domains [40,41], in which the estimation of the weights from the source samples is a key to this technique. e most classic sample-based transfer algorithm is TrAdaBoost proposed by Dai et al. [42]. It expands the AdaBoost algorithm and applies boosting technology to weigh the source and the target samples. Many algorithms are put forward to extend TrAdaBoost, such as DTrAdaBoost [43], Multisource-TrAdaboost (MTrA), and Task-TrAdaboost (TTrA) [44], Multi-Source Tri-Training Transfer Learning (MST3L) [45].
Subspace and manifold alignment methods try to align the subspace or manifold representations to preserve some important properties of data and simultaneously reduce the distribution discrepancy across domains. Subspace alignment (SA) [46][47][48] first projects the source and target samples into subspaces, respectively, and then functions a linear mapping to align the source subspace with the target ones and reduce cross-domain distribution difference for knowledge transfer.
Statistical distribution adaptation methods aim to explicitly evaluate and minimize the divergence of statistical distributions between the source and target domains to reduce the difference in the marginal distribution, conditional distribution, or both. To achieve this purpose, many statistical distances, such as Maximum Mean Discrepancy (MMD) [49], Bregman divergence [50], and KL divergence [51], are proposed for domain adaptation. Transfer Component Analysis (TCA) [52], Joint Distribution Analysis (JDA) [53], Weighted Maximum Mean Discrepancy (WMMD) [54], Transfer Subspace Learning (TSL) [55], and so forth are proposed to simultaneously tackle feature mapping, adaptation, and classification.

Extreme Learning Machine (ELM)
. ELM is a fast learning algorithm for the single hidden layer neural network. Compared with the traditional neural network learning, it has two characteristics: (1) hidden layer parameters (i.e., input weights and the biases) can be randomly initialized. (2) e output layer weight can be solved as the least squares problem. As a result, ELM has a faster learning speed and more excellent generalization performance than traditional learning algorithms while guaranteeing higher accuracy.
Suppose giving a training dataset (x i , y i ) N i�1 with N samples, where y i ∈ R C×1 is the label corresponding to x i , and C is the number of categories. e structure of the ELM is shown in Figure 2.
In Figure 2, x i is the input sample, w i is the input layer weight, b i is the hidden layer bias, g(x) is the nonlinear activation function, L is the number of nodes in the hidden layer, and β i is the hidden layer output weight. e goal of ELM is to solve the optimal output weight β * by minimizing the sum of the squared loss function of the prediction error. e objective function is as follows: In the previous equation, the first term is a regular term to prevent model overfitting, e i is the error vector corresponding to the i-th sample, and θ is the tradeoff coefficient between the training error and the regular term.
Adding the constraint term to the objective function yields where e objective function is considered as a ridge regression or a regular least square problem. By setting the gradient of the objective function with respect to β to zero, we have

Input layer
Hidden layer Computational Intelligence and Neuroscience ere are two cases in the process of solving β. If N ≤ L, equation (3) is overdetermined [20]; the optimal solution is where I L is a L-dimensional unit matrix.
If N ≥ L, equation (3) is underdetermined [23]; the optimal solution is where I N is an N-dimensional unit matrix.
In the classification task, given a sample x Te to be tested, the classification result can be obtained: where h Te � g(x Te ).

TELM-OWA
In the past few years, the theory and application of ELM have received extensive attention from scholars and great progress has been made in this field. However, when there are fewer training samples, the performance of ELM will decrease [34]. Transfer learning draws on relevant domain knowledge to improve the learning efficiency of tasks in the target domain [31]. erefore, through transfer learning, the performance of ELM can be improved in the case of insufficient labeled samples. In transfer learning, there are two different but related datasets: source domain D S � (x s(i) , y s(i) ) n S i�1 and target do- and y s(i) are the source domain sample and its label, respectively, and n S is the number of D S samples. Accordingly, x T(j) ∈ D Tr and y T(i) ∈ D Tr are the target labeled sample and its corresponding label, respectively, x Te(k) ∈ D Te is the target unlabeled sample, n T and n Te are the number of labeled and unlabeled samples in D T , and n T ≪ n S . In this section, we hope to construct an ELM model using

Output Layer Weight Alignment.
By using the source domain labeled samples and the target domain labeled samples, respectively, two ELM can be built as follows: where H S is the hidden layer output matrix of D S and β S is the output layer weight of the ELM obtained by D S training. Accordingly, H T is the output layer output matrix of D T and β T is the out-layer weight of the ELM obtained by training. Due to the difference in the distribution between D S and (x T(j) , y T(j) ) n T j�1 , it can be known that β S ≠ β T . Inspired by the literature [46,47], the transformation matrix M is used to align the output layer of ELM between the source domain and the target domain in order to achieve cross-domain knowledge transferring. e function is established as follows: where ‖ · ‖ 2 F is Frobenius mode. It can be known from the previous equation [43] that M * � min f(M).
Since the Frobenius mode is invariant to the orthogonalization operation [46], equation (8) can be rewritten as For equation (9), we can conclude that the optimal can be regarded as the output layer weight after the output layer of the source domain ELM model is aligned to the target domain, as shown in Figure 3.

Objective Function of TELM-OWA.
In order to realize the transfer of the Extreme Learning Machine, the following objective function can be established to solve where (c/2)‖β T − β S ‖ 2 is a regular term for facilitating knowledge transfer and preventing negative transfer and λ, c are the balance parameter.
To align the output layer of source ELM to target one, we replace β S with β a and substitute it into equation (10) to get Because of β a � β S β T S β T , equation (11) becomes Because Computational Intelligence and Neuroscience , and the objective function of TELM-OWA can be simplified as and, then, After β T is obtained with knowledge transferability, the test samples are classified by equation (6). A complete classification procedure of TELM-OWA is summarized in Algorithm 1.

Discussion.
In order to improve the classification performance of ELM under transfer learning environment, we propose TELM-OWA and its objective function is equations (11) to (14) which can be seen as follows: (1) Compared with the traditional ELM, TELM-OWA adopts ‖H S β a − Y S ‖ 2 to utilize the source domain knowledge to help the target ELM to obtain the optimal parameter β * T and also increases the fitness of β * T to the target domain data by ‖H T β T − Y T ‖ 2 .
(2) DAELM-S proposed by Pan and Yang [34] also applies ‖H S β S − Y S ‖ 2 to help target task, in which the objective function is as follows: ough DAELM-S uses ‖H S β S − Y S ‖ 2 to transfer the knowledge from the source domain and increases the fitness of β S to source data, this decreases the fitness to the target domain comparing with TELM-OWA in which β a is more approximated to β T than β S by applying a subspace alignment mechanism. erefore, ‖H S β a − Y S ‖ 2 can increase the fitness of β * T to target data more than ‖H S β S − Y S ‖ 2 , and  [34] uses ‖H Tu β T − H Tu β S ‖ 2 to promote the approximation of β S and β T , the objective function is as follows: However, ‖β T − β a ‖ < ‖β T − β S ‖ is obvious according to equation (9) and Figure 3. erefore, TELM-OWA has a better knowledge transfer effect than DAELM-T. (4) Because TELM-OWA and DAELM-T need to firstly solve β s when solving the optimal parameter β * T , therefore, compared with ELM and DAELM-S, TELM-OWA and DAELM-T have more computing complexity of O(L 3 ), where L is the number of hidden layer nodes. (5) In [37], PTELM also adopted Output Weight Alignment based on ELM for knowledge transfer. But there are two differences between PTELM and TELM-OWA. On one hand, PTELM is suitable for unsupervised transfer learning in which no target label is needed, but TELM-OWA is an supervised transfer learning algorithm requiring little target label. On the other hand, PTELM needs to solve the projection matrix for Output Weight Alignment and output weight adopting the coordinate descent method in alternatively optimizing manner. In TELM-OWA, output weight is only needed to be solved as the standard ELM. Step 1: Use D S � (x s(i) , y s(i) ) n S i�1 to calculate β S according to equation (6).

Experiment and Analysis
Step 2: Solve Q, T, and A by using D S , D Tr , and β S .
Step 3: Solve the output weight according to equation (15) β T .
Step 4: Use β T to predict D Te and get its label. ALGORITHM 1: TELM-OWA 6 Computational Intelligence and Neuroscience recognition. It contains 30,607 images in 256 categories. e Office + Caltech dataset released by Gong [56] contains four fields C (Caltech-256), A (Amazon), W (Webcam), and D (DSLR) in the 10 common classes. During the experiment, two different fields are randomly selected as the source and target domain datasets and 12 cross-domain target datasets can be constructed, namely, C⟶A, C⟶W, C⟶D, ..., and D⟶W. (iv) Reuters-21578: the Reuters-21578 text dataset, which is a common dataset for text categorization, containing 21,577 news articles from Reuters in 1987 that were manually labeled by Reuters with 5 classes including "exchanges," "orgs," "people," "places," and "topics." 5 classes are divided into multiple major classes and subclasses. e three largest classes shown in Table 1 are "orgs," "people," and "place," which can construct 6 cross-domain text classification tasks as orgs versus people, people versus orgs, orgs versus place, place versus orgs, people versus place, and place versus people. e article conducted a more intensive evaluation on 6 classification tasks.

Experimental Results and Analysis.
We compared the proposed algorithm with some classifiers for evaluating the performance.  In the experiment, we set the SVM penalty parameter belonging to 0.1, 0.5, 1, 5, 10, 50, 100 { }, and the penalty parameter θ ∈ [0.001, 0.1] in ELM, SSELM, DAELM_S, DAELM_T, and TELM-OWA. TCA and JDA are feature transfer algorithms, which are combined with PCA to achieve the extraction of shared feature subspace based on MMD. In the above feature transfer algorithms, the dimension of the feature subspace is 100. e value range of the balanced-constraint parameter of the projection matrix in TCA and JDA algorithm is [0.1, 1]; ARRLS algorithm combines JDA with structural risk minimization and graph regular terms to improve knowledge transfer effect. Its parameters are set according to [57].

Classifier for Transfer Learning
Among them, in each dataset, 20% of the total number of target domain samples are randomly selected as a small number of labeled samples and are used as test sample sets together with source domain samples. In1NN, SVM, ELM, SSELM, TCA + (1NN, SVM), JDA + (1NN, SVM), and ARRLS, the labeled samples from the source and target domain are used together to train the classifier. Table 2 shows the classification results of the algorithms on the image and text datasets.   e standard machine learning methods, that is, 1NN, SVM, and ELM, suffer from the domain shift problem; thus, they could obtain an unsatisfied performance. But ELM gains more significant performance than 1NN and SVM because of its good fitness and generality to data. (5) e semisupervised method SSELM performs better than ELM by exploring the geometry property of domain, but worse than TELM-OWA, DAELM_S, and DAELM_T without considering domain shift problem. (6) Due to the lower accuracy of 1NN, TCA + 1NN and JDA + 1NN are worse than SVM, ELM, TCA + SVM, and JDA + SVM but higher than 1NN. (7) e accuracy of the feature extraction algorithm with transfer capability, such as TCA + SVM and JDA + SVM, is higher than SVM, which is similar to 1NN as a classifier, indicating the importance of feature transfer learning in the case of few or not the same distribution samples. (8) e accuracy of JDA + 1NN and JDA + SVM is generally higher than TCA + 1NN and TCA + SVM, which indicates the superiority of reducing the marginal and conditional distribution discrepancy at the same time. Moreover, in Tables 2-3 and Figures 4-7 we can see the following: (1) TELM-OWA, as an extension of ELM in transfer learning, also has faster learning speed and higher accuracy than other non-ELM methods, because it maintains the advantages of the good fitness of neural network and ridge regression model with a closed-form solution. (2) Although TELM-OWA has higher accuracy than ELM, SSELM, DAELM_S, and DAELM_T, it also has more learning time. When L > 2000, if the number of hidden-layer nodes is reduced, its learning speed will improve but its accuracy has a small drop (seen in Figure 8    increase of the number of target labeled samples for training ELM, the accuracy of TELM-OWA is increasing, as shown in Figure 8(a). It can be known that when the target domain label sample is small, the source domain knowledge can help the target domain task. With target labeled sample increasing, the trained model better fits target data and has higher accuracy.
(2) As shown in Figure 8(b), the accuracy of TELM-OWA increases with the number of hidden layer node on the 4 datasets. is verifies that a huge amount of hidden nodes are beneficial because they may force the ELM network to behave better on output function approximation.
(3) In Figure 8(c), with the gradual increase of λ, the accuracy increases first and then little decreases. When λ is too small, the helpful information from source domain is underutilized leading to the low performance. When λ is too large, the trained model overfits the source domain samples, resulting in performance degradation. TELM-OWA achieves a good result when λ ∈ [10, 100]. Dataset org versus people is robust to changes in parameter λ. (4) In Figure 8(d), the accuracy exhibits a little rising and then declining tendency with increase of c, in which better accuracy is obtained when c ∈ [10, 100]. When c is small, the performance is a little low because β S is far from β T . When c is too large, ‖β T − β a ‖ 2 will reduce the influence of the empirical risk error of labeled sample from source and target domains and the accuracy will degrade. (5) As shown in Figure 8(e), the accuracy increases first and then decreases with the increasing of the parameters θ which control the quality of β S and achieves better classification results when θ ∈ [10 − 4 , 10 − 3 ].

Conclusion
To solve the problem of the performance degradation of the traditional Extreme Learning Machine algorithm in the case of a small number of reliable training samples, in this paper, we propose TELM-OWA which is an Extreme Learning Machine with the ability of knowledge transfer. It reduces the distribution difference across domains by aligning the ELM output weight matrix between domains and introducing the approximation between the interdomain ELM output weight matrices to the objective function. Moreover, the objective function is transformed to the standard ELM form to solve. Many experiments were designed to compare our proposed algorithm with other related algorithms, and the results show that TELM-OWA has higher accuracy and better generalization performance. TELM-OWA still has some limitations: (1) it still needs some labeled samples in the target domain, and it is not suitable for the supervised transfer learning environment.
(2) It reduces the distribution difference across domains by aligning the ELM output weight matrix between domains and ignore the overall distribution differences in the output layer, in which the divergence of statistical distributions between the source and target domains still is different due to variance among each dimension. (3) Its shallow architectures lead to failure to find higher-level representations and thus can potentially capture relevant higher-level abstractions.
As a result, the following research focuses on the following three aspects to improve TELM-OWA: firstly, reliable samples selection is introduced for unsupervised transfer learning. Secondly, the effectiveness of knowledge transfer is further promoted by aligning the ELM output weight matrix and minimizing the divergence of statistical distributions together. irdly, as is similar to deep learning, TELM-OWA is improved by stacking it into a deep structure model for extracting deep feature.

Conflicts of Interest
e authors declare that they have no conflicts of interest.