Weighted Domain Transfer Extreme Learning Machine and Its Online Version for Gas Sensor Drift Compensation in E-Nose Systems

Machine learning approaches have been widely used to tackle the problem of sensor array drift in E-Nose systems. However, labeled data are rare in practice, which makes supervised learning methods hard to be applied. Meanwhile, current solutions require updating the analytical model in an offline manner, which hampers their uses for online scenarios. In this paper, we extended Target Domain Adaptation Extreme Learning Machine (DAELM T) to achieve high accuracy with less labeled samples by proposing a Weighted Domain Transfer Extreme Learning Machine, which uses clustering information as prior knowledge to help select proper labeled samples and calculate sensitive matrix for weighted learning. Furthermore, we converted DAELM T and the proposed method into their online learning versions under which scenario the labeled data are selected beforehand. Experimental results show that, for batch learning version, the proposed method uses around 20% less labeled samples while achieving approximately equivalent or better accuracy. As for the online versions, themethodsmaintain almost the same accuracies as their offline counterparts do, but the time cost remains around a constant value while that of offline versions grows with the number of samples.


Introduction
The wide spread of wireless sensor environment has provided abundant sources to help improve the convenience and prosperity of human life.Accordingly, the researches have shifted from original construction and routing problems [1][2][3] to more specific sensory data processing and analyzing tasks [4][5][6].With the fast development of sensor technology, E-Nose systems comprised of gas sensor arrays have been widely used in air quality monitors, security check points, and other gas compound identification scenarios.Such devices rely on the direct or indirect reactions between their materials and gas compounds.Taking metal-oxide sensor array as an example, the one that uses chemiresistors will have different electroconductivities when exposed to diverse gases [7].Partly due to the mechanism of gas sensors, the detection of such reaction may degrade after some time or be exposed in the compound for too long.The phenomenon is called sensor drift, which could hinder the performance of not only sensors themselves, but also the pattern recognition techniques used to determine the compounds.Therefore, an effective way of dealing with the problem is essential for industries.
The cause of drift owes to two sources: the first-order drift, which is due to the interaction process such as long exposure in gas, sensor aging, or poisoning, and the secondorder drift, which arises in the experimental setting or system noises [8][9][10][11].Currently, researchers have been trying to solve the problem in several different aspects, and various drift compensation or calibration techniques have been proposed to relieve the problem.Proper choosing of gas sensors that have slow degradation was the first choice of building E-Nose system [12,13].With the improvements of materials technologies, durable sensor materials and the proper selection methods have then occurred [7,[14][15][16][17].However, these two types of approaches rely on the resilience of materials or sensor individual differences, which do not 2 Wireless Communications and Mobile Computing tackle the degradation directly and have their limits.Another alternative is to improve postprocessing techniques so that the model continues to work after the degradation happens.
In the postprocessing of the sensor readings, techniques that can track the patterns are suitable for classifying gases when drift occurs.In the process, sensor readings are first preprocessed into multifeatures reflecting different aspects of the readings.Taking the data used in the paper, for example, discrete readings [] from a single sensor along time are transformed into a 6-dimensional feature vector  reflecting the steady-state, absorption, and desorption responses.The pattern refers to the distribution of specific gas labeled in the feature space, that is, ( | ), where  is the label and  is a -dimensional feature.The analytical model, which takes  as input, will output the probability of  belonging to a certain label .Ideally, the probability of  belonging to its correct label is 1 and 0 for other labels.In the circumstance of drift, the distributions before and after the drift are different; that is, ( before drift | ) ̸ = ( after drift | ).In this case, the analytical model that works perfectly before drift is no longer reliable.With current research work in sensor drift compensation, it is commonly accepted that the drift is a slow process and can be compensated by tracking the changes of the sensor readings.Therefore, a detection mechanism based on the features in the postprocessing that can adapt to the distribution changes is required.Ideally, we wish to obtain a mechanism, in which the distributions of specific gas before and after drift are the same; that is, ( before drift | ) = ( after drift | ).
From the perspective of model learning, sensor drift problem can be viewed as concept drift in which the distribution of gas labels in the feature space changes over time.In the past few decades, some researchers chose to gather different classification models to build a robust one that could resist the drift to some extent [10,19,20], while others attempt to map the unknown response to a proper tuned model [21,22].Although the two types of models can somehow alleviate the effect of sensor drift, ensemble based method may require proper choosing of the subclassifier, and techniques like transfer learning-based methods require manual labeling of some samples in unknown domain.Additionally, few of the methods considered the imbalanced nature of the samples when building classifiers.In applications, supervised learning is mature and accurate in general, while manual labeling process is time-consuming.Unsupervised approaches require no such labors but are less accurate than their supervised counterparts.To balance between accuracy and time efficiency, more accurate and advanced models with less human involvement are imperative and promising.
In this paper, we are dedicated to building adaptive semisupervised models that allow themselves to train and learn patterns when dataset changes.In particular, we aimed at two different scenarios, of which the first is an offline semisupervised learning with a few manually labeled samples and the second is an online semisupervised learning after labeling samples are selected and unchanged.Our goal is to provide classification models that use less human efforts while maintaining the accuracy of gases identification at a relatively high level.The contributions are threefold.
(i) The samples selected for labeling play key roles in the models.To evaluate the effectiveness of the proposed methods, we conducted different experiments on two aforementioned scenarios.The improvements on the accuracy with less labeled samples verified the feasibility of CSS and the weighted learning mechanism.In addition, ODAELM and OWDTELM have also been proved to be able to update themselves in a time-efficient way and achieve approximately equivalent overall classification performances, when compared with their batch learning versions.For distinguishing purpose, we use DAELM to refer to DAELM T in L. Zhang and D. Zhang's work [21] in the following parts of the paper.
The remaining of the paper is organized as follows.Section 2 provides some preliminaries on drift compensation and extreme learning machine.Section 3 illustrates the dataset used in the paper and details the CSS, WDTELM, and online learning process.Experimental comparisons on the classification performances have been provided as Section 4. Discussions on the proposed methods and the conclusions are drawn in Sections 5 and 6, respectively.For easy understanding of the terms in the paper, a list containing frequently used abbreviations and corresponding full names is given at the end of the paper.

Analytical Model for Drift Compensation.
Sensor drift is one of the major obstacles that prevent E-Nose systems from being effective for long period of their lifetimes.Therefore, various analytical models have been invented to address the issue.
Ensemble method is one of the most popular methodologies in the field.It uses a group of different classifiers to classify diverse gas compounds so that the sensor drift problem can be mitigated [10,19,20].By doing this, the lifetime of sensor array can be prolonged as well.However, the learning procedures are supervised and require labeling the sample first.Moreover, the method and its variants have assumptions on the gas data, for example, the drift direction remains the same for different gases, which are sometimes not true.
Unsupervised methods such as Sequential Minimal Optimization-(SMO-) based ones have been proved to be effective [23,24].Nevertheless, these methods can sometimes mistakenly update the pattern by following the wrong reference class.Component Correction-(CC-) based methods are said to have good results [25][26][27].However, they assume that the gases behave in a similar way in the drift process, while the truth is quite the opposite.Drift is a slow process; therefore adaptive methods can be used for it [28].In [21], the authors proposed a domain adaptive ELM using limited manually labeled samples and semisupervised training procedure to achieve one of the highest accuracies.However, the performance drops rapidly when the number of labeled samples becomes small, and the imbalanced nature of data is not considered either.
It is worth mentioning that some of the techniques, although targeting different problems, can also help transfer the model from one dataset to another using transfer learning [22,29].Moreover, instead of detecting and learning the drift problem directly, some of the researchers have contributed by detecting sensors with degrading performance so as to replace them [10,30,31].Although these methods are not included in the discussion of the paper, they do relief the drift by maintaining the performance of E-Nose system at a certain level.

Extreme Learning Machine.
In general, ELM is a three layer feed forward neural network with fully connected nodes between layers.Unlike other neural networks, ELM randomizes the connections between input and hidden layers, while leaving the ones between hidden and output layers to be tuned.The randomness feature of ELM lightens the burden of computing the optimal parameters.Together with the generalized inverse used in the learning process, ELM has been favored as a rapid learning algorithm with good generalization ability [32,33].
Typical ELM with  hidden layer nodes can be formulated as (1) where  is the corresponding hidden layer output of  training samples (see (2)),  is the output weight matrix, and  is the target.
The fast training speed resides in the fact that only  needs to be determined.To calculate it, ELM tries to solve an optimization problem (3).The solution can be given as  =  †  = ( 푇 ) −1  푇 , where  † is the generalized inverse of , also known as Moore-Penrose generalized inverse.The good generalization of the method is largely attributed to the Moore-Penrose generalized inverse which is used to replace recursive calculation of  in traditional neural network algorithms.

min
−      2 . (3) Additionally, to better help the algorithm leverage the effect of empirical errors and smallest norm of weights, the optimization problem is modified as (4), where  is a preset parameter or penalty factor.
For the past decade, there have been a number of variants of ELM.Incremental ELM (IELM) was proposed to change the network structure by adding more hidden layer neurons [34,35].Online sequential ELM (OSELM) enables the network to change its output weight matrix so as to adapt to the changes in the data [36].Kernel trick is not new in machine learning, and the concept was widely used in Supported Vector Machine (SVM).In ELM, the kernel is defined as  elm =  푇 , and, according to Huang et al. 's work in [37], it improves the generalization performance.In recent years, it is also used to speed up the training process [38].
In the past few years, ELM has been regarded as an effective solution for various applications, like active recognition [39], speech emotion recognition [40], and medical data classification [41], to name a few.With the fast developments in big data and distributed systems, there are also literatures dedicated to making the algorithm adaptive to large scale datasets [42] or Map-Reduce framework [43,44].

Semisupervised Methods for Gas Sensor Drift Compensation
In this section, we first performed a brief analysis on the dataset used in the paper, and then proposed CSS to improve the selection of samples for labeling.Subsequently, we aimed at two application scenarios, namely, offline training and online training, and proposed online versions for DAELM and WDTELM, respectively.

Specification of Dataset.
The chemical gas sensor dataset used in the paper has been published on UCI repository [45].
To properly characterize the features of such data, techniques that transform the time-continuous raw data into discrete values are commonly used [46][47][48].In this paper, the data are measurements on the conductivity of metal gas sensors array's responses to some gas compounds for continuous 36 months and have been preprocessed using Exponential Moving Average (EMA).Each sample consists of readings from a sensor array of 16 metal gas sensors.For one gas sensor, each sample has 2 steady-state features and 6 dynamic features.In total, there is a 128-dimension feature space for the dataset.Table 1 is a detailed data distribution of six gas compounds from 10 batches in the dataset.As shown in the table, some of the batches (e.g., batch (1)) have all the six gas compounds while some (e.g., batch (3)) have only five.In addition, the number of some gas samples may be 20 times larger than others (e.g., Toluene and Acetone in batch (2)).Moreover, the distributions of the samples in their feature space are also imbalanced.
Figure 1 shows the distributions of 10 batches after Principal Component Analysis (PCA) [49].The three axes represent the first three dimensions after performing PCA on the dataset.Different colors represent diverse labels.It can be noted that for some labels, such as the purple, the samples cover a large area for most batches while for others, such as the yellow, the samples only expand in small areas.It can also be seen that the data distributions for the classes are scarcely alike.However, the relative position of each class's distribution stays still.For example, the purple ones always stay on the right while the navy ones keep themselves to the left.This phenomenon confirms that using offline trained model on a single batch is highly unrealistic.Nevertheless, some knowledge acquired on one batch may be applied to other batches and semisupervised learning with limited representative samples may help in capturing the differences.

Clustering-Aided Sample Selection.
Due to the fact that sensor data are of various sources and usually redundant, a small size of samples can effectively approximate the distribution of the data with little information loss [50,51].These data are called representative data.Therefore, in order to perform semisupervised learning, a group of representative samples should be selected for manual labeling.The selection of the to-be-labeled samples in L. Zhang and D. Zhang's paper [21] uses Kennard and Stone (KS) algorithm, also called SSA in that paper, which is based on Euclidean distance of the features.The effectiveness of L. Zhang and D. Zhang's work confirms that distance-based measurements can be used to distinguish the samples.However, the method treats the samples equally during selection, and it might explain why the method has large degradation when the number of selected samples is small.To further improve the performance and include no extra human labor, we are inspired to use another unsupervised distance-based method before KS to provide additional information for pruning the selection process.
To achieve the goal, clustering is the first choice.By using proper clustering method, we can classify the unlabeled samples in an unsupervised way without extra human involvement.If the clustering is accurate enough, the class information will be provided.Although the exact labels are still unknown, the difference of samples from diverse classes is certain.
Following the intuition, we first examined different clustering strategies in classifying the gas sensor data.Although the following part belongs to the section of performance evaluation, we place it here for better illustration purpose.The performance was evaluated by the accuracy of classification.Even though the clustering method has no concept of classification accuracy, we can define one here for examination purpose.In this article, we assume the label of the majority samples in the cluster is the label of the cluster.Therefore, we can define the performance of the accuracy on each clustering method using (5) For each batch of the dataset, we conducted clustering methods on it and summarized the majority label of the samples.Consequently, the samples with the same label in that cluster are considered correctly labeled.In total, 7 built-in hierarchical clustering methods in MATLAB were used in this evaluation process, that is, unweighted average distance (Average), furthest distance (Furthest), centroid distance (Centroid), weighted center of mass distance (Median), shortest distance (Shortest), weighted average distance (Weighted), and inner square distance (Inner).The average accuracies for each methods are as follows: Average: 53.3%; Furthest: 40.8%;Centroid: 52.7%; Median: 51.4%; Shortest: 40.8%;Weighted: 57.7%; and Inner: 69.1%.Although some methods surpass Inner for specific batches, for example, Centroid surpasses Inner in batch (2), the excess part is not significant (around 0.5%), and the method has terrible performance in other batches.To sum up, Inner is the ideal choice among the methods in general.We also conducted the same clustering process using the first 3 features after PCA on the dataset and the result remains the same.
Although the clustering performance is relatively good, it is not applicable in real scenario since we cannot decide the exact label for each batch without manually labeling all the samples or at least large amount of samples.Meanwhile, since the data are unlabeled, we cannot find the majority class to help determine the label of the cluster either.Fortunately, we are to solve the problem in a semisupervised way.If we can select the most representative samples from each cluster, the problem can be solved to some extent.
In this paper, CSS uses KS on each cluster to select samples to be labeled.The process can be described as in Figure 2. The cloud-like items represent different clusters and the circles in the figure are the selected samples.The number in the circle represents the sequence of the selection.The dataset in the figure is divided into four clusters and the number of selected samples is 5. Different locations in a cluster represent various values in feature space.In Figure 2, the left upper and right lower clusters have 3 and 4 samples, respectively.As for the other two clusters, the samples could be hundreds, even thousands.As shown in the figure, KS

Input:
fl the number of samples selected for each cluster;  fl the number of clusters; Output: fl the selected samples;  fl the unselected samples; (1) Clustering  clusters using inner square distance clustering; (2) for each cluster do (3) if the number of samples is less than  then (4) Put all the samples into ; (5) else (6) Calculate the distance between samples in this cluster; (7) Selected the two samples with the largest distance and put them in ; (8) Initialize  fl 2; (9) while  <  do (10) Find the nearest distances of the remaining samples to the selected ones; (11) Choose the one with largest distance and put it in ; (12)  =  + 1; (13) end while (14) Put the unselected samples in ; (15) end if (16) end for (17) return , ; Algorithm 1: Clustering-aided sampling.selects the pair with the farthest distance in each cluster, that is, circles labeled 1 and 2, as labeled samples first.Then the sample whose nearest distance to the selected samples is the largest will be selected, that is, circles labeled 3, 4, and 5 in sequence.The process keeps on until the maximum number of samples () has been selected in each cluster.
Note that it is possible that the number of samples in some clusters may be less than 5 (see the left upper and right lower clusters in Figure 2).In this case, we directly choose all the samples as selected samples.The pseudocode of CSS is listed in Algorithm 1.For semisupervised learning scenario described in this paper, the exact or estimated number of labels should be set first.As for the dataset in this paper, we set it to 6 for there are 6 different compounds.The maximum value  should be set to be larger than 2 in order for KS to be effective.Although some batches have less than 6 gas compounds, it does not hinder the effectiveness of the method for the selected samples will be labeled.Moreover, the weighted process described later ensures that the unlabeled samples will not be affected by nonexisting label in the batch.

Weighted Domain Transfer Extreme Learning Machine.
The objective is to train a new classifier on the labeled samples and leverage the effects of the unlabeled ones.In L. Zhang and D. Zhang's paper, the unlabeled samples are treated equally in the function.However, leveraging the effect of unlabeled sample requires distinguishing the difference of correctly and incorrectly classified samples.To be more specific, to help learn a more accurate model, the samples that are being incorrectly classified should weigh less compared with the weight of correct ones.If all the samples were treated equally, the negative effects of the wrongly classified samples would be amplified which would cause the learning to follow a wrong pattern and reduce the classification accuracy.This explains why the accuracy of DAELM degrades quickly when the labeled samples are few.In this paper, our proposed method intends to employ weighted learning to emphasize the effects of the samples that are less likely to be incorrectly classified by the base classifier.Therefore, the optimization problem becomes (6) by incorporating a sensitive matrix .
In (6),  푇 and  푇푢 are the hidden layer outputs of the labeled and unlabeled samples, respectively. 푇 is the labels of the labeled samples. 푇 and  푇푢 are two preset parameters for regularization purpose. 푆 is the output weight of ELM trained in source domain (base ELM) and  푇 is the one for target domain classifier to be learned.Note that the labeled samples are fixed and reliable with their labels manually examined and determined.For unlabeled samples, the third term wishes to learn their information based on the output from a classifier trained from source domain.However,  푆 is not 100% reliable and the labels could be wrong.Therefore, the learning algorithm should learn the information of the samples whose labels are most unlikely to be wrong and ignore the ones that are most likely to have incorrectly classified labels.In this paper, we solve the problem in the optimization by adding  in the third term in which the symbol ∘ represents Hadamard product.
In detail,  has the same number of rows as  푇푢 and each row represents the probabilities of the sample belonging to specific labels.For example, let there be 3 labels, namely, 1, 2, and 4, in the cluster which the th sample belongs to.The number of labels in total is 6.Equation ( 7) is an example of the th row in .The values in columns 1, 2, and 4 are the probabilities of th sample belonging to each class, denoted by  1 ,  2 , and  4 .Since there are no more labels in the cluster, the rest of the columns is 0.
The ideal value of  is unknown for semisupervised scenario for we do not know the exact labels of the unlabeled samples.However, we can estimate it with the information we collected from clustering process.In the labeling process, the selected samples are labeled with certain label(s).It is deterministic.Considering the fact that, in clustering, the more closer the distance is, the more likely the samples share the same label.We can further extend the idea to the following: the more closer the samples are to certain labeled sample, the more likely they belong to its labels.The demonstration of the idea can be viewed in Figure 3.With the help of CSS, the target domain is split into labeled and unlabeled sets.The labeled samples are representative and we assume that the labels of the unlabeled belong to the ones of the labeled.Meanwhile, we partially trust  푆 trained from the source domain.Therefore, the output using  푆 , say label , may or may not be true.Note that the representative samples are chosen based on the distances among samples.Considering the fact that the real label is one of the labeled samples, say label , under the aforementioned assumption, the probability of  =  can be calculated based on the distance.
In this paper, we use the reciprocal of the distance to a certain labeled sample as the degree to its label.If more than one selected sample in a cluster belongs to the same label, we use the one with smallest distance.The probability of an unlabeled sample belonging to a given label can be calculated based on all the degrees to all the labels in the cluster.Figure 4 is the example of calculating each value in (7).There are 3 different labels in the example.The center circle is the unlabeled sample that requires to be estimated and circles labeled 1, 2, and 4 are the selected samples with their labels being the numbers.The lines are the distances between the unlabeled and the selected samples, tagged by  푖 , where  is the label number.In this process, for a specific unlabeled samples in one cluster, we assume that it can only belong to the labels of the selected samples.If more than one selected sample is of the same label, that is, 1 in the figure, we choose the smaller or smallest one; that is,  1 is the smaller one between  (1)  1 and  (2)  1 .By doing so, we can calculate the distance of the unlabeled sample to each label, written as dist(unlabeled, ),  = 1, 2, 4. Subsequently, the probability of the unlabeled belonging to each label  is  푗 where  ∈ {1, 2, 4}.For those labels (3, 5, and 6) that do not appear in this cluster, we set the probability  푘 to 0 where  ∈ {3, 5, 6}.Eventually, we have each estimated value of (7).The calculation of probabilities is summarized as (8).In the same way, we can calculate each row of .
; if  ∈ the labels of selected samples 0; otherwise.
We call  sensitive matrix in this paper, and the pseudocode of calculating  is given in Algorithm 2. For each cluster generated in the clustering phase, the distances between unlabeled and labeled samples are calculated within the cluster. is formed with exact number of labels in the dataset and each value in a column represent the probability of an unlabeled sample belonging to a specific label.The calculation ensures that, for each cluster, the unlabeled samples only belong to the labels of the labeled ones in the cluster, which makes the probability of belonging to the labels that are not in the cluster 0, and the sum of each row equals 1.In CSS, the number of clusters for each batch is set to 6, which could make the samples belonging to a specific label be split into two or more clusters.In this case, the sensitive matrix calculation can still determine the probability that each of the unlabeled samples belonging to a certain label for the split cluster will have its representative samples of the same label.

Input:
fl the selected samples in a cluster;  fl the unlabeled samples in a cluster; Output: fl the sensitive matrix samples; (1) Initialize  to a zero matrix; (2) for each samples  ∈  do (3) Find unique labels as  (4) for each value  ∈  do (5) Find the samples with label  ∈  as ; (6) Calculate the distances between  and the samples in ; (7) Set  푗 to the nearest distance; (8) Store 1/ 푗 in ; (9) end for (10) Replace the value with probability in  using ( 8); (11) end for (12) return ; Algorithm 2: Sensitive matrix calculation.
푢 .We can solve the equation by setting (9) to zero and we get (10), in which  is the identity matrix.

𝛿𝐿 𝛿𝛽
If  푇 has more rows than columns, that makes (6) an overdetermined problem and the least square solution is unique.According to the work in [38,52], a special RBF function, that is, Gaussian RBF function, ensures that the inverses of  푇  and  푇 exist.Therefore, we use this type of function in the paper.Note that  푇 and  푇푢 are positive.By using the formulas of generalized inverse of sum of matrices [53], it can be verified that  +  푇  푇 푇  푇 +  푇푢  푇 푢  푢 has a unique generalized inverse which is also the inverse.In this case,  푇 can be written as follows: For the case where  푇 has less rows than columns, (6) becomes an underdetermined problem.In L. Zhang and D. Zhang's work [21], they solved the problem by using Lagrange multiplier.The method is equal to adding an assumption that  푇 is the linear combination of the columns of  푇 and  푇푢 ; that is,  푇 =  푇  푇 +  푇푢  푇푢 .It is hard to determine if it applies to all cases.However, since the two separate cases are divided based on the rows and columns of  푇 , the unlabeled term  푇푢 should be excluded.Therefore, in this paper, we assume that  푇 is a linear combination of the columns of  푇 , written as (12), so as to get the unique solution.Then the problem becomes solving .By multiplying ( 푇 푇  푇 ) −1  푇 on both sides of (10), we can get In order to solve , we can substitute (12) into (13) and get For simple illustration purpose, let  be  푇 푇  푇 and  be  푇  푇 푢 .Similarly, we can verify that  +  푇  +  푇푢  −1  푇 has inverse.Subsequently, we can get  as (15), and  푇 can be written accordingly as (16).

𝛼 = (𝐼 + 𝐶
The pseudocode for WDTELM with CSS is listed as Algorithm 3. Before the training begins, we set the number of clusters () to the number of gases in the dataset.The algorithm trains a new ELM with a preset number of hidden layers using the source domain samples (line (1)).For target domain denoted by , it uses CSS to select a group of labeled samples (lines (3)-( 7)) and then initializes a new ELM network with the same hidden layer neurons (line (8)).The output weight  푇 of the ELM in target domain is then calculated based on the aforementioned cases (lines (9)-( 13)).

Online Domain Transfer Extreme Learning Machine.
In order for DAELM and WDTELM to be applicable, an initial set of data is required.However, in real application scenario, the data may not be accessible in full.More common situation is that data come in a one-by-one or chunk-bychunk manner.To retrain the classifier in an offline manner would be unrealistic and time-consuming whenever the data come.As DAELM and WDTELMT are all based on batch training and updating, the problem remains the same.Therefore, online learning process is needed.
In this section, we consider a simple online scenario where the labeled samples have been determined or provided beforehand.Then the unlabeled samples are fed into the model in an one-by-one or chunk-by-chunk manner and we wish to use the semisupervised method in previous subsections without retraining the network from scratch.
The demonstration of ODTELM is provided in Figure 5.The unlabeled samples are organized in a sequence as the left rectangle.The target classifier is initialized with the labeled sample using (4).For the online learning phase, we wish to update the target classifier using current  푇 , the incremental value ℎ, and some intermediate result(s).
In this case, we have insufficient data to perform clustering before training.However, given the labeled samples, we can still calculate the probability of each unlabeled sample.To derive the proper formulas, we begin with unweighted  version, that is, DAELM.In this case, we use the same objective function as follows: Assume an incremental batch  is added to the unlabeled data in the target domain and let the corresponding hidden layer output be ℎ.The output weight  (푘+1) 푇 can be calculated in the aforementioned cases as (18), where  =  푇  푇 푇 and  =  푇  푇 푇푢 .
Consequently, the output weight  푘+1 푇 can be derived as follows: The pseudocode for the updating procedure of ODAELM is given in Algorithm 4. Whenever a new sample arrives, the algorithm takes the intermediate result  푘 and output weight  푘 calculated from last update and updates current intermediate result and output weight in two cases.The calculation saves time by using results calculated in previous updates, so the training is time-saving, especially when the unlabeled set becomes too large.It is essential to seamless service especially for scenarios like antiterrorism, security checkpoint, and so on.
Note that the formulas are derived from DAELM.If we are to perform online learning for WDTELM, taking Figure 5 for example, the weight  can be calculated by comparing the output from ELM with  푆 and the labeled samples.Let  be the weight for .Then, we can simply switch  푇푢 and ℎ in this subsection to  ∘  푇푢 and  ∘ ℎ, respectively, and the formulas still stand.

Experimental Setup.
The experiments in the paper were conducted in MATLAB on a Linux Workstation with an E5 2.6-GHz CPU and 32-GB RAM.We followed the setup in [21].The hidden layer neurons of the ELM network were by default 1000, with RBF function being their activation function.The coefficients  푆 ,  푇 , and  푇푢 were set to be 0.001, 100, and 0.001, respectively.We also tested hidden layer neurons with 100 hidden layers to show the performance of the proposed methods.
For WDTELM and its counterparts, the experiments used the current batch as source domain and the next batch as target domain.For example, batch one was used as source domain at first and batch two was target domain.After classification of batch two was finished, the source domain became batch two and target domain would be batch three.The performance was evaluated by the classification accuracy using (5).The proposed method was compared with SVM with rbf kernel (SVM-rbf), ELM with rbf activation function (ELM-rbf), ELM based ensemble methods (Ensemble ELM), SVM based ensemble method (Ensemble SVM), and DAELM with diverse numbers of selected samples.The ensemble methods trained subclassifiers on each batch using corresponding algorithms and combined with previous learned subclassifiers to form a compact classifier.For example, when batch (3) was target domain, targets 1 and 2 would be used to train their subclassifiers, respectively.The weight for each subclassifier followed the work in [20], which is the training accuracy on the corresponding batch.For online learning, the source and target domains were divided in the same way as described in the previous paragraph.The difference was that, after representative samples were fixed, the unlabeled samples were tested in sequence and fed to the classification model in a one-by-one manner for updating.The overall accuracy was recorded when all the unlabeled samples were used for updates. 2 shows the overall classification performances of different models used in the experiments.The last row labeled "Ave."refers to average accuracy for each method which applies to Tables 2, 3, 5, and 6 Note that the performance of WDTELM surpasses all its counterparts except for batch (3) where it is around 3% to 4% less than that of DAELM.For batch (5) and (8), although the accuracy of DAELM beats WDTELM by 0.1% to 0.7%, the difference is not distinctive.As for ELM and SVM, the two methods have over 20% accuracy lower than those of DAELM and WDTELM.Ensemble based methods have better classification accuracy in general.This is due to the increasing training data used for generating subclassifiers.Although the method can somehow alleviate the drift, it does not perform as well as semisupervised methods do in the paper.Although WDTELM has slightly lower performance than that of DAELM for specific batches, say batch (3), in an overall point of view, it beats DAELM.Taking batches (7) and (10), where DAELM has its classification accuracy lower than other batches, for example, WDTELM has 5% to 10% accuracy increase.As for the other batches, WDTELM has slightly better performance.Therefore, in general, WDTELM captures the changes in the data more accurately and therefore can better help the gas sensors bounce back from degradation caused by drift problem.Moreover, the labeled samples for WDTELM are chosen with  equal to 5 and 8, and the selection process will choose no more than 30 and 48 samples, respectively.Therefore, the selected samples for WDTELM are no more than that of DAELM.For real application scenario, the manual labeling process is more time-consuming than that of training ELM from scratch.In this sense, WDTELM is more time-saving than DAELM is.

Performance Evaluation. Table
The overall performance shown in Table 2 confirms that proper choice of selected samples helps the model to achieve better classification performance.To better show the performance improvements of WDTELM over DAELM, we experimented on different numbers of selected samples for DAELM and WDTELM.Table 3 shows the performance of the two methods using 6 different numbers of selected samples, respectively.The  value represents the number of selected samples and the rows show the classification accuracies for different batches, among which the last row shows the average classification performance of each  value.Note that for DAELM  is the exact number of selected samples, while for WDTELM it is the upper bound because of CSS process.In general, it can be noted that the average performance increases as the number  increases and it is almost the same for all rows except for some disturbances such as batch (2) with  = 45.The increasing of selected samples almost guarantees the performance increase.However, the price is the manual labeling process.If the process was quick, there would be no need for classification models.Therefore, the maximum performance with minimum labeling is an ideal solution for semisupervised learning scenario.In the table, the classification accuracy of WDTELM increases around 10% in those settings where DAELM has low performance before  = 30, for example,  = 20 for DAELM versus  < 18 for WDTELM in batch (2).The performance of DAELM has its almost maximum classification performance after 30.As an exception, DAELM surpasses WDTELM in batch (3) with around 4% after  = 40.The difference is not large compared with other batches where DAELM has low accuracy, and the accuracies of both methods are over 90% as  increases.Therefore, we consider them both feasible for application scenario.In general, together with Table 2, DAELM does not surpass WDTELM unless the selected samples are large, say 50.If manually determining the gas type takes 1 h for one sample, DAELM would take 10 more hours before its accuracy outperforms WDTELM for some batches.The number of labeled samples is chosen from [12,18,24,30,36,42,48] and the overall classification accuracies are recoded accordingly.

Wireless Communications and Mobile Computing
To better illustrate the differences of the performance of DAELM and WDTELM, we conducted experiments on the same number () of selected samples for each batch.The number of hidden layer is set to 100 so as to observe the effects of a smaller ELM network.The results can be found as in Figure 6, in which the red curves represent the classification performance of WDAELM and the blue ones are of DAELM.As can be observed from the figures, the red curves have significant accuracy increase compared with the blue ones in batches (2), ( 6), (7), and (10).The maximum amount of increase reaches over 20%; for example,  = 24 in batch (6).For batches (4), (5), and (8), the lines intertwine together, which indicate that the performance is similar.For batches (3) and (9), DAELM is better than WDTELM with around 5% to 10% accuracy increase.However, the performance begins to overlap after 42.Taking batch (9), for example, the difference after  = 42 is hard to distinguish.The reason for this phenomenon may be that the distribution of the clusters in these two batches changes more seriously than other batches which makes it hard for CSS in WDTELM to capture representative ones in some clusters.Fortunately, the performance difference can be decreased by increasing .Meanwhile, for the part where DAELM beats WDTELM, the accuracies for both methods are relatively high, say over 90%.Note that, when the numbers of selected samples are small, for example, 12 and 18, the performance of DAELM is terrible.For example, for batches (2), (6), and (8), the accuracies drop below 60%, while WDTELM maintains its accuracy above 70%.
To further compare the significance of the two methods regarding their classification accuracy, we performed Friedman's test.Their performance is grouped by their  values, that is, each  has the corresponding average accuracies of the two methods on 9 batches.Therefore, there are 7 groups in total.The  value of Friedman's test is 9.0214 −18 which means the classification performances are significantly unlike each other.In order to show which one surpasses the other, we summarized the average classification accuracy as Table 4, together with other batch learning algorithms.To further show that the improvements are not only due to the labeled samples in target domain, we train ELM and SVM in a way that the labeled samples in target domain are combined with source domain for training the classifiers.Although the average accuracy for SVM-rbf is higher than the one in Table 2, it is still over 10% lower than those of DAELM and WDTELM.ELM-rbf method has poor performance which is probably due to the insufficient hidden layer nodes.Hence, the labeled samples in semisupervised methods are not the only reason that results in a higher accuracy.It can also be seen that WDTELM is better than DAELM in terms of average classification performance.Therefore, WDTELM is considered to be more effective than other tested methods in general.
For online learning process, we wish to achieve approximately the same if not better accuracies of their batch learning versions.To evaluate the overall performance, we conducted experiments on both DAELM and WDTELM.For comparison purpose, we use SSA and CSS on DAELM and OWDTELM, respectively, so as to see the difference between them and their batch versions.Table 5 shows the results on the online version of DAELM.It can be noted that the performance of the online version reaches similar classification accuracy as DAELM does.For some batches, it surpasses its batch learning method, for example, batches (3) and (10).The reason may be because the online learning process updates the model whenever a new sample arrives, and this procedure may better help learn the difference between samples.
Table 6 shows the classification accuracy of OWDTELM.The accuracies were recorded after the method updates all the unlabeled samples in each batch.As shown in the table, the accuracy of OWDTELM remains the same as its batch learning version does.Although little disturbance occurs, for example, batches (2) and (3) when  = 8, the gap is not significant.For batches (2) and (3), it has less than 2% accuracy decrease while for batch (10) it increases by 3%.It can be noted that the average accuracy for OWDTELM is higher than that of WDTELM for some values of  such as  = 6.The difference is below 1.5% which can be partly due to the experimental error caused by randomization of the ELM neurons.In general, the OWDTELM achieves similar performance as its batch learning version does.
The processing time was recorded whenever ODAELM and OWDTELM updated the model.On average, the processing time of ODAELM takes 0.0948 seconds and fluctuates within 0.0139.OWDTELM takes slightly more time, that is, 0.0958 on average, and fluctuates within 0.0125.This is different from batch learning whose processing time will increase with the size of the samples.We also replace the online learning updating with their batch learning, that is, DAELM and WDTELM, respectively.In this case, the methods will retrain the model from scratch whenever the new sample arrives.The processing time increases drastically.With 400 unlabeled samples, the updating procedure took over 0.2 seconds for both methods, which is over 2 times of its online version.If the size of the samples grows too large, the updating will take minutes, and even hours, to complete.

Discussion
In the evaluation of the proposed methods in the paper, we compared their performances regarding the classification accuracy.In general, the proposed CSS helps in choosing more representative samples as KS in DAELM does, and the weighted strategy further helps the learning process to be more accurate.The online versions of DAELM and WDTELM can achieve similar performance as their batch learning versions do and are suitable for scenarios where unlabeled samples are not accessible before training.
It should also be noted that the proposed methods in this paper are not limited to E-Nose systems only.Other domains that have similar characteristics are also worth trying.Meanwhile, the online version of the proposed method only considered a simple scenario as a head start.More sophisticated cases including, but not limited to, labeled sample incremental learning, samples switched from unlabeled to labeled, and data distribution changes should be considered for future researches.

Conclusions
In this paper, we proposed WDTELM to reduce the number of labeled samples in DAELM while achieving similar or better performance.Then, aiming at online learning process, we proposed online learning versions of DAELM and WDTELM, named ODTELM and OWDTELM, respectively, which allow the unlabeled samples to be added to the classifier in a one-by-one or chunk-by-chunk manner after labeled samples are given.The experimental results show that WDTELM outperforms DAELM with less labeled samples regarding the classification accuracy.Meanwhile, it also possesses higher performance than other commonly used approaches such as SVM, ELM, and ensemble based methods.The experiments for online versions of DAELM and WDTELM have verified that they each possess the same classification accuracy as their batch learning version does.The processing time also confirms that the online versions are time-saving methods compared with their batch learning versions.
. Number pos represents the number of correctly labeled samples and Total Number is the number of all the samples.Accuracy = Number pos Total Number .

WirelessFigure 1 :
Figure 1: Data distributions for 10 batches without normalization.The three axes represent the first 3 features after PCA transformation.Labels of the samples are painted in different colors for better observation.

Figure 2 :
Figure 2: Sample selection in each cluster.

Figure 3 :
Figure 3: Demonstration of determining the effects of different unlabeled samples.

Figure 4 :
Figure 4: The probability of a sample belonging to a certain label.

Figure 5 :
Figure 5: Demonstration of online domain transfer extreme learning machine.
fl the output weight matrix of ELM for target domain;  fl the number of hidden layer neurons;  −1 푘 fl the intermediate result;  fl the labeled samples;  fl the increment of unlabeled samples; Output:  푘+1 푇 fl the updated output weight matrix;  −1 푘+1 fl the updated intermediate result; (1) Calculate the hidden layer output for  as  푇 ;

Table 2 :
Comparisons of classification accuracy in 9 batches.

Table 3 :
Comparisons of DAELM and WDTELM using different numbers of selected samples.