Learning Deep Attention Network from Incremental and Decremental Features for Evolving Features

In many real-world machine learning problems, the features are changing along the time, with some old features vanishing and some other new features augmented, while the remaining features survived. In this paper, we propose the cross-feature attention network to handle the incremental and decremental features. ,is network is composed of multiple cross-feature attention encoding-decoding layers. In each layer, the data samples are firstly encoded by the combination of other samples with vanished/ augmented features and weighted by the attention weights calculated by the survived features. ,en, the samples are encoded by the combination of samples with the survived features weighted by the attention weights calculated from the encoded vanished/ augmented feature data. ,e encoded vanished/augmented/survived features are then decoded and fed to the next cross-feature attention layer. In this way, the incremental and decremental features are bridged by paying attention to each other, and the gap between data samples with a different set of features are filled by the attention mechanism. ,e outputs of the cross-feature attention network are further concatenated and fed to the class-specific attention and global attention network for the purpose of classification. We evaluate the proposed network with benchmark data sets of computer vision, IoT, and bio-informatics, with incremental and decremental features. Encouraging experimental results show the effectiveness of our algorithm.


Background.
In the machine learning problems, a basic assumption is the data samples have consistent and stable features.
ese features are usually generated by a set of sensors and used by the machine learning models as inputs. However, in many real-world applications, this assumption does not hold, and the features are changing with some old features vanishing and some new features added. For example, in the application of environmental monitoring, different sensors are deployed, including gravimetric, optical, and electrochemical sensors [1][2][3][4]. ese sensors have different life cycle lengths and different working conditions. us, some sensors expired sooner than the others; thus, the corresponding features vanished sooner. Meanwhile, some other sensors can be used for a long time to continue to generate features. eir features will be surviving along the data collection process. Moreover, with the development of sensors, some new sensors are produced and deployed and begin to generate newly augmented features. As a result, the working sensors are evolving over time and the features are changing accordingly. Some old features are vanishing and some new features are augmented, while the remaining features survive. is scenario makes the feature not stable and challenges the stable feature assumption of most popular machine learning settings [5][6][7][8]. is problem is called the incremental and decremental feature (IDF) problem. Given the importance of the IDF problem, surprisingly, only very few works have been done to solve it directly [6], and the performance is not satisfying.
Meanwhile, the deep attention network has been a popular method for the machine learning area. Attention mechanism represents a data instance not only by itself but also by paying attention to the other instances weighted by the attention weights. e attention weights are usually calculated according to the instance features and then normalized by the softmax function [9][10][11][12]. ere are two types of attention network: self-attention [13][14][15][16] and crossattention networks [17][18][19][20]. e self-attention mechanism calculates the attention weight of each instance from itself, while the cross-attention mechanism usually calculates attention weights according to the similarity between itself and the other instances. However, most existing attention network only pays attention to instances and assuming the features are stable. us, the attention mechanism of existing methods cannot be applied to the IDF problem.
In this paper, we propose a novel solution for the IDF problem with a cross-feature attention network. Our solution pays attention to the vanished, survived, and augmented features to bridge the gaps among the features of evolving sensors. is is the first work of attention mechanism across features, and it fits the nature of the IDF problem.

Related Works.
In this section, we review the related works of IDF; even there are only very few such existing works.
(i) Hou and Zhou [6] developed an algorithm to handle the incremental and decremental features and the streaming data instances. is algorithm has two stages. e first stage is to compress the vanished features by learning a classifier in the vanished feature space so that the important information of the vanished features is embedded in the trained classifier. e second stage is an expanding stage. It will not only include the augmented features in this stage but also try to balance the vanished features, survived features, and augmented features. e balancing is conducted by imposing the classification responses with/without the augmented features. Moreover, the learning strategy is one-pass learning, which takes only one training sample to update the model in each iteration. (ii) Ma et al. [7] designed a transfer learning method for the domains, where only a part of the features are shared, while the other features are different. is problem setting is similar to the IDF, given the partially shared features space. To be specific, the target domain not only has the source domain's feature but also has some newly augmented features.
To solve this problem, this method also imposes the target domain's classification responses of data with/without augmented features to be consistent with each other. Moreover, the features shared across the source and target domains are also jointly regularized to be consistently sparse, i.e., the importance of the same feature should be consistent across two domains. (iii) Wu et al. [21] proposed a feature selection algorithm to handle the streaming features. In this scenario, the features are not known from the very beginning of the learning process, but come in a one-by-one way, while the number of training samples remains the same. e algorithm is designed to select the most important features from the streaming feature set. e selected features should be not only relevant but also nonredundant. e feature selection is performed in an iterative algorithm. When the algorithm receives a new feature, the algorithm first determines if it is relevant to the class. If not relevant and is also redundant, it will be dropped. Otherwise, this feature is selected. e problem of streaming features is a special case of the IDF, where it only handles the streaming incremental features but ignores the decremental features.
Among these existing methods, the IDF problem is solved by imposing consistency of classification responses with/without augmented features, in both works of [6,7]. e intersection of vanished/augmented/survived features is not explored directly. us, the cross-feature information is not utilized effectively to boost the learning performance.

Our Contribution.
To fill this gap, in this paper, we propose the first attention network to pay attention from one feature set to another one. e motivation to do so is that we believe even the feature changes in the sample batches collected from a different time, and they have an inner relationship and they are complimentary for the purpose of classification of the samples. e sensor evolving changes the observed features, but actually, the features should be complete and consistent in an ideal situation where all the sensors do not expire and are all deployed at the very beginning. us, we would like to recover the vanished features for the new batch of data, and also recover the augmented features for the old batch of data. For this purpose, we encode each sample by paying attention to the vanished/ augmented features. To explore the feature relationship, we calculate the attention weight by the survived features. In this way, we have a vanished/augmented feature-attention code vector for each sample, and even it has no vanished/ augmented features, by bridging itself to the samples with vanished/augmented features with help of survived-feature attention. With the vanished/augmented feature-attention code vectors, we pay attention back to the survived features by encoding each sample as the combination of other samples' survived features. e attention weights are again calculated by the codes of the last cross-attention layers. Decoders are also applied to recover the original features from the code vectors, and the recovered features are inputs of the next cross-attention layers. In this way, we design a deep cross-feature attention network to represent the samples with IDF. e encoded vectors of the network are further represented by a set of class-specific attention networks and a global attention network for the purpose of classification.
Our contribution is threefold: (1) We design a novel deep neural network with new cross-attention layers for the purpose of learning from evolving 2 Scientific Programming (2) We propose a novel learning algorithm to optimize the parameters of the network in a supervised way (3) We evaluate the proposed algorithm experimentally regarding parameter sensitivity, running time, and comparison to other algorithms

Paper Organization.
is paper is organized as follows. In Section 2, we describe the new network of cross-feature attention. In Section 3, we evaluate the proposed method experimentally. In Section 4, the conclusion is given.

Problem Setting.
In this section, we discuss a learning problem with changing features. Suppose we have training set of n data samples, and these samples belong to two batches. One batch has no data samples generated from a set of historical sensors, and the other batch has n c � n − n o data samples generated from the current set of sensors. Compare to the current sensor set, some old sensors have vanished, some new senors have been added to the current sensor set, and the remaining sensors remain the same. e old data batch is denoted as is the ith sample's feature vector of the d 2 survived sensors, while z i ∈ R d3 is its feature vector of the d 3 newly added sensors, and ψ i is its class label. e overall training data set is X � Xo ∪ Xc, and the learning problem is to learn a model to predict the class label of a test data sample with features of the current sensors.

Network Architecture.
To represent each data sample of both current and old batches, we propose a deep crossfeature attention representation network and a discriminative network to separate samples of different classes.

Cross-Feature Attention Layers.
Given the ith data sample, to represent it, we propose to pay attention from itself to three feature spaces, which are the vanished sensor space, survived sensor space, and newly added sensor space.
(1) Attention to Vanished Sensor Space. We firstly pay attention from the ith sample to the old batch Xo, even the ith sample is from the current batch. To this end, we calculate its similarity to the jth sample of Xo in the feature space of the survived sensors shared by both batches. e similarity between the ith and jth sample is calculated as where A ∈ R d2×d2 is the parameter matrix of the similarity function. e attention score from the ith sample to the jth sample regarding the survived sensors is obtained by applying a softmax function to the similarity scores: (2) With the attention scores calculated from survived sensor features, we represent the ith sample by combining the transformed features of the vanished sensor features weighted by these attention scores: where Θ is the transforming matrix. Please note, in this attention-based representation of the ith sample, the attention scores are calculated in the survived sensor space, while the attention base vectors are in the vanished sensor features space.
(2) Attention to Newly Added Sensor Space. We also pay attention to the current batch of training samples in the space of newly added sensors. To this end, we calculate the attention weights in the space of the survived sensors and use them to weigh the samples of the current batch in the new sensor space. We firstly calculate the similarity between the ith sample (i ∈ X) and the jth sample of the current batch: where B is the similarity function parameter matrix. Accordingly, we apply a softmax function to the similarities to calculate the attention weights: . (5) e new representations of the ith sample by the attention to the current training batch in the space of newly added sensor space is the combination of the transformed samples with the above weights: where Φ is the transforming parameter matrix. e crossfeature mapping is performed from newly added features with weights of the survived features.
(3) Attention to Survived Sensor Space. Paying attention to the samples of the entire data set is weighted by the representations of the above two layers. Given the ith sample, we firstly concatenate the two vectors of the last two attention layers, f i and h i , to a longer vector, With this vector, we calculate the similarity between two samples, the ith and jth samples: where E is the similarity function parameter matrix. e attention weights are calculated by softmax: Scientific Programming e attention layer output vector of the ith sample is the combination of the features of survived sensors weighted by attention weights in (8): where Ψ is the transforming matrix.
(4) Decoding Layer. With the above three layers of crossattentions, we have three representation vectors f i , g i , and h i . We can further decode the sensor features from these vectors for the next layers' inputs in a deep network architecture. e decoding layers are dense layers with activation layers: where W, V, and R are the dense layer parameter matrices and φ(·) is the activation function. Given the above base layers, we build a multiple layer cross-feature attention network by feeding the outputs of the decoding layer to the next layers of old and newly added sensor attention layers. In the lth layer, the output of the l − 1th layer is x l−1 i , y l−1 i , and z l−1 i for the lth sample, and itwill be used to estimate f l i and h l i of this layer according to (3) and (6): en, f l iand f l i will be used to recalculate the weights of (8), c l ij , and finally estimate g i l as e decoding layer is applied to generate the outputs of this layer: e cross-feature attention layer is shown in Figure 1. We can see that, in this layer, the input data has three sets of features, and the attention is paid from one feature to another. To be more specific, the new presentation of one feature is the combination of samples in this feature space, but the weights of attention are estimated from another feature space.
Suppose we have L layers of cross-feature attention and the outputs of the last attention layers are f L i , g L i , and h L i . In our implementation, we set the layer number L to 12. ey are further concatenated as a long vector to represent the ith sample as follows: is vector will be the input of the next class-specific attention network for the purpose of classification.

Class-Specific Attention Layers.
Given the ith samples cross-feature attention representation, u i , and its class label, ψ i , we have used two class-specific attention layers to map it to the space of its won class and the entire data set of all classes.
(1) Class-Specific Attention Layer. To represent the ith sample, we pay attention to the samples of the same class, j: ψ i � ψ j . e attention weight is again calculated according to the similarity between u i and u j : where the similarity function is based on the concatenation of u i and u j , a dense layer parameterized by ι ψ , and an activation layer φ(·). e attention weights from the ith sample to class ψ are calculated by softmax normalization over the samples of the class ψ: e class-specific attention representation of the ith sample regarding to class ψ is the combination of the weighted samples of class ψ: where Ω ψ the projection matrix.
(2) Global Attention Layer. Beside the class-specific attention layers, we also build a global attention layer to represent the ith sample to all the data samples of the entire data set. e attention weights are calculated from the ith sample to all samples, j ∈ X. e similarity between the ith and jth samples is also based on a concatenation, a dense, and a activation layer: where π is the dense layer parameter. Accordingly, the weights of attention are normalized by a softmax: e global attention representation of the ith sample is where Υ the the projection matrix. With these layers, for each data sample, we have two representations, which are class-specific attention vector, p ψi i , and a global attention vector, q.

Network Training.
ere are many parameters of the proposed network. To learn these parameters, we firstly model a minimization problem with the training data set and then develop an iterative optimization algorithm to solve it.

Objective Function.
To train the parameters of the cross-feature attention network and class-specific/global attention network, we consider the following two problems: (1) Minimization of Within-Class Scattering. For each class ψ, we hope that its samples' representations of this class are not scattered so that they can be gathered as close as possible. e samples' class-specific representations are p ψ i | i: ψ i �ψ . To measure the within-class scattering, we first calculate the mean vector of this class as where n ψ is the number of samples of the class ψ. e withinclass scattering measure of class ψ is calculated as where Tr(·) is the trace of a matrix. e following minimization problem is modeled to optimize the parameters, so that, for all the classes, the within-class scattering is minimized jointly.
(2) Maximization of Interclass Scattering. We also propose to maximize the scattering of different classes. For this purpose, we firstly calculate an overall mean vector over the entire data set, using the global attention representations: Meanwhile, in the global attention representation space, we also calculate the mean vectors for each class, ψ: e interclass scattering is measured as To separate different classes, we propose to maximize it: e overall objective of this problem is the combination of (23) and (27).
A minimization problem is proposed as follows to learn the parameters of the network: where Π � (A l , B l , E l , Θ l , Φ l , Ψ l , W l , V l , R l )| L 1�1 , (ι ψ , Ω ψ ) | C ψ�1 , π, Υ}, ‖Π‖ 2 F is the squared ℓ 2 norms of the parameters to prevent the overfitting problem, and C is the weight of the squared ℓ 2 norm term.

Objective Optimization.
To solve the problem of (28), we employ the algorithm of Adam [22]. is algorithm is based on gradient descent updating of the parameters, and its optimization is for the stochastic objective. e lowerorder moments is updated adaptively.

Model Inference.
With the trained parameters of the network, we can inference the class label of a test sample. Its survived feature vector is y, and its newly added feature vector is z. We firstly represent it by the trained cross-feature network as u, according to (14). en, for each class, we calculate its class-specific representation p ψ and a global representation q. en, we calculate the distance between the specific representation of the test sample and the class mean regarding to the class ψ: Moreover, we also update the mean vector of this class in the global representation space by With the updated class mean, we recalculate its distance to the overall mean vector µ as follows: e overall score of assigning the test sample the class ψ is the difference of s b (ψ) and s w (ψ): It measures how close the test sample is to the class ψ and how it makes the class ψ far away from the other classes. e test sample is assigned to the class which gives the largest score:

Experiments
In this section, we evaluate the proposed method crossfeature attention network (CFAN) experimentally. We firstly introduce the data sets and the experimental setting, then give the experimental results, and summarize the observations from the results.

Data Sets.
In the experiment, we use four data sets as benchmarks, including two computer vision data sets, an Internet of things (IoT) data set, and a bio-informatics data set. e statistics of the data sets is given in Table 1. e details of the data sets are as follows.
(i) Satimage is an image data set. It has 6,431 images, and the problem is to categorize each image into one of the 7 categories, including red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, mixture class (all types present), and very damp grey soil. Each image is represented by 36 features, which are the 9 pixels in the neighborhood of 4 spectral bands [23]. (ii) MINIST is a hand-written digit data set. It has a training set of 60,000 images and a test set of 10,000 images. Each image has 28 × 28 pixels and a class label of ten digits; thus, it is a 10-class classification problem [24]. (iii) SensIT vehicle is an acoustic data set. It has 78,823 training samples and 19,705 test samples. For each sample, it has 50 features. e learning problem of this data set is a 3-class classification problem [25]. (iv) Protein data set is a bio-informatics data set. It has 32,661 training data samples, 6,621 test data samples, and 2,871 evaluation samples. Each sample has 357 features. e problem for this data set is a 3class classification problem [26].

Protocol.
To perform the experiments, we split the features set of each data set into three subsets, so that each set has an equal size. e three feature sets are the vanished features, survived features, and newly added features, respectively. To create the training and test data set, we use the 10-fold cross-validation protocol. A data set is split into 10 folds of equal sizes, and each fold is used as a test set, while the other 9 folds are combined to form a training set. We firstly train the model over the training set and then test it over the test set. To measure the accuracy of the classification, classification accuracy is used. It is calculated as the rate of correctly classified samples from the overall test samples.

Experimental Results.
In this section, given the data and protocol, we perform the experiments and report the results. We evaluate the proposed method from three different aspects, including the sensitivity to the tradeoff parameter, the running time, and the performance compared to the state of the art.

Parameter Sensitivity.
In the objective of our model in (28), there is a tradeoff parameter C to balance the regularization term and other terms. It controls the weight of the model complexity. To evaluate how it affects the performance of the model, we plot the curves of the accuracy against the changing values of C in Figure 2. From the figure, we observe that, with the increase of the weight of the regularization term, the accuracy is improved slightly. However, generally speaking, the performance keeps stable regarding the change of the value of C. A simpler model with a larger value of C can improve the quality of the model, but the improvement is not significant.

Running Time Analysis.
e running time of the training process of the model is also studied in the experiment. Moreover, the running time of the classification of the test samples is also reported. e running time over four benchmark data sets is given in Figure 3.
(1) From the figure, we can see that the training time is longer than the test time for each data set. is is natural because the training algorithm scans each training sample for many iterations, and the number of training samples is also larger than the test sample. Meanwhile, in the test process, each test sample is only scanned once.  [21] e accuracy of these methods is reported in Figure 4. We have the following observations from this figure: (1) In all the cases, the CFAN algorithm keeps outperforming the compared methods, especially in the most challenging data sets, SensIT Vehicle and Protein. is is a strong indicator of the effectiveness and advantage of CFAN over the compared methods. e main reason is the power of the cross-feature attention layers which takes advantage of the essential nature of the changing features due to the evolving sensors. (2) e second best algorithm is OPID, which is also specially designed for the IDF problem. However, because it used as a linear model to model the evolving features, it fails to capture the complex pattern of the features. In contrast, CFAN used the deep attention layers for this purpose, thus giving much better results. (3) HF-SAR and OSFS give the worst results. ey can handle some special cases of IDF but are not perfect solutions for this problem.

Conclusion
In this paper, we proposed a novel solution for the machine learning problem with evolving features. In the process of learning, old features vanished, while new features are argumented, and the remaining features survived. To handle the evolving features, we use a deep network structure with a newly designed cross-feature attention layer. is layer fills the gap between the survived features and vanished/argumented features, by paying attention from/to different feature sets. In this way, data samples with a different sets of features are mapped to a common feature space. To learn the attention network parameters, we proposed to construct the class-specific attention layer to minimize the within-class scattering and the global attention layer for the maximization of interclass scattering. Experimental results show the stability of the algorithm and the outperforming of the proposed algorithm against the other methods.
Data Availability e dataset used in this paper are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.