Instance-Wise Denoising Autoencoder for High Dimensional Data

Denoising Autoencoder (DAE) is one of the most popular fashions that has reported significant success in recent neural network research. To be specific, DAE randomly corrupts some features of the data to zero as to utilize the cooccurrence information while avoiding overfitting. However, existing DAE approaches do not fare well on sparse and high dimensional data. In this paper, we present a Denoising Autoencoder labeled here as Instance-Wise Denoising Autoencoder (IDA), which is designed to work with high dimensional and sparse data by utilizing the instance-wise cooccurrence relation instead of the feature-wise one. IDA works ahead based on the following corruption rule: if an instance vector of nonzero feature is selected, it is forced to become a zero vector. To avoid serious information loss in the event that too many instances are discarded, an ensemble of multiple independent autoencoders built on different corrupted versions of the data is considered. Extensive experimental results on high dimensional and sparse text data show the superiority of IDA in efficiency and effectiveness. IDA is also experimented on the heterogenous transfer learning setting and cross-modal retrieval to study its generality on heterogeneous feature representation.


Introduction
Denoising Autoencoder (DAE) [1][2][3][4][5] is an extension of the classical autoencoder [6,7], where feature denoising is key for the autoencoder to generate better features.In contrast to the classic autoencoder, the input vector in DAE is first corrupted by randomly setting some of features to zero.Then attempts are made to reconstruct the uncorrupted input from the corrupted version.Operating based on the principle of predicting the uncorrupted values from the corrupted input, DAE has been shown to generalize well even with noised input.However, DAE and its variants do not fare well on high dimensional and sparse data since many features are already zeros in nature, and any further reset of the feature vector has no more effects on the original data.Moreover, high dimensional data also lead to uneven distribution of uncorrupted features.To address the above challenges, in this paper, we propose a denoising scheme that is designed for high dimensional and sparse data, which is labeled here as the Instance-Wise Denoising Autoencoder (IDA).To be more specific, if one nonzero feature of the instance is chosen, then this instance will be removed totally.That means that many instances will be removed from the data.Obviously, this will lead to serious information loss.Therefore a recovery strategy is further adopted where multiple independent autoencoders are constructed based on different versions of corrupted inputs and then combined to obtain the final solution.In IDA, instances are directly dropped and thus can reduce the training data size significantly.Obviously, this will be considerably useful to large scale data analytics.Additionally, autoencoders in the model are independent of data retrieval level to command execution level, and this leads IDA natural to be parallelized and carried out on a single multicores CPU computer or a distributed computing platform.In the paper, we verify the performance on classic high dimensional and sparse text data.The experimental results show that the proposed autoencoder is very fast and effective.
Furthermore, we study IDA's application on heterogenous feature representation and propose the Heterogenous-SIDA based on the heterogenous feature fusion framework [8].Experiments on transfer learning and cross-modal retrieval show that IDA can obtain better performance than mSDA which is embedded autoencoder of the fusion framework [8].
The core contributions of the current paper are as follows: (i) an Instance-Wise Denoising Autoencoder (IDA) method is proposed improving generalization performance and efficiency.(ii) A procedure for building a fast deep learning structure rapidly via stacking IDA for large scale high dimensional problems is proposed.(iii) The deep learning approach is further introduced to two heterogenous feature learning tasks including cross-language classification and cross-modal retrieval.

Review on Denoising Autoencoder
In a classic autoencoder [6], the aim is to learn a distributed representation that captures the coordinates along the core factors of variation in the data.As shown in Figure 1(a), an autoencoder takes in input x and maps it to a hidden space y = (Wx) through a deterministic mapping with weights W in the step called .Then, in the  step, the latent space y or code is mapped back into a reconstructed feature space z = (W  y) that bears the same shape as x through a similar transformation .The parameter of this model W (W  denotes the transposition of W) is optimized such that the average of reconstruction error is minimized.The considered reconstruction error can be the cross-entropy loss [9] or the squared error loss as follows: However, the basic autoencoder alone is not sufficient to be the basis of a deep architecture because it has a tendency of overfitting.In other words, the reconstruction criterion alone is unable to guarantee the extraction of useful features as it can lead to the obvious solution of simply copying the input or similarly uninteresting ones that maximize the mutual information in a trivial manner.Denoising Autoencoder (DAE) [1] is an extension of the classical autoencoder introduced specifically to address this phenomenon.As shown in Figure 1(b), DAE is trained to reconstruct a "clean" or "repaired" version of the corrupted input.This is achieved by first corrupting the original input x to arrive at x by means of a stochastic corruption process consisting in randomly setting some of the values in the input vector to zero [1].Corrupted input x is then mapped, as with the basic autoencoder, to a hidden representation y = (xW) from which we reconstruct a z = (yW  ).Parameter W is trained to minimize the average reconstruction error over a training set, that is, to have z as close as possible to the uncorrupted input x.There is a crucial limitation of DAE, which is high computational cost due to the expensive nonlinear optimization process.To this end, Chen et al. [4] proposed Marginalized Denoising Autoencoders (mDAE) which replace the encoder and decoder with one linear transformation matrix.mDAE provides a closed-form solution for the parameters and thus eliminates the use of other optimization algorithms, for example, stochastic gradient descent and backpropagation.Liang and Liu [10] combined stacked Denoising Autoencoder with dropout technology together and reduced time complexity during fine-tuning phase.Moreover, when the input is heavily corrupted during training the network tends to learn coarse-grained features, whereas when the input is only slightly corrupted, the network tends to learn finegrained features.To address this problem, Geras and Sutton [3] proposed scheduled Denoising Autoencoders that learn features at multiple different levels of scale which starts with a high level of noise that lowers as training progresses.To reduce the effect of outliers, Jiang et al. [5] proposed a robust ℓ 2,1 -norm to measure reconstruction error to learn a more robust model.To improve denoising performance, Cho [11] proposed a simple sparsification method of the latent representation found by the encoder.Wang et al. [12] proposed a probabilistic formulation for stacked denoise autoencoder (SDAE) and then extend it to a relational SDAE (RSDAE) model which jointly performs deep representation learning and relational learning in a principled way under a probabilistic framework.These DAE algorithms address many shortcomings of traditional autoencoders such as their inability in principle to learn useful overcomplete representations and have been shown to generalize well even with noised input.However, DAE and its variants do not fare well on high dimensional and sparse data since many features are already zero in nature; any further reset of the feature vector has no more effects on the original data.Moreover, high dimensional data also lead to uneven distribution of uncorrupted features.To address the above challenges, in this paper, we propose a denoising scheme that is designed for high dimensional sparse data, which is labeled here as the Instance-Wise Denoising Autoencoder (IDA).

Instance-Wise Denoising Autoencoder (IDA).
In this section we introduce a novel denoising method of autoencoder, which preserves its strong feature learning capabilities and alleviates the concerns mentioned.
Given  original instances X = {x  | x  ∈ R  }  =1 , and corrupt them by the modified strategy-if one nonzero feature of the instance is selected with a given probability , then this instance will be reset to zero totally.To be further specific, we generate -bits boolean vector m = [0, 1, 0, . . ., 1] ×1 with probability  of nonzero occurrence where each element corresponds to a feature.If the indices (m) of nonzero element of m and (x) of an instance x have overlap (i.e., (m) ∩ (x) ̸ = Φ), all features of the instance will be reset as 0; otherwise, the instance will be retained.It can be written as follows: After denoising, the resultant input is denoted as X.We reconstruct the inputs through minimizing the following reconstruction loss: where  and  denote different norm with corresponding  > 0,  = 1/2, 1, 2, . . ., +∞.According to different parameters , , , and , different coding can be obtained.
The optimization methods and computation cost are also different.
(i)  and  Are Linear,  = 2, and  = 1.The loss function can be rewritten as min W ‖ X − XWW  ‖ 1 2 .The reconstruction error norm is Frobenius norm [13] and then the solution will be obtained in a closed-form directly.
(ii)  Is Nonlinear and  Is Linear,  = 2, and  = 1.The loss function can be rewritten as min W ‖ X − ( XW)W  ‖ 1 2 ; the solution can be obtained by Extreme Learning Machine Based Autoencoder (ELM-AE) [14] where W is first randomly assigned and then replaced by optimized result.
As shown in Figure 2, to recover the information loss led by instances corruption, multiple independent encoders combination is adjusted.In particular, V > 1 versions of denoising inputs { X } V =1 and corresponding independent autoencoder are constructed as the following: where X and W  denote the different version of corrupted input and its corresponding encoder.Because all autoencoders are independent of the process from data retrieval to operation execution, they are suitable to be parallelized and carried out on a multicore CPU computer or a distributed computing platform.

Instance-wise noise process
Encode Decode After obtaining the code of every autoencoder, we can reach the final solution by combining them together; that is, In comparison to feature-wise denoising scheme where some features are corrupted, Instance-Wise Denoising scheme has the following benefits: (i) tackling the challenging of high dimensional sparse data, (ii) reducing the data instance size used explicitly.For example, for a problem with 1 million data instances, if only 1% of the instances are retained in the corrupted inputs, the computational cost will be reduced to only 0.01% of the original (here we refer to the widely used O( 2 ) [13]), and (iii) being easy to be implemented in parallel paradigm.

Stacked Instance-Wise Denoising Autoencoder (SIDA).
IDA can be stacked to build deep network which has more than one hidden layer.Generative deep model built by stacking multilayer autoencoders can obtain a more useful representation to express the multilevel structure of image features or other data.Figure 3 shows a typical instance of SIDA structure, which includes two encoding layers and two decoding layers.Supposing there are  hidden layers in the encoding part, we have the activation function of the th encoding layer: where the input y (0) is the original data x.The output y () of the last encoding layer is the high level features extracted by the SIDA network.In the decoding steps, the output of the first decoding layer is regarded as the input of the second decoding layer.The decoding function of the th decoding layer is where the input z (0) of the first decoding layer is the output y () of the last encoding layer.The output z () of the last decoding layer is the reconstruction of the original data.The training process of SIDA is provided as follows.
Step 1. Train the first IDA, which includes the first encoding layer and the last decoding layer.Obtain the network weight W (1) and the output y (1) of the first encoding layer.
Step 2. Use y () as the input data of the ( + 1)th encoding layer.Train the ( + 1)th IDA and obtain W (+1) and y (+1) , where  = 1, . . .,  − 1 and  is the number of hidden layers in the network.
It can be seen that each IDA is trained independently, and therefore the training of SIDA is called layer-wise training (Algorithm 1).

SIDA for Transfer Learning.
In the previous sections, we have shown the superiority of our algorithm in computational complexity for single source data.Nevertheless, with the multimedia data becoming current mainstream of information dissemination in the network, heterogenous data mining is

Encode Decode
Instance-wise noise process Instance-wise noise process s W (2)   s y (2)   s z (1) s W (2)   s T z (2)   s t y (2)   t more important than ever.In the this section, we are going to discuss how to integrate SIDA into the heterogeneous feature learning framework [15] and then apply it to two classical heterogenous data mining tasks consisting in heterogenous transfer learning and cross-modal retrieval.We term this SIDA for heterogenous data mining as Heterogenous-SIDA.
Transfer learning has demonstrated its success in different applications.Our study focuses on heterogenous transfer learning which aims to learn a feature mapping across heterogeneous feature spaces based on some cross-domain correspondences.In this field, Shi et al. [16] proposed a spectral transformation based heterogenous transfer learning method which employs spectral transformation to map cross-domain data into a common feature space through linear projection.Duan et al. [17] used two different projection matrices to transform the data from two domains into a common subspace and then use two new feature mapping functions to augment the transformed data with their original features and zeros.Kulis et al. [18] proposed learning an asymmetric nonlinear kernel transformation that maps points from one domain to another.Zhou et al. [8] proposed a multiclass heterogeneous transfer learning algorithm that reconstructs a sparse feature transformation matrix to map the weight vector of classifiers learned from the source domain to the target.Glorot et al. [19] trained a stacked denoise autoencoder (SDAE) to reconstruct the input (ignoring the labels) on the union of the source and target data, and then a classifier is trained on the resulting feature representation.Chen et al. [20] proposed a marginalized Stacked Denoising Autoencoder (mSDA) for domain adaptation where the closedform solution is achieved for SDAE.Zhou et al. [15] further applied mSDA to learn the deep learning structure as well as the feature mappings between cross-domain heterogeneous features to reduce the bias issue caused by the cross-domain correspondences.
The Heterogenous-SIDA model can be trained based on the multilayer heterogenous data fusion framework [15] as in Figure 4.In particular, given a set of data pairs from two different domains =1 , the objective is to learn the weight matrices {W ()   , W ()  }  =1 that project the source and target data to the th hidden layer, X  ) and H ()  = (W ()  X  ), respectively, and also two feature mappings {G ()   , G ()  }  =1 that map the data to a common space such that the disparity between source and target domain data is minimized: where   > 0 and   > 0 are regularization terms to avoid overfitting.
{G ()  }  =1 and {G ()  }  =1 can be computed by alternative optimized algorithm.However, here for simplification, we still let one weight matrix be unit matrix I and just learn one feature mapping {G () }  =1 :      H ()   − H ()  G () where  > 0 is a regularization term.
The closed-form solution can then be obtained by Sometimes, the correlation between two domains may be nonlinear, so we extend the linear mapping to the nonlinear one by kernel method.The dual form can be written as the following: where K is kernel function such as RBF kernel.
The feature mapping G () can be obtained by The representation of H ()   can be written as After learning the multilevel features and mappings, for each source domain instance x  , by denoting h ()   as the representation of the th layer, one can define a new representation z  by augmenting the original features with high level features of all the layers to arrive at z  = [h (1)   , . . ., h ]; we then apply a standard classification (or regression/logic regression) algorithm on (Z  , T  ) to train a target predictor   .

SIDA for Cross-Modal
Retrieval.Cross-modal retrieval is another important heterogenous feature representation application.Many cross-modal retrieval works focus on this issue through learning common space for two modality feature spaces.Rasiwasia et al. [21,22] applied CCA to learn a common space between image and text cooccurrence data (image and text occurrence in one document).Semantic matching (SM) [21,22] is to use Logistic regression in the image and text feature space to extract semantically similar feature to facilitate better matching.Bilinear model (BLM) [23] is a simple and efficient learning algorithm for bilinear models based on the familiar techniques of SVD and EM.LCFS [24] learns two projection matrices to map multimodal data into a common feature space, in which cross-modal data matching can be performed.GMLDA [25] adopts LDA under the multiview feature extraction framework.GMMFA [25] uses MFA for cross-modal retrieval under the multiview feature extraction framework.
In order to apply the proposed approach into cross-modal retrieval, instead of training a classifier, it needs to compute the similarity between cross-modal data in the common space.In particular, given a database D = { 1 , . . .,  || } of documents comprising image and text components, we consider the case where each document consists of a single image and its corresponding text; that is,

Experimental Study
In this section, we present the experimental study of IDA on three popular machine learning tasks including text classification, cross-language sentiment classification, and image-text cross-modal retrieval to verify the performance of IDA from multiple aspects.

Results on High Dimensional Sparse Data.
In order to compare the performance of IDA and SIDA (including serial and parallel implementation) on the high dimensional sparse data, here we select two popular datasets, News20.bin and Rcv1.mul as benchmarks.As detailed information of News20.bin and Rcv1.mul shown in   2 and 3.
We can find that, with nearly the same performance, SIDA is significantly faster than SDAE up to one hundred times.For example, on News20.bin,SIDA (2-layer 4500-1200) needs time of 503.45 seconds while SDAE needs about 18000 seconds, which is 100 times compared to SIDA.
When the autoencoders in SIDA are carried out in parallel, we can find that the speed is improved about nearly 2 times.For example, on News20.bin,SIDA obtains 2x speedup rate while on Rcv.mult it can be improved around 2.3 times.Our computer is a 4-core CPU, and no optimized strategy for parallel running is adopted.We just modify "for" to "parfor" in our Matlab implementation.SIDA can obtain more significant advantage than SDAE in more efficient distributed computing platform.

Cross-Modal
Retrieval on Wikipedia Data.The Wikipedia dataset (http://www.svcl.ucsd.edu/projects/crossmodal/)which has 2866 image-text pairs is a challenging image-text dataset with large intraclass variations and small interclass discrepancies.The context of each text article describes people, places, or events, which are closely relevant to the content of the corresponding image document.There are 10 semantic categories in the Wikipedia dataset, including art & architecture, geography & places, history, literature & theatre, biology, media, music, sports & recreation, royalty & nobility, and warfare as shown in Table 4.Here we follow the data partitioning procedure adopted in [21,22] where the original dataset is split into a training set of 2173 pairs and a testing set of 693 pairs.Then, we evaluate our proposed method against the following state-of-the-art cross-modal retrieval approaches.
(i) Correlation Matching (CM) [21,22].This method applied CCA to learn a common space in which the possibility of whether two different modal data items represent the same semantic concept can be measured.
(ii) Semantic Matching (SM) [21,22].This method applied Logistic regression in the image and text feature space to extract semantically similar feature to facilitate better matching.
(iii) Semantic Correlation Matching (SCM) [21,22].This method applied Logistic regression in the space of CCA projected coefficients (a two-stage learning process).
(iv) Bilinear Model (BLM) [23].This method is a suite of simple and efficient learning algorithms for bilinear models, based on the familiar techniques of SVD and EM.
(v) Learning Coupled Feature Spaces (LCFS) [24].This method learns two projection matrices to map multimodal data into a common feature space in which cross-modal data matching can be performed.

Query image (a)
Images corresponding to the top retrieval texts (vi) Generalized Multiview Linear Discriminant Analysis (GMLDA) [25].This method applied LDA with the multiview feature extraction (MFA) framework.
We here use mean average precision (MAP) to measure the retrieval performance [30].Two tasks were considered: text retrieval based on an image query and image retrieval based on a query text.In the first case, each image is used as a query and produces ranking of all texts.In the second, the roles of images and text were reversed.The scores for text retrieval from an image query, image retrieval from a text query, and their average are presented in the Table 5.From the results obtained, the following conclusions can be made: (i) the proposed method is shown to be superior to the simple random retrieval which forms the baseline for comparison.(ii) The proposed method outperforms PCA, BLM, GMMFA, GMLDA, LCFS, CM, SM, and SCM [21,22] on image retrieval given text query and vice versa.Figure 5 shows several example image queries and the images corresponding to the top retrieved text by the Heterogenous-SIDA.Due to the limitation of pages, we only present the ground-truth images.The query images are framed in Figure 5(a), and the images associated with the four best text matches are shown on Figure 5(b).By comparing the category and text content, each of the top-4 retrieved texts contains one or more relevant words to the image query or they are belonging to the category of query image.
Figure 6 depicted two examples of the text queries and corresponding retrieval results using Heterogenous-SIDA.The text query is presented along with its corresponding ground-truth image.The top retrieved five images are shown below the text.By comparing the category and text content, we can find that Heterogenous-SIDA retrieves these images correctly since they are belonging to the category of query text ("history" at the top, "sports" at the bottom) or the corresponding text contains one or more relevant words to the text query.
Figure 7 shows the MAP scores achieved per category by the proposed method and state-of-the-art counterparts, SM, CM, and SCM [21,22].Note that, on most categories, the MAP of our method is competitive with those of CM, SM, and SCM.

Transfer Learning Results.
In this section, we present further studies on the performance of IDA for a transfer learning task: cross-language classification.In particular, the At the conclusion of the regular season, Virginia Tech's defense was ranked third nationally in scoring defense (12.6 points allowed per game), fourth in total defense (269.5 total yards allowed per game), and fifth in pass defense (149.8passing yards allowed per game), Andy Gardiner, USA Today, December 28, 2004, accessed June 22, 2008.The Tech defense featured two highly regarded cornerbacks, Jimmy Williams and Eric Green, who finished the regular season with 50 tackles and 31 tackles, respectively.Williams also had four interceptions (the most on the team), including one returned for a touchdown (PDF), "Eric Green" Virginia Tech Sports Information, December 2004, Blacksburg, Virginia, Page 31, and was named first-team All-ACC.Green, meanwhile, had one interception (PDF), "Eric Green" Virginia Tech Sports Information, December 2004, Blacksburg, Virginia, Page 17, Auburn wide receiver Courtney Taylor praised the two players highly in an interview before the game, saying, "Those cornerbacks are amazing to me every time I look at them.I think, 'God, those guys are very athletic.' We're going to have our hands full." Linebacker Mikal Baaqee was first on the team in tackles, recording 63 during the regular season (PDF), "Eric Green" Virginia Tech Sports Information, December 2004, Blacksburg.cross-language sentiment dataset [31] is considered here.This dataset comprises the Amazon product reviews on three product categories: Books (B), DVDs (D), and music (M).These reviews are written in four languages: English (EN), German (GE), French (FR), and Japanese (JP).
For each language, the reviews are split into training and testing set, including 2,000 reviews per categories.We use the English reviews in the training dataset as the source domain labeled data and non-English (each of the other 3 languages) reviews in a train file as target domain unlabeled data.Further, we use the Google translator on the non-English reviews in the testing dataset to construct the cross-domain (English versus non-English) unlabeled parallel data.The performances of all methods are then evaluated on the target domain unlabeled data.
Here we focus on cross-language cross-category learning between English and the other 3 languages (German, French, and Japanese).This is a more challenging task than only cross-language.For a comprehensive comparison, we constructed 18 cross-language cross-category sentiment classification tasks as follows:  For example, the task EN-B-FR-D uses all the Books reviews in French in the testing dataset and its English translations as the parallel dataset, the DVDs reviews in French as the target language testing dataset, and original English Books reviews as the source domain labeled data.We compare the proposed method with the following baselines.
(i) SVM-SC [15].This method first trains a classifier on the source domain labeled data and then predicts the source domain parallel data.By using the correspondence, the predicted labels for source parallel data can be transferred into target parallel data.Next, it trains a model on the target parallel data with predicted labels to make predictions on the target domain test data.
(ii) CL-KCCA [32].This method applied cross-lingual kernel canonical component analysis on the unlabeled parallel data to learn two projections for the source and target languages and then train a monolingual classifier with the projected source domain labeled data.
(iii) HeMap [16].This method applied heterogeneous spectral mapping to learn mappings to project two domain data and sparse data.The experimental results of our method and HTTL with the same layers show that, whether for 1 layer or 3 layers, our method can produce much better performance than HTTL.This shows that the proposed autoencoder method can learn useful higher-level features to alleviate the distribution bias with the same number of layers.Additionally, the training time of these algorithms is reported in Table 7.We can find that the proposed algorithm is faster than HHTL.For example, For 3 layers, our method is faster than HHTL(3) in most cases.Compared with the other 4 transfer learning methods including SVM-SC, CL-KCCA, HeMap, and mSDA-CCA, the proposed method is also very competitive efficient from the testing accuracy and training time.Due to the deep structure.Our method with multiple layers is not the fastest algorithm, but it showcased improved prediction accuracies over the other counterpart algorithms

Figure 5 :
Figure 5: Image-to-text retrieval on Wikipedia.Query images are framed in (a).The four most relevant texts, represented by their groundtruth images, are shown in (b).
Around 850, out of obscurity rose Vijayalaya, made use of an opportunity arising out of a conflict between Pandyas and Pallavas, captured Thanjavur, and eventually established the imperial line of the medieval Cholas.Vijayalaya revived the Chola dynasty and his son Aditya I helped establish their independence.He invaded Pallava kingdom in 903 and killed the Pallava king Aparajita in battle, ending the Pallava reign.In K. A. N. Sastri, A History of South India p 159, the Chola kingdom under Parantaka I expanded to cover the entire Pandya country.However towards the end of his reign he suffered several reverses by the Rashtrakutas who had extended their territories well into the Chola kingdom.The Cholas went into a temporary decline during the next few years due to weak kings, palace intrigues, and succession disputes.Despite a number of attempts the Pandya country could not be completely subdued and the Rashtrakutas were still a powerful enemy in the north.However, the Chola revival began with the accession of Rajaraja Chola I in 985.Cholas rose as a notable military, economic, and cultural power in Asia under Rajaraja and his son Rajendra Chola I.The Chola territories stretched.

Figure 6 :
Figure 6: Two examples of text-based cross-modal retrieval using EIDA from Wikipedia.The query text and ground-truth image are shown on the top; retrieved images are presented at the bottom.

Table 1 ,
News20.bin contains 1,355,191 features and only about 0.0335% are nonzero values, and the dimensions of Rcv1.mul are 47,236 while only 0.14% are nonzero values.All the parameters are determined through cross-validation.We select the simple linear SVM as classifier.The experimental results including
classification accuracy (%) and training time (in seconds) are shown in Tables

Table 4 :
Summary of the Wikipedia dataset.