Missing Value Estimation for Microarray Data by Bayesian Principal Component Analysis and Iterative Local Least Squares

Missing values are prevalent in microarray data, they course negative influence on downstreammicroarray analyses, and thus they should be estimated from known values.We propose a BPCA-iLLSmethod, which is an integration of two commonly usedmissing value estimation methods—Bayesian principal component analysis (BPCA) and local least squares (LLS).The inferior row-average procedure in LLS is replaced with BPCA, and the least squares method is put into an iterative framework. Comparative result shows that the proposedmethod has obtained the highest estimation accuracy across all missing rates on different types of testing datasets.


Introduction
Data generated from DNA microarray data is useful for various biological applications; the data is in the form a large matrices.Generally, a row in a matrix represents a gene, and a column represents an experimental condition.But as large matrices, the data often suffer from missing values due to technical reasons such as spotting problems and background noise [1].However, downstream analyses always need full matrices as input; thus these missing values should be estimated from existing values.Various methods to estimate missing values in microarray data have been proposed in the past decades.Generally, methods to estimate missing values can be divided into four categories [2]: (i) global based methods, (ii) local based methods, (iii) hybrid methods, and (iv) knowledge-based methods.Singular value decomposition (SVD) [3] and Bayesian principal component analysis (BPCA) [4] are two major global based approaches.SVD estimates the missing value  in gene  by first regressing this gene against  eigengenes and use the coefficients of the regression to reconstruct  from a linear combination of the  eigengenes.BPCA estimates the target gene (i.e., a gene that contains missing values) by a linear combination of  principal axis vectors, where the parameters are identified by a Bayesian estimation method.Local based category includes some classical and newly proposed methods.The most wellstudied local based method is local least squares (LLS) [5].LLS uses a multiple regression model to estimate the missing values from  nearest neighbor genes of the target gene.Most recently proposed local methods are based on LLS, including iterated Local Least Squares (iLLS), weighted local least squares (wLLS) and iterative bicluster-based least squares (bi-iLS).Hybrid methods aim to capture both global and local correlations in the data.LinCmb [6] and EMDI [7] are two typical hybrid methods which estimate the missing values by a combination of other estimation methods from global approaches and local approaches.In the knowledge-based category, domain biological knowledge or external information is integrated into the estimation process.
Among all kinds of microarray missing value estimation methods, BPCA and local least squares (LLS) are two most widely used approaches.The former is based on the global structure of the matrix, and the latter is based on local similarity of the matrix.According to a survey [8] about different microarray missing value estimation methods, BPCA performs better than LLS on datasets with lower complexity, whereas due to another survey [9], LLS is superior than BPCA in the presence of data with dominant local similarity structures.This phenomenon inspires us to integrate the two methods, with the hope of improving the estimation accuracy and robustness.The idea of iterated local least squares again inspired us to put the integrated method into an iterative framework, which will further improve the estimation accuracy.We will give a brief review of BPCA and LLS in Section 2, the new method will be described in Section 3, comparative test of the proposed method with LLS and BPCA will be given Section 4, and a conclusion is drawn in Section 5.

Brief Review of BPCA and LLS
2.1.Bayesian Principal Component Analysis.Bayesian methods have been widely used in many fields such as face recognition and decision making [10][11][12][13], and it also has successful application in microarray missing value estimation.Bayesian principal component analysis (BPCA) represents the dimensional microarray expression vectors Y as a linear combination of  ( < ) principal axis vectors   (1 ≤  ≤ ): where the coefficient   is called a factor score and  denotes the residual error.The principal axis vectors are obtained by computing the eigenvalues and eigenvectors of the covariance matrix of the dataset Y.As there are missing values in the original matrix Y, the principal axis vectors are separated into two parts as W = (W obs , W miss ), corresponding to the observed part and missing part, respectively.Factor scores  = ( 1 ,  2 , . . .,   ) are obtained by minimizing the residual error of the observed part: Equation ( 2) is a least squares problem which can be solved easily in BPCA.By using the factor scores  and W miss , the missing part of the dataset is estimated as In BPCA, the factor scores  and the residual error  in (1) are assumed to obey normal distributions; BPCA utilizes a probabilistic PCA (PPCA) model [14] to estimate parameters in the normal distribution.The parameter W, along with another two parameters  and  in the normal distribution, forms a parameter set  = {W, , }.BPCA introduces a Bayesian estimation method for the PPCA model, where the posterior distributions of  and Y miss are estimated by a variational Bayes algorithm [15] simultaneously.

Local Least Squares.
Local least squares (LLS) uses the linear correlation of the target gene and its  nearest neighbors to recover unknown entries in the target gene.To explain how LLS works, we take an  ×  microarray matrix as an example.Assuming that gene  has  missing values, take  1 and its  nearest neighbors  1 ,  2 , . . .,   as a column vector, where in finding the nearest neighbors, the measurement can be  2 -norm distance or Pearson's correlation; then, rewrite the vector as (4): In ( 4),  is the vector of unknown entries of the target gene and w  is the vector of known entries of the target gene.B and A are the  neighbors' corresponding columns with  and w  , respectively.A linear coefficient vector X is established as a least squares problem with A  and w: Then the unknown entries of the target gene can be reconstructed by a linear combination of B  and X: where (A  ) † is the pseudoinverse of A  .Repeat the procedure for all rows that have missing values and the full matrix can be recovered.
To estimate a proper  value in finding  nearest neighbors, LLS [5] provides a method like this.First, erase a certain number of known entries as missing values.Then, estimate the artificial missing matrix by using different  neighbors by LLS.At last, compare these estimated matrices with the actual matrix; the  value corresponding to the highest accuracy is chosen to be the optimal parameter.

BPCA-iLLS
Note that in LLS, in order to find  nearest neighbors and to estimate an optimal  value, a complete matrix is needed.However, in many cases, almost all rows in a microarray matrix contain missing values, which makes the distances between the target gene and other genes unable to be measured.To solve this problem, LLS [5] fills all missing values in the target gene by the row's average value first.But in our experiment, we found that row-average cannot reflect the real structure of the dataset.Because row-average only uses the information of an individual row, the missing values in a target gene do not only rely on the known values in its own row.In the proposed BPCA-iLLS method, we replace the rowaverage procedure in LLS with BPCA.The flowchart of the proposed method is shown in Figure 1.
First, the input incomplete matrix is estimated by BPCA, to get a complete matrix.Next, this complete matrix is used as a temporary matrix for a further LLS procedure.In the LLS procedure, the optimal  value is estimated on this temporary matrix, and this  value is used to find matrices A and B. Subsequently, the missing values in every target gene are estimated by matrix B and the coefficient vector X.LLS is put into an iterative framework in the proposed method; that is, the estimated values by LLS are reused to form the temporary matrix in every iteration, and matrices A and B are refined in every iteration.It can be seen from the flowchart that the temporary matrices are different in each iteration.The initial temporary matrix is estimated by BPCA; following that, this matrix turns into the complete matrix that is estimated by LLS in each iteration.It should be mentioned that if the number of complete rows in the original incomplete matrix exceeds a preset threshold (e.g., 400 in LLS [5]), only complete rows are used to form the initial temporary matrix, which will highlight the original information of the matrix.This phenomenon happens only when the missing rates are low (typically below 5%).In most cases, the initial temporary matrices are BPCA-estimated ones in our proposed method.By replacing the row-average procedure in LLS by BPCA, and refining the temporary matrix in each iteration, the proposed method has the advantage over LLS and BPCA to be more robust on all kinds of datasets and has the ability to reduce the estimation error.

Comparative Result
4.1.Methods and Evaluation.We compare the proposed BPCA-iLLS method with BPCA and LLS.The only parameter of BPCA (number of principal axis vectors) is set to its default value, and the only parameter of LLS (number of neighbor genes) is learned by its heuristic method.For the proposed method, the number of iterations is a new parameter, and in our experiments, we set this parameter to be 5 because the estimation results do not change much after 5 iterations.
The accuracy is evaluated by normalized root mean square error (NRMSE): where   is the real value, ŷ is the estimated value, and   is the standard deviation for the  actual values of the missing entries.A smaller NRMSE represents a higher accuracy.The same evaluation criterion was also used in LLS, BPCA, and a survey of different missing value estimation methods [9].4.2.Datasets.Three types of datasets are tested for the proposed method, they are time series data (TS), non-time-series data (NTS), and mixed data (MIX).Table 1 shows details of the testing datasets.Here, CDC15 28 is the same time series data as what was used in survey [9]; SP ALPHA was also used in [5] to test the performance of LLS.NCI60 and Yoshi come from the non-time-series data and mixed data in survey [9], respectively.All original datasets contain missing values.To compute the estimation error rates, only complete rows of these datasets are used.A number of entries are randomly removed from the complete part to get artificial missing values in different missing rates.As the real values of these entries are actually known, the error rates can be calculated following (7).The same testing method was also employed in BPCA, LLS, and surveys [2,8,9].

Experimental
Result.We estimate different rates of simulated missing values on the abovementioned datasets by three comparative methods: LLS, BPCA, and BPCA-iLLS, and calculate NRMSE following (7).Figures 2(a), 2(b), 2(c), and 2(d) provide the NRMSE across different missing rates for the three comparative methods on datasets CDC15 28, SP ALPHA, NCI60, and Yoshi, respectively.Every NRMSE is a mean value of five independent experiments.It can be seen from Figure 2 that on all the four testing datasets, BPCA-iLLS obtains the lowest NRMSE across all missing rates.LLS outperforms BPCA on datasets CDC15 28 and NCI60, and BPCA outperforms LLS on dataset SP ALPHA; this reveals that the two methods are complementary with each other.As an integration of the two methods, BPCA-iLLS shows its robustness on different datasets.
Table 2 shows the computational time of different methods on dataset CDC15 28.The time is obtained from running experiments by Matlab R2011b on an ordinary 64 bit Windows 7 computer with 3.4 GHz quad-core processor and 16 GB internal memory.Intuitively, as an integration of two methods, BPCA-iLLS requires more computational time.It can be seen from Table 2 that the computational time of BPCA-iLLS is indeed longer than that of BPCA and LLS.However the increment of time is within a limited scope.Considering its estimation accuracy, the increment of computational time is acceptable.

Conclusion
Microarray missing value estimation is an important procedure in biology experiments.As two widely used missing value estimation methods, Bayesian principal component analysis (BPCA) and local least squares (LLS) take advantage of the matrix's global structure and local structure, respectively; these two methods are complementary with each other.The proposed BPCA-iLLS method is an integration of BPCA and LLS, which fully exploits the global structure and local structure of the microarray matrix simultaneously, and the iterative scheme also helps to reduce the estimation error.
Experimental results show that BPCA-iLLS has obtained the lowest normalized root mean square error (NRMSE) across all missing rates on all the testing datasets within an acceptable computational time.The performance of BPCA-iLLS also reveals the effectiveness of the integration of both global and local correlations of the microarray data, and such integration is one possible future direction of this field.