1. Introduction

MPE

Mathematical Problems in Engineering

1563-5147 1024-123X

Hindawi Publishing Corporation

162938

10.1155/2013/162938

162938

Research Article

Missing Value Estimation for Microarray Data by Bayesian Principal Component Analysis and Iterative Local Least Squares

Shi

Fuxi

0000-0002-0539-0007

Zhang

Dan

² Chen

Jun

0000-0001-7629-3266

Karimi

Hamid Reza

³ Yang

Rongni

College of Mechanical and Electronic Engineering

Northwest A&F University

No. 22 Xinong Road

Yangling

Xi'an

Shaanxi 712100

China

nwsuaf.edu.cn

School of Electronics and Information Engineering

Xi'an Jiaotong University

No. 28 Xianning West Road

Xi'an

Shaanxi 710049

China

xjtu.edu.cn

Department of Engineering

Faculty of Technology and Science

University of Agder

Service Box 509

4898 Grimstad

Norway

uia.no

2013

31 3 2013

2013 01 03 2013 11 03 2013 13 03 2013

2013

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Missing values are prevalent in microarray data, they course negative influence on downstream microarray analyses, and thus they should be estimated from known values. We propose a BPCA-iLLS method, which is an integration of two commonly used missing value estimation methods—Bayesian principal component analysis (BPCA) and local least squares (LLS). The inferior row-average procedure in LLS is replaced with BPCA, and the least squares method is put into an iterative framework. Comparative result shows that the proposed method has obtained the highest estimation accuracy across all missing rates on different types of testing datasets.

1. Introduction

Data generated from DNA microarray data is useful for various biological applications; the data is in the form a large matrices. Generally, a row in a matrix represents a gene, and a column represents an experimental condition. But as large matrices, the data often suffer from missing values due to technical reasons such as spotting problems and background noise [1]. However, downstream analyses always need full matrices as input; thus these missing values should be estimated from existing values. Various methods to estimate missing values in microarray data have been proposed in the past decades. Generally, methods to estimate missing values can be divided into four categories [2]: (i) global based methods, (ii) local based methods, (iii) hybrid methods, and (iv) knowledge-based methods. Singular value decomposition (SVD) [3] and Bayesian principal component analysis (BPCA) [4] are two major global based approaches. SVD estimates the missing value j in gene i by first regressing this gene against K eigengenes and use the coefficients of the regression to reconstruct j from a linear combination of the k eigengenes. BPCA estimates the target gene (i.e., a gene that contains missing values) by a linear combination of K principal axis vectors, where the parameters are identified by a Bayesian estimation method. Local based category includes some classical and newly proposed methods. The most well-studied local based method is local least squares (LLS) [5]. LLS uses a multiple regression model to estimate the missing values from K nearest neighbor genes of the target gene. Most recently proposed local methods are based on LLS, including iterated Local Least Squares (iLLS), weighted local least squares (wLLS) and iterative bicluster-based least squares (bi-iLS). Hybrid methods aim to capture both global and local correlations in the data. LinCmb [6] and EMDI [7] are two typical hybrid methods which estimate the missing values by a combination of other estimation methods from global approaches and local approaches. In the knowledge-based category, domain biological knowledge or external information is integrated into the estimation process.

Among all kinds of microarray missing value estimation methods, BPCA and local least squares (LLS) are two most widely used approaches. The former is based on the global structure of the matrix, and the latter is based on local similarity of the matrix. According to a survey [8] about different microarray missing value estimation methods, BPCA performs better than LLS on datasets with lower complexity, whereas due to another survey [9], LLS is superior than BPCA in the presence of data with dominant local similarity structures. This phenomenon inspires us to integrate the two methods, with the hope of improving the estimation accuracy and robustness. The idea of iterated local least squares again inspired us to put the integrated method into an iterative framework, which will further improve the estimation accuracy. We will give a brief review of BPCA and LLS in Section 2, the new method will be described in Section 3, comparative test of the proposed method with LLS and BPCA will be given Section 4, and a conclusion is drawn in Section 5.

2. Brief Review of BPCA and LLS 2.1. Bayesian Principal Component Analysis

Bayesian methods have been widely used in many fields such as face recognition and decision making [10–13], and it also has successful application in microarray missing value estimation. Bayesian principal component analysis (BPCA) represents the D-dimensional microarray expression vectors Y as a linear combination of K (K<D) principal axis vectors wl (1≤l≤K): (1)y=∑l=1Kxlwl+ε, where the coefficient xl is called a factor score and ε denotes the residual error. The principal axis vectors are obtained by computing the eigenvalues and eigenvectors of the covariance matrix of the dataset Y. As there are missing values in the original matrix Y, the principal axis vectors are separated into two parts as W=(Wobs,Wmiss), corresponding to the observed part and missing part, respectively. Factor scores x=(x1,x2,…,xk) are obtained by minimizing the residual error of the observed part: (2)err=∥yobs-Wobsx∥2. Equation (2) is a least squares problem which can be solved easily in BPCA. By using the factor scores x and Wmiss, the missing part of the dataset is estimated as (3)ymiss=Wmissx.

In BPCA, the factor scores x and the residual error ε in (1) are assumed to obey normal distributions; BPCA utilizes a probabilistic PCA (PPCA) model [14] to estimate parameters in the normal distribution. The parameter W, along with another two parameters μ and τ in the normal distribution, forms a parameter set θ={W,μ,τ}. BPCA introduces a Bayesian estimation method for the PPCA model, where the posterior distributions of θ and Ymiss are estimated by a variational Bayes algorithm [15] simultaneously.

2.2. Local Least Squares

Local least squares (LLS) uses the linear correlation of the target gene and its k nearest neighbors to recover unknown entries in the target gene. To explain how LLS works, we take an m×n microarray matrix as an example. Assuming that gene y has p missing values, take g1 and its k nearest neighbors gs1,gs2,…,gsk as a column vector, where in finding the nearest neighbors, the measurement can be l2-norm distance or Pearson’s correlation; then, rewrite the vector as (4): (4)(g1gs1gs2⋮gsk) =(αwTBA) =(α1α2⋯αpw1w2⋯wn-pB1,1B1,2⋯B1,pA1,1A1,2⋯A1,n-pB2,1B2,2⋯B2,pA2,1A2,2⋯A2,n-p⋮⋮⋮⋮⋮⋮⋮⋮Bk,1Bk,2⋯Bk,pAk,1Ak,2⋯Ak,n-p).

In (4), α is the vector of unknown entries of the target gene and wT is the vector of known entries of the target gene. B and A are the k neighbors’ corresponding columns with α and wT, respectively. A linear coefficient vector X is established as a least squares problem with AT and w: (5)minX∥ATX-w∥.

Then the unknown entries of the target gene can be reconstructed by a linear combination of BT and X: (6)αT=(a1⋯ap)T=BTX=BT(AT)†w, where (AT)† is the pseudoinverse of AT. Repeat the procedure for all rows that have missing values and the full matrix can be recovered.

To estimate a proper k value in finding k nearest neighbors, LLS [5] provides a method like this. First, erase a certain number of known entries as missing values. Then, estimate the artificial missing matrix by using different k neighbors by LLS. At last, compare these estimated matrices with the actual matrix; the k value corresponding to the highest accuracy is chosen to be the optimal parameter.

3. BPCA-iLLS

Note that in LLS, in order to find k nearest neighbors and to estimate an optimal k value, a complete matrix is needed. However, in many cases, almost all rows in a microarray matrix contain missing values, which makes the distances between the target gene and other genes unable to be measured. To solve this problem, LLS [5] fills all missing values in the target gene by the row’s average value first. But in our experiment, we found that row-average cannot reflect the real structure of the dataset. Because row-average only uses the information of an individual row, the missing values in a target gene do not only rely on the known values in its own row. In the proposed BPCA-iLLS method, we replace the row-average procedure in LLS with BPCA. The flowchart of the proposed method is shown in Figure 1.

Figure 1

Flowchart of BPCA-iLLS.

First, the input incomplete matrix is estimated by BPCA, to get a complete matrix. Next, this complete matrix is used as a temporary matrix for a further LLS procedure. In the LLS procedure, the optimal k value is estimated on this temporary matrix, and this k value is used to find matrices A and B. Subsequently, the missing values in every target gene are estimated by matrix B and the coefficient vector X. LLS is put into an iterative framework in the proposed method; that is, the estimated values by LLS are reused to form the temporary matrix in every iteration, and matrices A and B are refined in every iteration. It can be seen from the flowchart that the temporary matrices are different in each iteration. The initial temporary matrix is estimated by BPCA; following that, this matrix turns into the complete matrix that is estimated by LLS in each iteration. It should be mentioned that if the number of complete rows in the original incomplete matrix exceeds a preset threshold (e.g., 400 in LLS [5]), only complete rows are used to form the initial temporary matrix, which will highlight the original information of the matrix. This phenomenon happens only when the missing rates are low (typically below 5%). In most cases, the initial temporary matrices are BPCA-estimated ones in our proposed method. By replacing the row-average procedure in LLS by BPCA, and refining the temporary matrix in each iteration, the proposed method has the advantage over LLS and BPCA to be more robust on all kinds of datasets and has the ability to reduce the estimation error.

4. Comparative Result 4.1. Methods and Evaluation

We compare the proposed BPCA-iLLS method with BPCA and LLS. The only parameter of BPCA (number of principal axis vectors) is set to its default value, and the only parameter of LLS (number of neighbor genes) is learned by its heuristic method. For the proposed method, the number of iterations is a new parameter, and in our experiments, we set this parameter to be 5 because the estimation results do not change much after 5 iterations.

The accuracy is evaluated by normalized root mean square error (NRMSE): (7)NRMSE=∑j=1N(yj-y^j)2/Nσy, where yj is the real value, y^j is the estimated value, and σy is the standard deviation for the N actual values of the missing entries. A smaller NRMSE represents a higher accuracy. The same evaluation criterion was also used in LLS, BPCA, and a survey of different missing value estimation methods [9].

4.2. Datasets

Three types of datasets are tested for the proposed method, they are time series data (TS), non-time-series data (NTS), and mixed data (MIX). Table 1 shows details of the testing datasets. Here, CDC15_28 is the same time series data as what was used in survey [9]; SP_ALPHA was also used in [5] to test the performance of LLS. NCI60 and Yoshi come from the non-time-series data and mixed data in survey [9], respectively.

Table 1

Testing datasets.

Dataset	Reference	Original size	Complete size	Type
CDC15_28	[16]	6178 × 41	869 × 41	TS
SP_ALPHA	[16]	6178 × 18	4489 × 18	TS
NCI60	[17]	9706 × 60	2266 × 60	NTS
Yoshi	[18]	6166 × 24	4380 × 24	MIX

All original datasets contain missing values. To compute the estimation error rates, only complete rows of these datasets are used. A number of entries are randomly removed from the complete part to get artificial missing values in different missing rates. As the real values of these entries are actually known, the error rates can be calculated following (7). The same testing method was also employed in BPCA, LLS, and surveys [2, 8, 9].

4.3. Experimental Result

We estimate different rates of simulated missing values on the abovementioned datasets by three comparative methods: LLS, BPCA, and BPCA-iLLS, and calculate NRMSE following (7). Figures 2(a), 2(b), 2(c), and 2(d) provide the NRMSE across different missing rates for the three comparative methods on datasets CDC15_28, SP_ALPHA, NCI60, and Yoshi, respectively. Every NRMSE is a mean value of five independent experiments.

NRMSE on the four testing datasets. (a) CDC15_28, (b) SP_ALPHA, (c) NIC60, and (d) Yoshi.

(a) (b) (c) (d)

It can be seen from Figure 2 that on all the four testing datasets, BPCA-iLLS obtains the lowest NRMSE across all missing rates. LLS outperforms BPCA on datasets CDC15_28 and NCI60, and BPCA outperforms LLS on dataset SP_ALPHA; this reveals that the two methods are complementary with each other. As an integration of the two methods, BPCA-iLLS shows its robustness on different datasets.

Table 2 shows the computational time of different methods on dataset CDC15_28. The time is obtained from running experiments by Matlab R2011b on an ordinary 64 bit Windows 7 computer with 3.4 GHz quad-core processor and 16 GB internal memory. Intuitively, as an integration of two methods, BPCA-iLLS requires more computational time. It can be seen from Table 2 that the computational time of BPCA-iLLS is indeed longer than that of BPCA and LLS. However the increment of time is within a limited scope. Considering its estimation accuracy, the increment of computational time is acceptable.

Table 2

Computational time (seconds) on CDC15_28.

Missing rate	BPCA	LLS	BPCA-iLLS
3%	20.90	12.78	41.04
5%	27.86	12.43	47.90
8%	31.33	11.99	49.76
10%	27.23	11.74	46.85
15%	25.14	10.65	37.16
20%	22.25	9.57	34.29

5. Conclusion

Microarray missing value estimation is an important procedure in biology experiments. As two widely used missing value estimation methods, Bayesian principal component analysis (BPCA) and local least squares (LLS) take advantage of the matrix’s global structure and local structure, respectively; these two methods are complementary with each other. The proposed BPCA-iLLS method is an integration of BPCA and LLS, which fully exploits the global structure and local structure of the microarray matrix simultaneously, and the iterative scheme also helps to reduce the estimation error. Experimental results show that BPCA-iLLS has obtained the lowest normalized root mean square error (NRMSE) across all missing rates on all the testing datasets within an acceptable computational time. The performance of BPCA-iLLS also reveals the effectiveness of the integration of both global and local correlations of the microarray data, and such integration is one possible future direction of this field.

Jörnsten

Wang

H. Y.

Welsh

W. J.

Ouyang

DNA microarray data imputation and significance analysis of differential expression

Bioinformatics 2005 21 22 4155 4161

2-s2.0-27944450456

10.1093/bioinformatics/bti638

Liew

A. W.

Law

N. F.

Yan

Missing value imputation for gene expression data: computational techniques to recover missing data from available information

Brief Bioinform 2011 12 5 498 513

10.1093/bib/bbq080

Troyanskaya

Cantor

Sherlock

Brown

Hastie

Tibshirani

Botstein

Altman

R. B.

Missing value estimation methods for DNA microarrays

Bioinformatics 2001 17 6 520 525

2-s2.0-0034960264

10.1093/bioinformatics/17.6.520

Oba

Sato

M. A.

Takemasa

Monden

Matsubara

K. I.

Ishii

A Bayesian missing value estimation method for gene expression profile data

Bioinformatics 2003 19 16 2088 2096

2-s2.0-0242643743

10.1093/bioinformatics/btg287

Kim

Golub

G. H.

Park

Missing value estimation for DNA microarray gene expression data: local least squares imputation

Bioinformatics 2005 21 2 187 198

2-s2.0-13444304426

10.1093/bioinformatics/bth499

Jörnsten

Wang

H. Y.

Welsh

W. J.

Ouyang

DNA microarray data imputation and significance analysis of differential expression

Bioinformatics 2005 21 22 4155 4161

2-s2.0-27944450456

10.1093/bioinformatics/bti638

Pan

X. Y.

Tian

Huang

Shen

H. B.

Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach

Genomics 2011 97 5 257 264

2-s2.0-79955522081

10.1016/j.ygeno.2011.03.001

Brock

G. N.

Shaffer

J. R.

Blakesley

R. E.

Lotz

M. J.

Tseng

G. C.

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

BMC Bioinformatics 2008 9 12

2-s2.0-39749093807

10.1186/1471-2105-9-12

Brás

L. P.

Menezes

J. C.

Dealing with gene expression missing data

IEE Systems Biology 2006 153 3 105 119

2-s2.0-33747418958

10.1049/ip-syb:20050056

Daliri

M. R.

Saraf

A Bayesian framework for face recognition

International Journal of Innovative Computing, Information and Control 2012 8 4591 4603

Nguyen

H. T. T.

Luong

H. N.

Ahn

C. W.

An entropy approach to evaluation relaxation for Bayesian optimization algorithm

International Journal of Innovative Computing, Information and Control 2012 8 6371 6388

Hsieh

A Bayesian approach in making mastery decisions: comparison of two loss functions

International Journal of Innovative Computing, Information and Control 2012 8 7427 7435

Eren-Dogu

Z. F.

Celikoglu

C. C.

Information security risk assessment: Bayesian prioritization for AHP group decision making

International Journal of Innovative Computing, Information and Control 2012 8 8001 8018

Tipping

M. E.

Bishop

C. M.

Mixtures of probabilistic principal component analyzers

Neural Computation 1999 11 2 443 482

2-s2.0-0033556788

10.1162/089976699300016728

Attias

Inferring parameters and structure of latent variable models by variational Bayes

Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI '99)

1999

21 30

Spellman

P. T.

Sherlock

Zhang

M. Q.

Iyer

V. R.

Anders

Eisen

M. B.

Brown

P. O.

Botstein

Futcher

Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization

Molecular Biology of the Cell 1998 9 12 3273 3297

2-s2.0-0031742022

Scherf

Ross

D. T.

Waltham

Smith

L. H.

Lee

J. K.

Tanabe

Kohn

K. W.

Reinhold

W. C.

Myers

T. G.

Andrews

D. T.

Scudiero

D. A.

Eisen

M. B.

Sausville

E. A.

Pommier

Botstein

Brown

P. O.

Weinstein

J. N.

A gene expression database for the molecular pharmacology of cancer

Nature Genetics 2000 24 3 236 244

2-s2.0-0034088857

10.1038/73439

Yoshimoto

Saltsman

Gasch

A. P.

H. X.

Ogawa

Botstein

Brown

P. O.

Cyert

M. S.

Genome-wide analysis of gene expression regulated by the calcineurin/Crz1p signaling pathway in Saccharomyces cerevisiae

Journal of Biological Chemistry 2002 277 34 31079 31088

2-s2.0-0037163129

10.1074/jbc.M202718200