^{1}

^{2}

^{1}

^{3}

^{1}

^{2}

^{3}

Missing values are prevalent in microarray data, they course negative influence on downstream microarray analyses, and thus they should be estimated from known values. We propose a BPCA-iLLS method, which is an integration of two commonly used missing value estimation methods—Bayesian principal component analysis (BPCA) and local least squares (LLS). The inferior row-average procedure in LLS is replaced with BPCA, and the least squares method is put into an iterative framework. Comparative result shows that the proposed method has obtained the highest estimation accuracy across all missing rates on different types of testing datasets.

Data generated from DNA microarray data is useful for various biological applications; the data is in the form a large matrices. Generally, a row in a matrix represents a gene, and a column represents an experimental condition. But as large matrices, the data often suffer from missing values due to technical reasons such as spotting problems and background noise [

Among all kinds of microarray missing value estimation methods, BPCA and local least squares (LLS) are two most widely used approaches. The former is based on the global structure of the matrix, and the latter is based on local similarity of the matrix. According to a survey [

Bayesian methods have been widely used in many fields such as face recognition and decision making [

In BPCA, the factor scores

Local least squares (LLS) uses the linear correlation of the target gene and its

In (

Then the unknown entries of the target gene can be reconstructed by a linear combination of

To estimate a proper

Note that in LLS, in order to find

Flowchart of BPCA-iLLS.

First, the input incomplete matrix is estimated by BPCA, to get a complete matrix. Next, this complete matrix is used as a temporary matrix for a further LLS procedure. In the LLS procedure, the optimal

We compare the proposed BPCA-iLLS method with BPCA and LLS. The only parameter of BPCA (number of principal axis vectors) is set to its default value, and the only parameter of LLS (number of neighbor genes) is learned by its heuristic method. For the proposed method, the number of iterations is a new parameter, and in our experiments, we set this parameter to be 5 because the estimation results do not change much after 5 iterations.

The accuracy is evaluated by normalized root mean square error (NRMSE):

Three types of datasets are tested for the proposed method, they are time series data (TS), non-time-series data (NTS), and mixed data (MIX). Table

Testing datasets.

Dataset | Reference | Original size | Complete size | Type |
---|---|---|---|---|

CDC15_28 | [ |
6178 × 41 | 869 × 41 | TS |

SP_ALPHA | [ |
6178 × 18 | 4489 × 18 | TS |

NCI60 | [ |
9706 × 60 | 2266 × 60 | NTS |

Yoshi | [ |
6166 × 24 | 4380 × 24 | MIX |

All original datasets contain missing values. To compute the estimation error rates, only complete rows of these datasets are used. A number of entries are randomly removed from the complete part to get artificial missing values in different missing rates. As the real values of these entries are actually known, the error rates can be calculated following (

We estimate different rates of simulated missing values on the abovementioned datasets by three comparative methods: LLS, BPCA, and BPCA-iLLS, and calculate NRMSE following (

NRMSE on the four testing datasets. (a) CDC15_28, (b) SP_ALPHA, (c) NIC60, and (d) Yoshi.

It can be seen from Figure

Table

Computational time (seconds) on CDC15_28.

Missing rate | BPCA | LLS | BPCA-iLLS |
---|---|---|---|

3% | 20.90 | 12.78 | 41.04 |

5% | 27.86 | 12.43 | 47.90 |

8% | 31.33 | 11.99 | 49.76 |

10% | 27.23 | 11.74 | 46.85 |

15% | 25.14 | 10.65 | 37.16 |

20% | 22.25 | 9.57 | 34.29 |

Microarray missing value estimation is an important procedure in biology experiments. As two widely used missing value estimation methods, Bayesian principal component analysis (BPCA) and local least squares (LLS) take advantage of the matrix’s global structure and local structure, respectively; these two methods are complementary with each other. The proposed BPCA-iLLS method is an integration of BPCA and LLS, which fully exploits the global structure and local structure of the microarray matrix simultaneously, and the iterative scheme also helps to reduce the estimation error. Experimental results show that BPCA-iLLS has obtained the lowest normalized root mean square error (NRMSE) across all missing rates on all the testing datasets within an acceptable computational time. The performance of BPCA-iLLS also reveals the effectiveness of the integration of both global and local correlations of the microarray data, and such integration is one possible future direction of this field.