Prediction and Characterization of Missing Proteomic Data in Desulfovibrio vulgaris

Proteomic datasets are often incomplete due to identification range and sensitivity issues. It becomes important to develop methodologies to estimate missing proteomic data, allowing better interpretation of proteomic datasets and metabolic mechanisms underlying complex biological systems. In this study, we applied an artificial neural network to approximate the relationships between cognate transcriptomic and proteomic datasets of Desulfovibrio vulgaris, and to predict protein abundance for the proteins not experimentally detected, based on several relevant predictors, such as mRNA abundance, cellular role and triple codon counts. The results showed that the coefficients of determination for the trained neural network models ranged from 0.47 to 0.68, providing better modeling than several previous regression models. The validity of the trained neural network model was evaluated using biological information (i.e. operons). To seek understanding of mechanisms causing missing proteomic data, we used a multivariate logistic regression analysis and the result suggested that some key factors, such as protein instability index, aliphatic index, mRNA abundance, effective number of codons (N c) and codon adaptation index (CAI) values may be ascribed to whether a given expressed protein can be detected. In addition, we demonstrated that biological interpretation can be improved by use of imputed proteomic datasets.


Introduction
Application of various experimental "omics" tools has enhanced our understanding of complex biological systems in the past decade [1][2][3][4][5][6][7][8]. However, due to technical limitations of these high throughput technologies and constraints on experimental design, most of these high-throughput "omics" datasets still suffer from significant missing values. The incomplete data available has impeded scientists from extracting correct information regarding cell metabolism. To address the issue, attempts to apply computational tools to impute missing values in various high throughput "omics" datasets have been made [9][10][11]. The most successful examples of such efforts were for missing transcriptomic data [12].
Compared with transcriptomic dataset, proteomic datasets suffer even more serious missing data since its identification range and sensitivity are still not fully comparable with transcriptomic measurements [13]. In particular, when partial proteomics dataset was used in various integrated "omics" studies, the undetected proteins were simply assigned a "zero" value and were excluded from relationship modeling, which could bias any conclusion resulted from the integrated studies [14]. To overcome this problem, several methods have been adapted from the estimation of missing values in transcriptomic data to estimate the missing proteomics values by using the available measurements from other proteins, such as the k nearest neighbor and Bayesian Principal Component Analysis (BPCA) methods for imputing missing proteomic values in gel-based proteomics dataset [15,16], and by integrating the GO (Gene Ontology) information into the proteomic data imputation [17]. Based on the assumption that there exists meaningful correlation between two types of datasets [14], in recent years we have developed the Zero-inflated Poisson (ZIP) linear regression model [18] and a stochastic Gradient Boosted Trees (GBT) nonlinear model [19] to 2 Comparative and Functional Genomics uncover possible relationships between transcriptomic and proteomic data and to predict protein abundance for the proteins not experimentally detected.
As part of a continuous effort to develop statistical tools for missing value imputation, we in the study applied an artificial neural network (ANN) approach to predict abundance for experimentally undetected proteins for several cognate transcriptomic and proteomic datasets of sulfate reducing bacterium Desulfovibrio vulgaris [20][21][22][23]. The advantages of artificial neural networks include ability to tackle complex relationships and to deal with noisy data with missing values, and its capability of generalization [24,25]. In the approach, a multilayer perceptron (MLP) with sigmoid activation function in the single hidden layer was used to capture the nonlinear relationship between transcriptomic data and protein abundance. The top ten relevant predictors for protein abundance identified in a previous study [19] were used as inputs to the MLP for training the output of the MLP to approximate the corresponding protein abundance of each gene. The trained MLP was validated using biological information. In addition, by using a multiple logistic regression analysis, we quantified the contribution of some sequence-based factors, such as mRNA abundance, protein instability index, gene length, "effective number of codons" (N c ), and CAI (Codon Adaptation Index) values to missing proteomic values. We also demonstrated that biological interpretation can be improved by using the imputed proteomic datasets.

Datasets.
Two datasets from Desulfovibrio vulgaris were analyzed in this study. The datasets covers the same species, although the experimental conditions are different and they were obtained by two independent studies [19][20][21][22][23]. The raw intensity values from both datasets were normalized using a quantile normalization in an R package (caret) available through the R project (http://www.r-project.org/). Absolute fluorescence intensity and peptide hits were used as abundance for transcriptomic and proteomic data, respectively [26]. These datasets, labeled Dataset 1 and Dataset 2, will be used throughout the article.

Genome Information.
Gene annotated attributes, such as sequence length, protein length, molecular weight, GC content, triple codon counts, and cellular functional categories of all genes in the D. vulgaris genome were downloaded from the comprehensive microbial resource (CMR) of TIGR (http://cmr.tigr.org/) [22]. To define operons, we used the distance-only approach, a relatively low threshold, in this study to cover more of the possible operons. The genes that are transcribed in the same direction and are with intergenic region less than 15 base pairs were defined as one operon [19].

Construction of Relationship between
Proteomics and mRNA through Artificial Neural Network. We trained an artificial neural network to learn a particular function by adjusting the values of the weights between neurons. It has been shown that a multilayer perceptron with a single hidden layer can approximate any continuous function to arbitrary precision by increasing the number of neurons in the hidden layer [27][28][29]. The weights connecting the input signals as well as the number of neurons can be determined by minimizing the sum of the squared difference between Predicted protein abundance   the network outputs and target outputs for each set of input signals. Nonlinear optimization methods in combination with a backpropagation algorithm [30] were implemented to find the optimal weights. The inputs to the network and the targets were usually rescaled to [−1, 1]. The outputs of the network were then rescaled back to approximate the targets in original scale. In this work, we trained a multilayer perceptron to learn the nonlinear relationship between transcriptomic data and proteomic data. The inputs were the transcriptomic data and the target was the protein abundance.
A feed-forward network with a hidden layer can approximate any continuous function with arbitrary precision, which provides ANN the power to deal with nonlinear relationship, also open a door to overfitting if too many neurons are used in the hidden layer. To address the possible overfitting, we implemented a cross-validation to prevent overfitting. Through this method the data was partitioned in K equal parts where K − 1 sets were used to train the model and the other set was used to calculate prediction errors [31]. This was repeated K times, hence K prediction errors values were computed one at every fold. An average and standard deviation can be extracted to select the most representative model for future prediction. Cross-validation was generally effective to prevent overfitting. Beside crossvalidation to prevent overfitting, we also used some statistical tools to diagnose the selected model. Once the best model has been selected, based on cross-validation, the model was evaluated based on its coefficient of determination (R 2 ). This coefficient represents the variation explained by the model. Furthermore, to alternatively assess the goodness of the model we studied the predictions of small sets of genes which are grouped based on operon information. In order to describe the variation within a data set, such as "molar abundance" of proteins within one operon, we computed the coefficient of variation (CV) for each set of proteins. The CV is defined as the ratio of the standard deviation and the mean of the "molar abundance" for a set of proteins [18,19,32] where the calculation of CV score is independent of the sample size. These coefficients of variations were computed for all operons and compared to a distribution of permuted CVs where permutation of genes was performed.

Analysis of the Factors Ascribed to Missing Protein
Measurements. For DNA and protein sequence features, the BioPerl module was used to batch query and parse results from the ExPASy ProtParam program, returning "AliphaticIndex" and "InstabilityIndex". CodonW was used to perform correspondence analysis of both codon and amino acid usages, returning the first four axes representing the major trends (AA axis1-4 for amino acid usage, CR axis1-4 for Relative Synonymous Codon Usage). The frequency of each base at the third codon position (synonymous site, designated as A3s, T3s, C3s, and G3s), as well as two additional measures on protein property such as Grand Average of Hydropathy (GRAVY) and AROMO, were also computed with CodonW. The Codon Adaptation Index (CAI) was computed with an in-house perl script using D. vulgaris ribosomal protein genes as reference for highly expressed genes. The calculation and analysis of these features were described in several of our previous works [33][34][35].
A logistic regression model was used to assess how each of the features attributes to whether a given protein will be detected [36,37]. Assume the features for a gene i are x 1i , . . . , x ki . In logistic regression, the logit, natural log of the odds, of the unknown probability, p i , that the protein is not detected is modeled as a linear function of x 1i , . . . , x ki : The model is equivalent to p i = exp β 0 + β 1 x 1i + · · · + β k x ki exp β 0 + β 1 x 1i + · · · + β k x ki + 1 .
The unknown coefficients are usually estimated by maximum likelihood using a method common to all generalized linear models [35]. In the model, the parameter β j is the additive effect on the log of the odds for a unit change in the jth explanatory feature variable. That is, one unit change of feature j, that is, x ji indicates the odds changes e βj times. Positive β j indicated that the probability that the protein will not be detected is increased with increasing the jth feature. By using a logistic regression analysis, the effects of relevant features on whether a protein will be detected were quantified and can be compared. And the fitted model can also be used to predict the chance that any other proteins will not be detected.

Determine the Variable Importance and Partial Variable
Dependence. In a previous study, we used the gradient boosted trees (GBT) model to measure the contribution of each of the 70 variables input variables to the RNAprotein correlation, and the 10 top-ranked variables were identified to be mRNA expression level, cellular role, and several triple codon usages for the D. vulgaris datasets we used (See Table 1 of Supplmentary Material available online at doi: 10.1155/2011/780973) [19]. In this study, we used the same top ten predictors as inputs to the artificial neural network approach. The datasets used are consistent with previous work for comparison. The results showed that for all datasets, we observe a smooth relationship between protein abundance and the abundance of corresponding mRNA controlling the other predictors at their mean values for each growth condition. The partial dependence between protein and mRNA abundance tends to be smoother when the abundance of mRNAs are in the high range ( Figure 1). Figure 2 shows the scatter plots of predicated protein abundance versus mRNA for proteins detected and undetected for datasets 1 and dataset 2, respectively. For both datasets, under each growth condition, there are much fewer genes with very high mRNA or protein abundance. Those proteins undetected are mostly with very low predicted protein abundance and observed mRNA abundance as shown in Figures 2(b) and 2(d) for dataset 1 and dataset 2, respectively. The proteins undetected have much smaller variance in predicted values especially for dataset 2 as shown in Figure 2(d). In comparison, those detected proteins have a larger range of protein abundance and are of very high variability.
In the previous work of using GBT model to uncover the nonlinear relationship between transcriptomic and proteomic datasets, an interesting pattern, a plateau effect, was observed in the partial dependence plot of mRNA abundance versus protein abundance for high values of mRNA. This is potentially due to the fact that protein abundance data is sparse, with high variance, where the tree models do not generate splits among the predictors [19]. In the neural network model, the output of an MLP is an inherent smooth function of the network inputs, hence a relatively smooth prediction is always guaranteed. However, we can still see the similar pattern that the curve becomes flat when the variance for the mRNA abundance is high. For example, in the region of high mRNA values there are a small number of genes/peptides whose protein values ranges from (0, 5 40) for Dataset 1 and (0, 500) for Dataset 2. This could reflect problems with the sensitivity of the current proteomic technologies for high abundant proteins.

Construction of the Nonlinear Artificial Neural Network
Model. A feed-forward network with a single hidden layer was trained for each condition for both datasets with mRNA abundance and other predictors as inputs and protein abundance as target. The neural networks were trained using the functions in the neural network toolbox provided in standard software Matlab (7.0.1). The activation function for the neurons in the hidden layer was the hyperbolic tangent function, which is given by The activation function of the output neuron is a pure linear function, which is the weighted sum of the outputs from the neurons of the hidden layer. The optimal number of neurons in the hidden layer and the optimal weights were determined using 10-fold cross-validation for training a neural network. For each dataset under each experimental condition, the networks with the same inputs but with different number of neurons (from 5 to 12) in the hidden layer were each trained using cross-validation to learn the relationship between the mRNA features and protein abundance. The optimal number of neurons in the layer was determined based on the one generating the minimum averaged validation error. Both transcriptomic and proteomic data were logtransformed. For dataset 1, the genes with no missing value for three out of four replicates were selected for model training. The numbers of genes used for training were 354, 312, and 308 for the FL, LL, and LS conditions, respectively. For each growth condition, the inputs to the multilayer perceptron were the top ten predictors identified previously [19] (See Table 1 at Supplementary Material). The optimal network had 6, 6, and 9 neurons in the hidden layer for FL, LL, and LS conditions, respectively. Coefficients of determinations R 2 were 0.47, 0.52, and 0.51 for FL, LL, LS conditions, respectively. For dataset 2, a total of 986, 1001, and 986 genes with no missing protein measurements were used for model training for CT0, CT120, and ST120 conditions, respectively. For each of the conditions, the inputs to the multilayer perceptron were the top ten predictors identified previously [19] (See Table 1 at Supplementary material). The optimal network had 8, 7, and 6 neurons in the hidden layer for CT0, CT120, and ST120 conditions, respectively. Coefficients of determination R 2 were 0.65, 0.62, and 0.68 for CT0, CT120, and ST120 conditions, respectively.

Validation of the Prediction through the Nonlinear
Artificial Neural Network Model. The validation of the model prediction was conducted through the calculation of coefficient of variation (CV) across conditions for every operon of D. vulgaris as described before [18,19]. The genes in the operons are considered to have less dispersion than any random set of genes in terms of their gene expression and/or protein abundance, because of their intrinsic biological relationship. With relatively low threshold for operon identification as described in Section 2, a total of 609 operons were identified in D. vulgaris, which ranged from 2 to 13 genes per operon. To compare these CV values, we have used a permutation test in the following manner: a CV is computed from the predicted protein values for a permuted number of genes. This step is repeated 2000 times through resampling of genes without replacement. The number of genes randomly selected is equal to the size of the gene group to be compared to. For dataset 1, 82% to 89% of the operon groups had smaller CV values than the permuted groups, and for dataset 2, 79% to 89% of the operon groups had smaller CV values than the permuted, demonstrating in general good performance of the model prediction. The validation results are summarized in Table 1.
To further demonstrate how the current measurement of dispersion (CV) compares with the distribution of the permuted coefficients of variation, we also calculated the percentile score for all coefficients of variations, based on the biological knowledge that genes from the same operon should be less dispersed than permuted sets of genes and the percentile scores are thus expected to be low. The results showed that even though the calculated CV for most groups was less than the mean CV value for permuted sets of genes, the percentile scores shows that only a small collection of these groups fall within a percentile less than 0.20. The percentage of groups with percentile scores less than 0.2 is 43% to 57%, and 57% to 61% for dataset 1 and dataset 2, respectively.

Comparison of Artificial Neural Network to Previous
Methods (GBT and ZIP). Similar to the GBT method, the artificial neural network method aims to identify an unknown and potentially nonlinear relationship between proteomics and mRNA abundance data, which distinguish them from the ZIP method, which carries zero-inflated linear assumptions [18]. Another key difference is that GBT and neural network methods utilized additional sequence based information, such as cellular role, gene length, amino acid usage, and codon usage known to be important for translation efficiency to model the relationship between transcriptomic and proteomic dataset, while the ZIP method establishes the relationship between proteomic and mRNA solely based on experimental transcriptomic and proteomic dataset.
Some improvements were observed in terms of prediction validation by the neural network model in comparison to the GBT model. The percentage of operon groups with CV smaller than permuted CV was 82% to 89% for dataset 1 in comparison to 75% to 79% for the GBT model. The percentage of operon groups with CV smaller than permuted CV was 79% to 89% for dataset 2 by the neural network model in comparison to 75% to 82% by the GBT model. For dataset 1, the percentage of groups with percentile scores less than 0.2 is 43% to 57% in the neural network model in comparison to 36% to 50% in the GBT model. For dataset 2, the percentage of groups with percentile scores less than 0.2 is 57% to 61% in the neural network model in comparison to 32% to 54% in the GBT model. The

Analysis of Factors Ascribed to Missing Protein Measurements.
Through the neural network model, the majority of the proteins that were experimentally undetected were predicted to have protein abundance with greater than 1.0 peptide hit. This thus tempted for us to look into the possible factors which may be ascribed to the missing protein values.
To do so, we collected some of the protein or mRNA/DNA sequence-based features which may affect protein detectability for use along with experimental transcriptomic data in a logistic regression procedure to determine their relative contribution to detectability of proteins in the D. vulagris genome. The features/measurements we used were from the following several categories: (i) features related to gene expressivity: experimental mRNA abundance data, N c and CAI values [33,34], codon usage such as CR axis1-CR axis4, T3s, C3s, A3s, G3s, GC3s, and GC content [35]; (ii) features related to protein and RNA stabilities: protein instability index and AliphaticIndex, and minimal free energy (MFE) of RNA [35]; (iii) features related to protein solubility: protein solubility indices such as Grand Average of Hydropathy (GRAVY) [33]; (iv) amino acid usage such as AA axis1-AA axis4, and others such as gene length [35]. The logistic regression procedure in the SAS software package was used to model the probability of whether a protein will be experimentally detected given the features and measurements for the gene/protein. The stepwise selection method was used to select those most influential factors for indicating missing proteomic value. Those highly correlated features were removed automatically. The 0.05 significance threshold was used for the stepwise selection. The mRNA abundance  and gene length values were both log2 transformed before being used in the procedure. The protein stability index, aliphaticIndex, and N c values were divided by 10 and the CAI, N MFE, GRAVY, AA axis1-AA axis4, and CR axis1-CR axis4 were all multiplied by 10 so that they scaled to 1. This standardization will not change the results, but make illustration of the results easier. Tables 2 and 3 present the odds ratio of each selected factor for dataset 1 and dataset 2, respectively. The results showed that some of the features are highly correlated with probability of missing proteomic values. For dataset 1, seven factors were indentified consistently by the regression model. Among the features that are positively correlated with probability of missing proteomic values (i.e., when the features/measurement increase), protein instability index which provides an estimate of the stability of a given protein in a test tube [38] was found as the most significant factor for all three growth conditions, and aliphaticIndex which is regarded as a positive factor for the increase of thermostability of globular proteins [39], was found significant for growth conditions LL and FL, but not for condition LS. In addition, AA axis2 and CR axis3 correlated most strongly with GRAVY and G3s, respectively, and they were also found significant in all three growth conditions, while the N MFE was only significant for LS condition (Table 2). For dataset 2, the features that are kept in the regression model are more the same for the three growth conditions in dataset 1, suggesting that these factors could be universal across experimental platforms. Six features that are significant for dataset 1 are also found to be critical for all the conditions and affect the probability in the same direction for dataset 2 include mRNA abundance, protein instability index, gene length, N c , CAI, and AA axis2 corresponding to GRAVY (Table 3). In general, the analysis suggested that those proteins with smaller mRNA expression values, smaller N c value, higher protein instability index, or high value of aliphaticIndex, have a greater probability to have missing proteomic values in the D. vulgaris datasets. Based on this preliminary analysis, it may be possible to apply a logistic regression model to predict whether a given protein will or will not be detected by choosing a proper threshold for the predicted probability.

Data Interpretation Using Imputed Datasets.
A total of 308-354 proteins were experimentally identified for dataset 1, and 986-1001 proteins were experimentally identified for dataset 2, which are slightly over 10-30% of   Protein abundance is in peptide counts; RNA abundance is in raw fluorescence measurements in microarrays. "x" indicated proteins are undetected experimentally. x Flagellar basal-body rod protein x L-lysine exporter  Table 4, even for proteins involved in energy metabolism such as ATP synthase which are typically expressed at relatively high levels, significant missing proteomic data was still obvious in the experimental datasets, only two out of nine putative ATP synthase proteins were experimentally detected under three growth conditions. By analysis of the imputed protein abundance for all putative ATP synthase proteins, a clear trend of downregulation was observed for LS and FL condition when compared with LL condition, which is consistent with the previous data suggesting lactate is a favorable electron donor for D. vulgaris [22]. When comparing lactate-based growth with formate-based at the exponential phase, we found that enzymes involved for formate utilization were predicted to be upregulated during growth on formate (Table 4). It has been reported that c-type cytochromes are highly expressed in sulphate-reducing bacteria [20], however, only low amounts of c-type cytochrome proteins have been identified from D. vulgaris by the proteomic (Table 4), probably due to the fact that c-type cytochromes undergo a complex posttranslational maturation process involving covalent attachment of heme groups [40]. Through the ANN model, significant expression of various cytochrome c proteins was predicted in D. vulgaris grown under three conditions. In general, cytochrome c proteins were predicted to have higher expression at the exponential phase in both lactate-and formate-based media than the stationary phase, consistent with their important roles in electron transfer during fast growth [20]. Two cytoplasmic hydrogenases (Ech and Coo) were previously assigned the putative function of generating hydrogen in the cytoplasm [20]. In agreement with previous prediction from the ZIP model [18], the ANN model predicted that all subunits of Ech hydrogenase (EchACDE) were upregulated, while all subunits of Coo hydrogenase (CooHLUX) were down-regulated on formatebased growth, suggesting their differential roles in H 2 generation when grown with different carbon sources. We also used the imputed data to analysis the stress responses when the D. vulgaris cells were treated under salt stress. The analysis allowed identification of some proteins which were not experimentally detected, however, were predicted to be upregulated under the salt treatment condition (ST120) as compared with control (CT120) ( Table 5). Among them were transporters and various multidrug resistance proteins and some signal transduction proteins. This list of proteins can serve as putative candidates for further experimental validation. In general, our analysis indicates that biological interpretation may benefit from imputing missing data using computational methods by integrating temporal transcrip-tomic and proteomic data. The full list of all proteinabundance predictions for all proteins has been provided in the Table 2 of the supplementary material.
In conclusion, missing values in large-scale proteomic analysis is a frequent problem and in some cases it has created difficulty for accurate interpretation of proteomic data for complex biological systems. However, in spite of its obvious importance, the methodologies to deal with missing proteomic data are still lagging well behind the methods for transcriptomic analysis. In this study, we applied an artificial neural network approach to approximate the relationships between cognate transcriptomic and proteomic datasets of D. vulgaris, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors, such as mRNA abundance, cellular role, and triple codon counts. The results showed that the coefficients of determination for the trained neural network models ranged from 0.47 to 0.68, providing better description than several previous models [18,19]. In an attempt to identify possible causes for missing proteomic values, a logistic regression analysis of various experimental measurements and sequence-based features was performed for the experimentally undetected proteins but with high predicted protein abundance values. Although the results are in general consistent with previous knowledge that proteins with low possible expressivity and high instability will have higher chance to be missed, in the study, we for the first time quantified the odds ratio of each of those possible factors in the D. vulgaris datasets. In addition, in a preliminary analysis we demonstrated that biological interpretation could be improved by using the imputed proteomic datasets. Finally, although the model was well validated statistically, caution needs to be exercised when interpreting experimental data based on predicted protein expression values because the predicted abundance values are constrained by the quality of experimental proteomic data used as input, and a dearth of large-scale quantitative predictors which can be included in the model. Nevertheless, the initial success in applying the ANN method is encouraging, and the model could serve as a basis for developing more sophisticated models to further improve the biology interpretation.