Is Subcellular Localization Informative for Modeling Protein-Protein Interaction Signal ?

1 Division of Biometrics, The Cancer Institute of New Jersey, 195 Little Albany Street, New Brunswick, NJ 08901, USA 2 Department of Biostatistics, School of Public Health, University of Medicine and Dentistry of New Jersey, 683 Hoes Lane West, Piscataway, NJ 08854, USA 3 Department of Epidemiology and Public Health, Yale University School of Medicine, 60 College Street, New Haven, CT 06520, USA 4 Department of Statistics, West Virginia University, P.O. Box 6330, Morgantown, WV 26506, USA 5 Department of Electronic and Computer Engineering, The Hong Kong University of Sciences and Technology, Clear Water Bay, Kowloon, Hong Kong, China


Two-way PPI count contingency table
We extracted 2712 PPIs from MIPS [1] which were available at http://hto-b.usc.edu/∼msms/AssessInteraction/MIPSMatchYPD.txtas of 2005 and used by Lin and Zhao [2] for PPI network robustness study.We use 1641 PPIs with complete PSL information in Huh et al. [3], for example, protein A has a 22-dimensional PSL vector L A = (L A,1 , L A,2 , . . ., L A,22 ), where L A,i = 1 represents presence of protein A at PSL i and L A,i = 0 represents absence of protein A at PSL i.For proteins A and B, we create 44-dimensional PSL vector L AB ( L A , L B ) along with an exchanged counterpart L BA ( L B , L A ) for naive balance.Since log-linear model with large number (2 44 ) of cross-classified cells may lack power where the total PPI count is relatively small (<10 000), we instead explore an alternative two-way (22 2 ) contingency table whose rows (compartments: i = 1, . . ., 22) and columns (compartments: j = 1, . . ., 22) jointly assign each PPI into cell (i, j) with one protein in compartment i and the other one in compartment j (i, j = 1, 2, . . ., 22) (Figure 1).Note that one PPI may be redundantly counted due to multiple PSL occupation.Cytoplasm and nucleus likely play crucial roles since these two compartments hold most PPI entries and other compartment pairs have much less entries.Negative binomial model avoids overdispersion and shows ER to Golgi, lipid particle and nucleus may be significant effects for this two-way contingency table.

Retrospective logistic regression
We propose a realistic model for quantifying PPI tendency from fused PSLs of proteins A and B (with exchanging).The PSL and PPI information is expressed as where I AB (= I BA ) is the binary PPI indicator (response) and the logistic regression model is proposed to be logit where β 0 and β i imply default PPI probability and PPI tendency of single protein with PSL i, β i j , and β i j represent PPI tendency of single protein with PSLs i and j and two proteins with PSLs i and j, respectively, where i = j describes PPI tendency of two proteins with common PSL i.The number of model parameters is 1 + 2C 1 22 + 2C 2 22 = 507.For efficiency we consider a reduced model β 0 + i≤ j β i j (L A,i L B, j + L B,i L A, j ) which incorporates second-order PSL effects between two proteins.The yeast interactome and proteome are inherent libraries and not subject to arbitrary experimental design, which indicates a retrospective (case-control) study.On the other hand, we have ∼18 × 10 6 total protein pairs and only ∼2 × 10 3 PPIs in our data.In order to overcome computer memory limitation and achieve reasonable sample sizes for both case (PPI) and control (non-PPI) groups, we need to select out a sample subset under statistical justification.For logistic model with responses y i s and predictors x i s, we let Z i indicate whether subject i is selected and assume ρ 1 = Pr(Z i = 1 | y i = 1) and ρ 0 = Pr(Z i = 1 | y i = 0), both of them are free of x i .If the logistic model based on all subjects has logit (Pr(y i = 1 | x i )) = α + βx i , then the retrospective logistic regression (RLR) after selection probability adjustment would be logit (Pr(y i = 1 | x i , z i = 1)) = α + log(ρ 1 /ρ 0 ) + βx i (Chapter 4.3.3,McCullagh and Nelder [4]).We apply case selection probability 1 and control selection probability 2 × 10 −3 (3282 PPIs, 38 338 entries and 254 parameters) and identify around 60 significant effects.The resultant prospective PPI probabilities are to be adjusted based on foregoing theory.

PPI prediction from PSL
After fitting the preceding model, we apply certain threshold τ to the simple classification rule We randomly divide the whole dataset for retrospective study into 10 disjoint portions.Each portion (includes PPIs and non-PPIs in proportion) acts as one testing set and the other nine portions are combined into one training set for 10-fold cross validation.We classify each protein-protein pair in the testing set into PPI or non-PPI by comparing the calculated probabilities (from trained model parameters) with some threshold τ.We find that PPI probability median of the non-PPI subset in the training set is always equal to that of the PPI subset in the training set and the PPI probability median (1.88×10 −4 ) for PPI subset also equals that of non-PPI subset for the whole dataset in retrospective study.For retrospective study with PPI probability median threshold, we have specificity around 98% and sensitivity around 15%, RandomForest Breiman [5] in R reaches specificity around 99% and sensitivity around 20% and support vector machine (SVM) in R reaches specificity around 50% and sensitivity around 90%.
The PPI probabilities from retrospective study dataset and 10-fold cross-validation are plotted in Figure 3.The logistic model-based classification results are found to be sensitive to threshold.If we use "Pr(PPI | L AB or L BA ) ≥ τ ⇒ PPI" and "Pr(PPI | L AB or L BA ) < τ ⇒ non-PPI", where τ equals PPI probability median, then we obtained very different classification results.After prospective PPI probability adjustment, the threshold-based classification ( 4) is applied to the complete PPI and PSL data ([Sets 1,2,3,4], Section 1.2) and the resultant ROC curve is given in Figure 4 with area under curve (AUC) less than 0.5.Since we may simply invert this classifier to make AUC greater than 0.5, Figure 4 indicates that the proposed logistic regression model ((3) in Section 1.3) may not be highly sufficient even if this model is carefully chosen.We also observe the following facts: selection procedure in retrospective study may involve some bias, the joined PSL patterns (from two proteins) are finite with uncertain overlap between PPI set and non-PPI set, false positives and false negatives may exist in both PPI and PSL data and others.From statistical point of view, interprotein PSL pattern may not independently determine PPI tendency,  and threshold-based PPI prediction rule may not discriminate PPI from non-PPI either.The former conclusion is also a major concern from biologists who consider PPI mechanism far beyond only PSL information.

DISCUSSION
In this article, we proposed statistical analysis of the association between PPI and PSL with the possibility of offering clues for further specific biological experiments.The aforementioned model is only one possible approach out of many helpful tries.It is likely that a totally different approach based on PSL information may lead to disparate results.As an alternative, if we could describe the distribution of 44dimensional joined binary PSL vectors given PPI or non-PPI: Pr( L AB | PPI) and Pr( L AB | non-PPI), then armed with some prior PPI probability, say Pr(PPI) = 3 : (1.8 × 10 4 ), we can predict PPI probability for joined PSL pattern L AB by Bayes rule where Q = Pr( L AB | PPI) Pr(PPI)+Pr( L AB | non-PPI) Pr(non-PPI).Section 1.2 is essentially an attempt to work on either the PPI or non-PPI set to study PSL pattern without considering the non-PPI or PPI counterpart, which may be only a matter of exploring Pr( L AB | PPI) or Pr( L AB | non-PPI) separately.However, the explicit probability of high-dimensional binary vector is difficult to be constructed.Empirical approaches (Sections 1.1 and 1.2, Huh et al. [3]) offer informative results from different perspectives.On the other hand, Liu et al. [6] modeled PPI based on domain-domain interaction information and computational PSL prediction from other sources which are also feasible, the readers are referred to Lu et al. [7], Szafron et al. [8], Höglund et al. [9], Horton et al. [10], Guda [11], Yu et al. [12], and Zhang et al. [13,14] among many others.

Figure 3 :
Figure 3: (Top panel) PPI probabilities from retrospective study.(Lower panel) PPI probabilities from 10-fold cross-validation (after prospective adjustment), each pair of consecutive boxplots is for individual testing set where PPI subset follows non-PPI subset.