Prediction of Cancer Proteins by Integrating Protein Interaction, Domain Frequency, and Domain Interaction Data Using Machine Learning Algorithms

Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues's method by employing the protein-protein interaction (PPI) data, domain-domain interaction (DDI) data, weighted domain frequency score (DFS), and cancer linker degree (CLD) data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I) using individual algorithm, (II) combining algorithms, and (III) combining the same classification types of algorithms. When compared with Aragues's method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis.


Introduction
It has been known for a long time that cancer is a result of loss of cell cycle control. The loss of control is a result of series of genetic mutations involving activation of proto-oncogenes to oncogenes and inactivation of tumor-suppressing genes. Oncogenes and tumor suppressors may cause cancer by alternating the transcription factors, such as the p53 and ras oncoproteins, which in turn control expression of other genes. Therefore, understanding how oncoprotein-oncoprotein interacts and how oncoproteins drive the cell division cycle is indispensable for the study of molecular oncology. Predicting novel cancer-related proteins is an important topic in biomedical research; experimental techniques such as microarrays are being used to characterize cancer. However, the process could be time consuming and labor-intensive. Nagaraj and Reverter [1] proposed a Boolean logic based approach to predict colorectal cancer genes. Li et al. [2] took GO enrichment scores and KEGG enrichment scores as features to predict retinoblastoma related genes. The above two studies are confined to predict specific cancers. For general types of cancers, Hosur et al. [3] combined linear programming formulation for interface alignment to predict cancer related PPIs. Aragues et al. [4] used PPI data to predict cancer-related proteins. In this study, we extended Aragues's study by employing PPI data and domain information to attain improved performance.
Protein-protein interactions are inherent in almost every cellular process. In fact, PPI is the core of the entire interactomics system in living cells. PPI appears when two or more proteins bind together and perform a biological function [5]. Almost all major research topics in molecular biology involve PPI such as cellular function [6], genetic diseases [7], conserved patterns [8], and homologous relationships 2 BioMed Research International [9]. The recent availability of PPI data has made it possible to study human disease at a system level. It is reported that since disease genes exhibit an increased tendency for their protein products to interact with one another, they tend to be coexpressed in specific tissues and display coherent functions [10]. Ideker and Sharan reported in a review article [11] on the applications of PPI networks to study disease in four major areas: (i) identifying new disease genes, (ii) studying their network properties; (iii) identifying diseaserelated subnetworks, and (iv) performing network-based disease classification. Another study [12] investigated the human cancer PPI network from a structural perspective, that is, protein interactions through their interfaces. Their findings indicated that cancer-related proteins have smaller, more planar, more charged, and less hydrophobic binding sites than noncancer proteins.
It is known that proteins are composed of multiple functional domains. A domain is a unit of function associated with different catalytic functions or binding sites, as found in enzymes or regulatory proteins. It is hypothesized that cancer proteins, also known as tumor associated genes, may share common functional domains [13], and thus a weighted domain score for each tumor associated gene's domain is determined. Novel cancer proteins are determined by translating full cDNA sequences to the corresponding protein sequences and calculating the weighted domain scores. Another work [14] used established methods to identify the network topology of a cancer protein network. They showed that cancer proteins contain a high ratio of structural domains, which have a high propensity for mediating protein interactions. Recently, Clancy et al. designed a statistical method to infer the physical interactions between two complexes for the human and yeast species [15]. Domains such as the immunoglobulin domain, Zinc-finger, and the protein kinase domains are the top three most frequently observed cancer protein domains. Many other works also employed PPI and DDI to characterize disease networks [7,[16][17][18][19][20]. In our previous works [21,22], a one-to-one DDI model was proposed to obtain specific sets of DDI for oncoproteins and tumor suppressor proteins, respectively. Three specific sets of DDI, that is, oncoprotein and oncoprotein, tumor suppressor protein and tumor suppressor protein, and oncoprotein and tumor suppressor protein, are derived from their PPIs.
Weka (http://www.cs.waikato.ac.nz/∼ml/weka/) [23] is a well known software tool which provides environments related to machine learning, data mining, text mining, predictive analysis, and business analysis. Machine learning and data mining algorithms have been widely used in bioinformatics and computational biology [24][25][26][27][28]. The present authors adopted amino acid composition profile information with the SVM classifier to improve protein complexes classification [29]. Additionally, we also proposed identifying microRNAs target of Arabidopsis thaliana by integrating prediction scores from PITA, miRanda, and RNAHybrid algorithms [30]. Recently, Li et al. [31] used random forest machine learning algorithm and topology features to identify the functions of protein complexes.
In this research, we began by collecting cancerous protein interaction data. That is, only interactions involved with  cancer proteins are considered. These types of interactions are known as cancerous PPIs. Noncancerous protein interaction means either one or both of the proteins are not yet identified in relation to cancer. Given the cancerous PPIs, a set of DDI rules for cancer proteins are derived. In addition to this set of DDI, we also considered other features: the weighted domain frequency scores (DFS): DFS C for cancer proteins and DFS X for noncancer proteins and the cancer linker degree (CLD) score. A total of four features (DDI, DFS C, DFS X, and CLD) are adopted to make novel cancer protein predictions by using 39 machine learning algorithms from the Weka tool. In addition, we also verified the accuracy of the predictive model on two independent datasets. Finally, using differentially expressed genes found in lung cancer microarray data as a case study, we discovered some potential cancer genes for further experimental investigation.

System Flowchart.
In this study, a system was set up to predict cancer proteins by integrating four types of features. Firstly, cancer and noncancer PPIs were collected from biological databases, and then those interactions were annotated by using the domain information. Next, we determined four feature scores; data normalization is needed to ensure consistency in their distribution. The system flowchart of this study is illustrated in Figure 1.  Sloan-Kettering Cancer Center (MSKCC, http://cbio.mskcc .org/research/cancergenomics/index.html). Noncancer proteins were from BioGrid (http://thebiogrid.org/) [32]. On the other hand, cancer proteins of the independent test dataset for Case Study 1 were obtained from two resources, that is, Online Mendelian Inheritance in Man (OMIM, http://www .ncbi.nlm.nih.gov/omim) and the human lung cancer database (HLungDB) [33].

Data Sources and Datasets
Protein domain information was downloaded from PFam database (http://pfam.xfam.org/) [34]. PPIs for both cancer and noncancer proteins were retrieved from BioGrid database. The Swiss ID of proteins was obtained from Swiss-Prot Database (http://www.expasy.org/sprot/).
There are three classified combinations of protein interaction adopted as input data in the current study: " -" indicates cancer-cancer protein interaction, " -" indicates cancer-noncancer protein interaction, and " -" indicates noncancer-noncancer protein interaction. Figure 2 depicts input data sources and dataset generation of this research.

Feature Scores Generation.
We derived four feature scores based on three different approaches, which are domaindomain interaction, weighted domain frequency, and cancer linker degree. For the domain-domain interactions, it may happen that some relationships are derived from homologous sequences, which may produce a bias in the 10-fold validation; therefore, some redundancies (by a certain homology degree that includes domains) should be removed. By extracting the homologous PPIs by using the "UniRef50" dataset obtained from the UniProt Reference Clusters (UniRef, http://www.uniprot.org/uniref/), we found that, among the 123751 edges derived from BioGrid database, only 431 edges (approximately 0.348%) are homologous PPIs and 86 edges are cancer PPIs; it means that most of the PPIs are not homologous PPIs. We removed the 431 PPIs and the remaining 123320 PPIs are used in our experiment. In addition, the 431 homologous PPIs comprise 180 proteins, in which 179 proteins still appear in other nonhomologous PPIs; therefore, the total number of domains, that is, 3970, remains unchanged.

One-to-One Domain Interaction Model.
Assuming that proteins and contain and domains, respectively, and then given an interacting protein pair ( , ), one considers that there are possible domain pairs. The set of domain pairs of two proteins and , , , is defined by where ( ) and ( ) denote sets of protein domains in proteins and , respectively, and × denotes the Cartesian product of two sets ( ) and ( ).
To measure the likelihood of a DDI combination, a DDI pair interaction matrix is introduced. The element , denotes the weighted combination probability of a domain pair ( , ) for a given protein pair ( , ), and it is given by where | ( )| and | ( )| denote the set sizes of ( ) and ( ), respectively, and * is the multiplication operation; the summation is over all possible protein pairs of ( , ) such that and are an element of ( ) and ( ), respectively. Subsequently, protein domains are randomized while maintaining the number of domain assignments for each protein the same as the original set. The randomized counterpart of , , ⟨ rand , ⟩, is performed in order to justify the protein domain pair calculation. Then, the domain pair score of the domain pair ( , ), , , is defined by where ⟨ rand , ⟩ denotes the ensemble average (we randomized the data 20 and 40 times, and it was found that the results converge after 40 times) of the randomized counterpart of , . This result provides a criterion to rank the domain pairs. If the ratio , is larger than one, then the correlation is stronger than the randomized counterpart, so the domain pair ( , ) is a preferred DDI relation.
Given the set of domain annotation for any two proteins, one can turn around and compute a score that signifies PPI based on the set of , values for DDI. This derived PPI score can answer the question whether any two proteins interact or not given their domain components. The DDI score for the protein pair ( , ), DDI , , is defined as follows: where ( ) and ( ) denote the set of domains in proteins and , respectively.

Weighted Domain Frequency Score (DFS).
The two feature scores (DFS C and DFS X) are defined in this section as the variations from the study by Chan [35]. Among the total of 3970 collected human domain types, the numbers of 381 and 2750 of them appear only in cancer proteins and noncancer proteins, respectively, and 839 of them appear in both cancer proteins and noncancer proteins. This result supports the propensity that certain domain types reside in cancer and noncancer proteins. Let = ( 1 , 2 , . . . , ) represent the set of cancer proteins, and let = ( 1 , 2 , . . . , ) be the set of domain types that appear in the cancer proteins; similarly, let = ( 1 , 2 , . . . , ) denote the set of noncancer proteins, and let = ( 1 , 2 , . . . , ) be the set of domain types that appear in noncancer proteins. For each domain , let ( ) and ( ) denote the numbers of occurrence of domain in cancer proteins and noncancer proteins, respectively. A higher score value suggested that the domain has a high propensity which resides in cancer or noncancer proteins. Then, two weighted DFS values for the protein pair ( , ), DFS , and DFS , , are defined by the following, respectively: where and are the total number of domain types that appear in cancer and noncancer proteins, respectively, and ( ) and ( ) denote sets of protein domains in proteins and , respectively. The weighted DFS is adopted to measure the propensity of domain occurrence in cancer and noncancer proteins.

Cancer Linker Degree (CLD).
The last feature is named the cancer linker degree (CLD) score which was adopted from the model proposed by Aragues et al. [4]. In organisms, proteins interact with each other to form a protein complex in order to perform special functions. We can conjecture the category of function and the level of activity by observing their interaction partners. For a given protein pair ( , ), let ( , ) and ( , ) denote the number of adjacent cancer proteins and noncancer proteins in PPI, respectively. Then, the cancer linker degree score for the protein pair ( , ), CLD , , is defined by ( , ) + ( , ) .
As an illustration, an example is presented in Figure 3, where and are cancer proteins, and , , , and are noncancer proteins. The CLD score represents the interaction ratio for a specified PPI interacting with a cancer partner. If the CLD score is close to one, it implies that the interaction edge is connecting many cancer nodes and could be located in the core of the cancer-related protein clusters.

Data Normalization.
The four features scores consist of DDI score (DDI , ), weighted domain frequency scores (DFS , and DFS , ), and cancer linker degree score (CLD , ). Due to the fact that the distributions of the numerical values for the above four features are not consistent between each other, data normalization is needed. For the value after normalization, is defined by where , , denotes the unnormalized feature value and max and min are the maximum and minimum values for , , , respectively.

Machine Learning Algorithms and Performance Statistical
Measures. Since different choices of machine learning algorithms resulted in different predictions of performance, we conducted several comprehensive experiments to determine the optimal combinations of the algorithms. Thirty-nine machine learning algorithms in Weka are discussed in this study. Readers may refer to [23] for detailed descriptions about these algorithms. According to Weka, machine learning algorithms are divided into six classification types, that is, "Bayes" (6 algorithms), "functions" (6 algorithms), "Misc" (2 algorithms), "lazy" (4 algorithms), "rules" (9 algorithms), and "trees" (12 algorithms). A rigorous 10-fold cross validation test is performed to test the classification performance. Six statistical measures are introduced to quantify the prediction performance, that is, accuracy (ACC), specificity (SPE), sensitivity (SEN), -score ( 1), Matthew's correlation coefficient (MCC), and positive predictive value (PPV), which are defined in terms of TP, TN, FP, and FN, where they denote true positive, true negative, false positive, and false negative events, respectively. Their definitions are listed in (9). SPE and SEN measure how well a true cancer protein or a true noncancer protein is identified. 1 conveys the balance between SPE and SEN. ACC and MCC provide an integrative measure of correct identification. PPV is positive predictive fraction. In addition, the AUC (area under the curve) score, which provides a global performance evaluation, is also included:

Results
A total of 123320 PPIs, which are composed of 15214 cancer PPIs and 108106 noncancer PPIs, 2863 cancer proteins, and 3970 domains were used in our experiment. A 10-fold cross validation test was conducted to determine the optimal threshold settings for each classifier. We assumed the " -" type PPI as a positive set and the rest as a negative set. According to our previous work [30], the balanced trained dataset usually has better performance than the unbalanced one; hence, the algorithms are trained with an equal size ratio of 1 : 1 for the positive and negative dataset. Since the sizes of the original positive and negative sets differ by a factor of about 6 (unbalanced learning set), to generate a balanced learning set, the 15214 positive target interactions (cancer PPIs) were kept, and a total of 15214 noncancer PPIs were randomly selected from the negative set. Later on, the above-mentioned seven statistical measures are determined. For comparison, the corresponding results of unbalanced dataset are listed in Supplementary File 1, Appendix Tables S1 to S5, in Supplementary Material available online at http://dx.doi.org/10.1155/2015/312047, where the performance of MCC and PPV is much worse due to the very large TN and very small TP. Therefore, the use of balanced datasets is more preferable.

Performance Comparison by Individual Machine Learning
Algorithm and Voting with the Majority. The performance comparison for the individual algorithm is listed in Table 1. The LMT algorithm of the "trees" type achieved the highest ACC (0.772), 1 (0.774), and MCC (0.548) among the 39 algorithms. Interestingly, according to either ACC, 1, or MCC, the top six algorithms (LMT, SimpleCart, J48, J48graft, REPTree, and FT) are all of the "trees" type. On the other hand, the LWL algorithm of the "lazy" type, the VFI algorithm of the "misc" type, the ConjunctiveRule algorithm of the "rules" type, and the DecisionStump algorithm of the "trees" type achieved the highest SPE and PPV (1.000), and the Nnge algorithm of the "rules" type achieved the highest SEN (0.858), while the Ridor algorithm of the "rules" type achieved the highest AUC (0.780). We also tried to combine the subsets of the four features for predicting the cancer proteins, but the predictive performance is not markedly improved.
The individual classifier has its own strengths and weaknesses; therefore, it is inspired to integrate multiple classifiers, that is, voting with the majority system, to improve the classification performance. Thirty-nine machine learning algorithms in Weka were selected and integrated using various types of voting with the majority system. In this study, the voting with the majority system involved an odd number of algorithms. For example, top 5 in Table 2 represents combining the top 5 algorithms extracted from Table 1, which are LMT, SimpleCart, J48, J48graft, and REPTree. Performance comparison by voting with the majority is listed in Table 2. The best performance in voting is attained when the top 23 algorithms are selected, which have the highest 1 (0.786) and PPV (0.890). When comparing Table 1 with Table 2, except SEN, the top voting with the majority system (top 23) is better than top individual algorithm (LMT) in the other six performance measures, and it also outperformed all thirty-nine individual algorithm in ACC, 1, MCC, and AUC. We also noted that the best performance of voting with the majority system (top 23) delivered lower FP events, that is, 186, than those of the top three individual algorithms, which are 303, 297, and 307, respectively. In other words, the voting approach does not introduce spurious events.

Performance Comparison with Group
Voting with the Majority. Performance comparison by voting with the majority for each group is listed in Table 3 (Misc type is omitted here, because it contains only two algorithms). For instance, under the "functions" type, the "Top: 3" classifier in Table 3 represents the combination of "functions" type algorithms, that is, MultilayerPerceptron, Logistic, and Sim-pleLogistic algorithms (see Table 1). The results indicated that "trees" type achieved the highest ACC (0.781), SEN (0.749), 1 (0.786), MCC (0.567), MCC (0.513), and AUC (0.788), while the "rules" type had the highest SPE (0.838), and "lazy" type achieved the highest PPV (0.883).
For clarity, Figure 4 illustrates the majority voting results for various types of algorithms; the results suggest that "trees" type algorithms perform better in most of the performance measures.

Performance Comparison with the Competing Study.
Since there are four features that are considered in this study, it is necessary to study their significance in classification performance. To study the prediction performance of the four features individually, we evaluated the feature importance by the area under the curve (AUC) value [36,37]. Features with a higher AUC score are ranked as more important than features with a low score. The results of AUC values for the four features are given in Table 4. The DFS C feature ranks at the top in AUC value, while the DFS X feature has the lowest AUC value. These results suggest that DFS C feature has the greatest discrimination information between positive datasets and negative datasets.
To demonstrate the effectiveness of the present study, we compared our results with the work by Aragues et al. [4], which uses the CLD feature only. As shown in Table 5 and Figure 5, for a single classifier, our method achieved better performance than Aragues's in all seven performance  Tables 5 and 6, we can see that, using the CLD feature only, the best performance in voting is attained when

Discussion
In this section, we designed two case studies to demonstrate the performance of the proposed method and showed how to discover potential cancer proteins, respectively.

Case Study 1.
To investigate the performance of the proposed method, we retrieved lung cancer protein data from OMIM and HLungDB databases. Among the 2599 experimentally confirmed lung cancer proteins, there are a total of 1302 cancer proteins not appearing in our original training dataset, which could be used as the independent test dataset. List of the 1302 cancer proteins can be found in Supplementary File 2. The LWL, VFI, ConjunctiveRule, and DecisionStump were excluded from Case Study 1 because of their intrinsically high PPV which may bias the performance estimation. Consequently, as shown in Table 7, the hit number and hit ratio denote how many cancer proteins are true positive events and true positive ratios, respectively. Most of the algorithms had a hit ratio remarkably consistent over 75%, especially, the Ridor algorithm which achieved 89.4%. Compared with classifiers with higher ranks in 1 (Table 1), the results appear to suggest that classifiers with high PPV achieve better hit ratios.

Case Study 2.
It is known that interacting proteins are often coexpressed; one can identify differentially expressed genes (DEGs) among a large number of gene expressions and understand the mechanism of lung cancer formation induced by these DEGs [38]. We further explored the potential cancer genes from DEGs in microarray data. Four sets of lung cancer microarray data were downloaded from the GEO database [39] and summarized in Table 8. Experiments GSE7670 [40] and GSE10072 [41] use the HG-U133A array, where GSE19804 [42] and GSE27262 [43] use HG-U133 plus 2.0 chip.   The tested DEGs are collected from the intersection set of the above four microarray datasets. Among the 1345 common DEGs in the four microarray datasets, 360 DEGs were excluded because of their appearance in the original training set; another 209 DEGs are also removed due to their lacking of domain data or PPI data. The remaining 776 DEGs serve as input data.
Five classifiers, including the top three classifiers according to the 1 measure, LMT, SimpleCart, and J48 algorithms, as well as the top one classifier according to PPV measure, LWL algorithm, along with the top one classifier according to Case Study 1, Ridor algorithm, were selected for evaluating potential cancer genes under strictly uniformed voting; that is, only the one with five votes which all five classifiers To validate our findings, we conducted a study of the literature by randomly selecting five genes (DBF4, MCM2, ID3, EXOSC4, and CDKN3) from the 565 potential cancer genes. The results indicated that while EXOSC4 remains unclear, the others are in agreement with our predictions. Barkley proposed that miR-29a targets the 3-UTR of DBF4 mRNA in lung cancer cells [44] and Bonte stated that most cell lines with increased Cdc7 protein levels also had increased DBF4 abundance, and some tumor cell lines had extra copies of the DBF4 gene [45]. Alexandrow noted that Stat3-P and the proliferative markers MCM2 were expressed in mice lung tissues in vivo [46]. Yang et al. observed that patients with higher levels of MCM2 and gelsolin experienced shorter survival time than patients with low levels of MCM2 and gelsolin [47]. Langenfeld et al. [48] indicated that Oct4 cells give rise to lung cancer cells expressing nestin and/or NeuN, and BMP signaling is an important regulator of ID1 and ID3 in both Oct4 and nestin cell populations. Tang claimed that CDKN3 has significant biological implications in tumor pathogenesis [49]. In [50], a metasignature was identified in eight separate microarray analyses spanning seven types of cancer including lung adenocarcinoma, and these included many genes associated with cell proliferation, and CDKN3 is among them.
Given a single protein as testing data, we can first treat this protein as . If both PPIs and domains information are available, one can then apply the present method to classify the interaction type " -".
Any " -" type is classified as " -", and then there are two possible explanations for this: (i) the classifier is not completely specific; therefore, one has FP prediction, and (ii) prediction is a TP event. If one can exclude the first explanation option, then the present calculation provides a potential way to assign as ; in other words, it provides a feasible solution for predicting cancer proteins.
If the PPI information is missing, given the FASTA sequence information, one can make use of the STRING database [51]. STRING is a database that provides known and predicted PPI derived from four sources: genomic context, high-throughput experiments, conserved coexpression, and published literature. On domain prediction, one can carry out the analysis by using the online tool "SEQUENCE SEARCH" under PFam [34] to find matching domains. Then, given the PPIs and domains information, one can conduct the same analysis as described in the last paragraph; otherwise, one is facing a difficult task, which requires further discussion or work.

Conclusion
Identifying cancer protein is a critical issue in treating cancer; however, identifying cancer protein experimentally is extremely time consuming and labor-intensive. Alternative   Figure 6: The performance comparison of the voting with the majority for Aragues (blue) and the proposed method (red). methods must be developed to discover cancer proteins. We have integrated several proteomic data sources to develop a model for predicting cancer protein-cancer protein interactions on a global scale based on domain-domain interactions, weighted domain frequency score, and cancer linker degree. A one-to-one interaction model was introduced to quantify the likelihood of cancer-specific DDI. The weighted DFS is adopted to measure the propensity of domain occurrence in cancer and noncancer proteins. Finally, the CLD is defined to gauge cancer and noncancer proteins' interaction partners. As a result, voting with a majority system achieved ACC (0.774), SPE (0.855), SEN (0.721), 1 (0.786), MCC (0.562), PPV (0.890), and AUC (0.787) when the top 23 algorithms were selected, which is better than the best single classifier (LMT) in six performance measures except SEN. We compared our performance with the previous work [4]. It was shown that the present approach outperformed Aragues's in all seven performance measures in both individual algorithm and combining algorithms. Effectiveness of the current research is further evaluated by two independent datasets; experimental results demonstrated that the proposed method can identify cancer proteins with high hit ratios. The current research not only significantly improves the prediction performance of cancer proteins, but also discovered some potential cancer proteins for future experimental investigation. It is anticipated that the current research could provide some insight into disease mechanisms and diagnosis.