Protein–protein interaction (PPI) plays an extremely remarkable role in the growth, reproduction, and metabolism of all lives. A thorough investigation of PPI can uncover the mechanism of how proteins express their functions. In this study, we used gene ontology (GO) terms and biological pathways to study an extended version of PPI (protein–protein functional associations) and subsequently identify some essential GO terms and pathways that can indicate the difference between two proteins with and without functional associations. The protein–protein functional associations validated by experiments were retrieved from STRING, a well-known database on collected associations between proteins from multiple sources, and they were termed as positive samples. The negative samples were constructed by randomly pairing two proteins. Each sample was represented by several features based on GO and KEGG pathway information of two proteins. Then, the mutual information was adopted to evaluate the importance of all features and some important ones could be accessed, from which a number of essential GO terms or KEGG pathways were identified. The final analysis of some important GO terms and one KEGG pathway can partly uncover the difference between proteins with and without functional associations.
Protein is the material foundation of all living things [
PPI has been thoroughly studied both in experimental and computing scenarios. To study PPI via experiments, coimmunoprecipitation, Western blot, and yeast two-hybrid systems are generally adopted [
Gene ontology (GO) is a bioinformatic concept that was originally proposed to unify the representation of genes and gene products of many species [
In this study, we investigated an extended version of PPI (protein–protein function associations) by using GO terms and KEGG pathways. Considering the fact that few PPI studies with computational methods investigated which GO terms were highly related to the determination of PPIs, the purpose of this study was to identify key GO terms or KEGG pathways that can indicate the difference between two proteins with and without functional associations. We first extracted protein–protein functional associations with experiment validations reported in Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) [
All human protein–protein functional associations used in this study were retrieved from STRING (
To extract the difference between positive associations and any two proteins without functional associations, some negative associations are necessary. Given that negative associations are substantially more than the positive ones, we constructed 211,176 differing pairs of proteins, which were thrice as many as positive associations, and each of them was produced as follows:
The whole procedures for analyzing protein–protein functional associations based on gene ontology (GO) and KEGG pathways. The raw 2,425,314 human PPIs were retrieved from STRING and refined by excluding similar proteins and selecting those validated by experiments, resulting in 70,392 PPIs. 6,623 proteins were involved in investigated PPIs and used to construct ten sets of protein pairs, each of which combined with 70,392 PPIs to constitute ten datasets. Each sample was represented by GO and KEGG features, which were evaluated by mutual information, producing ten feature lists, from which we extracted most important features, corresponding to 134 GO terms and one KEGG pathway.
GO terms [
Moreover, according to KEGG [
By integrating the GO term and KEGG pathway information of proteins into
As mentioned in Section
Given a dataset in which each sample is represented by
To quickly implement the program of MI, we adopted the program of minimum redundancy maximum relevance (mRMR) method [
As mentioned in Section
Features with high ranks (large MI values) in the MaxRel feature list are more important than those with low ranks (small MI values). For the MI value, we set 0.01 as the threshold to select important features in each MaxRel feature list, thus producing 10 feature sets denoted as
Number of selected features in each MaxRel feature list.
Dataset | Number of selected features |
---|---|
| 154 |
| 154 |
| 153 |
| 155 |
| 149 |
| 150 |
| 155 |
| 152 |
| 153 |
| 153 |
Distribution of 158 selected features: 146, 2, 1, 2, 3, and 4 feature/s in 10, 9, 8, 7, 6, and less than 6 feature sets derived from 10 datasets, respectively.
Heat map of MI values of 158 features in the 10 datasets. X-axis represents ten datasets; Y-axis represents 158 features.
A careful checking showed that the important 158 features were derived from 134 GO terms and one KEGG pathway (Supplementary Material
The distribution of the rating scores of 134 selected GO terms.
As mentioned in Section
The performance of the random forest (RF) on ten datasets, in which samples were represented by selected 158 features or randomly selected 158 features, evaluated by tenfold cross-validation. The box plot indicates the distribution of MCCs yielded by RF with randomly selected 158 features and the circles represent the MCCs yielded by RF with selected 158 features on ten datasets. It is clear that based on selected 158 selected features, RF produced much better performance, implying the strong associations between these features and PPIs.
As mentioned in Section
Analyzing above-mentioned 134 GO terms one by one is difficult. Here, we selected the most important 21 GO terms with rating scores larger than 0.5 for detailed analysis, which are listed in Table
Information of most important 21 GO terms.
GO term ID | GO term | Rating score | Group |
---|---|---|---|
GO:0044260 | cellular macromolecule metabolic process | 0.688 | Biological process |
GO:0043170 | macromolecule metabolic process | 0.640 | Biological process |
GO:0044428 | nuclear part | 0.618 | Cellular component |
GO:1901363 | heterocyclic compound binding | 0.600 | Molecular function |
GO:0032991 | protein-containing complex | 0.593 | Cellular component |
GO:0097159 | organic cyclic compound binding | 0.591 | Molecular function |
GO:0031981 | nuclear lumen | 0.590 | Cellular component |
GO:0044238 | primary metabolic process | 0.589 | Biological process |
GO:0003676 | nucleic acid binding | 0.583 | Molecular function |
GO:0090304 | nucleic acid metabolic process | 0.569 | Biological process |
GO:0071704 | organic substance metabolic process | 0.556 | Biological process |
GO:0044237 | cellular metabolic process | 0.552 | Biological process |
GO:0005634 | nucleus | 0.549 | Cellular component |
GO:0044446 | intracellular organelle part | 0.547 | Cellular component |
GO:0044424 | intracellular part | 0.537 | Cellular component |
GO:0044422 | organelle part | 0.536 | Cellular component |
GO:0070013 | intracellular organelle lumen | 0.529 | Cellular component |
GO:0005622 | intracellular | 0.523 | Cellular component |
GO:0043233 | organelle lumen | 0.521 | Cellular component |
GO:0031974 | membrane-enclosed lumen | 0.514 | Cellular component |
GO:0006139 | nucleobase-containing compound metabolic process | 0.506 | Biological process |
Distribution of 21 GO terms on three groups: cellular component, molecular function, and biological process.
The cellular component GO term with the highest rating score was GO: 0044428, describing the nuclear part of the eukaryotic cells, involving in chromosomes housing and replicating. Such processes involve multiple effective PPIs, like Esc2 and Rad51 [
Apart from the nucleus region of the cell, according to our results, we also identified that cellular regions associated with functional organelles may also be related to PPIs. GO: 0044422, describing the organelle part of cells, GO: 0070013, describing intracellular organelle lumen, GO:0044446, describing the intracellular organelle part, and GO:0043233, describing organelle lumen, have all been screened out as the potential cellular components that may be associated with positive PPIs [
Apart from such specific GO terms, we also identified some more general ones, like GO: 0032991 (protein-containing complex), GO: 0044424 (intracellular part), GO: 0005622 (intracellular), and GO: 0031974 (membrane-enclosed lumen). They all describe the regions that enrich significant biological processes of the cells. Therefore, actual PPIs tend to enrich in such region, revealing the specific PPI distribution pattern in the eukaryotic cells.
On the basis of the analyses, all 21 GO terms are involved in different aspects of PPI, and they can be used to mark proteins with functional associations. For the remaining GO terms shown in Supplementary Material
As for other GO terms extracted in this study, although not so relevant with PPIs as such GO terms described in Section
Furthermore, one KEGG pathway hsa03010 was obtained in our study. It describes the ribosome associated pathway. Considering that genes/proteins that participate in such pathway may interact with each other, forming the complex of ribosome, such KEGG pathway, may also contribute to the distinction of positive and negative PPIs.
This study investigated protein–protein functional associations based on GO terms and KEGG pathways. By using mutual information, we identified important GO terms and KEGG pathways that can describe the difference between actual associations and pairs of proteins without associations and help understand the mechanisms of protein interactions. A possible future research direction is to further use these GO terms and KEGG pathways to build a computational method for inferring novel associations between proteins, enriching the biological functional annotation of proteins.
The original data used to support the findings of this study are available at STRING dataset and in supplementary information files.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This study was supported by the National Natural Science Foundation of China (31701151), Natural Science Foundation of Shanghai (17ZR1412500), National Key R&D Program of China (2018YFC0910403), Shanghai Sailing Program (16YF1413800), The Youth Innovation Promotion Association of Chinese Academy of Sciences (CAS) (2016245), the fund of the Key Laboratory of Stem Cell Biology of Chinese Academy of Sciences (201703), and Science and Technology Commission of Shanghai Municipality (STCSM) (18dz2271000).
70392 protein–protein functional associations.
A part of MaxRel feature list on ten datasets obtained by mutual information of each feature.
Selected 158 features and their occurrences in the 10 feature sets of the 10 datasets (√: feature is in the feature set;
Extracted important GO terms and their rating scores.