Observing what phenotype the overexpression or knockdown of gene can cause is the basic method of investigating gene functions. Many advanced biotechnologies, such as RNAi, were developed to study the gene phenotype. But there are still many limitations. Besides the time and cost, the knockdown of some gene may be lethal which makes the observation of other phenotypes impossible. Due to ethical and technological reasons, the knockdown of genes in complex species, such as mammal, is extremely difficult. Thus, we proposed a new sequence-based computational method called
Recognition of gene phenotypes of proteins is a central challenge of the modern genetics to modulate protein functions and biological processes, and many well-known diseases, such as HIV [
During the past decades, numerous efforts have been made in the prediction of gene phenotype of yeast protein based on the following approaches: experimental methods and computational methods. As for experimental approaches, the high-throughput phenotype assays [
In this study, 6,732 proteins of yeast were taken from CYGD (the MIPS Comprehensive Yeast Genome Database [
Breakdown of 1462 budding yeast proteins according to their 11 phenotypes.
Tag | Phenotype category | Number of proteins |
---|---|---|
|
Conditional phenotypes | 536 |
|
Cell cycle defects | 272 |
|
Mating and sporulation defects | 198 |
|
Auxotrophies, carbon, and nitrogen utilization defects | 266 |
|
Cell morphology and organelle mutants | 535 |
|
Stress response defects | 147 |
|
Carbohydrate and lipid biosynthesis | 46 |
|
Nucleic acid metabolism defects | 219 |
|
Sensitivity to amino acid analogs and other drugs | 124 |
|
Sensitivity to antibiotics | 43 |
|
Sensitivity to immunosuppressants | 14 |
| ||
Total | — | 2,400 |
The first important step to build an efficient prediction model is to encode each sample by numeric vector. Here, to catch the information of protein phenotype, Gene Ontology (GO) and KEGG enrichment scores were employed to represent the protein, which have been used in some biological problems [
Each protein was represented with 4682 features which include 4583 GO enrichment scores and 99 KEGG enrichment scores. However, among the 4,682 features, some features were with little relationship to the target, which may bring noises to the prediction model. Therefore, these features should be removed. Before removing the irrelevant features, the following formula was used to adjust all features to a standard scale:
After the transformation, the correlation coefficient between each feature with the target vector was computed and those with correlation coefficient less than 0.1 were discarded. Finally, 989 features remained. Within these 989 features, there were 947 Gene Ontology (GO) enrichment scores and 42 KEGG enrichment scores. Thus, each protein
Minimum Redundancy Maximum Relevance (mRMR), first proposed by Peng et al. [
mRMR has been widely used in the areas of bioinformatics [
Nearest neighbor algorithm is effective in solving classification and optimization problems in the field of bioinformatics due to its simplicity. It is adopted here to construct the multilabel prediction classifier.
Within
For a query protein, Identifying the Then, the following formula:
The corresponding category labels of the category scores are denoted as
In the ranking by pairwise comparison (RPC) method, for each pair of labels, a data is allocated to the pair of labels if the data belong to one and only one of the two labels (not both). Given
Given a new instance, all pairwise classifiers are trained to predict its label, and the ranking of the labels is obtained by counting the votes of each label, where if the instance is classified into a label, the label receives one vote.
Each dataset contains those examples of
The
Incremental feature selection (IFS) is often used to search out an optimal feature subset that performs best. Specifically, features in the ranked feature set are added one by one from higher to lower rank and the first
Each feature subset is used to make prediction and the feature subset (first
We apply mRMR method to the dataset, and obtain two tables for the features (see Supplementary Material). One is called MaxRel feature table that ranks the features based on their relevance to the class of samples and the other is called mRMR feature table that lists the ranked features by the maximum relevance and minimum redundancy to the class of samples. Such list of ranked features was to be used in the following IFS procedure for the optimal features set selection.
The first-order prediction accuracy of Jackknife test is 62.38%, while
The 11 order prediction accuracies by
Method order | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
|
62.38 | 30.44 | 22.16 | 14.09 | 9.03 | 6.43 | 5.75 | 2.8 | 3.08 | 3.49 | 4.51 |
The curve showing the trend of the 11 order prediction accuracies.
30 IFS curves of
The peak and its coordinate of these IFS curves.
Firstly, we classify the total labels into 55(
The 11 order prediction accuracies by RPC-based methods (Dagging, RandomForest, SMO).
Methods order | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
Dagging | 60.05 | 33.58 | 21.96 | 13.75 | 10.53 | 8.28 | 6.57 | 3.56 | 2.6 | 1.85 | 1.44 |
RandomForest | 58.62 | 34.2 | 22.3 | 14.7 | 9.92 | 7.66 | 5.95 | 5.2 | 3.28 | 1.5 | 0.82 |
SMO | 56.16 | 34.68 | 21.55 | 14.84 | 10.88 | 7.8 | 6.36 | 4.65 | 3.21 | 2.26 | 1.78 |
We compared the first-order prediction accuracy of our method with the first-order prediction accuracy of RPC-based method. It can be found that the first-order prediction accuracies of RPC-based method using Dagging, RandomForest, and SMO are all lower than our
To illustrate the biological meanings of the selected optimal feature subset, we firstly classified GO terms into three kinds: the biological process, cellular component, and molecular function GO terms. The 622 GO terms in the mRMR feature list were mapped to the Gene Ontology (GO) terms, the children of the three root GO terms. The figures show the frequency of each GO term in the feature subset, and display the ratio of the number of each GO term to the scale of the number of its children terms.
In BP frequency, the top five GO biological process terms are GO:0009987: cellular process (399), GO:0008152: metabolic process (316), GO:0019740: nitrogen utilization (216), GO:0065007: biological regulation (136), and GO:0050789: regulation of biological process (131). In BP percentage, the top five GO biological processes are GO:0019740: nitrogen utilization (4.20%), GO:0071840: cellular component organization or biogene (3.57%), GO:0000003: reproduction (2.94%), GO:0022414: reproductive process (2.88%), and GO:0009987: cellular process (2.04%). For both GO biological process term number and percentage distribution analysis, the GO terms corresponding to the nitrogen utilization (GO:0019740) and cellular process (GO:0009987) were highlighted within the top five GO terms. This indicates that proteins assigned with these two GO terms may affect protein phenotype determination greatly. This conclusion is consistent with the common knowledge that specific cellular biological activities of the proteins confer with special phenotypes. It was also reported by Granek and Magwene that two key signaling networks: the filamentous growth MAP kinase cascade and the Ras-cAMP-PKA pathway, can regulate the yeast colony morphology response [
The highlight of nitrogen utilization (GO:0019740) suggests that the nitrogen utilization, which is essential for life survival and development, may have more definite affection on protein phenotype. Nutrient stresses trigger a variety of developmental switches in the budding yeast
In CC frequency, the top six GO cellular component terms are GO:0005623: cell (171), GO:0044464: cell part (169), GO:0043226: organelle (135), GO:0044422: organelle part (103), GO:0032991: macromolecular complex (84), and GO:0031974: membrane-enclosed lumen (39). In CC percentage, the top six GO cellular component terms are GO:0031974: membrane-enclosed lumen (12.4%), GO:0044422: organelle part (8.42%), GO:0043226: organelle (8.4%), GO:0032991: macromolecular complex (5.20%), GO:0044464: cell part (4.77%), and GO:0005623: cell (4.20%). For both GO cellular component term number and percentage distribution analysis, the GO terms corresponding to the organelle (GO:0043226) and organelle part (GO:0044422) were highlighted within the top six GO terms. It may be concluded that proteins located in all cellular organelles should be guaranteed. It suggests that organelles, which have specific structural and functional attributes, may possess more definite protein phenotype to carry out their specific functions. This also implicated that proteins assigned to these GO terms could contribute relatively more to the overall protein phenotype determination. For example, the communication between mitochondrial and nuclear loci (i.e.,
In MF frequency, the top six GO molecular function terms are GO:0003824: catalytic activity (79), GO:0005488: binding (69), GO : 0001071: nucleic acid binding transcription factor activity (40), GO:0000988: protein binding transcription factor activity (14). GO:0065009: regulation of molecular function
Tao Zhang and Min Jiang contributed equally to this research.
This work was supported by grants from the National Basic Research Program of China (2011CB510101, 2011CB510102), the National Natural Science Foundation of China (31371335), the Innovation Program of Shanghai Municipal Education Commission (12ZZ087), the Leading Academic Discipline Project of Shanghai Municipal Education Commission “Molecular Physiology,” the grant of “The First-class Discipline of Universities in Shanghai,” and the Foundation for The Excellent Youth (SHU10022).