Identification and Analysis of Driver Missense Mutations Using Rotation Forest with Feature Selection

Identifying cancer-associated mutations (driver mutations) is critical for understanding the cellular function of cancer genome that leads to activation of oncogenes or inactivation of tumor suppressor genes. Many approaches are proposed which use supervised machine learning techniques for prediction with features obtained by some databases. However, often we do not know which features are important for driver mutations prediction. In this study, we propose a novel feature selection method (called DX) from 126 candidate features' set. In order to obtain the best performance, rotation forest algorithm was adopted to perform the experiment. On the train dataset which was collected from COSMIC and Swiss-Prot databases, we are able to obtain high prediction performance with 88.03% accuracy, 93.9% precision, and 81.35% recall when the 11 top-ranked features were used. Comparison with other various techniques in the TP53, EGFR, and Cosmic2plus datasets shows the generality of our method.


Introduction
Recent developments of large-scale sequencing in the cancer genome have exploited hundreds or thousands of various types of mutations [1], such as DNA sequence alterations including point mutations, nucleotide mutations, and genomic rearrangements [2]. Although many somatic mutations are discovered, a small fraction of mutations promote cancer progress (driver genes that drive tumor evolution, about <1%) and majority of mutations are likely to be "passengers" which have no effects on tumor cell selection [3][4][5]. Many methods are used to explore the mechanism on the different mutations. For example, Purohit et al. [6] have conducted studies on the drug resistance through docking and binding analysis and found that mutation (S315T) has high docking score: it can decrease the flexibility of binding residues and make them rigid by altering the conformational changes, and in turn it hampers the INH activity. Lamin A/C proteins are the major components of a thin proteinaceous filamentous meshwork and the structural and functional consequences of mutation R482W cause FPLD [7]. Both structure and relationship of mutation protein are also studied, such as cancer-associated E17K [8], SH2-containing protein (NSP3) and Crk-associated substrate (p130Cas) [9], TMC114 [10,11], PncA of Mycobacterium tuberculosis [12], and KIT receptor [13]. Among these mutations' analyses, the missense mutation which is a point mutation that can cause different codon coding through gene is widely noted [14,15]. So, various methods on the basis of data are used to identify which missense mutations are drivers and which are passengers [16].
So far, several approaches have been exploited to identify driver mutations and can be roughly classified into two categories. The first class is based on biological difference with the hypothesis that a driver gene has a higher frequency compared to passenger genes with passenger mutations [1,[17][18][19]. Parmigiani et al. developed a software package (CancerMutationAnalysis, bioconductor) to identify driver mutations at the gene level. This software can calculate passenger mutation rate. Carter et al. proposed a novel method for estimating the passenger mutation rate from three aspects including the number of nonsilent somatic single based variants, reducing known driver mutations and the frequency of the nonsilent somatic single (24 categories) [20]. Zhang et al. [17] computed the Mahalanobis distance of a gene from known cancer genes with four features including gene size, background nonsynonymous mutation rates, somatically acquired events, and the rate of these events in carriers. MutSig tools are also used to compute the score of each gene in the tumor. On the other hand, researchers adopt some features related to the missense mutations to train classifier using some learning algorithms, and then the model can be applied to the test dataset. Hitherto several groups propose some methods to recognize driver mutations from a lot of passenger mutations [15,[20][21][22][23][24][25][26][27][28][29][30]. They use different features and algorithms for prediction, especially feature spaces.
Recently, Tan et al. [30] proposed a novel feature extraction scheme for driver mutations identification. They selected 126 features relating to physicochemical properties of amino acids (AARC), scoring mutation matrix (SSM) from AAIndex database [31], 2-gram feature from sequence (PSS), and annotated features (AF) from other databases, then used DX score to rank 126 features, and finally selected 70 features according to accuracy of support vector machine (SVM). This work is interesting and shows us how to select efficient features for our recognition.
In this study, inspired by Tan et al. 's method, we developed a novel method to predict driver mutations from candidate passenger mutations using DX-RF (rotation forest (RF) algorithm with DX method). In order to utilize more features, we also adopt four kinds of features that were used by Tan et al. A novel scoring system (DX) was employed to evaluate the performance of each feature in identifying driver mutations. Our experiments can acquire 87.97% average accuracy on DX-RF method using the 11 top-ranked features combined. We also tested the classifier on the other dataset and got higher accuracy than before.

Data Collection.
The driver-passenger mutations dataset is retrieved from Tan et al. [30]. This dataset is composed of cancer-associated variants (driver mutations) which were collected from COSMIC database and neutral polymorphisms (passenger mutations) which were collected from Swiss-Prot Variant Pages (humsavar.txt) with only the record type "Polymorphism. " Based on this dataset, train dataset with 4193 driver mutations and 4193 passenger mutations is constructed. The test dataset contains three disjointed driver mutations sets (EGFR, TP53, and Cosmic2plus) and passenger mutations dataset which was collected from humsavar.txt by removing those that appeared in the train dataset. In this study, driver mutations are labeled as positive class and passenger mutations are labeled as negative class.

Feature Extraction.
The candidate features were collected from Tan et al. 's paper which mainly contain four type features which are composed of AARC features (physicochemical properties), SSM features (scoring mutation matrix, from AAIndex), PSS features which were produced according to Wu et al. [32] and Wang et al. [33] using 2-gram and 6letter method, and annotated features which were collected from several databases including UniProt KnowledgeBase, Swiss-Prot Variant Page, and COMSIC database. In the annotated features, there are 14 binary categorical features, which perhaps are unavailable for the referring mutations.

Feature Coding.
Machine learning-based techniques such as support vector machine (SVM) and rotation forest (RF) need a fixed number of inputs for training. So, before training, the features should be converted to number. The AARC feature value AARC( ) for a missense mutation is defined by where denotes sample, denotes wild-type residue, denotes mutation residue, and denotes the th AARC feature value. The SSM feature value for a missense mutation is assigned as the element ( , ) of scoring mutation matrix. The 2-gram method extracts two consecutive amino acid residues in a protein sequence and counts the number of occurrences of the residue pairs; it will produce 400-dimension vector for a protein sequence. DX is used to calculate the score of each feature and the 30 top-rank features are selected for prediction. The 6-letter method classifies 20 amino acids to six groups according to physicochemical properties [34]. Table 1 shows the six groups.
The 6-letter method first represents a protein sequence by the 6-letter group and then encodes new protein sequence using 2-gram method. Thus, The PSS feature value for a missense mutation is assigned as the 436-dimension vector.
In order to reduce lost information, the linear correlation coefficient (LCC) is computed through 436-dimension vector as follows: where is the th 2-gram feature value and is the mean value of th 2-gram feature. Finally, we got 31 PSS features. The annotated features were collected from different databases including UniProt KnowledgeBase, Swiss-Prot, and COSMIC; here 29 features were used in this study.

Feature Selection
Method. In many pattern recognition applications, feature selection is very important. Here we use two methods to solve this problem: DX score [33] and minimum redundancy maximal relevance (mRMR) [35]. The author of DX method adopted it to pick out the most relevant 2-gram features. Intuitively, this DX score bears the capability of assessing a feature's discrimination power in general case. According to [36], the DX score can be defined as follows: where average pos denotes the mean value of the feature in the interaction pairs of train dataset and average neg denotes the mean value of the feature in the noninteraction pairs of train dataset. var pos and var neg denote the variance of the feature in the interaction pairs and noninteraction pairs of train dataset, respectively. The mRMR method selects good features according to the maximal statistical dependency criterion based on mutual information. A smaller index of a feature denotes that it has a better trade-off between maximum relevance to the target and minimum redundancy to the features. The mutual information equation of random variables , is defined as follows: Here , are vectors and ( , ), ( ), ( ) is probabilistic density function. Max-Relevance is to find features satisfying (5) and meanwhile Min-Redundancy condition needs to be added to select mutually exclusive features with (6); , denote feature, denotes the whole feature set, and denotes the target class. Consider The mRMR feature evaluation uses incremental search methods for optimal features and would loop rounds when given a feature set with features. After the mRMR feature evaluation, a ranking feature set is obtained.

Model
Construction. The classification model of identifying driver mutations was based on rotation forest (RF) [37] and the software Weka [38] was adopted to implement our classification. The final train dataset is comprised of 4193 driver mutations and 4193 passenger mutations. In statistical prediction, subsampling test and jackknife test are used as two cross-validation methods. Jackknife test is considered to be more objective and has been widely adopted by many researchers to validate the power of various classifiers, but it will take much longer time to perform the jackknife test. So considering the numerous samples used in this study, 5fold cross-validation is used to evaluate the importance of the features for train dataset. This process is repeated five times and average accuracy is used to evaluate features.
A RF model was constructed on the train dataset with default parameters. In order to get good features for identifying driver mutations, 126 train datasets are built according to IFS [39,40] approach based on the ranked features obtained by the DX method and mRMR method, respectively. Then the 126 train datasets are trained with 5-fold cross-validation and this process was repeated five times. Thus, 126 * 5 * 2 models were generated. Five parameters, precision, recall, accuracy, -measure, and Matthews's correlation coefficient (MCC), were employed to measure the performance of features combined on the training dataset and TP denotes true driver mutations, TN denotes true passenger mutations, FP denotes false driver mutations, and FN denotes false passenger mutations    cross-validation based on two classifiers can be seen in the Supplemental Materials S3. This feature selection process is illustrated in Figure 1; from Figure 1 it can be seen that the DX-RF predictor achieved the highest 87.97% accuracy when adopting the 11 top-ranked features and the mRMR-RF predictor also got a similar highest 88.18% accuracy with the 76 top-ranked features. In order to compare with Tan

Feature Analysis.
We investigate the distribution of the optimal features based on DX-RF, mRMR-RF, and Tan et al. 's method. From Figure 2, 0, 6, and 1 features were derived from amino acid residue change features (AARC); 0, 12, and 40 were derived from substitution scoring matrix features (SSM); 7, 31, and 21 were derived from protein sequencespecific features (PSS); and 4, 27, and 8 were derived from annotated features (AF) of DX-RF, mRMR-RF, and Tan et al., respectively.

Comparison of the Prediction Performance on the Train
Dataset. After the optimal feature subset can be confirmed, the experiment was performed to evaluate whether DX-RF method is better than other methods. According to DX and mRMR, the experiments using 5-fold cross-validation on the train dataset are performed again and this process can be run 10 times. Table 2 shows the average results of DX-RF and mRMR-RF method. From Table 2, the performance of DX-RF method is almost the same as the mRMR-RF method. However, the DX-RF method only needs 11 features, while the mRMR-RF method needs 76 features.

Comparison of the Prediction Performance with Different
Methods on the Independent Set. To determine whether the 11 top-ranked features' set contributes to the prediction of driver mutations, we test independent set between DX-RF and Tan et al. 's method and construct four classifiers, called DX-SVMLight, DX-LibSVM, DX-RF, and mRMR-RF, respectively. We know that false positive should be avoided. In the experiment, DX-SVMLight (651 false driver mutations), DX-LibSVM (895 false driver mutations), and mRMR-RF (620 false driver mutations) all got high FP (false positive). DX-RF method only got 597 false driver mutations. Table 4 gives the detailed information based on the four classifiers on the three datasets. From Tables 3 and 4, we can conclude that DX-RF is more reliable than DX-SVMLight, DX-LibSVM, and mRMR-RF according to the results of three independent sets.

Conclusion
In this study, we propose a novel feature extraction for identifying driver mutations. The model was constructed by the optimal features set with rotation forest. The 5-fold CV experiments are performed on the train dataset and obtain high prediction performance with 93.9% precision and 81.35% recall when the 11 top-ranked features are used. On the independent set of missense mutations, the DX-RF got higher 89.28%, 87.18%, and 85.53% accuracy than the other methods on the TP53, EGFR, and Cosmic2plus, respectively. Although our work got the best performance, further improvements are both needful and possible. In the future, on the one hand, we will exploit more correlation features to describe the difference between driver mutations and passenger mutations. On the other hand, a new fast algorithm will be considered for driver mutations prediction.