A Low Computational Complexity Scheme for the Prediction of Intrinsically Disordered Protein Regions

We employ the Rayleigh entropy maximization to develop a novel IDP scheme which requires computing only five features for each residue of a protein sequence, that is, the Shannon entropy, topological entropy, and the weighted average values of three propensities. Furthermore, our scheme is a linear classification method and hence requires computing simpler decision curves which are more robust as well as using fewer learning samples to compute. The simulation results of our scheme as well as some existing schemes demonstrate its effectiveness.


Introduction
Accurately identifying intrinsically disordered proteins (IDPs) which have at least a region lacking a unique 3D structure with a dynamic conformational ensemble [1,2] is vital to obtain more effective drug designs, better protein expressions, and functional annotations.This is because it is confirmed that some of these intrinsically disordered proteins are involved in some of the most important regulatory functions in the cell [3], which have a great impact on diseases such as Alzheimer's disease, Parkinson's disease, and certain types of cancer [4].It is essential to investigate the IDPs through the computation of the amino acid sequence of a protein [4].This is because it is often difficult to purify and crystallize the disordered protein regions [5], which creates great problems for the disordered protein regions identification with the experimental approaches.Furthermore, experimental approaches for the disordered protein regions identification are usually both expensive and time-consuming [4].
Many IDP schemes have been proposed in the past decades, which can be roughly classified into two categories.(1) The first category is to exploit the amino acid propensity scales of the protein sequences for IDPs, such as FoldIndex [6], GlobPlot [7], IUPred [8], and FoldUnfold [9].These schemes utilize the amino acid propensity scales to compute parameters such as the ratio of mean net charges with the mean hydropathy, the relative propensity of an amino acid residue, and the interresidue contacts for IDPs.These IDP schemes are simple but not accurate enough in general [10].
(2) The second category is to employ machine learning techniques for the IDPs.The examples of these include PONDR5s [11], RONN [12], DISOPRED2 [13], BVDEA [4], and DisPSSMP [14].Many of these schemes are based on the artificial neural networks as well as support vector machine (SVM) which in general require computing a lot of features of a given protein sequence for IDPs.The computation of these features of a protein sequence could be expensive and time-consuming.More recently, MetaPrDOS [15] and Meta-Disorder predictor [16] which use several different predictors and their trade-off to yield an optimal decision for IDPs are also reported.
In this paper, we employ the Rayleigh entropy maximization to develop a novel IDP scheme which requires computing only five features for each residue of a protein sequence, that is, the Shannon entropy, topological entropy, and the weighted average values of three propensities.In contrast with most existing IDP schemes which need to compute no less than 30 features for each residue of a protein sequence, our scheme with a similar performance greatly reduces the computational complexity.Furthermore, our scheme based on the linear classification method has simpler decision curves which are more robust and require fewer learning samples to compute.Our scheme is trained and tested by the dataset DIS803 with 10-fold cross-validation, firstly.The dataset DIS803 is comprised of 803 protein sequences with 1254 disordered regions and 1343 ordered regions, which include 92423 disordered and 315503 ordered residues.As a comparison, we run our scheme together with some existing schemes, such as PONDR [11], FoldIndex [6], DISOPRED2 [13], RONN [12], and DISPRO [17] on the datasets PU159 and R80 which are comprised of 239 protein sequences with 183 disordered regions and 231 ordered regions.They are comprised of 18111 disordered and 46477 ordered residues, respectively.The simulation results suggest that only our scheme, BVDEA [4], and DisPSSMP [14] have PE (probability excess) values exceeding 0.5 for both datasets PU159 and R80.Our scheme is at least as accurate as BVDEA [4] and DisPSSMP [14] and requires computing only 5 features for each residue of a protein sequence, while the other two need to compute 188 and 120 features for each residue, respectively.In addition, both BVDEA [4] and DisPSSMP [14] are based on nonlinear classification methods which require computing the complex decision curves that are less robust in general.

A Brief Review of Some Notations
In a protein sequence, the complexity denotes how a sequence can be rearranged in many different ways [18].It has been demonstrated that the low complexity regions are more likely to be disordered than ordered [12].Shannon entropy and topological entropy are two parameters to measure the complexity of a sequence.To begin with, let us first recall some notations.
Given a protein sequence {(), 1 ≤  ≤ } of length , the Shannon entropy is where   for 1 ≤  ≤ 20 is defined as with Φ = {, , , , , , , , , , , , , , , , , , , } being an ordered set of 20 amino acid symbols.The complexity function   () representing the total number of different -length subwords of  (1 ≤  ≤ ) is defined as [19] where a subword  of a length  is one of any  consecutive symbols of .|| denotes the length of .For example, for a given sequence  = , the subwords of length 2 are {, , , , } , which yields Given a finite protein sequence  of length , let  be the unique integer satisfying where   20  +−1 1 (⋅) is defined in (3).Thus, we have  top () = 1 when the subwords of  20  +−1 1 run over all the possible subwords of length .On the other hand,  20  +−1 1 is a repetition sequence comprising a single letter which suggests  top () = 0. Similar to [19], we also compute the average of the topological entropy of  as The Rayleigh entropy maximization [20] of where   represents the total number of all the samples and x  (1 ≤  ≤   ) represents the features of the th sample, is to compute the projection direction W which optimizes the cost function S  and S  in (8) are, respectively, defined as where   is the number of samples in the th class and X  is the set of samples in the th class.
Using the Lagrange method, the optimal W and the corresponding optimal projection Y of X on the direction of W are given as

The Computation of the Optimal Projection Direction
In this section, we compute the Shannon entropy and the topological entropy of the dataset DIS803 from DisProt [21] (http://www.disprot.org/).Then, choosing Remark 465, Deleage/Roux, and Bfactor(2STD) propensities provided by the GlobPlot NAR paper [7] (http://globplot.embl.de/html/propensities.html),we compute the weighted average values of these propensities of the dataset DIS803.Finally, utilizing the computed Shannon entropy, topological entropy, and the weighted average values of three propensities of the dataset DIS803, we derive the optimum projection direction W defined in (8).The procedure proceeds as follows: (1) Let  be a protein sequence.We choose a window of length  to extract  consecutive residues from .Therefore, we assume the length of  to be .
Using (1), we can compute the Shannon entropy of .To compute the topological entropy of , we first map  to the propensities as follows.We map bulky hydrophobic (I, L, V) as well as aromatic (F, W, Y) amino acid residues defined in [10] to 1 and the rest of residues to 0. We use  to represent the mapped sequence of .Table 1 lists all the amino acid residues and their corresponding mapping values.
(2) For this protein sequence  of length , we also compute the weighted average values of Remark 465, Deleage/Roux, and Bfactor(2STD) propensities defined in the GlobPlot NAR paper [7]: where w () with 1 ≤  ≤  represents the values of the th propensity of .We use the th propensity of  with  = 1, 2, 3 to denote Remark 465, Deleage/Roux, and Bfactor(2STD) propensities, respectively.The weight ln( + 1) in ( 18) is identical to the sum function of the GlobPlot NAR paper [7].
(3) For a general protein sequence  of length , we use a sliding window of length  ( < ) to extract  consecutive residues w  = () ⋅ ⋅ ⋅ ( +  − 1), 1 ≤  ≤  −  + 1.For this sliced w  , we compute the Shannon entropy   (w  ), the topological entropy  top (w  ), and   (w  ) for  = 1, 2, 3 defined in (18).Define a 5 × 1 vector v  to be Thus, we can compute the feature matrix of the protein sequence  of length  as where vector x  (1 ≤  ≤ ) is with v  for 1 ≤  ≤  −  + 1 being defined in (19).
For the protein sequence  of ( 15), we choose the size of window  = 20 and compute the 10th and 30th residues of .x 10 and x 30 are where v  is defined in (19).
(4) Utilizing 10-fold cross-validation [22], we randomly divide the dataset DIS803 into ten subsets of approximately equal size.The protocol uses nine subsets as the training dataset to build a model and the remaining 10th subset for testing.Using the training dataset of 10-fold cross-validation [22], we can compute the feature matrix where   is the total number of the protein sequences of the training dataset.F  defined in (20) with 1 ≤  ≤   is the feature matrix of the th protein sequence whose length is   .Of all the residues of the training dataset obtained from DIS803 through 10-fold crossvalidation described above, we divide it into two disjoint subsets: one comprised of all the disordered residues and the other of all the ordered residues of the training dataset.Let  dis and  ord , respectively, denote the number of all the disordered and all the ordered residues of the training dataset.X dis and X ord , respectively, represent the feature matrices defined in (23) corresponding to all the disordered and all the ordered residues of the training dataset.From (11), it follows that where X  dis and X  ord represent the th column vector in X dis and X ord , respectively.Using m dis and m ord , S  in (9) can be calculated as From ( 12), the projection direction is The projection Y can be computed by (13).Finally, using linear searching in Y, we can obtain the threshold of classification.

The Simulation Results
We employ the Rayleigh entropy maximization shown in the previous sections to develop an IDP scheme which requires computing only five features for each residue of a protein sequence, that is, the Shannon entropy, topological entropy, and the weighted average values of three propensities.In contrast, computing no less than 30 features is demanded by most existing schemes, such as PONDR [11], DISOPRED2 [13], RONN [12], DISPRO [17], BVDEA [4], and DisPSSMP [14], for the IDP identification.Furthermore, our scheme is based on the linear classification method which requires fewer learning samples to compute the simple decision curves that are more robust.
In order to train and test our scheme, the sequences in the dataset DIS803 are randomly split into ten subsets of approximately equal size to conduct a 10-fold crossvalidation.The dataset DIS803 is comprised of 803 protein sequences.The results of our scheme with different window sizes are shown in Table 2.We use Sens., Spec., PE, and MCC to abbreviate sensitivity, specificity, probability excess, and Matthews' correlation coefficient, respectively.In addition, the values on probability excess and Matthews' correlation coefficient with different window sizes are shown in Figure 1.When the window size is larger than 35, the values tend to be smooth.Thus, we present our results with the window size of 35 in subsequent simulations.
Considering the classification method used, we use DIS-REM as the abbreviation of our scheme.The simulation results listed in Tables 3 and 4 show that the IDP identification accuracy of our scheme is approximately accurate as those of BVDEA [4] and DisPSSMP [14] whose performance exceeds the rest of the schemes mentioned above on the datasets PU159 and R80.From Tables 3 and 4, it is suggested that only our scheme, BVDEA [4], and DisPSSMP [14] have PE (probability excess) values exceeding 0.5 for both datasets PU159 and R80.To achieve these PE values, our scheme requires computing only 5 features of each residue, while computing 188 and 120 features for each residue of a protein sequence is demanded by DisPSSMP [14] and BVDEA [4], respectively.Furthermore, unlike nonlinear classification of DisPSSMP [14] and BVDEA [4] which require computing the complex decision curves, our scheme is based on the Rayleigh entropy maximization which is the linear classification method.Therefore, our scheme has simpler decision curves to compute and hence decision curves are more robust and require fewer learning samples than those of DisPSSMP [14] and BVDEA [4].

Conclusions
In this paper, we compute the Shannon entropy, the topological entropy, and the weighted average values of three propensities to develop a criterion based on Rayleigh entropy maximization for predicting the intrinsically disordered regions of a protein.Compared with several existing schemes, the identification accuracy of our scheme is at least as accurate as those schemes whose performance exceeds the rest of the compared schemes.Particularly, in contrast with those schemes that require computing no less than 30 features, our scheme only relies on computing five features.

Table 1 :
Bulky hydrophobic and aromatic amino acid.

Table 2 :
Performance on dataset DIS803 with different window sizes.

Table 3 :
Performance comparison with existing schemes on dataset PU159.

Table 4 :
Performance comparison with existing schemes on dataset R80.