Efficient Data Mining Algorithms for Screening Potential Proteins of Drug Target

The past few decades have witnessed the boom in pharmacology as well as the dilemma of drug development. Playing a crucial role in drug design, the screening of potential human proteins of drug targets from open access database with well-measured physical and chemical properties is a task of challenge but significance. In this paper, the screening of potential drug target proteins (DTPs) from a fine collected dataset containing 5376 unlabeled proteins and 517 known DTPs was researched. Our objective is to screen potential DTPs from the 5376 proteins. Here we proposed two strategies assisting the construction of dataset of reliable nondrug target proteins (NDTPs) and then bagging of decision treesmethodwas employed in the final prediction. Such two-stage algorithms have shown their effectiveness and superior performance on the testing set. Both of the algorithms maintained higher recall ratios of DTPs, respectively, 93.5% and 97.4%. In one turn of experiments, strategy1-based bagging of decision trees algorithm screened about 558 possible DTPs while 1782 potential DTPs were predicted in the second algorithm. Besides, two strategy-based algorithms showed the consensus of the predictions in the results, with approximately 442 potential DTPs in common. These selected DTPs provide reliable choices for further verification based on biomedical experiments.


Background
In domains of biotechnology, pharmacology, and medicine development, identification of drug targets is to discover new candidate molecules that are active in the process of remedies with drugs.A notation is given in [1] that the drug target is a broad concept ranging from molecular entities such as Ribonucleic Acids (RNAs), genes, and proteins to biological phenomena like phenotypes or pathways.
History about the drug development has confirmed a fact that most failures in drug exploration can be attributed to inappropriate targets pursued [2,3].It is widely acknowledged that identifying potential targets for intervention is the first and foremost step in the modern drug campaign [1,[4][5][6][7], which has attracted increasing attention and focus from both academia and industry.Once the molecule was predicted as drug target, the engineering of drug design would begin in clinical trials.Since such programs, involving huge investments from pharmaceutical corporations and governments, are exactly time-consuming and labor-intensive, the choice of potential targets for experiments seems quite crucial.
As the dataset collected in our experiments is trapped in a special case where limited drug target proteins are known while the rest are uncertain in labels, the screening of potential drug target proteins from the unlabeled is complicated.A prior information supported in our research lies in low ratio of "druggable" genomes in humans, approximating to 10% [8].In the light of this, the nondrug target proteins (NDTPs) would dominate the unlabeled by inference.For more detailed information about our dataset, see Materials and Methods, and our ultimate objective is to screen several reliable drug target proteins (DTPs) from the unlabeled.Looking back to the previous methodologies of identification of drug target proteins (IDTPs), some specific biological hypotheses were required such as side-effect similarity [9], chemical structure, and genomic sequence information [10].
For further review about this, refer to [4].To overcome the limits on the reliability of hypothesis and explore a robust way to address the problem as well, we have developed a novel paradigm combining the proteins biochemical characteristics with the booming data mining techniques.Figure 1 shows the process of drug discovery using data mining techniques.
Inspired by a family of algorithms with regard to the positive and unlabeled learning, we transferred the existing knowledge into the domain of bioinformatics.A two-stage paradigm was adopted for the screening task, with the final result showing the efficiency of our algorithms.

Data Collection and Preliminary Analysis
2.1.1.Data Collection.Proteins, as one of the main sources of drug targets, have been a lasting heated topic for researchers from various domains.Some of them interact with each other, forming the basis of signal transduction pathways and transcriptional regulatory networks.As the focus of our research, proteins of drug targets are those functional biomolecules addressed and controlled by some active compounds.In this paper, we collected proteins from the DrugBank Database (Version 3.0) in which 1604 proteins were annotated as drug targets [11].Further data cleaning was imposed by removing the nonhuman proteins as well as those sequences larger than 20% using PISCES [12].As the compounds of atoms and molecules, whether the protein can be the candidate for the drug targets is frequently determined by factors like water solubility, hydrogen ion concentration (pH), trait of bases, and its structure.Though the interaction relations provide the additional information for the screening, they are not exactly reliable.Other properties of proteins also originate from the basic chemical or physical properties of proteins in essence.Our selected properties in the research were just some basic chemical or physical properties of proteins.We followed the extracting process in [13].Then some properties of significance for our task were extracted such as peptide cleavages [14], N-glycosylation [15], O-glycosylation [16], low complexity regions [17], transmembrane helices [18], and some other influential physical or chemical characteristics.These properties were important clues in deciding the biological activity of proteins.We made use of pepstats, an online software from EMBOSS [19], to calculate statistics of properties.In our article, we also call the unlabeled proteins as uncertain NDTPs because of the former prior information about the proportions of DTPs in dataset.The uncertain NDTPs were those when we did not know whether any of them would be the drug target candidates.Finally, a collected where   is the normalized value of some property   ,  is the mean of the population, and  is the standard deviation of the property.After the preprocessing, we need to apply hypothesis tests to check whether the information of each property is beneficial for our screening task.More specifically, Kolmogorov-Smirnov two-sided test was picked as the technique while the DTPs and the unlabeled were recognized as two classes.Since the unlabeled were dominated by the NDTPs, it was reasonable to consider that the traits of the NDTPs can be well approximated by the distribution of the unlabeled with some noise from the potential DTPs.We denote the list of properties in the following order: Ala, Cys, Asp, Glu, Phe, Gly, His, Ile, Lys, Leu, Met, Asn, Pro, Gln, Arg, Ser, Thr, Val, Trp, Tyr, Tiny, Small, Aliphatic, Aromatic, Nonpolar, Polar, Charged, Basic, Acidic, Hydrophobicity, SignalP, LowComplexityRegions, Ogly S, Ogly T, Ngly, and Trans Helices.All of these have been elaborated in the former work [13] with the detailed process of property extraction.The final results in Table 1 show the difference of significance between two classes, suggesting almost all of these properties in our dataset are discriminating and the effectiveness of properties would further support the following experiments.Another factor would affect the predicting performance is the correlation between the properties.After computing the values of correlation, covariance matrix is visualized in Figure 3.In the figure, the names of the horizontal axis are just in the order of the list of properties from the top to the bottom as well as from left to the right in axes.As is shown in the figure, the properties have weak correlations between each other, indicating less information redundancy in the properties.Up to now, it seems that the task is able to learn since the properties are quite information-beneficial.Something must be emphasized that we only make use of continuous properties in our experiments to relieve the dimension disaster which comes from the nominal properties.Further experimental results would confirm our induction.

Two-Stage Methodologies.
Taking details into consideration, an arduous task is the identification.The task is just a problem of one class classification, which is also viewed as the type of transductive learning [20].If we want to establish a classifier, some negative samples, NDTPs, are in necessity.Here, we innovatively employ anomaly detection techniques [21] to convert the problem of one class classification to the general binary classification problem.Therefore, the task is addressed in the two-stage paradigm.Specifically speaking, the first stage is to screen some reliable negative ones for the formation of training dataset in the binary classification and then a classifier is constructed with the help of obtained dataset in the second stage.The flowchart in Figure 4 illustrates our framework in detail.

Strategies in the First Stage.
The construction of the negative from the collected unlabeled is a nontrivial task and in some sense, it is to screen some reliable NDTPs.Though a prior knowledge indicates NDTPs' large occupation in the unlabeled, the discriminating criteria between two classes are hard to make up.Here, some statistical analyses with proper techniques are employed for NDTPs' fine extraction and we devise two strategies for the choice of reliable NDTPs from the perspective of statistical anomaly detection.Both of them are mining the inner discrimination between distributions of DTPs and the uncertain NDTPs.
Strategy 1.Such strategy is in a nonparametric style and the computations in the initial process only rely on the The flowchart of our framework [ 1 ,  2 ] where  1 and  2 are quantiles of some property, respectively.In our experiments,  1 is set as 10% while  2 is 90%.Displayed in Figure 5, any sample of the unlabeled with value of the property lower than the down threshold or higher than the up threshold is judged as the property violation towards the frequent DTPs' pattern.Another crucial definition is the extent of unknown sample's violating towards the frequent DTPs' pattern, which is really complicated to determine.To simplify the process and maintain the anomaly information, we count the number of 31 property values not conforming to the reliable ranges for each unlabeled protein and use the count as the index measuring the reliability of being the NDTP for each sample.After a series of computations, a statistical result is given in Figure 6 and for the Selection Algorithm in reliable interval the threshold to screen likely NDTPs is set as  = 14 to make a trade-off between class-balance in the training dataset and reliability of NDTPs.In this way, 441 proteins are selected from the unlabeled as the most likely NDTPs for further training.
Strategy 2. As we know, the dataset of the unlabeled is capable of approximating the distribution of NDTPs, but such approximation is biased because of the potential DTPs' existence.Meanwhile, the distribution of DTPs is easily captured with the help of 517 known DTPs.When the unlabeled is combined with the labeled, semisupervised learning framework can be utilized to exploit additional information in the unlabeled, contributing to the reduction in the bias of probability density estimation.
Expectation maximization (EM) [22] is the algorithm we employed for learning the mixture of probability distributions.Gaussian distributions are frequently used in mixture models as approximation of distributions.
The model can be described as ( Equivalent objective is the maximization of log likelihood: For our problem, we denote the { 0 ,  0 } and { 1 ,  1 }, respectively, as the parameters for the DTPs and NDTPs distributions.
As some samples have been determined as DTPs, it would be better to incorporate such partial label information to the model.Denoting {  |  = 1, 2, . . ., } as the known DTPs, the objective in our problem is adapted as Applying EM algorithms to optimize the objective, we can obtain the final parameters.Once the parameters learned, the mixture model is derived.As a generative model, the probability likelihood that assigns the sample to each class can be computed.The probability of assigning a sample to the NDTP, which we mostly care about, can be calculated as The calculation is just the posterior probability by Bayesian inference.
Ranking scores of the above probability in decreasing order, some reliable NDTPs are selected as the top 441 in the rankings just to maintain the same number as in Strategy 1.

Classifier Establishment in the Second Stage.
In the first stage, several reliable NDTPs are screened to constitute the part of training dataset.Then, bagging of decision trees [23], a traditional but efficient model, is developed for the further identification.Bagging takes advantage of bootstrapping [24] technique over training dataset to generate a series of meta models with variance.Benefiting from the randomness, several learned meta models as decision trees are aggregated to capture the complex boundary of concept.Especially for our task, each extracted property has been proved to be information discriminative between classes and the information redundancy is in a rather low level, so a meta decision tree easily established by learning random subset over some property is beneficial and effective in practice.In the experimental process, bagging is performed by running package of scikit-learn [25].
In our experiment, the partition criteria were chosen as the Gini index as follows.
Define the entropy of the dataset  as where   is the proportion of samples belonging to class  ( = 1, 2, . . ., ||).
Then the Gini index can be computed as where { V | V = 1, 2, . . ., } corresponds to the samples belonging to branch nodes derived from the  types of property .
Maximizing the Gini index is our partition criteria.Besides, the minimum samples for splitting were set as 2 and minimum samples of leaf were 1. Compute the reliable interval of Pos corresponding to   (4) End for; Obtain a series of reliable interval {interval  |  = 1, 2, . . ., } (5) For each sample  in U: (6) For each property   in u: (7) count = 0 (8) I f  locates out of the corresponding reliable interval  : (9) count = count + 1 (10) Ifcount≥  (11) RN= RN ∪ {} Output: The set of reliable negative samples RN Algorithm 1: The Selection Algorithm in reliable interval.

Experiments and Results Analysis
Input: The unlabeled dataset , the positive dataset , the number of selection  (1) Initialization the reliable negative set RN = NULL (2) Run EM on mixture model using U and P to derive the mixture probability distributions Compute the probability of the sample assigned as the negative in random selection (361 randomly selected known DTPs) and the well-picked reliable NDTPs (441 NDTPs) in the first stage were merged into the training set.The rest of the dataset including 156 known DTPs as the positive and 4935 uncertain NDTPs as the negative acted for evaluating our two-stage models.That is, the 4935 uncertain NDTPs were for the final screening of the potential DTPs.To eliminate the randomness from the partition, we averaged the results in 10 independent turns during the process of result analysis.Algorithm 1 makes use of the reliable intervals to detect reliable NDTPs.Algorithm 2 is in a semisupervised style to form the dataset of reliable NDTPs.For the process of the meta decision tree, see Algorithm 3. Algorithm 4 illustrates the bagging method.
In our research, we accomplished the screening task by directly learning in a supervised style.Furthermore, the metrics for the binary classification can also be employed for performance evaluation.The confusion matrix in (9) provides the result in an intuitive way.

Predicted Positive Predicted Negative Actual Positive TP FN
Actual Negative FP TN.
FN stands for the number of DTPs by mistake identified as the nontargets and the rest can be understood in a similar way.
Of great importance is the recall ratio of DTPs in our task, which is defined as To maintain the low ratio of incorrectly recognizing the DTPs, the recall ratio of the DTPs is also important.
Meanwhile, the precision of the NDTPs should be monitored as well.
Besides, the accuracy is estimated as Input: The training dataset  = {( 1 ,  1 ), ( 2 ,  2 ), . . ., (  ,   )} and properties set  = { 1 ,  2 , . . .,   } Process: Function Tree Generator(, ) (1) Generate a node; (2) If all of the samples belong to the same class C then (3) Assign the node as the leaf node of class C; Return (4) End if (5) If  = 0 Or samples in D achieve same values on P then (6) Assign the node as the leaf node of the class C when most of samples belong to class ; Return (7) End if (8) Choose the best partition property from  as  * ; (9) For each value  ∼ in  * : (10) Generate a branch for the node; Let  ∼ be the subset of  in which sample holds the value  ∼ ; (11) Assign the node of branch as the leaf node of the class C when most of samples belong to class C; Return (13) Else: Set the Tree Generator( ∼ ,  \ { * }) as the node of branch (15) End if (16) End for Output: A decision tree which roots in the node.In some sense, due to the dominance of the uncertain NDTPs, the accuracy seems not as important as the former two metrics.The uncertainty of testing set also leaves room of tolerance about the precision of the negative.More specifically, the relatively but not extremely high level of recall ratio of NDTPs contributes the final decision on DTPs' screening.During the process of bagging, the decisive parameter is the number of meta decision trees denoted as  estimators in scikit-learn [25].To explore optimal parameters for the bagging of decision trees, we ranged the scope of  estimators from 5 to 2000 with the step width of 5.The criteria for the choice of optimal  estimators were the recall ratio of DTPs.
The predicting process is all of our concern.Since the prior information indicates the small ratio of DTPs in the unlabeled, the predicted FP in the testing set can be taken as the main source of candidate DTPs.The mechanism behind this prediction pipeline is that the known DTPs and the potential DTPs share the same statistical distribution trait, so FP may contain most of candidate DTPs if the recall ratio of DTPs maintains a higher level.

Case Analysis in One Turn.
In one turn of the experiments, we derived confusion matrix as follows: (14b) The meaning is the same as (9) and positive is the DTPs with the negative denoted as NDTPs.The bold number is the number of predicted DTPs.Equation (14a) is the result using Strategy 1 while (14b) is the result using Strategy 2. It was significant that both of two strategies based bagging of decision trees achieved higher recall ratios of DTPs, reaching 93.5% and 97.4%, respectively.Something also worthy of noticing was that with the help of Strategy 1-based bagging method, the recall ratio of the uncertain NDTPs reached about 88.7%.Such results just conformed to the prior information that the actual NDTPs dominated the dataset.However, the confusion matrix of Strategy 2 maintained a relatively lower recall ratio of NDTPs approximately 63.9%, indicating Strategy 2 based method was able to provide a broad but rough scope for the final recommendation.
During the prediction process, we have directly taken the samples of FP in the confusion matrix as the potential DTPs.The consistency of two strategies has been verified in this turn; Figure 7 is the Venn graph about the proportions of predicted DTPs by employing two strategies.We suggested that about 442 proteins were predicted as the potential DTPs at the same time, occupying most of predicted potential DTPs from Strategy 1 based method.
The detailed information about the commonly predicted potential drug target proteins in two strategies based bagging of decision trees has been uploaded in the website http://pan .baidu.com/s/1c1SB2EG.

Sensitivity Analysis to Data Partition.
As the comparison to our strategies, random sampling method for the negative construction was performed in our research, which was a prevailing practice [26].In other words, 441 proteins were randomly picked up to form the set of most likely NDTPs in the training dataset.Bagging of decision trees was combined for classification as well.
Table 2 illustrates the results of the above experiments in 10 turns, including the circumstance of fitting on the training dataset.By averaging the metrics and, respectively, computing the variance, an evident but valuable conclusion was drawn that bagging of decision trees using our strategies worked steadily with low variance.In contrast, S2-bagging and S1-bagging achieved higher recall ratios of DTPs, recall ratios of NDTPs, and precisions of NDTPs.It suggested that S1-bagging can finely detect the potential drug target proteins while S2-bagging offered a broader range for further screening.In Table 2, another interesting fact about the RS-bagging was that the overfitting on the training dataset severely damaged the testing results, leading to low recall ratios for the DTPs.We confirmed that random sampling for the selection of NDTPs as training dataset would not find reliable ones though the actual DTPs occupies a small proportion of the unlabeled.
Besides, the higher performance on the training dataset using random sampling technique has made inevitable bias in the predicting process.Such circumstance did not happen when employing S1-bagging and S2-bagging.What is more, the performance on training dataset using two strategies was superior.In Table 3, we carried out Student's paired test for checking the results of significance.For each metric, 10 independent results were compared in pairs between S1bagging, S2-bagging, and RS-bagging.As is shown in the table, both of the two strategies were significantly superior to the RS-bagging in three metrics, namely, recall ratios of DTPs, recall ratios of NDTPs, and precisions of NDTPs.Something interesting was that, for the precision of NDTPs, S2-bagging was not significantly better than S1-bagging.Totally, the results of Student's paired -test have verified the effectiveness of two proposed methods in the sense of significance.Figure 8 provides the 10 independent experimental results corresponding to the metrics of recall ratios and precisions on the testing dataset.Three radar figures further

2 MathematicalFigure 1 :
Figure 1: Process of drug target discovery using data mining techniques.

Figure 3 :
Figure 3: Correlation graph.If the color is darker enough, it means the rather stronger correlation between two properties.

Figure 6 :
Figure 6: Results of violating statistics.Count is the number of properties violating the reliable interval for some sample and cardinality is number of such samples.

3. 1 .
Experimental Settings and Some Metrics.A persuasive manipulation in the experiments is to partition the dataset into the training set and testing set.Here 70% known DTPs Input: The positive dataset Pos, the unlabeled dataset , the threshold  to measure the extent of violation (1) Initialize the reliable negative dataset RN = NULL (2) For each property   ( = 1, 2, . . ., ):(3)

6 )
Rank the above probability likelihood in decreasing order (Select the top L samples to append the RN Output: The reliable negative samples RN Algorithm 2: EM for negative Selection Algorithm.

Figure 8 :
Figure 8: Results on testing dataset in 10 independent experiments.Only recall ratio of DTPs, recall ratio of NDTPs, and precision of NDTPs on the testing dataset are involved in the figure, respectively, denoted as (a), (b), and (c).

Table 1 :
Statistical results after K-S test.Most of properties are significant during test.
information on the accumulated distribution of the known DTPs and those proteins whose values of the property fall out of the range are more likely to share similar patterns with the reliable NDTPs.In our experiments, the reliable range for the DTPs regarding one continuous property is defined as an interval

Table 2 :
Results on 10 independent experiments: averaged results and variance.RS-bagging stands for the random sampling to obtain NDTPs for training.S1-bagging represents established bagging classifier using Strategy 1 in the first stage and so is the S2-bagging.

Table 3 :
Significance value of paired -test.The two-sided significance level set as 0.05.