Novel Design Strategy for Checkpoint Kinase 2 Inhibitors Using Pharmacophore Modeling, Combinatorial Fusion, and Virtual Screening

Checkpoint kinase 2 (Chk2) has a great effect on DNA-damage and plays an important role in response to DNA double-strand breaks and related lesions. In this study, we will concentrate on Chk2 and the purpose is to find the potential inhibitors by the pharmacophore hypotheses (PhModels), combinatorial fusion, and virtual screening techniques. Applying combinatorial fusion into PhModels and virtual screening techniques is a novel design strategy for drug design. We used combinatorial fusion to analyze the prediction results and then obtained the best correlation coefficient of the testing set (r test) with the value 0.816 by combining the BesttrainBesttest and FasttrainFasttest prediction results. The potential inhibitors were selected from NCI database by screening according to BesttrainBesttest + FasttrainFasttest prediction results and molecular docking with CDOCKER docking program. Finally, the selected compounds have high interaction energy between a ligand and a receptor. Through these approaches, 23 potential inhibitors for Chk2 are retrieved for further study.


Introduction
DNA-damage is induced by ionizing radiation, genotoxic chemicals, or collapsed replication forks, and when DNA was damaged or the responses of cells were failure, the mutation associated with the breast or ovarian cancer of genes may occur. To prevent and repair the DNA-damage, mammalian cells will control and stabilize the genome by cell cycle checkpoint. The checkpoint pathway consists of several kinases, such as ataxia telangiectasia mutated protein (ATM [1,2]), ataxia telangiectasia and Rad3-related protein (ATR [1,2]), checkpoint kinase 1 (Chk1 [3,4]), and checkpoint kinase 2 (Chk2 [5][6][7][8]). ATM and ATR are upstream kinases passing messages to downstream kinases and phosphorylating several proteins that initiate the activation of the DNAdamage checkpoint. Moreover, ATM is a primarily pathway to activate p53 (protein 53 [9]) by Chk2, and ATR may influence the phosphorylation of Chk1. Both Chk1 and Chk2 are key components in DNA-damage; however, their cellular activities are different. Chk1 is involved in S and G2 phases of the cell cycle with ATR pathway. By contrast, Chk2 is activated in all phases through ATM-dependent pathway and plays an important role in response to DNA double-strand breaks and related lesions. Furthermore, Chk1 is an unstable protein and lacks the forkhead-associated domain (FHA) which was involved in several processes that protect against cancer and can be found in Chk2. Therefore, we concentrate on Chk2 in this study.
Chk2 is a protein containing 543 amino acid residues and the structure of Chk2 consists of some functional elements, including the N-terminal SQ/TQ cluster domain (SCD), FHA, and the N-terminal serine/threonine kinase domain (KD) [5][6][7][8]. The SCD is known to be the preferred site with the residue Thr68 for phosphorylation to respond to DNA-damage by ATM/ATP kinases. The FHA domain is a phosphopeptide recognition domain found in many regulatory proteins and thought to bind to the phosphoThr68 segment of SCD [5][6][7][8][10][11][12][13][14]. Hence it is a good candidate  for interactions of Chk2 with its upstream regulators or downstream targets in the cell-cycle-checkpoint signaling. The KD occupies almost the entire carboxy-terminal half of Chk2 and has been identified based on their homology with serine/threonine kinases. Some studies reported that when DNA was damaged, Chk2 is activated by ATM/ATR through the phosphorylation of residue Thr68. Moreover, Chk2 induces transautophosphorylation of residues Thr383 and Thr387 and then cis-phosphorylation of residue Ser516 [5][6][7][8][10][11][12][13][14]. After that, Chk2 will phosphorylate several downstream substrates, such as BRCA1 (breast cancer 1, early onset [15,16]), Cdc25A (cell division cycle 25 homolog A), Cdc25C, and p53 [7,8,10]. Several researches indicated that Chk2 phosphorylates Cdc25A which is considered an oncogene on the residue Ser123 in S phase of cell cycle, and it also phosphorylates Cdc25C on the residue Ser216 in G2 phase helping prevent mitotic entry in cells with damaged DNA [5]. Furthermore, BRCA1 and p53 are involved in DNA repair process in the breast or ovarian cancer. BRCA1 is a human caretaker gene and helps repair damaged DNA or destroys cells which cannot be repaired. The p53 is a tumor suppressor protein involved in preventing cancer in human and plays an important role in the G1 checkpoint in response to DNA damaging agents. We consider that the sites of the phosphorylations are important in the drug design for cell survival when DNA is damaged.
Recently, several studies identified the inhibitors of Chk2 [6][7][8][10][11][12][13][14], and they also showed the crystal structures of Chk2 complex, such as PDB: 1GXC, 2W7X, and, and so forth. They are selective, reversible, and ATP-competitive Chk2 inhibitors demonstrating that they effectively restrain the radiation-induced phosphorylation of Chk2. In addition, several selective Chk2 inhibitors have been also identified (two examples were shown in Figure 1) and the researches indicated that they are potential and selective inhibitors of Chk2 with chemotherapeutic and radiosensitization potential. On structure-based drug design, several developments of Chk2 were published [17,18]. Quantitative structureactivity relationship model (QSAR model) is a regression or classification model and is an important technique in the rational drug design. It is used to correlate the structure properties of compounds with their biological activities. The method to predict the quality by QSAR was improved by considering the three-dimensional structure QSAR (3D-QSAR) [19][20][21][22][23][24] of targeted inhibitor. Therefore, the compound structure can be directly optimized in the 3D space.
Pharmacophore [20][21][22][23][24][33][34][35] is a set of structural features responsible for the biological activity of a molecule. It allowed compounds with diverse structures to find the common chemical features by ligand pharmacophore mapping, and that is different from CoMFA and CoMSIA with the common structure constraint. Thus, pharmacophore can explain how diverse ligands bind to a receptor site by these features and visualize the feature of potential chemical interactions between ligands and receptors. Moreover, pharmacophore can easily and quickly identify candidate inhibitors for a target protein based on 3D query. Therefore, in this work, we first used 3D-QSAR study to build pharmacophore hypotheses (denoted as PhModels) for Chk2 inhibitors by HypoGen Best, Fast, and Caesar algorithms, respectively. Then we used the combinatorial fusion to select and combine prediction results for improving the predictive accuracy in biological activities of inhibitors. Virtual screening is a computational technique used in drug discovery research. There are two categories of screening techniques: ligandbased and structure-based. In this work, for ligand-based virtual screening, we used the selected PhModels as 3D structure query by pharmacophore hypothesis screening that each compound in National Cancer Institute (NCI) database will be mapped onto the pharmacophoric features of selected PhModels. When the chemical features of a compound fit the generated PhModels, it will be selected. All of feasible compounds in NCI database were selected in this work. Finally, the potential inhibitors were retrieved from selected compounds by using molecular docking program to predict the conformation and interaction energy between Chk2 and ligand. Applying combinatorial fusion into PhModels and virtual screening techniques is a novel design strategy for drug design and can help medicinal chemists to identify or design new Chk2 inhibitors. Besides, the potential inhibitors of Chk2 retrieved in this work can be estimated by biologists for further study.

Biological Data Collection.
In order to construct the PhModels, at first, we collected the Chk2 inhibitors with two-dimensional structures and the biological activity values from the ChEMBL database [36]. Then, according to the structure variations and chemical differences in the kinase inhibitor activity, 158 known Chk2 inhibitors were selected and retrieved. The biological activity of 158 known Chk2 inhibitors was represented as IC 50 (nanomolar, nM). There are 260,071 compounds from the NCI database (release version 3, http://cactus.nci.nih.gov/download/nci/) which were used in the database screening and molecular docking approach in this work.

Training and Testing Sets Selection.
Before generating PhModels, we should divide the 158 Chk2 inhibitors into the training set and testing set, respectively. The rules used to select training set inhibitors are according to the following requirements as suggested by the Accelrys Discovery Studio.
(1) All selected inhibitors should have clear and concise information including structure features and activity range.
(2) At a minimum, 16 diverse inhibitors for training set were selected to ensure the statistical significance. (3) The training set should contain the most and the least active inhibitors. (4) The biological activities of the inhibitors spanned at least 4 orders of magnitude. Based on the above four rules, the 158 Chk2 inhibitors were divided, and the scatter diagram of training set and testing set inhibitors was shown in Figure 2. Figure 2 demonstrates the distribution of the inhibitors in the training set and testing set, and the representative points of the testing set are close to those of the training set. The training set with 25 inhibitors is used to construct PhModels, and the IC 50 values of these 25 inhibitors are ranged from 2.3 to 100,000 nM ( Table 1). The testing set with remaining 133 inhibitors is used to test the predictive ability of generated PhModels, and the IC 50 values of the 133 testing set inhibitors are ranged from 3.4 to 74,000 nM ( Table 2). After selecting the training set and testing set inhibitors, we established PhModels at first, and then we used the correlation analysis to estimate the prediction abilities of PhModels.

Pharmacophore Generation.
The workflow of PhModel generation for Chk2 inhibitors was shown in Figure 3. In this study, we used the HypoGen program [37] in Accelrys Discovery Studio 2.1 to generate PhModels. At the initial step, 3D conformations of the training set inhibitors were generated by using "3D-QSAR Pharmacophore Generation protocol" with the Best, Fast, and Caesar generating algorithms, respectively, based on the CHARMm-like force field. The conformational-space energy was constrained ≤20 kcal/mol which represented the maximum allowed energy above the global minimum energy. For each training set inhibitor, the number of the diverse 3D conformations was set to ≤255. All other parameters were set as default values. Following the above rules, the 3D conformations were generated, and then we can construct the PhModel by using "Ligand Pharmacophore Mapping protocol. " Each of the ten PhModels using HypoGen Best, Fast, and Caesar algorithms were generated in this study.

Combinatorial Fusion.
In this study, we use a combinatorial fusion technique to facilitate prediction results selection and combination for improving predictive accuracy in biological activities of inhibitors. The combinatorial fusion we take is analogous to that used in information retrieval [38,39], pattern recognition [40], molecular similarity searching and structure-based screening [41], and microarray gene expression analysis [42]. These works have demonstrated the following remark [43].
Remark 1. For a set of multiple scoring systems, each with a score function and a rank function, we have that (a) the combination of multiple scoring systems would improve the prediction accuracy only if (1) each of the systems has a relatively high performance, and (2) the individual systems are distinctive (or diversified), and (b) rank combination performs better than score combination under certain conditions.
Given an inhibitor and for each prediction result , let be a function as the predicted biological activity and it is represented as a real number. We view the function as the score function. Since only assigns a number not a set of numbers, in this work, no rank function would be used for an inhibitor. Therefore, the rank combination and the rule (b) in Remark 1 are not considered in the study. Suppose we have prediction results ( scoring functions). There are combinatorially 2 − 1 combinations for all individual prediction results (∑ =1 ( ) = 2 − 1) with score functions. The total number of combinations to be considered for predicting biological activity of an inhibitor is 2 − 1. This number of combinations can become huge when the number of prediction results is large. Moreover, we have to evaluate the predictive power of each combination across all inhibitors. This study would start with combining only two prediction results which still retain fairly good prediction power.
Suppose prediction results , = 1, 2, . . . , , are given with score function ; there are several different ways of combination. Among others, there are score combination, voting, linear average combination, and weighted combination [38][39][40][41][42]. Voting is computationally simple and better than simple linear combinations when applied to the situation with large number of prediction results. However, a better alternative is to reduce the number of prediction results to a smaller number and then these prediction results are combined. In this paper, we reduce the set of prediction results to those which perform relatively well and then use the rank/score function to decide whether to combine by score. In this paper, we use the rules (a) (1) and (a) (2) stated in Remark 1 as our guiding principle to select prediction results and to decide on the method of combination. After generating each of the ten PhModels by using HypoGen Best, Fast, and Caesar algorithms for training set inhibitors, each of the best PhModel (denoted as Best train , Fast train , and Casear train ) was evaluated by its correlation coefficient of the   training set ( train ). Then these best PhModels were used to predict the biological activities of testing set inhibitors by using HypoGen Best, Fast, and Caesar algorithms. Therefore, there are nine prediction results (denoted as train × test , = {Best, Fast, Caesar}, that is, Best train Best test ) generated for testing set inhibitors. Using data fusion, results from various prediction results are combined to obtain predictions with larger accuracy rate. The diversity rank/score function is used to select the most suitable prediction results for combination. If these three best PhModels were selected, there are nine prediction results and then there are 2 9 − 1 = 511 combinations. According to the rule (a) (1) in Remark 1, the train of Casear train is far less than those of Best train and Fast train (Table 1); then, the Casear train was not considered in the combinations. Therefore, there are six prediction results ( 1train × 2test , 1 = {Best, Fast} and 2 = {Best, Fast, Caesar}) and 2 6 − 1 = 63 combinations. A special diversity rank/score graph was used to choose the best discriminating prediction results for further combination.
For an inhibitor in the testing set = { 1 , 2 , . . . , } and the pair of prediction results and , the diversity score function ( , ) is defined as ( , ) = ∑ | − |. When there are prediction results selected (in this study, = 6), there are ( 2 ) = ( − 1)/2 (in this study, the number is 15) diversity score functions. If we let vary and fix the prediction result pair ( , ), then ( , ) is the diversity score function ( , ) from = { 1 , 2 , . . . , }. Sorting ( , ) into descending order would lead to the diversity rank function ( , ) . Consequently, the diversity rank/score function ( , ) is defined as ( , ) = ( ( , ) ∘ −1 ( , ) )( ) = ( , ) ( −1 ( , ) ( )), where is in = {1, 2, 3, . . . , }. We note that the set is different from the set which is the testing set considered. The set is used as the index set for the diversity rank function value and | | = is indeed the cardinality of . The diversity rank/score function ( , ) so defined exhibits the diversity trend of the prediction result pair ( , ) across the whole spectrum of input set of inhibitors and is independent of the specific inhibitor under study. For two prediction results and , the graph of the diversity rank/score function ( , ) ( ) is called the diversity rank/score graph. This study aims to examine all the ( − 1)/2 diversity rank/score graphs to see which pair of prediction results would give the larger diversity measurement according to the rule (a) (2) in Remark 1.

Database Screen.
After examining 15 diversity rank/score graphs, the PhModels and determined from the best prediction result pair were used to screen the NCI database for new Chk2 inhibitor candidates. Under the PhModel, pharmacophore hypothesis screening can be used to screen small molecule database to retrieve the compounds as potential inhibitors that fit the pharmacophoric features.
In this study, the "Search 3D Database protocol" with the Best/Fast/Casear Search option in Accelrys Discovery Studio 2.1 was employed to search the NCI database with 260,071 compounds. We could filter out and select the compounds in the NCI database based on the estimated activity and chemical features of PhModel.

Molecular Docking.
After the database screening approach, the selected compounds can be further estimated according to the interaction energy between a receptor and a ligand through the molecular docking approach. In this study, selected compounds in the NCI database were docked into Chk2 active sites by CDOCKER docking program, and then their CDOCKER interaction energies were estimated. Finally, new potential candidates were retrieved from the NCI database with high interaction energy. The workflow of database screening and molecular docking approach was shown in Figure 4.

Combinatorial Fusion Analysis.
Under the six prediction results, the diversity score function ( , ) was calculated for each testing set inhibitor by a pair of prediction results ( , ). There are 15 diversity score functions ( , ) that were    Figure 6: The test for all of 63 combinations from six prediction results.
performed at first and then these diversity score functions were sorted to become the diversity rank function ( , ) , respectively. Finally, 15 diversity rank/score functions ( , ) were represented as diversity rank/score graphs shown in Figure 5. Among 15 diversity rank/score graphs, there are several combinations (gray color) that have less diversity scores than those by others, such as BB + BC, BB + BF, and FB + FB, shown in Figure 5. It means that these combinations may have less test than those by others according to rule (a) (2) in Remark 1. In other words, several combinations, such as BB + FC (purple color), BB + FF (blue color), and BF + FF (orange color), may have larger test than those by others due to larger diversity scores. For the six prediction results, all of the 63 combinations were preformed and evaluated by its test , respectively, as shown in Figure 6. In Figure 6  value of less than 2 nM were studied in a molecular docking study (Figure 4).

Molecular Docking
Results. 23 drug-like compounds along with the training set compounds were docked into the  Table 3.
The structures and characteristics of the top 2 compounds are given in Table 4, and we can find that some active site residues were identified from the Chk2 complex. The interaction sites of NSC136954 were Leu226, Val234, Ala247, Lys249, Ile251, Glu273, Ile274, Leu277, Ile286, Ile288, Ile299, Leu301, Leu303, Met304, Glu305, Gly306, Gly307, Glu308, Leu354, Thr367, Asp368, Phe369, and Gly370. On the other hand, the interaction sites of NSC70804 were Leu226, Leu227, Val234, Ala247, Ile248, Lys249, Ile251, Glu273, Leu277, Ile299, Leu301, Leu303, Met304, Gly307, Glu308, Glu351, Asn352, Leu354, Thr367, and Asp368. Several studies indicated that they are involved in hydrophobic interactions with Val234, Ile251, Leu354, Ile299, and the aliphatic portions of the side chains of Lys249, Thr367, and Asp368, in addition to several interactions of van der Waals or hydrophobic with Leu226, Val234, Leu303, Gly307, Leu354, and the aliphatic portions the side chains of Met304 and Glu308 [10,11]. Furthermore, the quinazoline was sandwiched between the lipophilic side chains of Val234 and Leu354, with the side chains of Ala247, Leu301, and Leu303 also contributing to a hydrophobic surface surrounding the core and an interaction between the pyrazole and Lys249 is likely to account for the increase in Chk2 potency [12]. And residue Thr367 of Chk2 is a serine in Chk1. Portions of the glycine-rich P-loop in Chk2, which is located directly above the inhibitor, are disordered (residues 229-231), whereas this loop is well defined in the structure of Chk1, and Leu301 in Chk2 corresponds to the "gatekeeper" residue in many kinases, which has been found to form contacts with bound inhibitors and is poorly conserved [44].

Conclusions
In this study, a novel design strategy for drug design was proposed to apply combinatorial fusion into PhModels and virtual screening techniques. 158 Chk2 inhibitors were divided into the training set and testing set, respectively. For 25 training set inhibitors, three best PhModels, Best train , Fast train , and Casear train , were generated at first, and then six prediction results for 133 testing set inhibitors were used for calculating 15 diversity rank/score functions. Finally, the combination Best train Best test and Fast train Fast test prediction results achieved the best test of value 0.816 among 63 combinations. Through these approaches, 23 potential Chk2 inhibitors with IC 50 value less than 2 nM and interaction energy value larger than 37.786 (kal/mol) are retrieved from NCI database. This study can help medicinal chemists to identify or design new Chk2 inhibitors. Besides, the potential inhibitors of Chk2 retrieved in this work can be estimated by biologists for further study.