Prediction on the Inhibition Ratio of Pyrrolidine Derivatives on Matrix Metalloproteinase Based on Gene Expression Programming

Quantitative structure-activity relationships (QSAR) were developed to predict the inhibition ratio of pyrrolidine derivatives on matrix metalloproteinase via heuristic method (HM) and gene expression programming (GEP). The descriptors of 33 pyrrolidine derivatives were calculated by the software CODESSA, which can calculate quantum chemical, topological, geometrical, constitutional, and electrostatic descriptors. HM was also used for the preselection of 5 appropriate molecular descriptors. Linear and nonlinear QSAR models were developed based on the HM and GEP separately and two prediction models lead to a good correlation coefficient (R 2) of 0.93 and 0.94. The two QSAR models are useful in predicting the inhibition ratio of pyrrolidine derivatives on matrix metalloproteinase during the discovery of new anticancer drugs and providing theory information for studying the new drugs.


Introduction
The tumor cell metastasis is a complex process that involves a series of processes such as the adhesion, enzymatic degradation, chemotaxis, and blood vessel hyperplasia in matrix [1]. Although there are many factors that affect the metastasis process of the malignant tumor cells, the interactive proteindegrading enzyme of the tumor cells and the surrounding microenvironment plays a key role in the deterioration of the tumor, which cannot be ignored [2]. Matrix metalloproteinases (MMPs) are one of them [3]. MMPs are a kind of endoenzyme depending on the zinc ion, playing an important role in the degradation and reconstruction of the extracellular matrix [4]. It turns out that MMPs play a crucial role in the tumor growth, invasion, metastasis, and angiogenesis in cancer tissue, in which gelatinase (MMP-2, MMP-9) is closely related to malignant tumors. Gelatinase (MMP-2, MMP-9) is an important target spot for antineoplastic drug research [5]. At present, it has been the hotspot of cancer drug research to develop and find the selective inhibitors of these target spots.
As a kind of alkaloid, with the derivative widely applied, pyrrolidine can be used as an important intermediate of fine chemicals and widely applied to the fields such as pharmaceutical [6,7], food, pesticides, daily chemicals [8], paints, textiles, printing and dyeing, papermaking, photographic materials, and polymer materials. Recent studies have found that it has anticancer activity with its mechanism of action to inhibit the activity of MMP-2 and MMP-9 and thereby to inhibit the tumor growth, invasion, metastasis, and angiogenesis in cancer tissues. IC 50 (the molar concentration of the compound leading to 50% enzyme inhibition) is often used to evaluate the effectiveness of the drug, as the action mechanism and therapeutic role of the drug after entering the body are closely related to its chemical structure and nature. However, these natures can be calculated or predicted by various methods. Quantitative structure-activity relationship (QSAR) and its variations have become a potentially effective way to predict the drug activity parameters [9][10][11][12]. The advantages of QSAR lie in that once the model is established, the nature of the compound can be predicted by the compound structure, and reasonable explanation can be made on the action mechanism of the drugs [13][14][15]. The method extends the range of rational drug screening and is helpful for finding new drugs according to the action mechanism [16][17][18][19][20].
Gene expression programming (GEP) [21] is a high efficient exploration algorithm based on the genetic evolution mechanism of natural population. Regarding the possible solutions in the problem domain as an individual or a chromosome of the group, coding the individual into the form of symbol string, carrying out repeated operation on the group based on the genetics (genetics, intersection, and heteromorphosis), evaluating the individuals according to the scheduled target fitness function, constantly obtaining better groups according to the evolution rule of "survival of the fittest, " and, meanwhile, searching the optimum individual with the searching approach in the overall situation to obtain the satisfied and optimal solutions, it has extremely strong generalization ability and has been used for the QSAR study of the drug [22][23][24][25].
This study adopts the heuristic algorithm (HM) and GEP to establish the QSAR model of pyrrolidine derivatives, gelatinase: IC 50 , establish linear and nonlinear models, predict the IC 50 of 33 pyrrolidine compounds, and also discuss the structural factors that affect the IC 50 .

Data Set.
The structures of the 33 pyrrolidine compounds (Figure 1) adopted and their corresponding IC 50 values are from [26] and are listed in Table 1 with the logarithm collected. In the study of HM and GEP, the data set is randomly divided into two sets: the training set contains 21 compounds and is used to establish the models; and the test set contains 12 compounds used to evaluate the stability and predictive ability of the established models.

Generation of Molecule Descriptors.
The two-dimensional structure of the molecules is drawn with the software ISISDRAW2.4. In the software Hyperchem7.0, all compounds shall be primarily optimized with the molecule mechanics method MM+, experiencing the geometry optimization with the semiempirical AM1 method on this basis to obtain the lowest energy conformation. The optimized molecule structure shall be calculated in the program MOPAC 7.0, with the resulting file of the MOPAC transferred into the program CODESSA to compute the five categories of descriptors, namely, the structure, topology, geometry, electrostatic, and quantum chemical descriptors, with totally 496 descriptors obtained. The stability of the model shall be inspected by the leave-one-out (LOO) crossvalidation correlation coefficient 2 CV . In this study, the HM regression result is represented with the root mean square (RMS).

GEP Method.
GEP is a new genetic algorithm invented by a Portuguese scientist in 1999 based on the genome (genome, GA) and phenotype (phenotype, GP). GEP mainly includes two aspects: chromosomes and expression trees (ETs). ET is mainly used to express the genetic coding information of the chromosome. In GEP, there are two languages used: the language of genes and ETs. The implementation techniques of GEP mainly include encoding scheme, expressions, selection operator, mutation operator, insert string operation, gene inversion, restructuring operator, polygene chromosome and the contiguous function, the standard function set and users-defined functions based on the frequent function set, and fitness function selection (Table 2). There are three kinds of fitness functions for the classic GEP method, and this paper adopts the fitness function based on the absolute error: where is the selection range, ( ) is the predicted value by the individual program for fitness case (out of fitness cases), and is the target value for fitness case .

Results and Discussions
3.1. Calculation Results of HM. All 33 compounds obtain 496 descriptors in total through the computing of the software CODESSA with all computed descriptors to establish the linear model for predicting log (IC 50 ). To determine the appropriate number of descriptors, this research studies different sets of the descriptors. When there is no significant improvement in the statistical performance of the model to add another descriptor, it means that the descriptor number is proper. The 2 increase of less than 0.02 or 2 CV decrease shall be selected as the limit standard to avoid the "over parameterization" of the model. In this study, the five descriptors closely related to the inhibition rate are finally selected ( Table 3). The correlation matrix of five descriptors is showed in Table 4. Seen from Table 4, the correlation coefficients between each of the two descriptors are less than 0.80, which means that they are interactively independent [27]. Figure 2 shows the correlation diagram of the predicted and experimental values of multiple linear regression models, which includes a total of 33 compounds of the training and test sets. The predicted log (IC 50 ) of these compounds is also   Train set Test set The predicted values The experimental values Test set: 2 = 0.85, 2 CV = 0.50, = 21.13, and = 0.36.

Calculation Results of GEP.
After the establishment of the linear model, the same descriptors, as the variables of GEP, establish the nonlinear model. In order to obtain satisfactory results, the parameters affecting the GEP are optimized. Automatic problem solver (APS), the software package used by GEP, is easy to control, and therefore, the evolutionary model can be tested by the test set. In the course of evolution, good selection has been made for the functions with 7 functions selected, namely, subtract, multiply, divide, index, sin, and tan and the fitting function is MSE. Through fitting, the five descriptors selected establish the best QSAR model with the prediction values and residua listed in Table 1 and     LUMO reflects the electron affinity of the molecule [28], with the coefficient positive in the model. When the target is fixed, the electrophilicity of the molecules is stronger, and the log (IC 50 ) value is greater. When 3 side chain is the aliphatic chain, the longer the chain, the greater the LUMO value, and the compound inhibition of enzyme activity of MMP-2 and MMP-9 will be increasing; the aromatics substituent is obviously stronger than the aliphatic substituent in side chain activity, which may be resulting from the large conjugation  system of the aromatic ring, increasing the LUMO value with stronger inhibition rate on the gelatinase activity. Generally, the substituent compound with branched chains is greater than that with a ring substituent, which means that the carbonyl reaction activity with open loop structure is stronger.
MRECO represents the minimum resonance energy of the C-O bond [29]. With the increase of the substituent, the three series of A, B, and C compounds keep an overall downward trend. The smaller the value, the lower the minimum resonance energy of the C-O bond, and the molecule is in a relatively stable state, highly reactive, and easy for the target combination. As its coefficient in the model is negative, with the decreasing of the MRECO, the value of log (IC 50 ) is gradually increased.
KSIND represents the three connectivity indexes of the molecule [30], represents the molecule size, shape, and degree of branching, and reflects the dispersion force between the molecule volume and the molecules to a certain extent. The larger the molecule volume, the greater the molecule dispersion force. Table 2 shows that the KSIND value increases along with the increase of the atom number and structure of the substituent, and, therefore, the steric hindrance and dispersion force of the molecule also increase. The introduction of the group with large volume and strong rigidity is against the activity and the combination with the target decreases accordingly, leading to the log (IC 50 ) value decrease, which is in line with the negative coefficient in the model. ZX represents the relative area of the projection part on the ZX plane of the molecule van der Waals [31], with Z and X representing the maximum and minimum inertial axes of the molecule, respectively. The appearing of the model descriptor means that the size of the molecule has great impacts on the log (IC 50 ) value of the drug, and the van der Waals force is an important part of the interaction energy between the subjects and objects. With negative coefficient in the model, the absolute value is relatively large, and, therefore, its increase results in the decrease of the log (IC 50 ) value of the drug. However, the compounds with structures similar to butterfly have higher flexibility and high activity.
MASEOAT [32] represents the minimum atomic state energy of the O atoms in the molecule and is related to the location of the oxygen atoms in the molecule, the molecule structure, and the steric hindrance. The lower the energy states of the oxygen atom, the higher its reactivity, and the easier the target molecule interactions. The description shows that the oxygen atoms in the molecule are related to the biological activity. In the model, the coefficient is positive, indicating that the energy state of the oxygen atom is positively correlated to the log (IC 50 ) value.
In summary, by comparing the data of in vitro inhibitory activities of the three series of A, B, and C, it can be seen that as A, B, and C molecule increases, the activity tends to decrease, suggesting that the smaller the side chain molecule of the 1 is, the more active the molecule is. The series of pyrrolidine compounds have good gelatinase inhibiting activity, and it is found that within a certain range, the larger the side chain of pyrrolidine ring C4, the better the flexibility, and the higher the activity; the activity of aromatic ring substituent is obviously higher than that of the aliphatic hydrocarbon substituent; and the compound with butterfly structure has higher activity.

Conclusions
This study proposes a method to predict the activity inhibition rate of pyrrolidine derivatives on gelatinase (MMP-2, MMP-9) based on HM and GEP. By calculating the molecule structure descriptors and establishing linear and nonlinear QSAR models by HM and GEP, the prediction results are satisfactory. Comparing the results of the two methods, we can see that both the linear HM method and nonlinear GEP method have strong predictive ability and better model stability in the activity inhibition rate of pyrrolidine derivatives on gelatinase (MMP-2, MMP-9), providing a theoretical basis for the in vitro screening of antitumor pyrrolidine derivatives.