Computational Study of Estrogen Receptor-Alpha Antagonist with Three-Dimensional Quantitative Structure-Activity Relationship, Support Vector Regression, and Linear Regression Methods

Human estrogen receptor (ER) isoforms, ERα and ERβ, have long been an important focus in the field of biology. To better understand the structural features associated with the binding of ERα ligands to ERα and modulate their function, several QSAR models, including CoMFA, CoMSIA, SVR, and LR methods, have been employed to predict the inhibitory activity of 68 raloxifene derivatives. In the SVR and LR modeling, 11 descriptors were selected through feature ranking and sequential feature addition/deletion to generate equations to predict the inhibitory activity toward ERα. Among four descriptors that constantly appear in various generated equations, two agree with CoMFA and CoMSIA steric fields and another two can be correlated to a calculated electrostatic potential of ERα.


Introduction
Estrogens are critical in the physiology of the female reproductive system, the maintenance of bone density, and cardiovascular health [1,2]. Estrogen receptors are classified into two isoforms, ER and ER , both of which are members of the nuclear receptor superfamily of ligand-modulated transcription factors [3,4]. When the natural ligand estradiol or other ligands bind to ER , complex signaling networks lead to a conformational change, specifically in the activation function (AF)-2 helix (H12), allowing estradiol to bind to chromatin; this, in turn, activates or inhibits responsive genes [5,6]. ER and ER are the targets of pharmaceutical agents used to fight cancers of the reproductive organs, for example, prostate, uterine, and breast cancer [6,7]. These pharmaceutical agents are divided into three distinct categories: (i) receptor agonists such as 17 -estradiol, the estrogen receptor's natural ligand; (ii) antiestrogens, such as the compound ICI 164,384 [5,8]; and (iii) raloxifene (arylbenzothiophene) [5,9] and tamoxifen [10], both of which act as agonists as well as antagonists. Raloxifene (compound 25 in Table 1) is a selective estrogen receptor modulator (SERM) providing a safer alternative to estrogen because it is an ER antagonist in mammary tissue and the uterus and also mimics the agonist effects of estrogen on bone and in the cardiovascular system [11]. The U.S. Food and Drug Administration (FDA) recently approved raloxifene for the treatment of osteoporosis [12], and it is also being tested as a preventive drug against breast cancer and coronary heart disease [5,9]. Because drug resistance and serious side effects, such as venous thromboembolism and fatal stroke, have been reported [13], there is a crucial need for new therapeutic agents. Two major 2 International Journal of Medicinal Chemistry strategies to achieve this are indirect ligand-based and direct receptor-based approaches, both of which could provide a deeper understanding of the structure-activity associations, thereby enabling the development of new compounds with increased activity and selectivity profiles for specific therapeutic targets. Support vector machine (SVM) is a statistic approach developed for classification and regression. When this tool is applied to the regression, it is commonly called support vector regression for clarity. Because of its prominent prediction and generalization capability, it is widely adopted in various fields. Lately it has been applied to QSAR field in evaluating physicochemical parameters such as solubility, lipophilicity, polarity, and steric properties and further predicting affinity [14][15][16][17][18]. The linear regression model seeks a linear combination for input variables x 0 = ( 1 , 2 , . . . , ) that best fits the target variable . The model can be formulated as = + = 0 + 1 1 + 2 2 + ⋅ ⋅ ⋅ + + , where variable is the predicted value and is the prediction error. The weight parameters 1 , 2 , . . . , are associated with 1 , 2 , . . . , , and the parameter 0 imposes an offset on the model. This represents the simplest form for regression.
In this work, a number of models capable of predicting the inhibitory activity of 68 raloxifene derivatives [19] were constructed. 3D-QSAR models, adopting the widely used approaches CoMFA [20,21] and CoMSIA [22], provide spatially specific pharmacophoric features for future synthesis. 2D-QSAR models on the base of physicochemical descriptors selected by SVR and LR methods were also performed to seek an alternative approach in relating structural features to affinity between ER and the raloxifene derivatives. In all, this information provides clear guidelines for the synthesis of additional compounds accelerating combinatory chemistry in the development of new drugs.

Data Set and Biological
Activity. This study considered 68 compounds of raloxifene derivatives in a core of arylbenzothiophene [19]. Structural information and bioactivity associated with MCF-7 cells are listed in Table 1. In 3D-QSAR modeling, 56 compounds formed a training set and 10 compounds formed a test set to externally examine the models. Compounds 9 and 37, both with estimated IC 50 = 1000 nM, were removed because they were always outliers in the training or test set, and retaining them made the models unacceptably unstable. It is likely that their exact IC 50 values lie somewhere between 600 and 1000 nM. The test set compounds and compounds not included in modeling are marked in Table 1. In SVR and LR modeling, all 68 compounds were included to choose descriptors for model construction.

Structure Preparation and Alignment.
Gasteiger-Hückel charge assignment and a Tripos force field were used to prepare the structure of the compound. The geometry of each arylbenzothiophene derivative was minimized using the simplex algorithm followed by the Powell algorithm to an energy convergence criterion of 0.05 kcal/molÅ. The alignment of compounds is an essential step in determining the structure-activity relationship because the maximized overlap of pharmacophoric features responsible for producing a biological response greatly increases the correlation between structure and activity. A ligand-based approach was adopted in this study, in which each compound in its energetically minimized geometry was aligned according to the core structure, as illustrated in Figure 1(a). The alignment results are given in Figure 1(b). It is notable that the 68 compounds were aligned in 3D space such that most of structural features common to all of the compounds had the same Cartesian coordinates.

CoMFA and
CoMSIA. This study used molecular modeling software Sybyl 8.1 (Tripos International, St Louis, MO) for the CoMFA and CoMSIA models. Two CoMFA descriptors, steric (Lennard-Jones 6-12 potential) and electrostatic (Columbic potential) field energies, were calculated using an sp3 carbon atom carrying a +1.0 charge set at default parameters, to serve as a probe atom. In addition to steric and electrostatic fields, CoMSIA also considers hydrophobic and hydrogen bond donor/acceptor interaction. These five similarity indices were calculated using a Gaussian-type distancedependent function using a default attenuation factor of 0.3. The probe atom was set to the same default parameters used in CoMFA. Both CoMFA and CoMSIA use pIC 50 as the target variable in partial least squares (PLS) regression [23] to derive 3D-QSAR models. The predictive value of the model was evaluated by calculating the leave-one-out cross-validated (LOOCV) coefficients, 2 [24], using the following equation: where pred is predicted affinity (calculated by model), actual is actual affinity (obtained by experiment), and is mean actual affinity. The term ∑ ( pred − actual ) 2 is the predictive sum of squares (PRESS). The number of components giving the lowest PRESS value determines the optimum number of component (ONC) to generate the final PLS regression model. The conventional coefficient, or the non-crossvalidated correlation coefficient, 2 , was subsequently calculated to characterize the statistics of the built model. In general, a 2 > 0.5 is an indication of internal predictability [25,26], whereas an 2 > 0.5 indicates that the constructed model is fairly good and interpretative [26,27]. 2.5. Feature Selection Procedure. The feature selection method for choosing proper descriptors is composed of feature ranking and sequential feature addition or deletion. We adopt the idea of maximal correlation and minimal redundancy.
The objective formula is given as follows: where T denotes any feature subset, T * represents the optimal feature subset, ( , ) denotes the correlation function between variables and , and denotes the universal set consisting of all available features, = { 1 , 2 , . . . , }. The value of is a weight that can be adjusted to represent the relative importance of these two terms.
Since solving T * is an optimization problem, it will inevitably involve a combinatorial search. If an exhaustive search is applied, O(2 |F| ) cases should be examined. In order to avoid an exhaustive search, we followed the idea of Peng et al. [28] and adopted a sequential and greedy search approach. We defined the ranking score of an unselected feature as where T denotes the selected feature subset and denotes the target value. After the feature ranking is obtained, the RMSE (root mean square error) √ ∑ ( − ) 2 / was tested by crossvalidation in a sequential forward manner. The next step is to locate where the minimal RMSE takes place, say , and select the top ranking features. Subsequently, a sequential feature deletion and a sequential feature addition procedure were applied for rounds. Finally, assuming not too many features are kept, the reserved features are subject to an exhaustive search and export the top feature subsets. The entire procedure is given as follows.
Procedure: Feature Subset Selection for Regression.
Input. The independent variable is X and target variable is y. The round number is for sequential feature deletion and addition procedure, and is for the top ranking feature subsets. Assume the linear regression method [29] is adopted to evaluate RMSE.
Output. The top ranking feature subsets from the reserved feature set T .
Step 1. Apply a sequential search approach to determine the feature ranking.
Step 2. Locate the feature subset associated with the minimal RMSE.
Step 3. For = 1 to do Step 3.1. Apply a sequential feature deletion process to the selected features and determine which features are to be removed.
Step 3.2. Apply a sequential feature addition process to append unselected features to the selected features and determine which features are to be added.
Step 4. Assume the reserved feature set to be T . Apply an exhaustive search to the reserved features and export the top ranking feature subsets among T . Table 2 are the statistic results of 3D-QSAR modeling. We used the partial least squares regression method [23] with the leave-one-out cross-validation procedure [24] to determine the optimum number for the principal components. In the two models created, the leave-one-out cross-validated correlation coefficients ( 2 ) all reached the criterion 2 ≥ 0.5, and all statistics with the conventional, non-cross-validated correlation coefficients were greater than 0.8. In the CoMFA   Figures 2(a) and 2(b)).

Statistics for CoMFA and CoMSIA Models. Listed in
Most of the absolute residual values, particularly for the training set data points, were less than 1 logarithm unit.

Statistics of SVR and LR Models.
The original data set contains 68 instances, each of which consists of one pIC 50 value and 231 descriptors (features). Since our goal is to use the descriptors to predict the pIC 50 value, it is reasonable to involve descriptors that are highly correlated with the pIC 50 value. Any descriptor that has very few distinct values is regarded as invariant to the pIC 50 value and thus would not 7 : Shadow area fraction: plane (spatial descriptors). 8 : Shadow ratio (spatial descriptors). 9 : Dipole moment (spatial descriptors). 10 : SASA (jurs descriptors). 11 : RPCS (jurs descriptors).
facilitate the prediction. The checking method is to calculate the median absolute deviation (MAD), which is given by Med(Abs(x − Med(x))), where Med and Abs denote median and absolute operators, respectively. There are totally 120 descriptors whose MAD values are equal to zero. Consequently, only 111 descriptors are employed for the subsequent processing. Before performing the regression process, a normalization procedure is applied to the reserved descriptors, that is, = ( − x)/ x , where x and x represent mean and standard deviation for the descriptor , respectively. We applied the feature selection procedure with = 1 on the data set. During the feature selection process, the linear regression was used to evaluate RMSE. Because only 11 descriptors are left for the exhaustive search, we set = 2 11 − 1 to let the program export all combinations. The intercorrelations between the selected 11 features, as well as the intercorrelations between each feature and pIC 50 , are listed in Table 3. Figure 3(a) is a superimposed image of two steric fields generated using CoMFA and CoMSIA on MCF-7 cell inhibition. Both steric models indicate that the regions around C2 and C3 are steric-favorable. This explains why the activity of compound 55 (IC 50 = 0.8 nM), the 1 -naphthyl of which is in contact with the green contour, was 100 times higher than that of compound 66 (IC 50 = 80 nM), the 2 -naphthyl of which is not in contact with the green contour but in contact with a steric-unfavorable region in yellow. Likewise, compound 59 (IC 50 = 3 nM and an isopropyl group to replace the phenyl ring) was more potent than compound 64 (IC 50 = 20 nM and a smaller ethyl group to replace the phenyl ring); compound 24 (IC 50 = 2.5 nM) was more active than compound 34 (IC 50 = 100 nM, with a phenyl group on C4 and being in contact with the yellow steric-unfavorable contour). Near C6 a steric-favorable contour was observed in CoMFA. This tiny green contour explains why compound 4 (IC 50 = 20 nM, with an ethynyl group on C6) was more active than compound 8 (IC 50 = 300 nM, with a methyl group on C6).

Electrostatic Fields Determined by CoMFA and CoMSIA
Models. Figure 3(b) shows two electrostatic fields generated by CoMFA and CoMSIA. Although the two electrostatic models were not identical, there was no conflict. In CoMSIA an electronegativity favorable red contour surrounds the phenyl moiety, indicating that a heteroatom with a partial negative charge would have a positive effect on inhibitory activity. This explains why compounds 27 (IC 50 = 1nM, with a chlorine) and 28 (IC 50 = 2.3 nM, with a fluorine) are more active than compound 33 (IC 50 = 50 nM, with a methyl group). In the vicinity of the CoMSIA's red contour, a blue contour was observed in CoMFA. Together these two contours suggest that a hydroxyl group attached to C4 increases activity.
Both CoMFA and CoMSIA show a contour favorable to a negative charge near C6 and a contour favorable to a positive charge farther away, indicating that a hydroxyl group herein would increase activity. The activities of compounds 25 (IC 50 = 0.2 nM, with a hydroxyl group), 4 (IC 50 = 20 nM, with an ethynyl group), and 8 (IC 50 = 300 nM, with a methyl group), differing in the substituent on C6, varied according to this electrostatic feature.

Hydrogen Bond Donor and Acceptor Fields Determined by CoMSIA Model. Preferences of hydrogen bond donors and acceptors are presented in Figures 3(c) and 3(d), respectively.
A number of hydrogen bond donor favorable/unfavorable contours are in the vicinity of C6 (Figure 3(c)). The activity of compound 25 (IC 50 = 0.2 nM), whose hydroxyl hydrogen atom is in contact with one cyan contour, is higher than that of compound 8 (IC 50 = 300 nM), with a methyl group on C6. Two hydrogen bond acceptor favorable contours surround C4 and C6. Accordingly, compound 28 (IC 50 = 2.3 nM, with a fluorine on C4 ) is more potent than compound 33 (IC 50 = 50 nM, with a methyl group on C4 ), and compound 7 (IC 50 = 250 nM, with a methoxy group on C6) is slightly more active than compound 8 (IC 50 = 300 nM, with a methyl group on C6). Meanwhile, the characteristic of favoring hydrogen bond acceptors near C4 and C6 confirms the red electronegative contours in Figure 3(b). Figure 4, we superimposed CoMSIA fields onto the activity site of ER (PDB code: 1ERR) [30] to reveal the correlation between the observed fields and ER 's amino acids involved in the binding of modulators. The raloxifene structure used in our 3D-QSAR modeling was obtained by energy minimization and therefore was slightly different from the ER bound that one retrieved from PDB (1ERR). The RMSD between the two raloxifene structures is 0.66Å, with a minor deviation caused by the orientation of the long chain extended from C3. Since the contour maps in CoMFA and CoMSIA models are about the phenyl and benzothiophene moieties, projecting the contour maps onto the ER binding cavity for discussion is proper. As shown in Figure 4(a), the green, steric favorable contour matches the empty area around Leu525 and Leu428, whereas the yellow, steric unfavorable contour corresponds to the corner surrounded by residues of His524, Ile424, and Met421. In Figure 4(b), the negative and positive charge favorable contours on C6 point toward the positively charged guanidinio of Arg394 and negatively charged carboxylic group of Glu353, respectively. Moreover, the blue contour above the phenyl ring moiety is related to the C4 red contour. That is, a reduction in phenyl ring electronegativity caused by the electron-withdrawing heteroatom adjacent to C4 benefits the interaction of the inhibitor and ER . Consequently, the resulting positive charge of the phenyl ring increases the interaction between the inhibitor and Met421 sulfur atom, carrying a partial negative charge. Such electrostatic attractions help discriminate the binding of the inhibitor to ER from ER , as pointed out earlier in Salum's CoMFA model in ligand binding selectivity over ER and ER [31]. ER and ER isoforms share an overall 58% sequence identity in binding domain, particularly their ligand-binding cavities, which differ by only two amino acids of highly conserved characteristics-Leu384 and Met421 on ER and Met336 and Ile373 on ER . Met421 in ER and Ile373 on ER are highly involved in the accommodation of ligands, and are regarded as pivotal in the process of selectivity [32,33]. Figure 4(c) shows that the contour near C6 favorable to the hydrogen bond donor points toward the carboxylic oxygen atoms of Glu353. In Figure 4(d), the contour favorable to the hydrogen bond acceptor on C6 points toward the guanidinio hydrogen atoms of Arg394 and a contour favorable to the hydrogen bond acceptor on C4 points toward His524 amide hydrogen. In all, the hydroxyl groups located on C4 and C6 in conjunction with the residues of Glu353, Arg394, and His524 were demonstrated to form a stable hydrogen bonding network.

SVR and LR Results.
For the previously mentioned selected 11 features, totally 2047 cases were to be examined. We applied the linear regression and leave-one-out cross validation (LOOCV) techniques to evaluate all the 2047 cases. The upper part of Table 4 gives the top 10 feature subsets, including formulas and the corresponding LOOCV RMSEs, based on LR. The feature subsets are ranked in terms of RMSEs. It is shown that the best RMSE is 0.7364, which is associated with eight features. To compromise between the model complexity and prediction capability, we adopted the 7th LR model equation, which consists of six features and whose RMSE is 0.7484, to demonstrate the prediction of pIC 50 listed in LR column, Table 1. Figure 2(c) plots the actual pIC 50 values against the predicted values based on this model equation.
In addition to the linear regression (LR), we also applied the linear support vector regression (SVR) [34,35]

Comparison of SVR Prediction with
CoMFA and CoM-SIA Models. Models equations derived from LR and SVM approaches indicate that a number of features consistently provide contributions in determining the target variable. Variables 6 , 7 , 9 , and 11 appear in all the derived equations in both SVM and LR models, whereas variables 3 and 4 appear in most of the derived equations. Meanwhile, variables 6 , 9 , and 11 are quite stable in being positively correlated to the target variable, while 3 , 4 , and 7 are negatively correlated to the target variable. Variable 6 represents the molecular shadow area projected on ZX plane; variable 7 represents the shadow area fraction on YZ plane. These two descriptors are found in accordance with CoMFA and CoMSIA steric fields shown in Figure 3(a), where the Cartesian coordinate frame is specified. The positive sign of 6 coefficients in the derived equations suggests that an increase in molecular shadow area on ZX plane enhances inhibitory activity, and this is in agreement with Figure 3(a)'s green, steric-favorable contours. That is, along the -axis point-of-view, the shadow area on YZ plane can be extended by adding bulky groups in contact with the green contours. Compound 55 (IC 50 = 0.8 nM) with 1 -naphthyl modification is a good example. Variable 7 , the shadow area fraction on YZ plane, is negatively correlated to inhibitory activity and can be correlated to the yellow steric-unfavorable contour in Figure 3(a). That is, an elongated side chain attached on C4 would increase the shadow area projected on YZ plane (which can be seen with a view point along the -axis) and reduce the activity.
Variables 9 , the dipole moment about the -axis, and 11 , a Jurs descriptor that is associated with relative positive charge surface area, are both positively correlated to the activity. Analysis on dipole moment about -axis shows that compounds with positive values possess higher activity, which implies that the activity can be boosted by positive charges distributed on compound surface. Together, features 9 and 11 suggest that the positive electrostatic potential benefits the inhibitory activity. These findings could be related to the electronegativity of the gate of ER binding pocket ( Figure 5) in which an inhibitor with a partial positive charge enters more easily. The electrostatic potential shown in Figure 5 is based on the solved X-ray structure in PDB code 1ERR [30].

Conclusion
Our results have shown that the hydroxyl groups on both C6 and C4 are irreplaceable, due to the strong hydrogen bonding network linking to Glu353 and Arg394 on C6 side and to His524 on C4 side. Accordingly, compounds 25 (raloxifene), 26, 45, and 55, possessing two hydroxyl groups at C4 and C6 sites, have satisfactory IC 50 values. Earlier results from the literature showed that in cases of E1 (estrone), E2 (17estradiol), and E3 (estriol) replacing the hydroxyl groups with methoxy eliminated the affinity toward both ER subtypes [36][37][38]. Likewise, compounds 7, 20, 21, 22, and 23 with a methoxy group on C6 held poor IC 50 values because of disruption to the hydrogen bond network and steric disfavor.
Comparison of RMSEs among different feature combinations suggests that if all 231 features are adopted for regression, the RMSEs are not good. On the other hand, if the appropriate feature selection method is used, the performance gets improved. From the results, we can see that most of the RMSEs obtained by SVR outperform those of the LR. This may be attributed to the well-selected features and prominent prediction capability of SVR, because the selected features are not specialized to the evaluation method. In summary, the best RMSE is 0.7580 when ten features are  adopted to perform SVR. If the subsets of only 5 features are considered, the best RMSE of SVR is 0.7273. In the present study models built on different methods were successfully employed to gain detailed insights on the structure of ER modulators. Accordingly, the clues derived from contour analysis can be used for further design work based on arylbenzothiophene and for screening large chemical databases for compounds with potential ER activity.