The paper suggests a new method that combines the Kennard and Stone algorithm (Kenstone, KS), hierarchical clustering (HC), and ant colony optimization (ACO)-based extreme learning machine (ELM) (KS-HC/ACO-ELM) with the density functional theory (DFT) B3LYP/6-31G(d) method to improve the accuracy of DFT calculations for the Y-NO homolysis bond dissociation energies (BDE). In this method, Kenstone divides the whole data set into two parts, the training set and the test set; HC and ACO are used to perform the cluster analysis on molecular descriptors; correlation analysis is applied for selecting the most correlated molecular descriptors in the classes, and ELM is the nonlinear model for establishing the relationship between DFT calculations and homolysis BDE experimental values. The results show that the standard deviation of homolysis BDE in the molecular test set is reduced from 4.03 kcal mol−1 calculated by the DFT B3LYP/6-31G(d) method to 0.30, 0.28, 0.29, and 0.32 kcal mol−1 by the KS-ELM, KS-HC-ELM, and KS-ACO-ELM methods and the artificial neural network (ANN) combined with KS-HC, respectively. This method predicts accurate values with much higher efficiency when compared to the larger basis set DFT calculation and may also achieve similarly accurate calculation results for larger molecules.
In the past few decades, quantum chemistry has attracted remarkable attention and made significant improvements along with the corresponding fields of physics, mathematics, and computer science [
Artificial intelligence methods can resolve the accuracy problem of quantum chemical calculations for molecular bond properties by establishing the relationship between the calculated and experimental values instead of strictly solving wave functions with very high cost. By this method, high-precision calculation results can be obtained without wasting time on advanced quantum chemical methods and ultra-large basis sets because this combination-type calculation method can take mutual advantages of two methods to reduce the systematic errors induced by defects of theories and functions with simple physical parameters adopted in artificial intelligence methods. Thus quantum chemical calculations with low-level methods and small basis sets are sufficient for obtaining valid data from artificial intelligence methods. Meanwhile, small basis sets enable quantum chemical calculations to apply to larger molecules, further increasing the chance to design novel lead molecules.
Although the artificial intelligence method is a relatively new multidisciplinary research field, in the past decade, it has received a lot of attention and thus has developed rapidly. The combination strategy consists of the quantum chemical method first being used to obtain the molecular properties and then take the quantum chemical calculation results as inputs for the statistical methods to establish the relationship between the experimental and calculation values. There are many choices on statistical methods including linear methods such as linear regression [
Nitric oxide (NO) plays an important physiological role in the human life cycle [
The content of this paper is arranged as follows. First, it describes the HC, ACO, and ELM methods. Second, it illustrates the technology roadmap of the KS-HC/ACO-ELM method. Third, it discusses the calculation results of the Kenstone, HC, ACO, and ELM methods (i.e., the appropriate classification of the training set and the test set, clustering analyses for molecular descriptors with both the known and unknown class numbers, and the establishment of the ELM model, resp.). Finally, the results and the discussions are summarized.
The basic idea of the hierarchical clustering (HC) method is to classify samples in terms of distances. Firstly, it defines the distance between samples and classes and combines the nearest two classes into classes of samples, and then it recalculates the distance between the new class and other classes and classifies them according to the minimum distance. By reducing one class at a time, this process is repeated until all of the samples come into one class. The chart showing the clustering process is called a clustering chart. One of the features of the HC method is that when samples are classified into a class in a certain level, they will always belong to the same class in the subsequent divisions.
There are a variety of methods to define the distance between classes. Different definitions generate different HC analysis methods, which include the following: the shortest-distance method, the longest-distance method, the middle-distance method, the center of gravity method, the group-average method, the variable group-average method, the variable method, and the variance and sum method. The recurrence formula for these methods is
One of the clustering problems is that the number of clusters is unknown. In the ant colony optimization (ACO) clustering method, the samples are thought of as ants with different attributes, and the cluster center is the food sources that the ants seek. Therefore, the sample clustering is the mechanism of ants looking for food sources.
If (1) Initially, allocate (2) Calculate Euclidean distances between classes (3) Calculate the amount of pheromone on each path. Let (4) Calculate the probability of (5) If (6) Determine whether there is merging or not. If there is no merging, stop the cycle; otherwise, go to step
The extreme learning machine (ELM) was proposed by Huang et al. in 2006 [
A typical structure of a single-hidden layer feed-forward neural network (SLFN).
Briefly, suppose the connection weight
The connection weight matrix
The threshold
The input matrix
The activation function of the neuron in the hidden layer is
Equation (
Because the number of neurons in the hidden layer is equal to the number of samples in the training set, SLFN can approach the training samples with zero error for any
However, when the number of the training set
Therefore, when the activation function
From the above analysis, ELM can randomly generate Determine the number of the neurons in the hidden layer and randomly set the connection weight Select an infinitely differentiable function as the activation function of the neurons in the hidden layer, and then calculate the matrix Calculate the output layer weight
The technology roadmap of the ELM method based on KS-HC/ACO to improve the accuracy of the molecular bond energies calculated by DFT is shown in Figure
The flow chart of the KS-HC/ACO-ELM model.
The basic idea of the Kenstone method [
It is critical for the HC to adopt the appropriate distance and clustering method. It is best if the method can exclude the dimensional effect, and usually the standard Euclidean distance is a reasonable choice. As for the clustering method, the shortest-distance method is too contracted and the longest-distance method is too expanded. These two methods are simple, but they belong to extreme classification methods. Therefore, the more appropriate methods are the average-distance method, the weighted average method, and the center of gravity method.
There is no existing standard for evaluating the quality of clustering results, and the application of the clustering method depends on the researchers’ application skills and experiences with the classification objects. If the clustering method is applied properly, the critical scale and the distance matrix could be highly correlated. The correlation is calculated by cophenetic correlation coefficients, which can be used to evaluate the clustering results. The value of the cophenetic correlation coefficient is between zero and one. If the clustering is effective, the value is large. However, the high value is only statistically significant; it does not mean that the result is physically effective. Sometimes the result is not meaningful in practice.
The cophenetic correlation coefficients are shown in Table
Cophenetic correlation coefficients of various clustering methods based on the Euclidean distance.
Clustering methods | Longest-distance method | Shortest-distance method | Average-distance method | Weighted-average method | Center of gravity method | Middle-distance method | Sum of squares |
---|---|---|---|---|---|---|---|
Cophenetic correlation coefficients based on the Euclidean distance | 0.7726 | 0.6685 | 0.8381 | 0.8246 | 0.8113 | 0.7739 | 0.7995 |
Based on the average-distance method of the Euclidean distance, 12 molecular descriptors are clustered into classes, and the numbers of classes are given from 2 to 11. This paper only takes for an example the 7 classes. The clustering pedigree chart of the average-distance method is shown in Figure
Clustering pedigree chart of the average-distance method.
The HC method needs a predefined number of clusters, which includes every type of classification. The advantage is obvious when the sample (i.e., the number of molecule descriptors) is small, but when the sample is large, it is neither necessary nor practical to list every classification using an exhaustive method. Therefore, in this paper, an ACO without a predefined number of clusters is also applied to the screening of molecular descriptors to arrive at a generalized model of the ELM method for either a small or large number of descriptors.
According to the definition of the ACO, the optimization function is the minimum ratio of the within-class distance to interclass distance. This means the distance among the classes should be large, but the distance between two samples within a class should be small:
The results of the ACO calculation of the clustering molecular descriptors (
The correlation coefficients between the 12 molecular descriptors and homolysis BDE experimental values were calculated, and they are 0.64, 0.46, 0.49, 0.02, 0.12, 0.18, 0.28, 0.51, 0.17, 0.43, 0.05, and 0.27. According to the magnitude of the correlation coefficients (e.g., by HC calculation) when
Errors of the calculated homolysis BDE and experimental values for the test set. (a) Errors calculated by the B3LYP/6-31G(d), KS-ELM (including all the molecular descriptors), KS-HC-ELM, and KS-ACO-ELM methods for twelve molecules in the test set. (
Test set (number)
Test set (number)
To assess the KS-HC/ACO-ELM method, the results calculated by the molecular descriptors without screening KS-ELM (using all descriptors), KS-HC-ELM (best results are obtained when the molecular descriptors cluster into 7 classes), and KS-ACO-ELM are compared with the results calculated by B3LYP/6-31G(d). A three layered artificial neural network (KS-HC-ANN) calculation was also performed for comparison. The ANN structure uses the same number of inputs and 6 hidden neurons (shown in Figure
The STD corrected by the KS-ELM method is much smaller than the error of the B3LYP/6-31G(d), and the STD calculated by KS-HC-ELM and KS-ACO-ELM is smaller than that by KS-ELM. That means that it is necessary to screen molecular descriptors to eliminate redundant descriptors. The STD is reduced from 4.03 to 0.30, 0.28, 0.29, and 0.32 kcal mol−1 after correction by the KS-ELM, KS-HC-ELM, KS-ACO-ELM, and KS-HC-ANN methods, respectively. It is quite simple to see that the result corrected by the KS-HC-ELM method is much closer to the experimental results. If some trivial features are introduced into ELM, the accuracy of the model might decrease, and if the parameters are selected inappropriately, overfitting could also appear. In this experiment, although the calculation results using the KS-HC-ELM method are superior to those by the KS-ACO-ELM method, it cannot be assumed that this will be the case under every circumstance because each method has its own advantages. The clustering methods can be divided into two categories, where one requires a predefined number of categories, while the other does not. Because the ELM inputs affect its performance and the best combination of molecular descriptors is difficult to determine, it is recommended that when the number of clustering molecular descriptors is smaller, both the HC method and the ACO method are suitable and that when the number of the molecular descriptors is larger, the ACO method is more appropriate.
Nitric oxide (NO) is a very important signaling molecule in mammalian systems. When there are problems resulting in the release of NO molecules in the body, NO carrier molecules may be taken as a drug to treat the disease. It is complicated to measure NO molecular bond homolysis BDE in various experimental methods, and it is rather difficult to achieve chemical accuracy. The DFT method has been very popular in computational chemistry for the past two decades because many computational studies have shown that DFT calculations can capture the physical essence of the molecules. For the calculation of small molecules, the accuracy can reach a high level, but for large molecules, the cost of calculation is too expensive and the calculation accuracy is quite poor.
In this paper, the combination of KS-HC/ACO-ELM and DFT calculation methods succeeds in improving the DFT calculation accuracy of NO bond homolysis BDE. The results show that for three KS-HC/ACO-ELM methods, the STD of homolysis BDE of 92 organic molecules in the test set decreases from 4.03 to 0.30, 0.28, and 0.29 kcal mol−1. This proves that the KS-HC/ACO-ELM method based on B3LYP/6-31G(d) can remarkably improve homolysis BDE calculations, and its result may be used as the reference value to experimental results with high accuracy. The adoption of Kenstone makes the model more generalized. Although the model is ambiguous and the space of the descriptors is complex, the reproducibility of the calculation is very important. The selection of the training set and test set using the Kenstone method makes the model more robust and reproducible and makes the calculation results more convincing. The selection of appropriate molecular descriptors using the HC/ACO can eliminate the subjective bias on some input descriptors. Compared to the ANN method, the nonlinear model created by ELM can not only avoid sensitive parameters and local minimum problems, but it can also achieve more accurate calculation results with less calculation time and computing resources.
The KS-HC/ACO-ELM method expands the feasibility and applicability of the B3LYP/6-31G(d) method, and the more experimental data and molecular descriptors there are, the higher calculation accuracy the KS-HC/ACO-ELM will achieve, leading to more obvious advantages. It is important that such a high-precision method can be used to design novel NO-releasing drug molecule. We believe that it is significantly effective for the KS-HC/ACO-ELM method to improve the accuracy of DFT B3LYP/6-31G(d) calculations, even with a smaller basis set (e.g., STO-3G, etc.). Meanwhile, this method can be applied to correct other properties such as Heterolytic Bond Dissociation Energies, Absorbed Energies, Ionization Energies, and heat of formation. Further studies are ongoing. This study provides an effective tool (the combination of the KS-HC/ACO-ELM and DFT methods) to improve and predict highly accurate homolysis BDE of the molecular NO carrier systems.
The authors gratefully acknowledge the financial support from the Science and Technology Development Planning of Jilin Province (20130522109JH, 20110364, and 20125002) and the Fundamental Research Funds for the Central Universities (11QNJJ008 and 11QNJJ027).