An Accurate and Efficient Method to Predict Y-NO Bond Homolysis Bond Dissociation Energies

Thepaper suggests a newmethod that combines theKennard and Stone algorithm (Kenstone, KS), hierarchical clustering (HC), and ant colony optimization (ACO)-based extreme learning machine (ELM) (KS-HC/ACO-ELM) with the density functional theory (DFT) B3LYP/6-31G(d) method to improve the accuracy of DFT calculations for the Y-NO homolysis bond dissociation energies (BDE). In this method, Kenstone divides the whole data set into two parts, the training set and the test set; HC and ACO are used to perform the cluster analysis on molecular descriptors; correlation analysis is applied for selecting the most correlated molecular descriptors in the classes, and ELM is the nonlinearmodel for establishing the relationship betweenDFT calculations and homolysis BDE experimental values. The results show that the standard deviation of homolysis BDE in the molecular test set is reduced from 4.03 kcalmol calculated by the DFT B3LYP/6-31G(d) method to 0.30, 0.28, 0.29, and 0.32 kcalmol by the KS-ELM, KS-HCELM, and KS-ACO-ELM methods and the artificial neural network (ANN) combined with KS-HC, respectively. This method predicts accurate values with much higher efficiency when compared to the larger basis set DFT calculation and may also achieve similarly accurate calculation results for larger molecules.


Introduction
In the past few decades, quantum chemistry has attracted remarkable attention and made significant improvements along with the corresponding fields of physics, mathematics, and computer science [1][2][3][4][5][6][7].Quantum chemical methods have captured the physical essence of molecules through numerous research efforts, to the point where various molecular properties (including geometry, dipole moment, polarizability, thermodynamics, and excited states) can be obtained without experiments.However, quantum chemical calculations for large molecules can be quite demanding, in particular for high-accuracy calculations that need to solve the Schrödinger equation thoroughly.Therefore, accuracy often has to be sacrificed for practical reasons, for example, introducing approximations when solving the wave function for large molecular systems.Despite these approximations, the size of molecule is still limited.For a large system with a regular structure, the quantum chemical methods can only give a qualitative description.For complex systems, it is hard to calculate with existing quantum chemical methods; even if the calculation can be carried out, the accuracy might be very low and the error is far beyond the specifications in scientific research [8,9].Therefore, alternative methods for large or complex systems need to be developed to meet the challenges of molecular calculations.
Artificial intelligence methods can resolve the accuracy problem of quantum chemical calculations for molecular bond properties by establishing the relationship between the calculated and experimental values instead of strictly solving wave functions with very high cost.By this method, highprecision calculation results can be obtained without wasting time on advanced quantum chemical methods and ultra-large basis sets because this combination-type calculation method can take mutual advantages of two methods to reduce the systematic errors induced by defects of theories and functions with simple physical parameters adopted in artificial intelligence methods.Thus quantum chemical calculations with low-level methods and small basis sets are sufficient for obtaining valid data from artificial intelligence methods.

Mathematical Problems in Engineering
Meanwhile, small basis sets enable quantum chemical calculations to apply to larger molecules, further increasing the chance to design novel lead molecules.
Although the artificial intelligence method is a relatively new multidisciplinary research field, in the past decade, it has received a lot of attention and thus has developed rapidly.The combination strategy consists of the quantum chemical method first being used to obtain the molecular properties and then take the quantum chemical calculation results as inputs for the statistical methods to establish the relationship between the experimental and calculation values.There are many choices on statistical methods including linear methods such as linear regression [10][11][12][13][14], nonlinear methods such as neural networks , and support vector machines [38][39][40].Usually, the linear regression method is simple and intuitive, but problems in molecular systems are mainly nonlinear; for the same descriptor inputs, neural networks can better solve complicated nonlinear problems that might be too difficult to be modeled by mathematical equations [27].However, the traditional neural networks have the following major drawbacks.(1) The speed of training is slow.The gradient descent method generally takes multiple iterations to converge, so the training process takes a long time.(2) Because it is easy to fall into a local minimum, the global minimum can be missed during the calculation.(3) The selection of learning rate  is sensitive.The learning rate  has a great impact on the performance of the neural network, so the proper  must be chosen to obtain a stable network.If  is too small, the rate of algorithm convergence is very slow and the training process takes a long time; on the other hand, if it is too large, the training process might be unstable (divergent).Therefore, it is important to explore a training algorithm where the learning rate is comparably high, which increases the possibility of acquiring the global optimum solution.This leads to better generalization for the prediction.The extreme learning machine (ELM) method avoids these defects, thus exhibiting a better performance than neural networks [3,4].
The content of this paper is arranged as follows.First, it describes the HC, ACO, and ELM methods.Second, it illustrates the technology roadmap of the KS-HC/ACO-ELM method.Third, it discusses the calculation results of the Kenstone, HC, ACO, and ELM methods (i.e., the appropriate classification of the training set and the test set, clustering analyses for molecular descriptors with both the known and unknown class numbers, and the establishment of the ELM model, resp.).Finally, the results and the discussions are summarized.

Hierarchical Clustering (the Known Number).
The basic idea of the hierarchical clustering (HC) method is to classify samples in terms of distances.Firstly, it defines the distance between samples and classes and combines the nearest two classes into classes of samples, and then it recalculates the distance between the new class and other classes and classifies them according to the minimum distance.By reducing one class at a time, this process is repeated until all of the samples come into one class.The chart showing the clustering process is called a clustering chart.One of the features of the HC method is that when samples are classified into a class in a certain level, they will always belong to the same class in the subsequent divisions.
There are a variety of methods to define the distance between classes.Different definitions generate different HC analysis methods, which include the following: the shortestdistance method, the longest-distance method, the middledistance method, the center of gravity method, the groupaverage method, the variable group-average method, the variable method, and the variance and sum method.The recurrence formula for these methods is where   is the distance between classes.The values of parameters , , and  vary according to the different methods and are adopted in the program.The default values of the parameters are used in the calculations.

Ant Colony Optimization (the Unknown Number).
One of the clustering problems is that the number of clusters is unknown.In the ant colony optimization (ACO) clustering method, the samples are thought of as ants with different attributes, and the cluster center is the food sources that the ants seek.Therefore, the sample clustering is the mechanism of ants looking for food sources.
If  is a data set to be classified and  is the feature set of the samples, the clustering algorithm is as follows.
(1) Initially, allocate  samples into  classes because one sample belongs to one class.
(2) Calculate Euclidean distances between classes   and   : where   represents the Euclidean distance between classes  and ,    is the cluster center vector, and   is the number of samples in class   .
(3) Calculate the amount of pheromone on each path.
Let  be the radius of the cluster.  () is the residual pheromone at time  on the path from   to   .The pheromone on the path (, ) is where  =  +  min + ( max −  min ) •  and  and  are constants.
(4) Calculate the probability of   merging into   : where  = { |   ≤ ,  = 1, 2, . . ., −1, +1, . . ., },  stands for a number of a certain class,  represents a collection of all numbers of classes, where the distance is less than or equal to  in the class ,  is the total number of classes, and   () is the weight parameters.
(5) If   () ≥  0 , then   is subsumed in the   , the number of classes minus 1.  0 is a given probability value.Recalculate the merged cluster center.
(6) Determine whether there is merging or not.If there is no merging, stop the cycle; otherwise, go to step (2) and continue the iteration.

Extreme Learning Machine.
The extreme learning machine (ELM) was proposed by Huang et al. in 2006 [62].
The core principle of ELM is a single-hidden layer feedforward neural network (SLFN), a typical structure of which is shown in Figure 1.This network consists of an input layer, a hidden layer, and an output layer.The neurons in the former two and latter two are fully connected.There are  neurons corresponding to the input variables in the input layer; there are  neurons in hidden layer; there are  neurons corresponding to  output variables in the output layer.Briefly, suppose the connection weight  between the input layer and the hidden layer is given by where   represents the weight between neuron  in the input layer and neuron  in the hidden layer.
The connection weight matrix  between the hidden layer and the output layer is given by where   represents the weight value between the  neuron in the hidden layer and the  neuron in the output layer.The threshold  of the neuron in the hidden layer is given by The input matrix  and the output matrix  in the training set with  samples are given by The activation function of the neuron in the hidden layer is (); the output of net  can be seen in Figure 1. where Equation ( 10) can be expressed as where   is the transpose of the matrix  and  is the output matrix of the hidden layer of the neural network.Consider Because the number of neurons in the hidden layer is equal to the number of samples in the training set, SLFN can approach the training samples with zero error for any  and ; that is, where   = [ 1 ,  2 , . . .,   ]  ,  = 1, 2, . . .,  However, when the number of the training set  is relatively large, the number of the hidden layer neurons  is usually smaller than  to reduce the computational complexity.Meanwhile, the training error of SLFN can approach any  > 0; that is, Therefore, when the activation function () can be infinitely differentiable, not all of the parameters of SLFN need to be adjusted.The values of  and  can be chosen randomly before training and remain the same in the rest of the selection process.The connection weights  between the hidden layer and output layer can be obtained by solving the equations to obtain the least squares solution: the estimated solution is β =  +   , where  + is the Moore-Penrose generalized inverse of the output matrix of the hidden layer .
From the above analysis, ELM can randomly generate  and  before training.It is only needed to determine the number of hidden layer neurons and the activation function of the hidden layer neuron (infinitely differentiable), and then  can be calculated.Generally, ELM learning algorithms follow the next steps.
(1) Determine the number of the neurons in the hidden layer and randomly set the connection weight  between the input layer and the hidden layer and the bias  of the neurons in the hidden layer.(2) Select an infinitely differentiable function as the activation function of the neurons in the hidden layer, and then calculate the matrix  of the hidden layer output.

Technology Roadmap
The    the weighted average method, and the center of gravity method.

Results and Discussion
There is no existing standard for evaluating the quality of clustering results, and the application of the clustering method depends on the researchers' application skills and experiences with the classification objects.If the clustering method is applied properly, the critical scale and the distance matrix could be highly correlated.The correlation is calculated by cophenetic correlation coefficients, which can be used to evaluate the clustering results.The value of the cophenetic correlation coefficient is between zero and one.If the clustering is effective, the value is large.However, the high value is only statistically significant; it does not mean that the result is physically effective.Sometimes the result is not meaningful in practice.
The cophenetic correlation coefficients are shown in Table 1.The clustering methods based on the Euclidean distance include the longest-distance method, the shortestdistance method, the average-distance method, the weightedaverage method, the center of gravity method, the middledistance method, and the sum of squares.
Based on the average-distance method of the Euclidean distance, 12 molecular descriptors are clustered into classes, and the numbers of classes are given from 2 to 11.This paper only takes for an example the 7 classes.The clustering pedigree chart of the average-distance method is shown in Figure 3. Numbers 1 through 12 on the horizontal axis in Figure 3

Ant Colony Optimization Calculation Results.
The HC method needs a predefined number of clusters, which includes every type of classification.The advantage is obvious when the sample (i.e., the number of molecule descriptors) is small, but when the sample is large, it is neither necessary nor practical to list every classification using an exhaustive method.Therefore, in this paper, an ACO without a predefined number of clusters is also applied to the screening of molecular descriptors to arrive at a generalized model of the ELM method for either a small or large number of descriptors.According to the definition of the ACO, the optimization function is the minimum ratio of the within-class distance to interclass distance.This means the distance among the classes should be large, but the distance between two samples within a class should be small: where ,  = 1, 2, . . ., ;  = 1, 2, . . ., , where   is the  factor of sample  and   is the  factor of the center of the class .

ELM Calculation Results
. The correlation coefficients between the 12 molecular descriptors and homolysis BDE experimental values were calculated, and they are 0.64, 0.46, 0.49, 0.02, 0.12, 0.18, 0.28, 0.51, 0.17, 0.43, 0.05, and 0.27.According to the magnitude of the correlation coefficients (e.g., by HC calculation) when   and   cluster into a class,   is correlated higher with homolysis BDE than   and will therefore be selected as the input of ELM.Similarly, when   and  are classified into one class,  will be selected as the input of ELM.In the cluster consisting of  HOMO ,  LUMO ,  HOMO−1 , and  LUMO+1 ,  HOMO is selected as the input of ELM.Δ homo ,   , , and Δ are clustered into one class.Therefore, Δ homo ,   , , Δ,   , , and  HOMO are selected as the final inputs of ELM by using the HC method.In the same way, Δ homo ,   ,   , , , and  HOMO are selected as the final inputs of ELM after using the ACO clustering method.The errors of the test set in different methods are shown in Figure 4(a).
To assess the KS-HC/ACO-ELM method, the results calculated by the molecular descriptors without screening KS-ELM (using all descriptors), KS-HC-ELM (best results are obtained when the molecular descriptors cluster into 7 classes), and KS-ACO-ELM are compared with the results calculated by B3LYP/6-31G(d).A three layered artificial neural network (KS-HC-ANN) calculation was also performed for comparison.The ANN structure uses the same number of inputs and 6 hidden neurons (shown in Figure 4(b)).The corrected results show that the systematic errors of the calculation value are eliminated, that the homolysis BDE are significantly improved by the KS-ELM, KS-HC-ELM, and KS-ACO-ELM methods, the range of error is narrowed down, the STD calculated by the KS-ELM, KS-HC-ELM, and KS-ACO-ELM methods is much smaller than the calculation error by DFT B3LYP/6-31G(d), the errors are in a Gaussian distribution, and the values of most of the errors are approximately 0. As is clearly seen in Figure 4(b), the results from KS-HC-ELM approach the homolysis BDE experimental values better than those obtained from the ANN, and the calculation time is also much less than that in the ANN because the ANN learning algorithm requires a number of network training parameters set by manual work.ANN also easily leads to a local optimal solution, and its calculated results are very unstable.The ELM method only requires setting the number of hidden layer nodes of the network.It does not need to adjust the network input weights and hidden bias in the implementation process of the algorithm, and it can generate a unique optimal solution, so it has both high learning speed and good generalization performance.
The STD corrected by the KS-ELM method is much smaller than the error of the B3LYP/6-31G(d), and the STD calculated by KS-HC-ELM and KS-ACO-ELM is smaller than that by KS-ELM.That means that it is necessary to screen molecular descriptors to eliminate redundant descriptors.The STD is reduced from 4.03 to 0.30, 0.28, 0.29, and 0.32 kcal mol −1 after correction by the KS-ELM, KS-HC-ELM, KS-ACO-ELM, and KS-HC-ANN methods, respectively.It is quite simple to see that the result corrected by the KS-HC-ELM method is much closer to the experimental results.If some trivial features are introduced into ELM, the accuracy of the model might decrease, and if the parameters are selected inappropriately, overfitting could also appear.In this experiment, although the calculation results using the KS-HC-ELM method are superior to those by the KS-ACO-ELM method, it cannot be assumed that this will be the case under every circumstance because each method has its own advantages.The clustering methods can be divided into two categories, where one requires a predefined number of categories, while the other does not.Because the ELM inputs affect its performance and the best combination of molecular descriptors is difficult to determine, it is recommended that when the number of clustering molecular descriptors is smaller, both the HC method and the ACO method are suitable and that when the number of the molecular descriptors is larger, the ACO method is more appropriate.

Conclusions
Nitric oxide (NO) is a very important signaling molecule in mammalian systems.When there are problems resulting in the release of NO molecules in the body, NO carrier molecules may be taken as a drug to treat the disease.It is complicated to measure NO molecular bond homolysis BDE in various experimental methods, and it is rather difficult to achieve chemical accuracy.The DFT method has been very popular in computational chemistry for the past two decades because many computational studies have shown that DFT calculations can capture the physical essence of the molecules.For the calculation of small molecules, the accuracy can reach a high level, but for large molecules, the cost of calculation is too expensive and the calculation accuracy is quite poor.
In this paper, the combination of KS-HC/ACO-ELM and DFT calculation methods succeeds in improving the DFT calculation accuracy of NO bond homolysis BDE.The results show that for three KS-HC/ACO-ELM methods, the STD of homolysis BDE of 92 organic molecules in the test set decreases from 4.03 to 0.30, 0.28, and 0.29 kcal mol −1 .This proves that the KS-HC/ACO-ELM method based on B3LYP/6-31G(d) can remarkably improve homolysis BDE calculations, and its result may be used as the reference value to experimental results with high accuracy.The adoption of Kenstone makes the model more generalized.Although the model is ambiguous and the space of the descriptors is complex, the reproducibility of the calculation is very important.The selection of the training set and test set using the Kenstone method makes the model more robust and reproducible and makes the calculation results more convincing.The selection of appropriate molecular descriptors using the HC/ACO can eliminate the subjective bias on some input descriptors.Compared to the ANN method, the nonlinear model created by ELM can not only avoid sensitive parameters and local minimum problems, but it can also achieve more accurate calculation results with less calculation time and computing resources.
The KS-HC/ACO-ELM method expands the feasibility and applicability of the B3LYP/6-31G(d) method, and the more experimental data and molecular descriptors there are, the higher calculation accuracy the KS-HC/ACO-ELM will achieve, leading to more obvious advantages.It is important that such a high-precision method can be used to design novel NO-releasing drug molecule.We believe that it is significantly effective for the KS-HC/ACO-ELM method to improve the accuracy of DFT B3LYP/6-31G(d) calculations, even with a smaller basis set (e.g., STO-3G, etc.).Meanwhile, this method can be applied to correct other properties such as Heterolytic Bond Dissociation Energies, Absorbed Energies, Ionization Energies, and heat of formation.Further studies are ongoing.This study provides an effective tool (the combination of the KS-HC/ACO-ELM and DFT methods) to improve and predict highly accurate homolysis BDE of the molecular NO carrier systems.

Figure 1 :
Figure 1: A typical structure of a single-hidden layer feed-forward neural network (SLFN).

Figure 2 :
Figure 2: The flow chart of the KS-HC/ACO-ELM model.

Figure 3 :
Figure 3: Clustering pedigree chart of the average-distance method.
[34]nology roadmap of the ELM method based on KS-HC/ACO to improve the accuracy of the molecular bond energies calculated by DFT is shown in Figure2.It consists of three parts.Part one is the collection of the data, including molecular bond data and the molecular descriptors.The second part describes the theoretical calculation.Ninetytwo important organic NO carrier molecules are studied, 12 molecular descriptors are calculated by DFT B3LYP/6-31G(d) (Δ homo ,   ,   ,   ,   , , ,  HOMO ,  LUMO ,  HOMO−1 ,  LUMO+1 , and Δ), and homolysis BDE experimental values are included[34].The Kenstone method is used to divide the training set and the test set.HC and ACO are taken as different classification methods; the former requires a predefined number of the classes to select the appropriate molecular descriptors that will be modeled by the ELM method, while the latter does not need to specify the number of classes.The third part aims at testing and analyzing the models.When the calculation results are not satisfied, the algorithm should be optimized repeatedly up to a reasonable accuracy.At the same time, it is suggested to determine the comparison test.

Table 1 :
Cophenetic correlation coefficients of various clustering methods based on the Euclidean distance.