An Improved QSPR Modeling of Hydrocarbon Dipole Moments

Dipole moments of hydrocarbons are not an easy property to model with conventional 2D descriptors. A comparison of the performance of the most commonly used sets of topological descriptors is presented, each set containing descriptors derived from the regular and Detour distance matrix, Electrotopological State Indices, and the basic number of atoms of each type and bonds. Data were taken on a representative set of 35 hydrocarbon dipole moments previously reported and the classical multivariable regression analysis for establishing the models is employed.


INTRODUCTION
The permanent electric dipole moment (DM) is an important physical chemistry property. By definition, the electric dipole moment vector points from the center of the positive charge distribution to the center of the negative charge distribution. It furnishes valuable information about the structure and polarity of the molecule under consideration. Evidently, the necessary condition of a suitable theoretical method is its capability to reproduce the experimental dipole moment.
Among the theoretical methods that are employed with significant success in the field of prediction of molecular physical chemistry properties and/or biological activities, the mathematical procedures based on topological concepts are very important. They are normally known as Quantitative Structure-Property/Activity Relationships (QSPR/QSAR). These methods are mainly based on nonempirical, structure-related graph theoretical molecular descriptors, where molecules are considered as a set of vertices attached to each other by a set of nonmetrical connections. These computational procedures allow one to forecast properties or activities of specific chemical compounds once the properties or activities of a set of those compounds has been modeled [1,2,3,4,5,6,7].
A rather successful mathematical topological method employed in QSPR/QSAR studies is the molecular connectivity method that permits one to derive a set of graph-theoretical indices of wide applicability by means of a well-defined set of rules [8,9,10,11,12,13,14]. The use of the so-called "pure" topological indices, i.e., those that have information only about the relationships of connectivity between atoms or bonds in the molecules, has proved to be quite simple and direct and they provide a suitable tool for accurate predictions of properties and activities. Classical examples of these indices are the Randic branching index [8], Hosoya Z index [15], Balaban J index [16], the Wiener index [17], and bond connectivity index [18].
In a recent paper [19], we presented some results for dipolar moments via QSPR modeling by means of a flexible set of molecular descriptors, comparing the predictions with available experimental and other theoretical results in order to extend the study of this physical chemistry property and, at the same time, to analyze the capabilities of this sort of molecular indices. Although final results revealed an improvement with respect to previous data, statistical parameters associated to the regression equations were not totally satisfactory, so that we have deemed it convenient to try the analysis via some pure topological indices.
It is our purpose to analyze how the different types of topological descriptors perform for the modeling of the 35 hydrocarbon dipole moments set. This article is organized as follows: next section deals with the definition of the topological indices and then we describe the calculation procedure. The presentation of results is accompanied with the comparison with other similar data and the discussion of them. Finally, we state the main conclusions derived from the present analysis.

TOPOLOGICAL INDICES
The two most important matrices that delineate the labeled chemical graph are the adjacency (A) and the distance (D) matrices [8,20,21]. In A, A ij = 1 if vertices i and j are adjacent, 0 otherwise. D ij = l ij if i = j, 0 otherwise. l ij is the shortest edge count between vertices i and j [15]. The D -2 matrix is the matrix whose elements are the squares of the reciprocal distances D ij -2 [22], with i,j = 1, 2, …. The molecular topological index (MTI), also called the Schultz index after its originator [23,24], is a molecular descriptor with structural significance and it has a number of attractive features [25,26,27]. Originally, it was defined for hydrocarbon molecules [23], but later on it was generalized for heterosystems [28]. The extended version of the MTI was tested in QSPR modeling of boiling points of alkyl alcohols and it showed to be a topological index worthy of future studies about its properties and applicability in the structure property/activity relationships.
The Schultz MTI is defined as where v is the valence (1 × N) matrix of a structure G, A is the adjacency (N × N) matrix, and D is the distance (N × N) matrix [29,30,31]. The entries in the valence row matrix v are the graph-theoretical valences.
The MTI may be extended to heterosystems in a straightforward way by substituting the elements of the A and D matrices corresponding to heteroatoms and heterobonds with the values taking into account the corrections due to the changes originated in the replacement of a carbon atom by a heteroatom.
The connectivity index χ = χ(G) of a structure G is defined as [8]: where v i is the valence of a vertex i in G. For heterosystems, the connectivity index is given in terms of the valence delta values δ v (i) and δ v (j) of atoms i and j, and it is written as χ v . This connectivity index is called the valence connectivity index and it is defined as [9,32,33]: with Z i the atomic number of atom i, Z i v the number of valence electrons of atom i, and H i is the number of hydrogen atoms bonded to atom i.
The Wiener W number is defined as [34]: i j and it is the simplest first-generation topological index. The Harary index H is based on the concept of reciprocal distance and is defined, in parallel to the Wiener index, as the half sum of the off-diagonal elements of the reciprocal molecular distance matrix D ij -1 [35]: i j It should be noted that diagonal elements D ii -1 are all equal to zero by definition. The Balaban index J (average distance connectivity index) is based on distasums [36], in order to reduce the degeneracy of previous indices, because distasums (unlike vertex degrees) are seldom identical for nonequivalent vertices in nonisomorphic graphics. The definition involves a sum of "weighted average topological distances" for all edges i-j The numbers q of graph edges and µ of cycles in the graph are introduced into the above formula in order to avoid the automatic increase of J with graph size and cyclicity [37]. Kier et al.[38] have generalized the definition [3] for longer paths than edges (which can be considered as paths of length L = 1 in the graph). The three carbon atoms in a propane graph form a path of length L = 2, an n-butane graph is a path of length L = 3, and so on. An isobutene graph is a cluster on four vertices. Thus. We have a sum of "weighted paths", yielding connectivity indices of higher order L: ijk Two other first-generation topological indices, advocated by the Zagreb group [40] are the sum of squares of vertex degrees, and the sum of products over all edges i-j (11) i,j The first-, second-and third-order molecular valence connectivity indices L χ v (L = 0, 1, 2) are defined in analogy with [9] and M lv (l = 1, 2). Zagreb group indices based on valence vertex degrees are also defined on the basis of Eqs. 10 and 11.
Finally, W max , H max , J max , and MTI max are analogous to those defined by Eqs. 1 and Eqs. 6-8 by resorting to the Detour matrix[41] instead of the regular distance matrix.
A particularly simple set of topological descriptors are the number of atoms of each kind and the number of classical chemical bonds. In spite of the naïveté of these descriptors, they have shown to be very valuable for this sort of calculation [42,43,44,45], so that we have deemed it suitable to employ them here.
As conventional topological descriptors characterize the molecule as a whole, i.e., molecular size or shape, it may be fruitful to employ atom-type topological indices. These are expected to reflect the role of each atom type or group in a molecule, then depicting more accurately their intermolecular interactions influencing the physical property of a molecule. This interaction between groups can be of the Van der Waals or the hydrogen bonding type. One kind of such index is the so-called Electrotopological State Index (S i ), based on the electronic state of each atom type and its topological nature in molecular graph, introduced by Kier et al. and used successfully in a variety of QSAR/QSPR studies of complex molecules [46,47,48,49,50,51].
The S i for atom i is defined from its Intrinsic State I i plus a perturbation term in order to take account for the molecular environment of that atom: with N being the principal quantum number, δ(i) the delta value for atom i: and with σ i equal to the count of electrons in sigma orbitals. It is worthwhile to cite three recently published books in topological indices, which are rather valuable source of significant information on this issue [52,53,54].

CALCULATION PROCEDURE
The QSPR models were built via multilinear regression analysis based on first-, second-, and third-order polynomials. The use of higher-order polynomials fitting equations is based on the fact they usually improve the prediction of physical chemistry properties or/and biological activities.
We try to minimize the functional standard deviation, defined as follows: where r j are the residuals. The best model for a given set of descriptors according to the criterion of smallest S can be obtained by trying all the combinations of k descriptors out of n for k = 1,2,…,n. This procedure is time consuming (especially in the case of exact rational calculations) because the total number of calculations is 2 n -1. For that reason, we resorted to a different strategy: first we perform a linear regression with all the descriptors and remove the one with the greatest relative error ∆c j /c j . Then, we repeat the calculation with the remaining n -1 descriptors, and again we remove the descriptor with the largest relative error. We proceed exactly in the same way for n -2, n -3,… remaining descriptors until we have just one descriptor and the constant. Then, one has to choose the set with the minimum S-value. This procedure, whose denomination is due to R. Maronna, requires only n calculations and enables us to single out an optimum model that is very close to or in complete agreement with that one obtained by means of the thorough search mentioned before.
Since the set of molecules under study is rather small, we need to assess the predictive power of the models through some kind of validation technique. We use the "leave-one-out" cross-validation [55].
In the present paper, the quality of the models is selected by the following statistic parameters: R cal , S cal , and F; the correlation coefficient; standard deviation; and Fisher test of the calibration set, respectively; absdv is the mean absolute deviation of the model; Rlou and Slou stand for the correlation coefficient and standard deviation of the leave-one-out cross-validation method.
We chose the models by first applying Maronna's Method and then selecting the generated model that had similar values for S cal and S lou . That is to say, we searched for the approximate solution near Maronna's minimum and not when S is minimum. The method is justified since the purpose of the QSAR-QSPR theories is to look for a model that can fit both the calibration data set and the validation data set with the same accuracy. One of the possible advantages of doing this is that the number of descriptors for the model usually decreases, but with a little increase in S cal and decrease in R cal .
In order to avoid round-off errors in our linear-regression calculations, we resorted to computer algebra system DERIVE[56] that enables one to solve the least-square equations in exact rational mode if necessary. Although this kind of calculation is commonly slow, it is sufficiently fast for our present purposes.

RESULTS AND DISCUSSION
First, we used a topological indices set composed of the following 18 descriptors: {W, H, J, MTI, W max , H max , J max , MTI max , 0 X, 1 X, 2 X, M 1 , M 2 , 0 X v , 1 X v , 2 X v , M 1v , M 2v }. Since the initial set of descriptors is large, we resorted to a first-order model. From this set of descriptors the best model found with Maronna's method contains 9 descriptors: Within the realm of this model, there are no-outliers with absolute deviation exceeding 2.5*S. The predicted values for the property are described in Table 1. As can be seen, model 1 does not have enough predictive ability, which means that the topological indices derived from the regular and Detour distance matrix are not suitable enough for modelling dipole moments of hydrocarbons.  A next step for improving the description of the property's space could be achieved by including the electronic aspects together with the topological description of the molecular structure. This would be well accomplished with the S i 's. The predictor variables were calculated for each i fragment of the molecule by taking the mean of its S i values. However, the resulting linear, quadratic, and cubic models with the set of S i descriptors were of inferior quality in comparison to the statistic parameters of model 1. Including nolinealities such as cross-terms in the original S i 's did not cause any remarkable effects.
Then we have tried a set containing the basic topological descriptors such as the number of atoms of each kind and the number of chemical bonds. The set is constructed by taking the linear, quadratic, and cubic potencies of the following descriptors: {C, H, C-C, C=C, C-C triple , C-C aromatic , C primary , C secondary , C tertiary , C quaternary }, then leading to a number of 30 descriptors.
If we study atoms and bonds separately, it is not possible to obtain a good model. Applying Maronna´s method, we got a model with 11 descriptors that perform better than the previous ones. There are no-outliers with absolute deviation exceeding 2.5*S. Model 2 clearly outperforms results from model 1 and the reported one using a flexible set of molecular descriptors [19]. The predicted values of DM are given in Table 1.
A further step in modeling the DM was to combine the different sets discussed above in a unique set. Joining the conventional topological indices, the S i 's, and atoms and bonds, we were not able to obtain a model better than model 2. In fact, we found that the best one-descriptor model of the total set is given by the descriptor C-C triple , and that the best two-descriptor model includes the descriptor C-C aromatic , thus suggesting the relative importance of the bond descriptors for the property. However, this conclusion is dependent on the dimension of the molecular set used.
The DERIVE programs used in this work are available on request to one of us (PRD).

CONCLUSIONS
The simplest and most elementary set of topological descriptors, the number of atoms of each kind and the number of chemical bonds, give better results for modeling the molecular charge distribution than the more elaborated topological indices, including those that can explain some electronic effects such as the S i 's. This result can be easily understood. It is a well-known fact that a statistical study does not allow one to derive unambiguous physical chemistry interpretations on the basis of the molecular descriptors employed to derive a model. In other words, there are no universal rules to draw conclusions on the basis of a particular set of molecular or/and topological descriptors because a relatively good predictive power of the model does not lead to cause-effect relationships between property and independent variables. Thus, the naïveté of these descriptors does not lead to a poorer model than that obtained with more sophisticated indices.
One must take into account the extremely simple computational procedure to derive the basic descriptors: just to count and to identify atoms and chemical bonds. Although the employment of these basic molecular descriptors does not provide any physical insight into the molecular behavior, they make clear the fundamental value of the atom and bond concepts within the framework of the molecular structure paradigm. After all, other more elaborated molecular descriptors always contain some degree of arbitrariness. This is an unavoidable fact due to the strong reduction process when trying to describe a so complex and rather subtle issue as it is the molecular structure concept via a single figure.
The conclusions obtained in this paper are in line with other similar findings reported before [42,43,44,45]. At present, we are studying other molecules and different activities and properties to support the motivating idea of this work. Results will be given elsewhere in the near future.