Nanoquantitative Structure-Property Relationship Modeling on C 42 Fullerene Isomers

The interest of scientists in nanostructures has been increased in the last years and proper methods for their assessment are needed. In silicomethods found their usefulness in the replacement of experimental evaluation and are successfully used as efficient alternatives for estimation and prediction of compound’s properties or activities. In this paper, it is shown that a Quantitative Structure-Property Relationship method is proper to be applied also on nanostructures. Based on computational experiment, several models to describe the total strain energy of C 42 fullerene isomers were obtained and their characteristics are presented. Furthermore, the best performing model obtained on C 42 fullerene isomers was validated on C 40 fullerene isomers.


Introduction
Since their discovery in 1985 [1], fullerenes attracted interest in different fields of science, including medical field (e.g. for potential use as antibiotics [2,3,4], as inhibitors of erythroid cells -fullerenol [5], as drugdelivery system [6], or as inhibitors of inflammatory mediators [7]). Fullerene molecules are constructed from carbon atoms and take the shape of sphere (also known as buckyballs), ellipsoid or tube [8]. First spherical fullerene, C 60 , was discovered in 1985 [1]. Fullerenes have different properties and showed different number of associated isomers (Table 1) [9]. The smallest fullerene (C 28 ) was stabilized by metal encapsulation (with Ti, Zr, and U) by Dunk et al. [10]. Chen et al. showed that C 32 fullerene has stronger aromaticity compared with C 30 and C 34 respectively [11]. Fifteen distinct isomers with different energies were reported by Manna and Ghanty who encapsulate U into various C 36 cages [12]. Muhammad et al. showed that C 20 is a closed-shell fullerene, fullerenes C 26 and C 30 are pure open-shell compounds, whereas C 36 , C 40 , and C 42 are intermediate open-shell compounds [13]. The C 42 fullerenes are small, not necessary spherical cages. The C 42 cages enclosed high pentagon/hexagon ratio [14]. Fullerene C 42 along with C 60 showed highest values of the main peak on Matrix-Assisted Laser Desorption Ionization Time-of-Flight (MALDI-T OF) on mass spectrometric measurement [15]. Some activities of fullerenes have been modeled using quantitative structure-activity relationship (QSAR) approaches (such as anti HIV protease inhibition activity [16], antiviral activity [17], drug delivery system [18]). However, C 60 received the main attention while other fullerenes were neglected in regards of QSAR/QSPR (Quantitative Structure-Property Relationship) modeling. The aim of our research was to model the total strain energy of the isomers of C 42 fullerene using the structural information.

Materials and Methods
All C 42 fullerene isomers were included into the analysis. Data related to continuum elasticity expressed as Total Strain Energy (TSE in eV) and the structures as *.xyz files of C 42 fullerene isomers were taken from [19] (Table 2). The analysis was conducted on the downloaded file of the C 42 isomers without any modification on the available geometry. According to [19], the fullerene geometries were based on the geometry of the structures in the Yoshida's Fullerene Library (http://www.cochem2.tutkie.tut.ac.jp/Fuller/Fuller.html, UNIX files) and re-optimized using Dreiding-like force-field [20]. Here are used the obtained geometry. The steps applied in the analysis are depicted in Scheme 1.

Scheme 1. Flowchart of the applied methods. The pool of filtered SMPI (Szeged Matrix Property Indices) descriptors contain those descriptors with absolute values between 10 -7 and 10 7 .
In the first step of the analysis the downloaded filed were translated into *.mol file with Spartan software (https://www.wavefun.com/products/spartan.html). In the second step the *.mol file is transformed as *.hin file using Babel software (http://openbabel.org). The partial charges were calculated in the third step using HyperChem software (http://www.hyper.com/) by applying PM3 (Parameterized Model number 3 [21]) single point (energy) semi-empirical calculations. The structural features of the investigated nano class of compounds were extracted using unsymmetrical Szeged set, an extension of corresponding Szeged Matrix [22] (forth step). The calculated values of the structural descriptors and the collected values of total strain energy were included in nano-QSPR modeling in the fifth step of the analysis and models with the highest goodness-of-fit (defined as highest correlation coefficients) were analyzed and validated in leave-one-out and leave-many out analyses [23,24].
Leave-one-out analysis retrieve valid models if determination coefficient (Q 2 ) takes values higher than 0.5.

Nano-QSPR C 42 fullerene models
Leave-many-out analysis was conducted for the models with highest abilities in estimation expressed as the highest value of the correlation coefficient. The set was split using a simple random technique [25] in training and test with 2/3 of compounds in training set. The models obtained in training sets were used to predict the TSE in the test sets. The leave-many-out analysis was run five times for equations identified as being with highest estimation and internal prediction abilities. The assessment of the prediction ability was done on an external data set represented by C 40 isomers considering the same property. The TSE values and the structures for external validation were taken from the same source as C 42 isomers: http://nanotube.msu.edu/fullerene/fullerene.php?C=40 (accessed December 20, 2015). Several metrics were used to assess the prediction ability of the model [23,24]: determination coefficient on the external set (R 2 ext ), predictive square correlation coefficient on external set (Q 2 F2 , [26]), external prediction ability (Q 2 F3 ), root mean square error of predicted (RMSEP), mean absolute error of predicted (MAEP), percentage predictive error (%PredErr), and concordance correlation coefficient (CCC [27]).

Results and Discussion
Structural information of the investigated C 42 isomers was obtained by calculation of the pool of descriptors given by Szeged Matrix Property Indices (SMPI) method [28]. Performing models in regards of goodness-offit (highest correlation coefficient) with one, 2, 3 and 4 SMPI descriptors were obtained and are given in Eq(1)-Eq (4): 16 (3) Ŷ TSE(4) = -199.61 -IFETB×21.63 + IFUGB×40.90 -IIUGF×2.62×10 -3 + IJUGE×1.56 (4) where Ŷ TSE = Total Strain Energy estimated by the model; IJUGE, IIUGF, IFEGE, IFETB, and IFUGB = SMPI descriptors. Two descriptors (IFETB and IFUGB) account the atomic number as atomic property, other two descriptors account electronegativity (IJUGE and IFEGE), while one account the first ionization energy (IIUGF). The investigated property is related with the geometry of compounds (fourth letter 'G' in the name of descriptors) with one exception that relates with topology (IFETB descriptor). The other letters reflect the linearization operator (first letter), matrix operation (second letter) and interaction descriptor (third letter). As expected, the determination coefficient increase as number of descriptors in the models increases while the standard error of the estimate decreases (Table 3). = adjusted determination coefficient; se = standard error of estimate; F (p) = Fisher's statistic (p-value); |t min | = the minimum of absolute t-statistic associated to the intercept and coefficients of the model %PredErr = percentage prediction error; Q 2 = determination coefficient in leave-one-out analysis; loo = leave-one-out analysis The distance between determination coefficient of the model and determination coefficient obtained in leaveone-out analysis varied from 0.0027 to 0.0227, the smallest distance being obtained by Eq(3) ( Table 3). On the other hand, the smallest difference between standard errors (estimation model and leave-one-out model) is obtained by the same model (Eq(3)). The analysis of the results presented in Table 3 showed that the model with four descriptors is the one with smallest percentage of prediction error. Furthermore, the data on the scatter closest to the straight line is observed for the model given by Eq(4) (Fig. 1). Fig. 1 shows the absence of the differences between models from Eq(3) and Eq(4), with the dispersion of the point in the scatter closest to the line for model given by Eq(4).

Figure 1: Observed vs. estimated TSE by Eq(1)-Eq(4).
The main characteristic of the models given by Eq (3) and Eq(4) obtained in leave-many-out analysis (training vs. test analysis; 2/3 of compounds in training set run 5 times) are presented in Table 4. The results presented in Table 4 showed the stability of the models, with internal prediction power (defined as determination coefficient in test sets) closed to the estimation power (determination coefficient in training set) from both investigated models. Therefore, the results obtained in training sets closely follow the results on the whole sample for Eq(3) with R 2 in the same range when two decimals are of interest. The R 2 obtained in test set in all five runs of the leave-many-out analysis was equal with 0.99, so slightly higher than the R 2 obtained in training sets (0.98). In three cases out of five, the R 2 in training sets for Eq(4) were on the same range for two decimals with the R 2 value given in Table 3. However, without any exception, the R 2 in test sets was smaller than the R 2 in training sets for Eq (4), with values that varied from 0.0005 (id7 in Table 4) to 0.0264 (id6 in Table 4). These results showed that Eq(3) perform slightly better in terms of determination coefficients in leave-many-out analysis. The plots of the models obtained in the fourth run for Eq(3) and fifth run for Eq (4), as examples, are given in Fig. 2 Figure 2: Internal prediction vs. estimation power in training and test analysis for Eq (3) and Eq (4).
The equations identified with estimation power and internal prediction abilities, namely Eq(3) and Eq(4), were further applied on C 40 isomers to test the external prediction abilities. The prediction power of Eq(4) proved better compared with prediction power of Eq(3) (see Fig. 3 and Table 5).   6.49 R 2 ext = determination coefficient on the external set; Q 2 F2 = predictive square correlation coefficient on external set; Q 2 F3 = external prediction ability; RMSEP = root mean square error of predicted; MAEP = mean absolute error of predicted; %PredErr = percentage predictive error; NR = not reliable value Despite the fact that the predictive square correlation coefficient on external set is higher for Eq(3) compared with the value obtained with Eq(4), all other calculated metrics sustain that the model given by Eq(4) has better prediction abilities (highest determination coefficient on external set, lowest mean absolute error of predicted, lowest percentage predictive error, see Table 5). Furthermore, the analysis of the overall spread of the points in the scatter-plot lead to the conclusion that Eq(4) had better prediction abilities compared with Eq(3). Nevertheless, the mean of residuals proved significantly different than the expected value (zero). It could be concluded that the model given by Eq(4) fit to the data on which was constructed. Nevertheless, are the structural feature extracted by SMPI descriptors on C 42 isomers able to predict the TSE on C 40 isomers? SMPI descriptors used by Eq(3) and respectively Eq(4) were used to predict the TSE on C 40 isomers. One out the three descriptors from Eq(3) proved to have the slope not significantly different by zero and was not included in further analysis. The identified models obtained on C 40 isomers are given in Eq(5) and Eq(6): Ŷ TSE ( where Ŷ TSE = Total Strain Energy estimated by the model; IJUGE, IIUGF, IFETB, and IFUGB = SMPI descriptors. Two descriptors (IFETB and IFUGB) account the atomic number as atomic property, one descriptors account electronegativity (IJUGE), and one account the first ionization energy (IIUGF). The investigated property is related with the geometry of compounds (fourth letter 'G' in the name of descriptors) with one exception that relates with topology (IFETB descriptor). The other letters reflect the linearization operator (first letter), matrix operation (second letter) and interaction descriptor (third letter). Note that both models have the mean of residual not significantly different by zero (p>0.49). The analysis of the metrics associated to Eq(5) and Eq(6) lead to the conclusion that model given by Eq (6) is more performing than the model given by Eq (5). The same conclusion is obtained by analyzing the plots of observed versus predicted TSE (Fig. 4). 25 Figure 4: Analysis of Eq (5) and Eq(6) on external dataset represented by C 40 isomers.
The results of our study showed that the identified nano-QSPR models fit to the data based on which the model was identified (C 42 isomers) but could be used for selection of those structural descriptors with fair abilities in prediction on external data set (C 40 isomers). To sum up, equations relating electronegativities, ionization potential and energy has been identified on C 42 isomers and proved to work also on C 40 isomers. Note that eletronegativities and ionization potential are atomic properties and since the investigated set contains just C and H atoms, the identified relation between the three properties could be assigned alto to the topology and geometry of the investigated compounds.
To the best of our knowledge, structure-property relationship approaches were not applied on C 42 or C 40 fullerene isomers. The small-diameter fullerene (C 20 , C 34 , C 42 , and C 60 ) were mainly investigated in regards of properties (such as adsorption [29], distribution of CC distance [14], Schlegel diagrams of molecular structures [30]). Therefore, this is the first report of a quantitative relationship between structure and property of C 42 fullerene. Undoubtedly, the advancement from theoretical to experimental studies is desired.

Conclusions
The C 42 fullerene isomers were successfully model and the total strain energy was characterize as function of information extracted from structure of the compounds. The models with goodness of fit in leave-one-out (Q 2 =0.9768) and leave-many-out analyses that proved also prediction power is the one with four descriptors. The total strain reaction proved a function of electronegativity and first ionization energy, and in relation with geometry of compounds. The structural descriptors able to fairly explain the total strain energy on C 42 isomers proved also able to explain the same property on C 40 fullerene isomers.