Predicting Flash Point of Organosilicon Compounds Using Quantitative Structure Activity Relationship Approach

The flash point (FP) of a compound is the primary property used in the assessment of fire hazards for flammable liquids and is amongst the crucial information that people handling flammable liquids must possess as far as industrial safety is concerned. In this work, the FPs of 236 organosilicon compoundswere collected and used to construct a quantitative structure activity relationship (QSAR) model for predicting their FPs.The CODESSA PRO software was adopted to calculate the required molecular descriptors, and 350molecular descriptors were developed for each compound. Amodified stepwise regression algorithmwas applied to choose descriptors that were highly correlated with the FP of organosilicon compounds.The proposedmodel was a linear regressionmodel consisting of six descriptors. This 6-descriptor model gave an R2 value of 0.9174, Q2 LOO value of 0.9106, and Q 2 value of 0.8989. The average fitting error and the average predictive error were found to be of 10.34 K and 11.22 K, respectively, and the average fitting error in percentage and the average predictive error in percentage were found to be of 3.30 and 3.60%, respectively. Compared with the known reproducibility of FP measurement using standard test method, these predicted results were of a satisfactory precision.


Introduction
Flash point (FP) is the primary characteristic of a compound used in the classification of flammable liquids for assessing their fire and explosion hazards in most countries and also indispensable information required legally for the handling and transport of liquid chemical of safety concern.Although experimental FP data are desirable, there is often a significant gap between the demand for such data and their availability due to the fast evolution of technology in discovering or synthesizing new compounds.This gap originates from the following two difficulties in experimentally measuring FP: (1) the significant size of test sample required in FP measurement often is unrealistic for high-value chemical or for newly synthesized chemical of only little inventory; (2) some chemicals may also possess toxicity or radioactivity, which present risk of exposure to people engaged in the testing.Thus, an alternative method of sufficient reliability for determination of the FP is in need.
Many correlations have been proposed to predict the FP of organic compounds in the literature and review [1][2][3].The quantitative structure activity relationship (QSAR) approach is one of the predominant strategies practiced in predicting the FP.In this approach, molecular characteristics-based parameters, often referred to as "molecular descriptors, " are used to predict the FP.Typically, these molecular descriptors are directly calculated based on specific features or moieties in the molecular structure of a chemical rather than measures of physicochemical properties.Thus, the QSAR once successfully established can be easily applied to predict the FP of a novel substance.Because of this advantage, the development of QSAR for prediction of chemical or biological properties of chemical has been in focus of many research areas in recent years [4][5][6][7][8].The former value was retrieved from The Chemical Database and the latter from the Sigma-Aldrich's website; the numbers in the parentheses were the pressures applied in testing.b The former value was retrieved from The Chemical Database and the latter from the Sigma-Aldrich's website.
While the QSARs for predicting the FP of organic compounds are rigorously explored, those aiming to predict the FP of organosilicon compounds are, to the authors' best knowledge, rarely reported in the literature.Chen et al. proposed an FP model of organosilicon compounds based on a data set of 230 compounds [9].In their work, the data collected for 230 compounds were randomly divided into a training set of 184 compounds and a testing set of 46 compounds for the purpose of building and validating their model, respectively.Chen et al. proposed two models consisting of different numbers of molecular descriptors.The six-descriptor model provided a fitting performance of  2 = 0.8981 and predicting performance of  2 = 0.7601, and the thirteen-descriptor one gave a fitting performance of  2 = 0.9293 and predictive performance of  2 = 0.8268.Wang et al. employed the same dataset Chen et al. used to build an FP model of organosilicon compounds.However, they adopted the structure group contribution (SGC) approach rather than the QSAR approach [10].In Wang et al., 39 molecular groups were identified as affecting the FP and the individual contributions for each molecular group were determined.The model Wang et al. presented was of a fitting performance of  2 = 0.9330 and predictive performance of  2 = 0.8868.A major disadvantage with the SGC approach is that its application is limited to predicting for compounds of functional groups predefined in the SGC model's group contribution table.Thus, a model developed by the SGC approach does not apply to a novel substance of new functional groups.Pan et al. also proposed a 5-descriptor model to predict the FP based on a dataset of 207 organosilicon compounds, in which 166 compounds were used to build the model and 41 compounds used to validate the model.The fitting performance and predictive performance were reported to be of 0.909 and 0.908, respectively [11].It should be noted that all aforementioned models only considered constitutional descriptors, geometrical descriptors, and topological descriptors in the FP prediction.However, as other molecular descriptors, such as the quantum chemical descriptors, charged partial surface area descriptors, and molecular orbital related descriptor, have been found to be more effective in describing the structure of a molecule, it seems imperative to revisit the QSARs applied in the FP prediction, with these new descriptors incorporated as a dimension of the equation.

Methodology
The FP of an organosilicon compound as reported in the literature might be measured using two different test methods, the ASTM D 93 (the Pensky-Martens Tester) and the ASTM D 3828 (the Setaflash Tester) methods [12,13].Between these two methods, the Pensky-Martens Tester has been accepted worldwide in the measurement of FP and frequently the preferred method.However, the quantity of the sample required in a single test for the Pensky-Martens Tester is more than 50 mL, whereas the quantity required in a test using the Setaflash Tester is below 4 mL.Thus, in the occasions where the test chemicals were difficult or expensive to obtain, for example, the organosilicon compounds, the Setaflash Tester method may appear to be an attractive option.However, the FPs for the same compound measured by these two methods have been found to be inconsistent in many cases.For example, the difference in the FPs determined for 3methacryloxypropyltrimethoxysilane by these two methods was reported to be greater than 40 ∘ C [14,15].Therefore, the FPs collected in this study were limited to those measured using the ASTM D 93 method.
The variation in the measured FPs could also originate from the testing condition applied in the FP measurement.For an organosilicon compound, the FP might be measured in vacuum rather than at atmospheric pressure.Table 1 provides examples for such situations.Using vinyl tris (trimethylsiloxy) silane as an example, the boiling points (BPs) of vinyl tris (trimethylsiloxy) silane were reported to be about 40 ∘ C at 1 torr and 87 ∘ C at 20 hPa (note: one atmospheric pressure equals 76 torr or 1013 hPa).The FP of this compound was reported to be 30 ∘ C in The Chemical Database and 66 ∘ C in the Sigma-Aldrich's website; in both cases the experimental conditions were not clearly specified.However, both these two FPs were probably measured at the specified vacuum conditions, not at atmospheric pressure, because the FP as currently displayed in the Sigma-Aldrich's website was between the two reported BPs and the FP reported in The Chemical Database was lower than the BP observed at 1 torr.The validation of the FP values reported in the databases required a sufficient understanding of the experimental conditions applied in the original measurement.In cases where the experimental conditions were unavailable from the data compilations, the corresponding FP data were excluded from our dataset.In this study, the values of FP for 236 organosilicon compounds were collected from the literature [14,16].Among these organosilicon compounds, two hundred and thirty were the same as those applied in Chen et al. in the development of FP QSAR [9], and the other six compounds in the current dataset were retrieved from the DIPPR database [16].These organosilicon compounds varied widely in both the FP and the molecular weight (MW).The range of FP for these compounds was between 204.15 K and 446.15K and that of MW was between 60.19 and 607.44.For convenience of later comparison, we partitioned the 230 compounds from Chen et al.'s dataset into a training set and a testing set as the same way they were divided in the original work and added the six FP data from the DIPPR database into the testing set.As a result, there were 184 compounds included in the training set and 52 compounds in the testing set for the development of FP QSAR in this study.
Figure 1 compares the distribution of MW for the compounds in the training set with that for those in the testing set, and Figure 2 compares the distribution of FP the compounds in the training set with that for those in the testing set.As the results show, the distribution of FP and MW both were similar between the training and the testing set, suggesting that the compounds selected in the testing set in this study were a reasonable representation of those included in the training set.The molecular structure of all 236 organosilicon compounds was drawn in the computer program HyperChem and preoptimized using the MM+ and then AM1 molecular mechanics force field to provide an accurate molecular geometry [17].Since the numerical values of some types of molecular descriptors were dependent on the bonds length and bonds angle in the structure, the aforementioned procedures of optimization were necessary to avoid errors in calculating these descriptors.In the next step, the software CODESSA PRO was used to calculate values of the molecular descriptors for all explored compounds according to their optimized chemical structures [18].CODESSA PRO was calculated up to 1,098 descriptors for every molecule.However, some of these molecular descriptors would return the same numerical values for the explored compounds.These molecular descriptors were removed, and a total of 350 molecular descriptors remained in the calculation.
The molecular descriptors retained for further calculation were the candidates of the regressor variables for constructing the multiple linear regression (MLR) model for FP prediction.Mathematically the MLR model was expressed as where the variable  was the molecular descriptor considered influential to FP prediction and selected into the MLR model and the coefficient  was the partial coefficient of the variable defining the weight of influence.As the MLR model depicted in (1) was built up from a large number of regressors, there might be dependence between the chosen regressors, and the correlations between these regressors should be examined to remove redundant (collinear) regressors and ensure that only descriptors of significant effects in FP prediction be identified.The statistical treatment applied here aimed to develop a QSAR model that predicted the desired property with the least number of molecular descriptors at the highest accuracy.
Selecting a subset of regressors to create a model with smaller number of regressors is a problem of feature selection.The selection criteria typically involve the minimization of a specific measure of the predictive error.Once the selection criteria are established, feature selection algorithms are applied to search for the group of regressors that optimally model the measured response, subject to the constraints specified in the criteria such as the required or excluded features and the size of the subset.In the literature many algorithms have been proposed to accomplish this work.
The stepwise regression, which is adopted in this work, is a systematic method for adding and removing regressors from a MLR model based on their statistical significance [19].The stepwise regression method begins with an initial model and then compares the explanatory power of models with incrementally larger or smaller numbers of regressors.At each step, the  value of an -statistic is computed to test models with and without a potential regressor.If a regressor is not currently in the model, the null hypothesis is that the regressor would have a zero coefficient if it is added to the model.If there is sufficient evidence to reject the null hypothesis, the regressor is added to the model.Conversely, if a regressor is currently in the model, the null hypothesis is that the regressor has a zero coefficient.If there is insufficient evidence to reject the null hypothesis, the regressor is removed from the model.However, depending on the regressors included in the initial model and the order in which regressors are moved in and out, this method may build up different models from the same set of potential regressors.In this sense, the models obtained by the stepwise regression method are just locally optimal but are not globally optimal.To overcome this drawback, the random search technique was implemented to automatically set up the initial model in the present work to ensure the representativeness of the final model, and the details could be found in our previous work [9].

Results and Discussions
Using the FP values for 236 compounds collected in the datasets, the following 6-descriptor MLR model for predicting the FP of organosilicon compounds was established: The definitions for all the molecular descriptors present in (2) are summarized in Table 2.Among these descriptors, NF is the constitutional type of descriptor that describes the number of fluorine atoms in the molecular structure.The descriptor 1 , Randic index of order 1, is the Randic connectivity index interpreting the contribution of one molecule  to the bimolecular interaction arising from the encounters of bonds of two identical molecules.The maximum partial charge for all atom types, the descriptor  max , is the maximum positive charge of atoms in a molecule.The descriptor DPSA2 indicates the difference in the total charge weighted surface area, which is defined by the total charge weighted positive solvent-accessible surface area minus the total charge weighted negative solvent-accessible surface area.The relative principal moment of inertia , the descriptor   , is the principal moment of inertia which gives the inertia between the maximum and minimum values.The final descriptor in the model,  2 tot , introduces the difference (Pos-Neg) in the charged part of charged surface area and is defined by the partial positive solvent-accessible surface area minus the partial negative solvent-accessible surface area.These descriptors were well documented in the literature [20].The numbers in parentheses in (2) were the estimated standard error for each individual parameter.Table 3 summarized the statistics from the goodness-of-fit analysis for the model presented in (2).As the results show, all of the molecular descriptors selected and integrated into the final MLR model were highly significant.The goodness-of-fit ( 2 ) and leaveone-out cross-validations ( 2 LOO ) were 0.9174 and 0.9106, respectively.Thus, the model described in (2) provided a satisfactory performance in endpoint fitting with reasonably robustness in the model parameters.The robustness in the model parameters was also verified by the estimated standard errors as shown in Table 3.
Table 4 shows for the proposed QSAR model the commonly examined indices of model performance in addition to the goodness-of-fit, including the external validation ( 2 ), the average absolute error (AAE), the maximum absolute error (MAE), the average absolute error in percentage (AAEP), and the maximum absolute error in percentage (MAEP).As the results show, for the FP-predicting model established in this study, the  2 , AAE, and MAE were 0.9174, 0.34 K, and 44.40 K, respectively.When the proposed model in (2) was validated externally using the testing set, the predictive performance,  2 , was 0.8989, and the AAE and MAE were 11.22 K and 31.48K, respectively.Compared with the reproducibility known in the measurements generated using the ASTM D 93 test method [12], the FPpredicting model proposed in this study was able to provide predicted FPs with reasonable accuracy.The experimental values of the FP for the organosilicon compounds collected in the training and the testing datasets in this study were further compared with their corresponding values predicted by the proposed model and shown in Figures 3(a) and 3(b), respectively.As these figures revealed, no outliers of statistical significance were present to the regressions.
Figure 4 shows the distribution of fitting errors in percentage among the compounds in the training set (Figure 4(a)) and the distribution of predictive errors in percentage among those in the testing set (Figure 4(b)), respectively.As the results show, the maximum error in the percentage error was less than 10%, except for four compounds in the  training set.The largest percentage error was 16.26% in the training set and 9.77% in the testing set.These results demonstrated a sufficient performance of the proposed model in both the fitting and predictive ability.
To verify the absence of chance correlation between the endpoint and the modeling descriptors, the -scrambling technique was employed as a test of susceptibility for the model proposed in the current study.Figure 5 shows the result.In Figure 5, the -axis is the number of scrambling and the -axis is the leave-one-out ( 2 LOO ) cross-validation.Overall, the values of  2 LOO were insignificant in all shuffled cases, attesting to a very small possibility of chance correlation if any for the proposed model.Table 4 also compares the performance of the model established in this study with those of the models developed in Chen et al. [9] and Pan et al. [11].When the model proposed in this study was compared with Chen et al. 's work, the  2 value increased from 0.8981 to 0.9174 and the  2 from 0.8535

Conclusions
In this study, a QSAR model of six descriptors is proposed to predict the FP of organosilicon compounds.The performance of the established model was examined, and the results indicated a  2 value of 0.9174,  2 value of 0.8989, an average fitting error in percentage of 3.30%, and an average predictive error in percentage of 3.60%.The average fitting error and predicting error were 10.34 K and 11.22 K, respectively.Compared with the known experimental reproducibility in FP measurement, this model delivered satisfactory performance in FP prediction with reasonable precisions.
Given the limitations inherent in evaluating the FP of a compound via experimental determination, the proposed QSAR model provides an alternative mechanism in compliance with the REACH regulations [21] for the determination of FP required in fire hazard assessment.In addition, as only the calculated molecular descriptors are required in the proposed model, this model may be readily applied to predict the FPs of novel compounds that are usually constrained

Figure 1 :Figure 2 :
Figure 1: Distributions of molecular weight of compounds included in the (a) training set and (b) testing set in this study for application in development of quantitative structure activity relationship for flash point prediction.

Figure 3 :
Figure 3: Relative distribution of experimental values of flash point versus values predicted by the quantitative structure activity relationship established in this study for organosilicon compounds in the (a) training set and (b) testing set.

Figure 4 :
Figure 4: Distribution of fitting errors in percentage among compounds in the training set (a) and of predictive errors in percentage among compounds in the testing set (b).
these six compounds do not appear in the training dataset for these three studies, the predictive errors in them do reflect the predictivity of individual models.As the results show, the predictive errors were higher than 150 K for five of these six compounds when their FPs were predicted using the model by Pan et al., while the predicted value of FP for hexadecamethylheptasiloxane was also a outlier when predicted using the model by Chen et al.In comparison, there were no obvious outliers found for the proposed model.It should also be noted that both Pan et al. 's and Chen et al. 's works used molecular descriptors generated by the computer program Dragon, in which there were only a small number of electrostatic types of molecular descriptors available for model construction.As the development of flash point is driven by both the physical and chemical behaviors (vaporization and combustion) of the compound, the aforementioned results elucidate the influence of electrostatic properties and the necessity of considering such effects in explaining the chemical behaviors in FP development.

Table 1 :
Boiling point and flash point for selected organosilicon compounds.

Table 2 :
Definitions of molecular descriptors in the quantitative structure activity relationship established for prediction of flash point for organosilicon compounds.

Table 3 :
Statistics summarizing goodness-of-fit and robustness analysis for the quantitative structure activity relationship established for prediction of flash point for organosilicon compounds.

Table 4 :
[11]istics summarizing goodness-of-fit ( 2 ), external validation ( 2 ), average absolute error (AAE), maximum absolute error (MAE), average absolute error in percentage (AAEP), and maximum absolute error in percentage (MAEP) for the quantitative structure activity relationships established in the current study and in Chen et al.[9]and Pan et al.[11]for prediction of flash point for organosilicon compounds.

Table 5 :
[11]rimental values of flash point and values predicted by the models in Chen et al.[9], Pan et al.[11], and in the current study for the six organosilicon compounds added in the testing set of this study from DIPPR database.It should be noted the training dataset of this work is the same as that of them, and there are six newadded chemicals in present testing dataset.Thus the aforementioned improvement suggests actual improvement in the model performance for the model proposed in this study.When the current model was compared with the announced performance described in Pan et al., the performance of the current model seemed not to gain improvement.However, it should be noted that both numbers of explored compounds in their training dataset and testing dataset are smaller than those of this study, and this will overestimate its performance.To illuminate on this point, Table5lists the experimental FPs and the FPs predicted by the models in Chen et al., Pan et al., and in the current study for the six organosilicon compounds added in the testing set of this study from the DIPPR database.These six compounds are[3-(mercapto)propyl] triethoxysilane, 3-(trimethoxysilyl)-1-propanethiol, hexadecamethylheptasiloxane, gamma-aminopropyl-triethoxysilane, dichlorodimethylsilane, and dimethylchlorosilane.Because