Prediction of Gas Chromatography-Mass Spectrometry Retention Times of Pesticide Residues by Chemometrics Methods

A quantitative structure-retention relationships (QSRRs) method is employed to predict the retention time of 300 pesticide residues in animal tissues separated by gas chromatography-mass spectroscopy (GC-MS). Firstly, a six-parameter QSRR model was developed by means of multiple linear regression. e six molecular descriptors that were considered to account for the effect of molecular structure on the retention time are number of nitrogen, Solvation connectivity index-chi 1, Balaban Y index, Moran autocorrelation-lag 2/weighted by atomic Sanderson electronegativity, total absolute charge, and radial distribution function6.0/unweighted. A 6-7-1 back propagation arti�cial neural network (ANN) was used to improve the accuracy of the constructed model. e standard error values of ANN model for training, test, and validation sets are 1.559, 1.517, and 1.249, respectively, which are less than those obtained reveals by multiple linear regressions model (2.402, 1.858, and 2.036, resp.). Results obtained the reliability and good predictability of nonlinear QSRR model to predict the retention time of pesticides.


Introduction
Pesticides are used on a large scale for agricultural purposes.e adverse effects of pesticides on both human health and the environment are a matter of public concern.us, both the actual state and the transition of pesticide residues in various matrices including water, soil, and agricultural products should be extensively monitored.ese researches should be undertaken using an efficient analytical system with a laborsaving and cost-effective device, as pesticides as well as applicable �elds of research rang over a broad spectrum.Conventional sample preparation methods used to analyze pesticide residues in various matrices require expensive instrumentation, an expert analyst [1][2][3][4].Besides the above mentioned, the experimental determination of chromatographic retention parameters of pestocides is time consuming and expensive.Alternatively, quantitative structure-retention relationship (QSRR) provides a promising method for the estimation of retention time based on descriptors derived solely from the molecular structure to �t experimental data.e advantages of this approach lie in the fact that it requires only the knowledge of chemical structure and is not dependent on any experiment properties.
QSRR studies [5,6] started from the calculation and selection of descriptors, to �nding their relation to retention times and derivation of mathematical models that involve these multivariate data in order to be used for predictive purposes in chromatographic system.Multivariate data consist of the results of observations of many different variables (molecular descriptors) for a number of individuals (molecules).Known methods for this include the multiple regression analysis, experimental design techniques, and nonlinear regression.e drawback, sometimes, of these very popular techniques is their inability to give highly predictive models due to hidden nonlinearity inside the data variables or the prerequisite to specify the mathematical model before the �tting of the data.So there is a need to improve further such kind of models in order to extract the most accurate prediction.�o this end, arti�cial neural networks (ANNs) could be used successfully in QSRR studies providing better results than the conventional regression models.
�oday arti�cial neural networks [7][8][9][10] have become an important modeling technique for QSAR and QSPR, and also this technique has been applied in numerous application areas of chemistry and pharmacy [11][12][13][14][15][16].e mathematical adoptability of ANN commends them as powerful tool for pattern classi�cation and building predictive models.A particular advantage of ANNs is their inherent ability to incorporate nonlinear dependencies between the dependent and independent variables without using an explicit mathematical function.ere are few reports about the application of QSRR in the chromatographic studies: D' Archivio et al. modeled the combined effect of solute structure and eluent composition on the retention behavior of 26 pesticides in isocratic reversed-phase high-performance liquid chromatography using multilinear regression and arti�cial neural networks [17].In another study, they applied a six-parameter nonlinear QSRR model to predict the retention behavior of 26 pesticides including commonly used insecticides, herbicides, and fungicides as well as some metabolites in reversedphase high-performance liquid chromatography [18].Also Ghasemi and his coworkers used multiple linear regression and partial least squares regression to QSRR study of the gas chromatography retention time of 38 diverse chlorinated pesticides, herbicides, and organohalides by using molecular descriptors [19].In the present study, the application of ANN is being described in order to predict accurately the retention time values of 300 pesticides in four groups with different molecular structures [20].

Dataset. Development of the multiple linear regression
and arti�cial neural networks in the present work relies on a data set taken from reference [20].is dataset (Table 1) consists of 300 pesticides in animal tissues such as beef, mutton, pork, chicken, and rabbit ranging in retention time from 5.62 to 35.77 min.All of these 300 pesticides divided into four groups, depending upon properties and retention time of each pesticide.Each group is consisting different kind of pesticides such as acaricides, insecticides, and Fungicides.To apply the ANN modeling, the dataset was randomly divided into three groups of training, test, and validation sets consisting of 256, 22, and 22 pesticides, respectively.e training set was used for the model generation.e test set plays a different role in the cases of the MLR and the ANN models.For the ANN model, this set was used for early stopping to optimize learning iteration and avoid overtraining.e validation set was used to assess the accuracy of the ANN predictions.On the other hand, in the case of the MLR model, the test set and the validation set were used to evaluate the model.As can be seen from Table 1, the pesticides in the test and validation sets were chosen in a way that adequately represents the training set in term of retention time.

Molecular Descriptors.
All structures were generated with the HyperChem (Version 7) [21] and optimized with the classic potential MM + included.Molecular geometry was optimized with the Austin Model 1 (AM1) method [22], and then the molecular descriptors were calculated by the soware Dragon 3.0 [23].Overall more than 1400 theoretical descriptors were calculated for each molecule by this soware.ese descriptors can be classi�ed into several groups: 0D: constitutional descriptors; 1D: functional groups, atom-centered fragments, empirical descriptors and molecular properties; 2D: topological descriptors, molecular walk counts, BCUTs descriptors, Galvez topological charge indices, and 2D autocorrelations; 3D: aromaticity indices, Randic molecular pro�les from the geometry matrix, geometrical, RDF, 3D-MORSE, WHIMs, and GETAWAYs descriptors.Molecular descriptor meanings and their calculation procedure are explained in Handbook of molecular descriptors by Todeschini.ese molecular descriptors of different kinds were used to describe compound chemical diversity.

Regression Analysis
. e main goal of the generation of the MLR model was to choose a set of suitable descriptors that can be used as inputs for construction of the ANN model.Linear models were formed by a stepwise selection of important descriptors and MLR model construction [24].e best MLR model is one that has high correlation coefficient and -value, low standard error, and high prediction power.e statistics of the constructed MLR model is presented in Table 2.

2.�. Arti�cial �eural �et�or�.
A detailed description of theory behind arti�cial neural networks has been adequately described in several publications [25][26][27][28][29][30][31].An ANN program was written in MATLAB 7 in our laboratory.is network was feed-forward fully connected that has three layers with sigmoidal transfer function.Descriptors appearing in the MLR models were used as inputs of network and signal of the output node represent the retention time of interested compound.us, this network has six nodes in input layer and one node in output layer.e value of each input was divided into its mean value to bring them into dynamic range of the sigmoid transfer function of the network.e back-propagation algorithm was used for the training of the network [32,33].Before training the network, the parameters of the number of nodes in the hidden layer, weights and biases learning rates, and momentum values were optimized [34][35][36].Optimized values of these parameters were numbers of nodes in the input layer (=6), numbers of nodes in the hidden layer (=7), numbers of nodes in the output layer (=1), weights learning rates (=0.1),bias learning rate (=0.4), momentum (=0.5), and transfer function was sigmoid.e ANN-calculated values of permeability coefficient for training, test, and prediction sets, are shown in Table 1.e correlation between these descriptors is shown in Table 3.As shown in this table, there are no signi�cant correlations between these descriptors.e �rst descriptor with the larger mean effect is solvation connectivity index-chi 1 (1sol) that de�ned in order to model solvation entropy and describe dispersion interactions in solution.Taking into account the characteristic dimension of the molecules by atomic parameters, they are de�ned as

Results and Discussion
where   is the principal quantum number (2 for C, N, O atoms, 3 for Si, S, Cl, …) of the th atom in the th subgraph and   the corresponding vertex degree;  is the total number of th order subgraphs;  is the number of vertices in the subgraph.e normalization factor 1/() 1 is de�ned in such a way that the indices   and    for compounds containing only second-row atoms coincide.e �rst-order solvation connectivity index is where  runs over all the bonds;   and   are the principal quantum numbers of the two vertices incident to the considered bond.is index coincides with the Randic connectivity index 1  for the hydrocarbons;    for all the atoms.ese molecular descriptors are de�ned for an H-depleted molecular graph [37].e positive sign for the mean effect of this descriptor reveals that molecules have higher numerical value of 1sol, therefore, they have longer retention time.For example, HCH-epsilon (compound 164) with 6 chloride atoms in its structure has bigger 1sol (7.75) than Allidochlor (compound 1) with one chloride atom in its structure (5.61) and also has longer retention time than Allidochlor (20.78 and 8.78 min, resp., Figure 1).Compounds that have more atoms in their structure have larger numerical value of 1sol and so they have longer retention time, for example, compound 194 (Alpha-cypermethrin) has 1sol value of 13.32 and retention time of 33.35 min but compound 4 (Chlormephos), which has less atoms and shorter structure than Alphacypermethrin, has 1sol and retention time of 7.21 and 10.53 min, respectively (Figure 2).e second important descriptor is Balaban  index which was calculated by the same formula as the Balaban distance connectivity index , but by using atomic information indices instead of vertex distance degrees [38].e  index is de�ned based on atomic information indices   calculated for vertices of a H-depleted molecular graph as follows: where  runs over all of the different topological distances from the th vertex,    is the number of distances from the th vertex equal to , and   is the th atom eccentricity (i.e., the maximum topological distance from the considered atom) [38].So, the Balaban  index is calculated as where  is the number of bonds and  is the cyclomatic number.As the number of cycles, and branching in molecular structures increased (so   in (2) increased), the value of  index decreased.For example, compounds 70 (EPTC),      e total absolute charge is the other descriptor which, also known as the electronic charge index (ECI), is the sum of absolute charge over all atoms in a molecule and is a measure of molecule polarity [37].erefore, compounds with polar bonds have more numerical value of Qtot and have longer retention on a polar stationary phase than others.For example, compound 288 (Pirimiphos-methyl) has more retention and polarity (20.30min and 10.14, resp.)than Dichlobenil (compound 72) that has Qtot of 1.02 and retention time of 9.75 min (Figure 3), also Compounds 9 (Ethalfuralin), 21 (Dinitramine), and 175 (Flumetralin) that contain CF3 and two nitro groups in their structure have retention time of 15, 19.35, 24.10, and Qtot value of 6.66, 6.86, and 7.33, respectively.e other molecular descriptors with lower mean effects are number of nitrogen atoms, Moran autocorrelation-lag 2/weighted by atomic sanderson electronegativity and radial distribution-6.0/unweighted.As number of nitrogen atoms increase, in molecule's structure due to increasing the interaction of molecules with polar stationary phase, the retention time of compound increases.For example, in etridiazol, atrazine-desethyl, and �uquinconazole (compounds 3, 11, and 255, resp.) which have 2, 5, and 6 nitrogen atoms in their structure, their retention time is 10.42, 16.76, and 32.62, respectively.A Moran coefficient is a general index of spatial autocorrelation that, if applied to a molecular graph, can be de�ned as where   is any atomic property (here Sanderson electronegativity),  is its average value on the molecule,  is the atom number,  is a Kronoker delta (  = 1 if   = , zero, otherwise).Δ is the sum of the Kronoker deltas, that is, the number of vertex pairs at distance equal to  [39].So the Moran autocorrelation-lag 2/weighted by atomic sanderson electronegativity can be a factor of electronegativity of moleculs.e last descriptor is radial distribution-6.0/unweighted.e 3D coordinates of the atoms of molecules can be transformed into a structure code that has a �xed number of descriptors irrespective of the size of a molecule.is task is performed by a structure coding technique referred to as radial distribution function code (RDF code) [40].In general, there are some prerequisites for a structure code: independence from the number of atoms, that is, the size of a molecule, unambiguity regarding the three-dimensional arrangement of the atoms, and invariance against translation and rotation of the entire molecule.
Formally, the radial distribution function of an ensemble of  atoms can be interpreted as the probability distribution to �nd an atom in a spherical volume of radius  [41].e equation represents the radial distribution function code as it is used in this investigation: where  is a scaling factor and  is the number of atoms.By including characteristic atomic properties  of the atoms  and , the RDF codes can be used in different tasks to �t the requirements of the information to be represented.e exponential term contains the distance   between the atoms  and  and the smoothing parameter  that de�nes the probability distribution of the individual distances.() was calculated at a number of discrete points with de�ned intervals.e atomic properties   and   used in this equation enable the discrimination of the atoms of a molecule for almost any property that can be attributed to an atom.Such distribution function provides, besides information about interatomic distances in a whole molecule, the opportunity to gain access to other valuable information, for example, bond distance, ring types, planar and nonplanar systems, and atoms types.is fact is a most valuable consideration for a computer-assisted code elucidation.e radial distribution function in this form meets the entire requirement mentioned above, especially invariance against linear translations.As RDF060u has negative sign for its mean effect, as molecules became larger their RDF factor in 6 Å radius reduces.For example Tebufenpyrad (compound 250) which has larger structure than Pyrimitate (compound 224) has lower RDF factor (18.79) than Pyrimitate (29.76); so the retention time of Tebufenpyrad is higer (29.06 min) than Pyrimitate (20.59).
From the above discussion, it can be seen that all descriptors involved in the QSRR model has physical meaning, and these descriptors can account for the structural features that affect the retention time of under studied pesticides.

Comparison of Neural Network and MLR Models.
A graphical comparison of ANN and MLR analysis is given in Figure 4, where the retention time values calculated by means of the respective models are plotted against the experimental values.e statistical parameters obtained by ANN and MLR models for these sets are shown in Table 4. e standard errors of training, test and validation sets for the MLR model are 2.402, 1.858, and 2.036, respectively, which would be compared with the values of 1.559, 1.517, and 1.249, respectively, for the ANN model.Comparison between these values and other statistical parameters in Table 4 reveals the superiority of the ANN model over MLR ones.Figure 5 shows the plot of the residuals against the experimental values of retention time, for the ANN model.Since the residuals are propagated on both sides of the zero line, there is no systematic error in developing of ANN model.

Conclusions
Few structure-activity relationships involving pesticides have been published.In this study, we use MLR and ANN to predict the retention time of 300 pesticides that were different in molecular structure.e results of this study demonstrate that QSRRs method using ANN techniques can generate a suitable model for prediction of gas chromatographic retention of pesticides.
Also the results obtained in this work indicate that the regression and ANN models exhibit reasonable prediction capabilities.Descriptors which appeared in the obtained QSRR models reveal that electronic interactions as well as steric parameters can be affected on the gas chromatographic retention time of pesticides.

3 F 2 : 3 F 3 :F 4 :
Comparison of 1sol values for -cypermethrin and allidochlor.Comparison of Qtot values for pirimiphos-methyl and dichlobenil.Plot of the calculated   by MLR method versus the experimental values for training, test, and validation sets (a) and the plot of retention time values predicted by 6-7-1 feed forward ANN (  (ANN)) against the experimental values (  (EXP)) (b).

4 F 5 :
Plot of the residuals versus the experimental values of   for training, test and validation sets.
Experimental, ANN, and MLR predicted values of the retention times (  ) of training, test, and validation sets.  ,   ,   , SE  .
F 1: Comparison of 1sol values for HCH-epsilon and allidochlor.
80 (Cis-Diallate), and 97 (iobenarb) which have amine group in their structures with retention time of 8.54, 14.75, 20.63 min have  index values of 2.27, 2.15, 1.02, respectively.is descriptor has negative sign for its mean effect.erefore, as the numerical value of this descriptor increased 3: Correlation matrix for descriptors applying in this work.