Prediction of Parallel Artificial Membrane Permeability Assay of Some Drugs from their Theoretically Calculated Molecular Descriptors

Parallel artificial membrane permeation assays (PAMPA) have been extensively utilized to determine the drug permeation potentials. In the present work, the permeation of miscellaneous drugs measured as flux by PAMPA (logF) of 94 drugs, are predicted by quantitative structure property relationships modeling based on a variety of calculated theoretical descriptors, which screened and selected by genetic algorithm (GA) variable subset selection procedure. These descriptors were used as inputs for generated artificial neural networks. After generation, optimization and training of artificial neural network (5:3:1), it was used for the prediction of logF for the training, test and validation sets. The standard error for the GA-ANN calculated logF for training, test and validation sets are 0.17, 0.028 and 0.15 respectively, which are smaller than those obtained by GA-MLR model (0.26, 0.051 and 0.22, respectively). Results obtained reveal the reliability and good predictably of neural network model in the prediction of membrane permeability of drugs.


Introduction
Determination of the intestinal permeability is a key parameter for the selection of compounds for drug discovery.To evaluate the absorption of drugs with diverse structures across a membrane via the transcellular rout, their permeability was measured using the parallel artificial temperance permeation assay (PAMPA).PAMPA was proposed by Kansy et al.  in 1998 1 , which consists of hydrophobic filters coated with lecithin in an organic solvent solution.It is a rapid in vitro assay system that evaluates transcellular permeation and is applicable to high throughput screening.
PAMPA has been used for the prediction of oral absorption and blood brain barrier penetration [2][3][4][5][6][7] .A recent PAMPA (PAMPA-BBB) study using porcine brain lipids has been successfully utilized to improve the predication of blood-brain barrier (BBB) penetration.The PAMPA-BBB assay is high throughput, accurate, low cost, reproducible and consumes a minimal amount of sample (<0.5 mg) 8 .Previously, the PAMPA permeability of peptiderelated compound such as protected peptides, cyclic peptides and indole compounds was measured as models of peptidomimetrics [9][10][11] .One approach to calculate physical properties from molecular structural descriptors is quantitative structure-property relationship (QSAR) methods 13 .QSAR methods yield explicit or implicit correlations which can be applied to compounds structurally related to (but not identical with) the compounds in the original training set.The quantitative structure-activity relationship procedure was applied to the quantitative analysis between chemical structures and PAMPA permeability coefficients.The classical QSAR, the so-called Hansch-Fujita approach is a representative of QSAR methods and has been widely used in medicinal chemistry 12 .The result of QSAR analysis showed that the hydrogen-accepting ability of molecules in addition of hydrophobicity at particular pH was significant in determining the variations in PAMPA permeability coefficients.
Kansy et al. 13 reported QSAR models for PAMPA flux of a large set of miscellaneous drugs with or without the addition of glycolic acid.Verma et al. 14 developed a QSAR model for the predication of permeation of drugs by PAMPA at pH 7.4.They correlated the drugs permeations by PAMPA to partition coefficient in 1-octanol/water system (is measure of the overall hydrophobicity of a molecule) and the presence or absence respectively of certain structural features with unusual effects.The functional groups with hydrogen bonding capabilities include the -COOH, -SO 2 NH 2 , atomic -OH and the -N(CH 3 ) 2 moieties.Their QSAR model for the estimation of drugs permeation of 94 drugs was shown R 2 =0.721 and standard error was 0.206.
The present work, a QSPR model based on genetic algorithm (GA) and artificial neural network (ANN) techniques was carried out to investigate the permeation of miscellaneous drugs measured as flux by PAMPA 14,15 .

Data set
The data set of permeation drugs through PAMPA was taken from the values reported by Verma et al. 14 and shown in Table 1.This data set consist the logarithm of drugs permeation for 94 compounds.Permeation of miscellaneous drugs measured as flux by PAMPA with or without the addition of glycolic acid at pH 7.4.The data set was randomly divided into three groups including training, test and validation sets, which consists of 70, 12 and 12 drugs, respectively.The training set was used for model generation, test set was used for monitoring the extent of overtraining and validation set was used for evaluation of the prediction power for obtained model.

Molecular descriptors
One important step in QSAR modeling is the numerical representation of molecular structures (often called molecular descriptor of the chemical structure).The built model performance and the accuracy of the results are strongly dependent on the way that the structural representation was performed.At first the structures of the compounds were drawn with Hyperchem 7.0 program 15 .The geometry optimization was performed with the semiempirical quantum method AM1 16 .The Hyperchem output files were used by the dragon program to calculated five classes of descriptors: topological, electrostatic, constitutional, geometrical and quantum chemical 17 .Some descriptors generated for each compound were encoded similar information about the molecule of interest.Therefore, it was desirable to test each descriptor and eliminate those that show high correlation (R > 0.90) with each other.A total of 167 out of 542 descriptors showed high correlation and were removed from the next consideration.Subsequently GA-MLR variable subset selection method was used for selection of important descriptors.The names of descriptors and the statistical values of constructed GA-MLR model are shown in Table 2.These descriptors were used as inputs for the generated artificial neural network.

Genetic algorithm
Genetic algorithm (GA) is a stochastic optimization method that has been inspired by evolutionary principles [18][19][20] .The distinctive aspect of a GA is that it investigates many possible solution simultaneously, each of which explores different regions in parameter space 21 .For the moment, one of the best available tutorial on variable selection using genetic algorithm published by Leardi et al. 22 .In the present paper, genetic optimization method was tried following the studies of Rogers et al. 23 and Luke 24 with a few minor modifications.In our GA program that was written by MATLAB 7.5 two modifications are made.The first is a stochastic reminder method, which allows an individual with above average fitness to be reproduced at least once 21 .The second is the inclusion of elitism, which protect the fittest individual in any given generation from crossover or mutation during reproduction.The genetic content of this individual simply moves on to the next generation intact.In original studies, the fitness function of the individual was determined by a function related to the residual error in the regression analysis of the training data.Here we try to use varieties of fitness functions, which are proportional to the residual error of the training set, prediction set and the number of selected variables according to the following equation: 1 Fitness = SEC + SEP + (m) w In this equation SEC and SEP are standard error of training and prediction set, respectively 25 , m is the number of variables in the represented model and w is a numerical value that implies the weights of m in the value of fitness.In fact the value of w determine the number of variables exist in the chromosome.Some experiments were done using different value for w.Here for the calculation of the fitness of each chromosome a linear model was constructed using variables consist in each chromosome separately by multiple linear regressions (MLR) and the values of SEC and SEP were calculated using this model.This procedure was applied for each chromosome separately.Descriptors selected by GA and the specifications of constructed GAMLR model are shown in Table 2. Since linear model cannot provide an appropriate model for prediction of logF, therefore, we decided to use ANN as nonlinear feature mapping for QSAR model construction. (1)

Artificial neural network
A detailed description of the theory behind a neural network has been adequately described elsewhere [18][19][20][21][22][23][24] .In addition we have reported some relevant principles of ANNs in our previous papers 25 .The program for the feed-forward neural network that was trained by back-propagation strategy was written in MATLAB 7.5 in our laboratory.The descriptors appearing in the GA model were used as inputs for generation of the ANN.Therefore the number of inputs in the ANN was five and the number of nodes in the output layer was set to be one.The number of nodes in the hidden layer was optimized.The initial weights were randomly selected from a uniform distribution that ranged between -0.3 and +0.3.The initial biases values were set to be one.These values were optimized during the training of the network.The value of each input was divided to its mean value to bring the values of the input variables into the dynamic range of the sigmoid transfer function in the ANN.Before training, the network was optimized for the number of nodes in the hidden layer, learning rates and momentum.In order to evaluate the performance of the ANN, standard error of calibration (SEC) and standard error of prediction (SEP) were used 26 .Then the network was trained using the training set appear by back-propagation strategy for the optimization of the values of the weights and biases.

Results and Discussion
The data set and corresponding observed and GA-ANN predicted values of the permeation of drugs studied in this work are shown in Table 1.For the selection of the most important descriptors genetic algorithm were used.Then these descriptors were used as inputs for generated ANN.Applied GA contained a population of 100 individuals, which evolved for 300 generations.Then by comparison between the fitness values of individuals, the best model was chosen.It can be seen from Table 2 that five descriptors appeared in the best GA-MLR model.These descriptors are: Mean information content on atomic composition (AAC), R autocorrelation of lag 6/ weighted by atomic polarizabilites (R6p), QXX COMMA2 value/ weighted by atomic masses (QXXm), Moran autocorrelation -lag8/weighted by atomic Sanderson electronegativities (MATS8e) and number of hydrogen attached to heteroatom H050).(Table 3 represents the correlation matrix of these descriptors, which represent there aren't any significant correlation between these descriptors.

Prediction of Parallel Artificial Membrane Permeability 1080
For inspection of the relative importance and contribution of each descriptor in the model, the value of mean effect (ME) was calculated for each descriptor and shown in the last column of the Table 2 and also were plotted in Figure 1.By interpreting the descriptors in this model, it is possible to gain some insight into factors that are likely related to the permeation of drugs.The first topological descriptor is the mean information index on atomic composition.This descriptor is the mean value of the total information content and is calculated as:

Figure 1. Plot of the mean effects of descriptors
Where A h is the total number of atoms (hydrogen included), A, is the number of equaltype atoms in the g th equivalence class, and p, is the probability of randomly selecting a g th type atom 27 .The positive value of mean effect for AAC (+1.550) in the GA-MLR model indicate that this descriptor contribute positively to value of logF of drugs.For example the values of I AC for tacrine and alprenolol are 1.29 and 1.34, respectively and their logF values are 1.73 and 1.81, respectively.
The second descriptor is R autocorrelation of lag 6/ weighted by atomic polarizabilites.This descriptor is one GETAWAY (GEometry, topology and atom-weights assemblY) type descriptors.Molecular descriptors based on the autocorrelation function AC l , defined as: Where f(x) is any function of the variable x and l is the lag representing an interval of x; a and b define the total studied interval of the function.A property of the autocorrelation function is that it does not change when the origin of the x variable is shifted.To obtain spatial autocorrelation molecular descriptors, function f(x i ) is any physico chemical property calculated for each atom of the molecule, there polarizability 28 .With increase value of polarizability and this index, log is decreased, that have compatibility with calculated mean effect of this descriptor (-0.811).For example the values of R6p for coumarin and suprofen are 0.09 and 0.27 and logF for these molecules are 1.95 and 1.12, respectively.
The next descriptor is QXXm (QXX COMMA2 value/ weighted by atomic masses) that is one molecular classified as comparative molecular moment analysis method based on the 3D geometrical representation of the molecule.This descriptor can encode calculates different molecular moments with respect to the center of mass, center of charge and centerof-dipole of the molecule 29 .The quadrupole components QXX is calculated with respect to a translated initial reference frame whose origin coincides with the center-of-dipole 30 .By increasing in dipole of molecular, value of this descriptor is increased.Corresponding to positive mean effect for this descriptor (+0.275), by increasing the magnitude of quadrupole moment, log increase.For example the values of QXXm for alprenolol and atenolol are 70.23 and 29.56 and logF for these molecules are 1.81 and 1.04, respectively.
The next descriptor is MATS8e; which is Moran autocorrelation -lag8/weighted by atomic Sanderson electronegativities, that is 2D autocorrelation descriptors.General index of spatial autocorrelation can be defined as: Where w, is any atomic property, W is its average value on the molecule, A is the atom number, d is the considered topological distance (i.e. the Zag in autocorrelation terms), ∆, is a Kronecker delta (δ i , = 1 if δ ij = d, zero otherwise).∆ is the sum of the Kronecker deltas, i.e. the number of vertex pairs at distance equal to d 36 .The Moran coefficient usually takes a value in the interval [-1, +1].Positive autocorrelation corresponds to positive values of the coefficient whereas negative autocorrelation produces negative values.Corresponding to negative mean effect for this descriptor (-0.015), by increasing moran coefficient, log decrease.For example the values of MATS8e for naproxen and chlorthalidone are -0.15 and 0.33 and logF for these molecules are 1.95 and 0.79, respectively.
The last descriptor is H-050 (number of hydrogen attached to heteroatom).The atomcentered fragment code is a short-range atom-centered code that describes each atom by its own atom type and the bond types and atom types of its first neighbors.Functionalities in a molecule can be represented by two to five atoms (corresponding to one bond to 4 bonds), which consist of a central atom and its neighboring bonded atoms [31][32][33][34] .Each fragment is represented by a single-value atom-centered fragment descriptor.All atom-centered fragment descriptors representing all fragments in the data set molecules are recorded in an arbitrary but fixed way in a uniform-length multidimensional vector.The use of these substructure descriptors greatly increases the specific chemical information regarding different functional groups, but cannot discriminate between different arrangements of functional groups within a molecule 31 .In general, substructure descriptors are the counts of predefined structural features in the molecules or binary variables specifying their presence/absence 35 .By increasing the number of hydrogen attached to heteroatom the value of H-050 increase and therefore the value of logF increase.For example the values of H-050 for chlorpropamide and dapson are 2 and 4 and log for these molecules are 1.65 and 1.73, respectively.logF(exp) logF(cal)

R=0.858
From the above discussion, it can be seen that all descriptors in the QSAR model has chemical meaning and these can account the structural features that affect on the permeation of the interested drugs.
The next step was the construction of artificial neural network.The selected descriptors were used as inputs for generated ANNs.Before training the network, the parameters of the number of nodes in the hidden layer, weights and biases learning rates and momentum values were optimized.Procedures for the optimization of these parameters were reported in our previous papers 37 .Table 4 shows the architecture and specification of the optimized network.After optimization of the ANN parameters, the network was trained for the adjustment of weights and biases values.Then the trained network was used to evaluate the logF values for molecules in validation set.The ANN and GA-MLR predicted values of logF all molecules in data set are shown in Table 1.Also, the statistical parameters obtained for the GA-MLR and ANN models are shown in Table 5.The standard errors of ANN model for training, test and validation sets are 0.17, 0.028 and 0.15, respectively, which would be compared with the values of 0.26, 0.051 and 0.22, respectively, for the GA-MLR model.Figure 2a and 2b shows a plot of the GA-MLR and ANN calculated versus the experimental values of logF for the data set molecules.Correlation coefficient (R) of 0.858 and 0.906 for these plots confirm the suitability of the GA-MLR and ANN models to predict of logF, respectively.Comparison between these values and other statistical parameters in table 5 reveals that nonlinear ANN model produced better results with good predictive ability than linear model.The residual of the GA-ANN calculated values of the logF are plotted against their experimental values in Figure 3.The propagation of the residuals on both sides of zero line indicates that no systematic error exists in the development of the neural network.Acetaminophen, acyclovir and chlorpropamide have high residual value.Figure 3. shows that predicted values for these molecules are not very good.By study structures of these molecules, it was find that acetaminophen has a very simple structure (compared with the training set), acyclovir has a N-penta and hexagonal structure and chlorpropamide is a amid that are bonded chloride and sulfide groups, so that they have different structures with other molecule.The overall rootmean square error for the GA-ANN model was 0.17, which was less than those obtained by grammatical et al. 38 as well as GA-MLR model (0.26).Results obtained reveals that there are some nonlinear relation between the permeation of drugs and the selected structural molecular descriptors.

Conclusion
In the present study, genetic algorithm techniques with artificial neural network approaches were used to develop the QSAR model for prediction of permeation of drugs by PAMPA.The effectiveness of the evolutionary programming algorithm is demonstrated by the selection of the best set descriptors.The key strength of neural networks is their ability to allow for flexible mapping of the selected features by manipulating their functional dependence implicitly, unlike regression analysis.Neural network handles both linear and nonlinear relationships without adding complexity to model.The result suggests that a small number of chemically meaningful descriptors will provide the most predictive QSPR.The statistical results showed that the best model was GA-ANN that combines genetic algorithm as variable selection technique and artificial neural network as feature mapping method.From the analysis, we can conclude that the topological, geometrical, GETAWAY, 2D autocorrelation and atom-centered fragments descriptors have an overall good modeling capability, providing their usefulness in QSPR studies.These descriptors contain local or distributed information molecular structure, so in most case more than one type of these descriptors are needed to reach an acceptable modeling power.Finally descriptors appearing in these QSPR models related to different molecular properties, which can participate in the permeation of drugs by PAMPA.

Figure 2 .
Figure 2. Plot of the (a) GA-MLR and (b) GA-ANN calculated logF against experimental value for all drugs in data set.Correlation coefficient (R) of 0.858 and 0.906 for these plots confirm the suitability of the GA-MLR and ANN models to predict of logF, respectively.Comparison between these values and other statistical parameters in table 5 reveals that nonlinear ANN model produced better results with good predictive ability than linear model.The residual of the GA-ANN calculated values of the logF are plotted against their experimental values in Figure3.The propagation of the residuals on both sides of zero line indicates that no systematic error exists in the development of the neural network.Acetaminophen, acyclovir and chlorpropamide have high residual value.Figure3.shows that predicted values for these molecules are not very good.By study structures of these molecules, it was find that acetaminophen has a very simple structure (compared with the training set), acyclovir has a N-penta and hexagonal structure and chlorpropamide is a amid that are bonded chloride and sulfide groups, so that they have different structures with other molecule.The overall rootmean square error for the GA-ANN model was 0.17, which was less than those obtained by grammatical et al.38 as well as GA-MLR model (0.26).Results obtained reveals that there are some nonlinear relation between the permeation of drugs and the selected structural molecular descriptors.

Figure 3 .
Figure 3. Plot of the residuals versus experimental values of logF for all molecules in data set

Table 1 .
Data set and corresponding observed, GA-MLR and GA-ANN predicted values of logF

Table 2 .
Specification of GA-MLR models

Table 3 .
Correlation matrix for descriptors applying in this work

Table 4 .
Architecture and specification of the generated ANNs

Table 5 .
Comparison of the statistical parameters obtained using the ANN and MLR models a a F is the statistical F value, R is the correlation coefficient and S.E. is the standard error of model c refers to the calibration(training) set; t refers to the test set; v refers to the validation set.