^{1,2,3}

^{2}

^{4,5}

^{1}

^{2}

^{3}

^{4}

^{5}

Multiple linear regression analysis is widely used to link an outcome with predictors for better understanding of the behaviour of the outcome of interest. Usually, under the assumption that the errors follow a normal distribution, the coefficients of the model are estimated by minimizing the sum of squared deviations. A new approach based on maximum likelihood estimation is proposed for finding the coefficients on linear models with two predictors without any constrictive assumptions on the distribution of the errors. The algorithm was developed, implemented, and tested as proof-of-concept using fourteen sets of compounds by investigating the link between activity/property (as outcome) and structural feature information incorporated by molecular descriptors (as predictors). The results on real data demonstrated that in all investigated cases the power of the error is significantly different by the convenient value of two when the Gauss-Laplace distribution was used to relax the constrictive assumption of the normal distribution of the error. Therefore, the Gauss-Laplace distribution of the error could not be rejected while the hypothesis that the power of the error from Gauss-Laplace distribution is normal distributed also failed to be rejected.

The first report on multiple linear regression appears on 1885 [

In his first published paper, Fisher introduces the method of likelihood maximization [

A multiple linear regression model involves more than two variables, one (

The least squares method is the standard approach for regression analysis, the method being credited to Legendre [

Iteratively applying local quadratic approximation to the likelihood (through the Fisher information [

Generalized Gauss-Laplace distribution is the natural extension [

A more general result regarding the maximum likelihood estimation can be found in [

The problem of estimating the parameters of a multiple linear regression under assumption of generalized Gauss-Laplace distribution of the error is a hard problem which can be solved only numerically and it involves an optimization problem with

In order to provide a proof of the facts for the proposed method of relaxing the distribution of the error when linear regression is used to link between chemical information and biological measurements, ten previously reported datasets were considered, all with significant role in human medicine or ecology.

One may define the generalized Gauss-Laplace (GL) distribution as

This definition will be used here for the Gauss-Laplace distribution to relax the normal distributed constraint for the distribution of the error (

Multiple linear regressions under assumption of GL distribution (see (

The case with intercept (

Let us take a sample of

Doing the substitution given in (

The likelihood is at maximum when all its partial derivatives are zero:

The problem of finding the maximum of the likelihood is a typical problem of finding the extreme points, but not easy to be solved because it depends on a large number of variables. The easiest way is to eliminate one variable, namely,

Please note that

On the other hand, only

At this point only the expression of the likelihood function (see (

A convenient notation was used in (

There are some inconveniences for a smooth application of the fixed-point theory. One of them is that the obtaining of the maximum of the LMLRGLS function (see (

start from some initial values of the regression coefficients (

use initial values to obtain the likelihood function LMLRGLS (from (

find the maximum (let this be

prepare starting of a loop on

it is possible, especially at the beginning of the iteration (when

do a loop (

repeat until (

At arriving in the stationary point, all criteria for the maximization of the likelihood are accomplished; namely, the equations corresponding to all derivatives cancellation are assured. The great advantage of this proposed method is that it reduces the problem of finding the maximum of a function with

The disadvantage is that the evolution is through a contracting functional of which contraction cannot be assured all the time. This is the reason why there are different strategies of finding such kind of contracting functional (see example 6.1 in [

Some calculations are the same regardless the strategy used and are given in the next as Algorithms

Require:

Require:

Require:

Require:

Require:

Require:

Require:

Require:

for

end for

One strategy is to use the equations from cancellation of regression coefficients derivatives (see (

Require:

Require:

Require:

Another strategy that is required to be specified is that if (

Require:

Require:

Require:

repeat

repeat

until

until

The contingency of 2 × 2 strategies given above was tested on sampled data (see Section

Require:

Require:

repeat

repeat

until

until

In order to assure the numerical stability of the calculations, Algorithm

Therefore, in all scenarios, the initial (starting) values of the estimates to be determined will be the one given by the classical multiple linear regression models as presented in the following:

Ten datasets of chemical compounds with different sample size (Table

Datasets characteristics.

Set | Sample size | Class | Property/activity | Reference |
---|---|---|---|---|

1 | 132 | Estrogens | Estrogen binding affinity— | [ |

2 | 37 | Carboquinone derivatives | Minimum effective dose (MED)— | [ |

3 | 33 | Organic pollutants | Oxidative degradation— | [ |

4 | 97 | Benzotriazoles | Fish toxicity—pEC50 | [ |

5 | 136 | Thiophene and imidazopyridine derivatives | Inhibition of polo-like kinase 1—pIC_{50} | [ |

6 | 14 | Substituted phenylaminoethanones | Average antimicrobial activity—pMICam | [ |

7 | 110 | Acetylcholinesterase inhibitors | Inhibition activity—pIC_{50} | [ |

8 | 107 | Polychlorinated biphenyl ethers | 298 K supercooled liquid vapor pressures— | [ |

9 | 107 | Polychlorinated biphenyl ethers | Aqueous solubility— | [ |

10 | 47 | Para-substituted aromatic sulphonamides | Carbonic anhydrase II inhibitors— | [ |

For all datasets, the experimental values of the dependent variable (

Reported bivariate models.

Set | Model under assumption of normal errors | Determination coefficient ( |
---|---|---|

1 | | 0.3976 |

2 | | 0.7700 |

3 | | 0.6859 |

4 | | 0.7161 |

5 | | 0.5101 |

6 | | 0.8357 |

7 | | 0.6838 |

8 | | 0.9880 |

9 | | 0.9619 |

10 | | 0.7058 |

Different descriptors (independent variables) were used to explain the activity/property of interest on models presented in Table

TIE: state topological parameter [

TIC1: total information content index (neighborhood symmetry of 1) [

IHDMkMg and IHDDFMg: MDF descriptors [

SAG: molecular surface area grid;

TPSA(NO): topological polar surface area expressed by nitrogen and oxygen contributions; Aeigm: Dragon descriptor [

RDF035m: radial distribution function on a spherical volume of a 3.5 Å radius weighted by atomic mass; small-RSI-mol: the smallest value of atomic steric influence in a molecule [

nR10: number of 10-membered rings; N-070: number of Ar-NH-Al fragments [

FNSA1: fractional partial positive surface area 1 PPS_{A1}/TMA_{S}; where PPS_{A} = Partial Positive Surface Area and TMS_{A} = Total Molecular Surface Area.

All sets subjected to analysis converged maximizing the likelihood and Table

Differences between values of coefficients obtained by classical linear regression approach compared to the proposed approach.

Set | | | | | | |
---|---|---|---|---|---|---|

1 | | | | | | |

2 | | | | | | |

3 | | | | | | |

4 | | | | | | |

5 | | | | | | |

6 | | | | | | |

7 | | | | | | |

8 | | | | | | |

9 | | | | | | |

10 | | | | | | |

The results presented in Table

The power of the error follows different patterns according to the model, decrease-fluctuation-plateau (set 1, Figure

Evolution of the power of the errors (

A question (hypothesis) can be raised about the power of the error:

The proposed algorithm (Algorithm

The authors declare that there is no conflict of interests regarding the publication of this paper.