An Extension of the Quadratic Error Function for Learning Imprecise Data in Multivariate Nonlinear Regression

,


Introduction
Let X and Y be, respectively, R p and R q valued random vectors defined on the same probability space (Ω, F, P) and let (X t , Y t ) t∈N be independent identically distributed (i.i.d) replications of (X, Y). One seeks to build the best estimator  2 ] must be conjointly minimized and given the bias-variance balance to quantify the accuracy of the estimator. Let S � F θ (X), θ ∈ Θ, X ∈ R p } be not necessarily a nonlinear parametric regression model family with Y � F θ (X) + ε. F θ : R p ⟶ R q is a continuous function of θ for fixed X and measurable for all θ fixed and Θ is a compact (i.e., closed and bounded) subset of the set of possible parameters of the regression model family. We assume that D n � (x 1 , y 1 ), . . . , (x n , y n ) is an observed data set coming from a true model (X t , Y t ) t∈N for which the true regression function is F θ 0 , for a θ 0 (possibly not unique) in the interior of Θ. e nonlinear regression is often applicable in modeling empirical data for forecasting purposes, where relationships that underlie the measurements can be strongly nonlinear, nonunivocal, noisy with dynamic nature [1][2][3][4][5][6]. Nonlinear regression techniques such as projection-pursuit regression, support vector machine, and multilayer perceptron are applied in various fields [7][8][9][10]. For instance in agronomy, crop yield can be predicted, where Y � {agricultural yields, insurance policy} and X � {agronomic, edaphic, climatic and economic parameters}; in climatology, estimation of long wave atmospheric radiation with Y � {heating rates profile, radiation fluxes} and X � {temperature, moisture, O 3 , CO 2 , cloud parameters profiles, surface fluxes, etc.}, and in finance, estimation of financial characteristics with Y � {future stock index, exchange rate, option price, etc.} and X � {financial data up to a certain time}.
Estimation of parameters is done using learning techniques. It consists to estimate the true parameter θ 0 based on the observations D n . is can be done by minimizing the mean square error (MSE) function [11]: . e minimization of this quadratic error function leads to a biased θ 0 under the assumption ε ∼ N(0, σ 2 I q ), when D n is imprecise. To solve this problem of the misspecification of the quadratic error function, Badran and iria [1] proposed another function by supposing ε ∼ N(0, Σ(x)) using Bayesian approach. is hypothesis is arbitrary if not false due to the nature of some data in some fields of application [12]. In this work, we propose an extension of Badran and iria [1] results using new hypothesis on ε. One of these hypotheses is that ε follows the multivariate skew-normal distribution. is distribution refers to a parametric class of probability density function (p.d.f.) that extends the normal distribution using an additional shape parameter that regulates the skewness, allowing for a multiple continuous variation from normality to nonnormality. In addition, we compare the newly established error function with that of Badran and iria [1] in case of any covariance matrix, especially when the matrix is diagonal and its diagonal elements are constant on the one hand and nonidentical on the other hand. is comparison is made on simulated and real data in order to evaluate the possible performance of the proposed error function.
In Section 2 of the paper, we recall some existing concepts. In Section 3, we establish the main results. Section 4 presents simulation study and application case on real dataset. e conclusion is considered in Section 5. e proofs of the main results are presented in the Appendix.

Imprecise Data and Multivariate Skew-Normal Distribution
Definition 1. Imprecise data are data inconsistent in the way they are collected and/or registered (rounding, grouping, etc.). It is data with asymmetry or data showing a slight or heavy tail of the distribution.

Definition 2.
A random multivariate variable ε � (ε 1 , . . . , ε q ) ⊤ of dimension q ∈ N * is said to follow a multivariate skew-normal distribution, denoted by ε ∼ SN(0, Ω, α), if it is continuous and its density function is given by (Azzalini and Dalla Valle [13]): where ϕ q (., .)is the density function of a q-dimensional multivariate normal distribution with standardized marginals and correlation matrix Ω, Φ(·) is the cumulative distribution function for the standard normal distribution, and α is a q-dimensional vector that regulates departure from symmetry. Consider the variance-covariance matrix Σ � D −1 1/σ ΩD −1 1/σ with D 1/σ � diag(1/σ 1 , . . . , 1/σ q ), the diagonal matrix of dimension q of the inverse of standard deviation 1/σ. e expression of the probability density function of ε is then where |Σ| denotes the determinant of matrix Σ. For more details, see Azzalini and Dalla Valle [13].

Bayes Approach.
Suppose that x ⊤ � (x 1 , . . . , x n ) is a vector of n observations whose probability density function p(x, θ) depends on the values of k-dimensional parameter θ � (θ 1 , . . . , θ k ) ⊤ . Suppose also that θ itself has a probability density function π(θ). en, denoting by p(x) and p(x | θ), the marginal probability density function of x, and the conditional probability density function of x given θ, respectively, one has where given the observed data x, the conditional probability density function of θ when p(x) ≠ 0 is erefore, Bayes formula in (5) is often written as e prior distribution of θ, π(θ) in (5) is the probability distribution of θ without knowledge of x. In contrast, the posterior distribution of θ given x, p(θ | x) in (5) is the probability distribution of θ given x.
A Bayesian statistic model is performed from a parametric statistical model p(x | θ) and a prior distribution of parameter π(θ).

Learning.
e learning consists to estimate θ of the regression function F θ : � E(Y | X � x) that minimizes the loss function R(θ) � E(L(θ, x, y)). A simple expression of this function is the generalized least squares (GLS) error function given by where dP(x, y) � p(x, y)dxdy with p(x, y), the joint probability density function of (X, Y), and Σ −1 (x) is the inverse of the covariance matrix Σ(x) of the conditional random variable Y|X � x, which is a symmetric positive definite matrix of order q.

Existence of the Minimum of the GLS Error Function.
e following result provides the solution of the minimization problem of (6).
be the output vector of the regression function, and E(Y | X � x) be the conditional mean vector of the observations y ∈ R q .
For any value of (x, Σ(x)), argminR(θ) with respect to θ exists and is given by e proof of eorem 1 is given in the Appendix. However, θ from (7) cannot be analytically obtained because Σ −1 (x) and the marginal probability density function p(x) are unknown. We consider the case where Σ −1 (x) is known and does not depend on x.
us, the solution of the minimization problem of (6) leads to the following theorem.
. . , u q ) be orthogonal matrix of the eigen vectors of (x) and σ 2 k , the eigenvalue associated to us, we have from (6) the existence of argmin θ mR(θ) given by e proof of eorem 2 is given in the Appendix.

Remark 1.
Under the conditions of eorem 1 or 2, the minimum of problem of (6) with respect to θ is reached with . e argument of minimizing the theoretical cost function R given by (6) with respect to θ is difficult to get because p(x) and Σ(x) are unknown. However, we can obtain a finite set of independent observations D n � (y t , x t ), t � 1, . . . , n}, n ∈ N, which allows to define the discrete approximation of the theoretical cost function.
is obtained by applying the optimization procedure on the empirical cost function R n (θ, x, y) given by with which is a discrete approximation of the theoretical cost function R given by (7).
In the next subsection, we use Bayesian approach to take into account uncertainties appearing during the regression process by considering some hypotheses linked to the nature of ε. In this case, the maximum likelihood model provides an extension of the empirical cost function (9).

Extension of GLS considering the Output Noise.
Results considered below are established under the following assumptions related to the random i.
A3. x t is not spoiled and does not depend on θ Theorem 3. Assume that the conditions A1-A3 hold. Let R n (θ, x, y) be the empirical cost function, p(θ | D n ) be the posterior distribution of the model parameters (θ ∈ Θ) known as finite set of independent observations D n � (y t , x t ), t � 1, . . . , n , n ∈ N * and π(θ), the prior distribution of θ which is assumed to have a uniform distribution, and p(D n | θ) be the likelihood of D n knowing θ. us, e proof of eorem 3 is given in the Appendix. In (11), it appears as an extension of the empirical expression (9). e additional terms lnΦ(α ⊤ (y t − F θ (x t ))) and ln| (x t ) − 1 | in the expression (9) contribute in the process to take into account of the imprecise data. ey can be known or not; if not, they have to be considered and estimated during the minimization process.

Corollary 1. Under the conditions of eorem 3, if the vector α is equal to the vector null, we have the same results with
Badran and iria [1], which is Journal of Probability and Statistics e simulation study presented below is based on multilayer perceptron neural networks (MLP) with one hidden layer. e output variable is continuous and bivariate. is output variable has a deterministic part (its true value) plus an additional random part which is asymmetric. e same MLP architecture is used to predict the bivariate response from simulated data at different sample sizes. ree different empirical cost functions have been used in the learning process of the imprecise data generated for prediction purposes.
ese are (i) the classical generalized least squares function (CGLS) (see (10)), (ii) the extension of the generalized least squares function proposed by [1] (EGLS B ) (see (12)), and (iii) the new extension of the generalized least squares function that we proposed (NEGLS) (see (11)). e coefficient of determination, R 2 , and the mean absolute percentage error, MAPE, allowed comparing the performance of the estimation of the nonlinear regression function based on these three empirical cost functions.

Simulation Study and Real Data Applications
In this section, we consider the regressor F θ as a multilayer perceptron neural network (MLP).
A MLP is a set of units (neurons) organized in successive layers (first layer � input layer, last layer � output layer, and the middle layer(s) � hidden layer(s)), which are interconnected with the weights (θ). e connections are always directed from lower layers to upper layers and the neurons of the same layer are not interconnected. e MLP model (F θ ) considered here has four input neurons noted X (X ∈ R 4 ), three hidden units which represent 75% of input neurons, and two output neurons denoted Y, Y ∈ R 2 (see Figure 1). Consider w ⊤ : � (w 11 , . . . , w 14 , . . . , w 31 , . . . , w 34 ; w 10 , . . . , w 30 ) ∈ R 15 as a vector of parameters between the input and hidden layers, β ⊤ : � (β 11 , β 12 , β 13 , β 21 , β 22 , β 23 ; β 10 , β 20 ) ∈ R 8 as a vector of parameters between the hidden and output layers and Θ m ⊂ R 23 as a compact subset of possible parameters of the regression model family S � F θ (X), θ ∈ Θ m , X ∈ R 4 . g and f (real value functions) are considered in the model as the output and hidden-unit activation function, respectively.

Simulation Plan.
Let X ∈ R 4 be a vector of input variables and Y ∈ R 2 , a bivariate output variable. We have Y � G(X) + ε where G: R 4 ⟶ R 2 and ε ∈ R 2 is a vector of residual variables. To guarantee the nonlinearity of G, we chose e coefficient used in the simulation study was prespecified as c 1 � 3.06; c 2 � 0.15; c 3 � 5.47; c 4 � −0.07. Concerning the vector X � (X 1 , X 2 , X 3 , X 4 ) ⊤ , its components, respectively, follow normal, log-normal, binomial negative, and Poisson laws, with , and X 4 ∼ P(λ � 2.4). e set of distributions of the different X components allowed having X distribution of probability unknown as in the previous theorems. e additional random part at G(X) is ε � (ε 1 , ε 2 ) with ε ∼ SN(0, Σ(x), α), where α � (−4, 4) ⊤ corresponding to a case of bivariate skew-normal distribution. Consider any covariance matrix (x) assumed to be known. us, we arbitrarily chose (x) � 1 0 0 1 so that the errors ε � (ε 1 , ε 2 ) are centered, reduced, and uncorrelated, and (x) � 1 0.8 0.8 1 so that the errors ε � (ε 1 , ε 2 ) are centered, reduced, and correlated (see Figure 2). e conditions A1-A3 and the previous information made it possible to randomly generate a population of size N � 10, 000.
Simulation's algorithm is presented in the following steps: Step 1. Generate the input variables X 1 to X 4 and ε from their respective distribution for N � 10, 000.
Step 2. Calculate G(X) using (13) from values of coefficients c 1 to c 4 and generated input variables X 1 to X 4 .
Step 3. Generate outcome variables Y for N � 10, 000, from each ε distribution adds to G(X).
Step 4. Consider a multivariate Bootstrap sample of size n (n � 50, 150, 1000) from the populations generated at previous steps.
Step 5. For each Bootstrapping sample, we run MLP model on 70% of the data (training dataset) and the remaining 30% (test dataset) are intended for predictive performance analysis. e characteristics considered for the model are as follows: Step 6. For each combination of n and ε distribution, we repeat the previous step 1000 times, and the mean of the performance criteria is computed. e R software (v. 3.3.6) [14] is used for data generation and model implementation.

Performance Criteria.
Two performance criteria are used on test data: coefficient of determination, R 2 , and mean absolute percentage error, MAPE [15,16].
A R 2 close to "1" indicates good performance. MAPE is a measure of the accuracy of the forecast. A value close to "0" indicates good accuracy.

Findings Related to Simulation Study.
Comparing the performance of nonlinear regression models based on multilayer perceptual neural networks using three different quadratic error functions in the learning process reveals that the larger the sample size, the more accurate the model (low value of MAPE and high values of R 2 ). On the other hand, when the errors are centered, reduced, and uncorrelated, the error functions GLS and EGLS B give the same results but NEGLS gives a better result than them (Table 1). For centered, reduced, and correlated errors, the three error functions each used in the learning process lead to different results. e best performance regardless of sample size is obtained with the quadratic error function NEGLS, followed by EGLS B and finally GLS (Table 1).

Real Data Applications: AIS Dataset.
In this subsection, an application of the new quadratic error function and the two other to real-world data is shown. Data refer to 13 biological characteristics of 102 male and 100 female athletes Hidden layer Output layer Input layer  Journal of Probability and Statistics recorded at the Australian Institute of Sport, courtesy of Richard Telford and Ross Cunningham. e data have been presented and examined by [17] and are available in the package "sn" of R software under the name "ais." e set is repeatedly used in the literature about skewed distributions. In our application, we only use a subset of these characteristics used by Azzalini [18] for fitting linear models with skew-elliptical error term and in addition to some biological characteristics. Output bivariate variable considered is body mass index (BMI) and lean body mass (LBM) with α �

Conclusion
In the case of imprecise data prediction, nonlinear regression models are more appropriate. In the context of parametric models with multivariate response, parameter estimation based on the traditional generalized least squares error function (GLS) leads to biases and the models obtained are less efficient. Results obtained from our simulation study reveal the importance of developing a new quadratic error function to learn these types of data rather than using a traditional GLS error function. Our proposed error function provides more flexibility to take into account the nature of noise in the imprecise data during the learning process. e next step will be to develop a learning algorithm and a package to learn imprecise data for prediction purposes of nonlinear system.

A. Proof of Theorem 1
Let be the least squares error function (cost function) in the case of a nonlinear regression model (F θ ). Replace dP(x, y) by p(x, y)dxdy in equation (A.1), and we have Using the Bayes rules, p(x, y) � p(y | x)p(x), in equation (A.2) gives We can now rewrite (A.2) as Using the distributivity rule with respect to Σ −1 (x) and doing some additional simplifications, equation (A.5) becomes In equation (A.6), does not depend on y, and then we have

B. Proof of Theorem 2
Replacing And the discrete form of R(θ) can be written as follows:

Journal of Probability and Statistics
Let us consider With some development at the internal of B by adding (B.6)

C. Proof of Theorem 3
Consider the following expression: Using the Bayes formula in (4), we have From this equation, p(D n ) is constant with respect to θ (see (5)). ln(π(θ)) is a regularization term and corresponds to an additional constraint on the θ distribution. By considering θ ∼ U[a, b], a and b are not depending on θ. en, we can remove p(D n ) and π(θ) in the minimization process of ln (p(θ | D n )). By considering the independent observations, we get ln p D n θ � ln p x t θ .
(C.4) e first term of (C.4) is the probability that y t comes from x t when the conditional probability is generated by the nonlinear regression model F θ (x t ). In the second term, considering that the hypothesis A3 is satisfied, it can be removed in the minimization process.
Recall that ε is the stochastic part of Y (ε t � Y t − Y v ) and depends on X, with Y v being the deterministic part of Y. It satisfies Definition 1. Assuming that hypotheses A1 and A2 hold, we have from (2) of Definition 2: (C.6)

Journal of Probability and Statistics
Assume an acceptable nonlinear regression model such that y vt � F θ (x t ) ≈ E(Y | X � x t ); then, we can simplify the term ln(p(y t | x t , θ)) by replacing y vt with F θ (x t ) and have −2 ln p y t x t , θ � y t − F θ x t ⊤ (x) − 1 y t − F θ x t − ln x t − 1 − 2 ln Φ α ⊤ y t − F θ x t − 2 ln 2 + q ln 2 π. (C.7) Hence, we can define the empirical risk from (C.7) as argmin θ R N (θ, x, y) � argmin θ −2 ln p θ | D n � argmin θ n t�1 R t (θ)], ⎡ ⎣ Data Availability e Australian Institute of Sport (AIS) data used to support the results of this study have been posted on https://rdrr.io/cran/ MoEClust/man/ais.html or https://www.rdocumentation.org/ packages/sn/versions/1.5-_4/topics/ais. In addition, these data were made public in the book by Cook and Weisberg [17].

Conflicts of Interest
e authors declare that they have no conflicts of interest.