An Improved Approach for Estimating the Hyperparameters of the Kriging Model for High-Dimensional Problems through the Partial Least Squares Method

KPLS and kriging. The developed method is validated especially by using a multimodal academic function, known as Griewank function in the literature, and we show the gain in terms of accuracy and computer time by comparing with KPLS and kriging.


Introduction
During the last years, the kriging model [1][2][3][4], which is referred to as the Gaussian process model [5], has become one of the most popular methods in computer simulation and machine learning.It is used as a substitute of highfidelity codes representing physical phenomena and aims to reduce the computational time of a particular process.For instance, the kriging model is used successfully in several optimization problems [6][7][8][9][10][11]. Kriging is not well adapted to high-dimensional problem, principally due to large matrix inversion problems.In fact, the kriging model becomes much time consuming when a large number of input variables are used since a large number of sampling points are required.Indeed, it is recommended in [12] to use 10 sampling points, with  the number of dimensions, for obtaining a good accuracy of the kriging model.As a result, we need to increase the size of the kriging covariance matrix which becomes computationally very expensive to invert.Moreover, this inversion's problem induces difficulty in the classical hyperparameters estimation through the maximization of the likelihood function.
A recent method, called KPLS [13], is developed to reduce computational time which uses, during a construction of the kriging model, the dimensional reduction method "Partial Least Squares" (PLS).This method is able to reduce the number of hyperparameters of a kriging model, such that their number becomes equal to the number of principal 2 Mathematical Problems in Engineering components retained by the PLS method.The KPLS method is thus able to rapidly build a kriging model for highdimensional problems (100+) while maintaining a good accuracy.However, it has been shown in [13] that the KPLS model is less accurate than the kriging model in many cases, in particular for multimodal functions.
In this paper, we propose an extra step that supplements [13] in order to improve its accuracy.Under hypothesis that kernels used for building the KPLS model are of exponential type with the same form (all Gaussian kernels, e.g.), we choose the hyperparameters found by the KPLS model as an initial point to optimize the likelihood function of a conventional kriging model.In fact, this approach is performed by identifying the covariance function of the KPLS model as a covariance function of a kriging model.The fact of considering the identified kriging model, instead of the KPLS model, leads to extending the search space where the hyperparameters are defined and thus to making the resulting model more flexible than the KPLS model.
This paper is organized in 3 main sections.In Section 2, we present a review of the KPLS model.In Section 3, we discuss our new approach under the hypothesis needed for its applicability.Finally, numerical results are shown to confirm the efficiency of our method followed by a summary of what we have achieved.

Construction of the Kriging Model.
For building the kriging model, we assume that the deterministic response (x) is realization of a stochastic process [14][15][16][17]: The presented formula, with  0 an unknown constant, corresponds to ordinary kriging [8] which is a particular case of universal kriging [15].The stochastic term (x) is considered as realization of a stationary Gaussian process with E[(x)] = 0 and a covariance function, also called kernel function, given by where  2 is the process variance and  xx  is the correlation function between x and x  .However, the correlation function  depends on hyperparameters  which are considered to be known.We also denote the  × 1 vector as r xX = [ xx (1) , . . .,  xx () ]  and the  ×  correlation matrix as R = [r x (1) X , . . ., r x () X ].We use ŷ(x) to denote the prediction of the true function (x).Under the hypothesis above, the best linear unbiased predictor for (x), given the observations y, is where 1 denotes an -vector of ones and In addition, the estimation of  2 is given by Moreover, ordinary kriging provides an estimate of the variance of the prediction, which is given by Note that the assumption of a known covariance function with known parameters  is unrealistic in reality and they are often unknown.For this reason, the covariance function is typically chosen from among a parametric family of kernels.In this work, only the covariance functions of exponential type are considered, in particular the Gaussian kernel.Indeed, the Gaussian kernel is the most popular kernel in kriging metamodels of simulation models, which is given by We note that the parameters   , for  = 1, . . ., , can be interpreted as measuring how strongly the input variables  1 , . . .,   , respectively, affect the output .If   is very large, the kernel (x, x  ) given by (7) tends to zero and thus leads to a low correlation.In fact, we see in Figure 1 how the correlation curve rapidly varies from a point to another when  = 10.
However, the estimator of the kriging parameters ( β0 , σ2 , and  1 , . . .,   ) makes the kriging predictor, given by (3), nonlinear and makes the estimated predictor variance, given by ( 6), biased.We note that the vector r and the matrix R should get hats above but it is ignored in practice [18].

Partial Least Squares.
The PLS method is a statistical method which searches out the best multidimensional direction X that explains the characteristics of the output y.It finds a linear relationship between input variables and output variable by projecting input variables onto principal components, also called latent variables.The PLS technique reduces dimension and reveals how inputs depend on output.In the following, we use ℎ to denote the number of principal components retained which are a lot lower than  (ℎ ≪ ); ℎ does not generally exceed 4, in practice.In addition, the principal components can be computed sequentially.In fact, the principal component t () , for  = 1, . . ., ℎ, is computed by seeking the best direction w () which maximizes the squared covariance between t () = X (−1) w () and y (−1) : w ()  X (−1)  y (−1) y (−1)  X (−1) w ()   such that w ()  w () = 1, where X = X (0) , y = y (0) , and, for  = 1, . . ., ℎ, X () and y () are the residual matrix from the local regression of X (−1) onto the principal component t () and from the local regression of y ()  onto the principal component t () , respectively, such that where p () (a 1 ×  vector) and   (a coefficient) contain the regression coefficients.For more details of how PLS method works, please see [19][20][21].
The principal components represent the new coordinate system obtained upon rotating the original system with axes,  1 , . . .,   [21].For  = 1, . . ., ℎ, t () can be written as This important relationship is mainly used for developing the KPLS model which is detailed in Section 2.3.The vectors w () * , for  = 1, . . ., ℎ, are given by the following matrix W * = [w (1)   * , . . ., w (ℎ) * ] which is obtained by (for more details, see [22]) where W = [w (1) , . . ., w (ℎ) ] and P = [p (1)  , . . ., p (ℎ)  ].For building KPLS, coefficients given by vectors w () * will be considered as measuring of the influence of the input variables  1 , . . .,   on the output .By some elementary operations on the kernel functions, we define the KPLS kernel by where   :  ×  → R is an isotropic stationary kernel and More details of such construction are given in [13].Considering the example of the Gaussian kernel given by (7), we obtain Since a small number of principal components are retained, the estimation of the hyperparameters  1 , . . .,  ℎ is faster than the hyperparameters  1 , . . .,   given by (7), where  is very high (100+).

Transition from the KPLS Model to the Kriging Model Using the Exponential Covariance Functions
In this section, we show that if all kernels   , for  = 1, . . ., ℎ, used in (12) are of the exponential type with the same form (all Gaussian kernels, e.g.), then the kernel  KPLS1:ℎ given by ( 12) will be of the exponential type with the same form as   (Gaussian if all   are Gaussian).; we have In the same way, we can show this equivalence for the other exponential kernels where  1 = ⋅ ⋅ ⋅ =  ℎ : However, we must caution that the above proof shows equivalence between the covariance functions of KPLS and kriging only on a subspace domain.More precisely, the KPLS covariance function is defined in a subspace from R +  whereas the kriging covariance function is defined in the complete R +  domain.Thus, our original idea is to extend the space where the KPLS covariance function is defined for the complete space R +  .

A New
Step during the Construction of the KPLS Model: KPLS+K.By considering the equivalence shown in the last , for  = 1, . . ., , as a starting point of the local optimization by considering the solution   , for  = 1, . . ., ℎ, found by the KPLS method.Thus, this optimization is done in the complete space, where the vector  = {  } ∈ R +  .This approach, called KPLS+K, aims to improve the MLE of the kriging model equivalent to the associated KPLS model.In fact, the local optimization of the equivalent kriging offers more possibilities for improving the MLE, by considering a wider search space, and thus it will be able to correct the estimation of many directions.These directions are represented by   for the th direction which is badly estimated by the KPLS method.Because estimating the equivalent kriging hyperparameters can be time consuming, especially when  is large, we improve the MLE by a local optimization at the cost of a slight increase of computational time.
Figure 2 recalls the principal stages of building a KPLS+K model.

Numerical Simulations
We now focus on the performance of KPLS+K by comparing it with the KPLS model and the ordinary kriging model.For this purpose, we use the academic function, named Griewank, over the interval [−5, 5] which is studied in [13].20 and 60 dimensions are considered for this function.In addition, an engineering example, done at Snecma for a multidisciplinary optimization, is used.This engineering case is chosen since it was shown in [13] that KPLS is less accurate than ordinary kriging.The Gaussian kernel is used for all surrogate models used herein, that is, ordinary kriging, KPLS, and KPLS+K.For KPLS and KPLS+K using ℎ principal  components, for ℎ ≤ , will be denoted by KPLSℎ and KPLSℎ+K, respectively, and this ℎ is varied from 1 to 3. The Python toolbox Scikit-learn v.014 [23] is used to achieve the proposed numerical tests, except for ordinary kriging used on the industrial case, where the Optimus version is used.The training and the validation points used in [13] are reused in the following.[−5, 5].The Griewank function [13,24] is defined by

Griewank Function over the Interval
Figure 3 shows the degree of complexity of such function which is very multimodal.As in [13], we consider  = 20 and  = 60 input variables.For each problem, ten experiments based on the random Latin-hypercube design are built with  (number of sampling points) equal to 50, 100, 200, and 300.
To better visualize the results, boxplots are used to show the CPU time and the relative error RE given by where ‖ ⋅ ‖ 2 represents the usual  2 norm and Ŷ and Y are the vectors containing the prediction and the real values of 5000 randomly selected validation points for each case.
The mean and the standard error are given in Tables 2 and  3, respectively, in Appendix.However, the results of the ordinary kriging model and the KPLS model are reported from [13].
For 20 input variables and 50 sampling points, the KPLS models always give a more accurate solution than ordinary kriging and KPLS+K, as shown in Figure 4(a).Indeed, the best result is given by KPLS3 with a mean of RE equal to 0.51%.However, the KPLS+K models give more accurate models than ordinary kriging in this case (0.58% for KPLS2+K and KPLS3+K versus 0.62% for ordinary kriging).For the KPLS model, the rate of improvement with respect to the number of sampling points is less than for ordinary kriging and KPLS+K (see Figures 4(b)-4(d)).As a result, KPLSℎ+K, for ℎ = 1, . . ., 3, and ordinary kriging give almost the same accuracy (≈0.16%) when 300 sampling points are used (as shown in Figure 4(d)), whereas the KPLS models give a RE of 0.35% as a best result, when ℎ = 3.
Nevertheless, the results shown in Figure 5 indicate that the KPLS+K models lead to an important reduction in CPU time for the various number of sampling points compared to ordinary kriging.For instance, 20.49 s are required for building KPLS3+K when 300 training points are used, whereas ordinary kriging is built in 94.31 s; in this case, KPLS3+K is thus approximately 4 times cheaper than the ordinary kriging model.Moreover, the computational time required for building KPLS+K is more stable than the computational time for building ordinary kriging; standard deviations of approximately 3 s for KPLS+K and 22 s for the ordinary kriging model are observed.
For 60 input variables and 50 sampling points, a slight difference of the results occurs compared to the 20 input variables case (Figure 6(a)).Indeed, the KPLS models remain always better, with a mean of RE approximately equal to 0.92%, than KPLS+K and ordinary kriging.However, the KPLS+K models give more accurate results than ordinary kriging with an accuracy close to that of KPLS (≈0.99% versus 1.39%).Increasing the number of sampling points, the accuracy of ordinary kriging becomes better than the accuracy given by the KPLS models, but it remains less accurate than for the KPLSℎ+K models, for ℎ = 2 or 3.For instance, we obtain a mean of RE with 0.60% for KPLS2+K against 0.65% for ordinary kriging (see Figure 6(d)), when 300 sampling points are used.
As we can observe from Figure 7(d), a very important reduction in terms of computational time is obtained.Indeed, a mean time of 2894.56 s is required for building ordinary kriging, whereas KPLS2+K is built in 23.03 s; KPLS2+K is approximately 125 times cheaper than ordinary kriging in this case.In addition, the computational time for building  KPLS+K is more stable than ordinary kriging, except the KPLS3+K case; a standard deviation of approximately 0.30 s for KPLS1+K and KPLS2+K is observed, against 728.48 s for ordinary kriging.However, the relatively large standard of deviation of KPLS3+K (26.59 s) is probably due to the dispersion caused by KPLS3 (26.59 s).But, it remains too lower than the standard deviation of the ordinary kriging model.
For the Griewank function over the interval [−5, 5], the KPLS+K models are slightly more time consuming than the KPLS models, but they are more accurate, in particular when the number of observations is greater than the dimension , as is implied by the rule-of-thumb  = 10 in [12].They seem to perform well compared to the ordinary kriging model with an important gain in terms of saving CPU time.

Engineering Case.
In this section, let us consider the third output,  3 , from tab 1 problem studied in [13].This test case is chosen because the KPLS models, from 1 to 3 principal components, do not perform well (see Table 1).We recall that this problem contains 24 input variables.99 training points and 52 validation points are used and the relative error (RE) given by ( 18) is considered.
As we see in Table 1, we improve the accuracy of KPLS by adding the step for building KPLS+K.This improvement is verified whatever the number of principal components

Figure 1 :
Figure 1: Theta smoothness can be tuned to adapt spatial influence to our problem.The magnitude of  dictates how quickly the squared exponential function variates.
RE (%) for 20 input variables and 50 sampling points RE (%) for 20 input variables and 100 sampling points RE (%) for 20 input variables and 300 sampling points

Figure 4 :
Figure 4: RE of the Griewank function in 20D over the interval [−5, 5].The experiments are based on the 10-Latin-hypercube design.

Table 1 :
(18)lts for tab 1 experiment data (24 input variables, output variables  3 ) obtained by using 99 training points, 52 validation points, and the error given by(18)."Kriging" refers to the ordinary kriging Optimus solution and "KPLSℎ" and "KPLSℎ+K" refer to KPLS and KPLS+K with ℎ principal components, respectively.Best results of the relative error are highlighted in bold type.
}, for  = 1, ..., , given by (7) are found by maximum likelihood estimation (MLE) method.Their estimation becomes more and more expensive when  increases.The vector  can be interpreted as measuring how strongly the variables  1 , . . .,   affect the output , respectively.

Table 2 :
Results of the Griewank function in 20D over the interval [−5, 5].Ten trials are done for each test (50, 100, 200, and 300 training points).Best results of the relative error are highlighted in bold type for each case.

Table 3 :
Results of the Griewank function in 60D over the interval [−5, 5].Ten trials are done for each test (50, 100, 200, and 300 training points).Best results of the relative error are highlighted in bold type for each case.