Kernel Parameter Optimization for Kriging Based on Structural Risk Minimization Principle

An improved kernel parameter optimization method based on Structural Risk Minimization (SRM) principle is proposed to enhance the generalization ability of traditional Kriging surrogate model. This article first analyses the importance of the generalization ability as an assessment criteria of surrogate model from the perspective of statistics and proves the applicability to Kriging. Kernel parameter optimization method is used to improve the fitting precision of Kriging model. With the smoothness measure of the generalization ability and the anisotropy kernel function, the modified Kriging surrogate model and its analysis process are established. Several benchmarks are tested to verify the effectiveness of the modified method under two different sampling states: uniform distribution and nonuniform distribution. The results show that the proposed Kriging has better generalization ability and adaptability, especially for nonuniform distribution sampling.


Introduction
Computer-intensive optimization problem becomes more and more as the increasing requirement for high-fidelity model in industry area, especially for aerospace engineering.It is very urgent to improve computation efficiency [1,2].Surrogate model is the comprehensive application of the Design of Experiment (DOE), mathematical statistics, and optimization techniques.It approximates the complicated and time-consuming physical model by building some analytical mathematical model, to reduce the analysis process and smooth the design space.It becomes one of the most effective methods for computer-intensive problem and has been widely applied to high-fidelity design and optimization [3].
The fitting precision of surrogate model for physical model is one of the most important indicators.For computational resource limitation, it is not rational to assess the entire design space of the physical model directly; using the error between surrogate model and physical model at the sample points to estimate the fitting precision is the most effective way.Fitting precision is a function of the sample capacity.It converges to the true precision only when the sample capacity satisfies the large number theorem.Actually sample capacity and fitting precision are conflicting.We hope less sample points with higher fitting precision.Vapnik proposed the statistical learning theory (SLT) in the late 1970s [4,5] and became a machine learning method for limiting sample points in the 1990s.The core idea is to control the generalization ability of the learning machine by control the capacity of the machine.That is to improve the fitting precision of the surrogate model by applying SRM principle.
There are many surrogate models, such as Response Surface Model (RSM), Kriging, Radial Basis Function (RBF), and Support Vector Regression (SVR).Kriging is an interpolation method based on statistical theory; the main idea is to evaluate the approximate function of the object based on the dynamic construction of design space to predict the information of unknown points [6,7].For Kriging, linear weighted combination of the information of nearby points is also needed for predicting unknown points, which can be determined by minimizing the variance of the estimated value error, which means Kriging is also a best linear unbiased estimator problem [8].Kriging surrogate model is simple and stable compared to other methods and has been widely used in many fields [9][10][11][12].Some improved Kriging models have also been studied, such as Gradient-Enhanced Kriging [13], CoKriging [14], and Hierarchical Kriging [15].
Kriging is sensitive to sample points.It has worse predicting result when with less sample points.The quantity of information of the correlation matrix and the parameters of the basis function have obvious influence to fitting precision.It is very necessary to optimize the parameters of the basis function to improve fitting capacity.SRM is the most important theory to evaluate the generalization ability of the surrogate model, which has been widely used in kernel parameter optimization of SVM [16], fuzzy model [17], and chaotic system [18].But there are a few studies for other surrogate models.Zhu [19] proposed an automatic method to optimize basis function for RBF using the SRM principle.Chen [20] applied to SRM principle to RBF network study to improve the generalization ability.These previous researches indicate that using SRM principle to optimize the parameters of the surrogate model really can improve the fitting precision of surrogate model.This paper proposes an anisotropic basis function parameter optimization method for Kriging based on SRM principle.The parameters of the correlation functions of Kriging are optimized using the smoothness measure as objective function.Some benchmarks with different scale are used to verify the proposed methods in this paper.The influence of the distribution pattern of sample points with uniform distribution and nonuniform distribution is also studied.

Assessment Criteria of Surrogate Model
The purpose of machine learning is to find the internal dependencies by learning given data to predict unknown data or estimate model characteristic.From the perspective of statistics, machine learning can be viewed as to predict the unknown output relation by some given training samples as accurate as possible.To minimize the risk functional, we should find the optimal function (,  * ) in function set {(, )}: {(, )} is the prediction function set;  is the parameter of the prediction function.So the machine learning is designed to minimize the risk functional of the squared error loss function.(

Empirical Risk Minimization
ERM has been widely used in the least square method for regression problem, maximum likelihood method for probability density estimation, and neural network learning [21].But ERM is based on the large number theorem, completely dependent on the sum of squared error of the samples.With less samples, small empirical risk cannot guarantee minimizing expected risk.There may happen oversmoothing for the RBF and neural network, which is based on ERM principle [22]: little training error for sample points but large testing error for other points.

Structural Risk Minimization Principle.
Generalization ability means the capacity to predict or estimate unknown phenomenon by machine learning.ERM principle cannot guarantee minimizing expected risk.Therefore, SRM principle is proposed based on the statistical learning theory.It provides the relationship between empirical risk and real risk, which is known as the bound of the generalization ability.SRM principle divides the real risk of machine learning into empirical risk and confidence interval [4]: ℎ is the Vapnik-Chervonenkis (VC) dimension, which is the most important theoretical basis of the statistical learning theory.It defines the capacity of the function set and reflects the generalization ability of the machine learning.It is the best descriptive indicators for the capacity of function set learning until now.From (3), we know ERM principle is unconscionable with limited samples.When ℎ/ becomes larger, empirical risk decreases, but real risk does not always become lower.When the dimension of samples is fixed, reduced ℎ (VC) may decrease confidence interval, so the empirical risk comes close to real risk.In this case, less empirical risk stands for less expected risk.Increasing the capacity of samples  also decreases confidence interval to reduce real risk, but it is difficult to get plenty of samples because of the limitation of computational cost.
Therefore, decreasing VC dimension is the most suitable way to reduce real risk.For a given observation set, SRM principle chooses the proper function ( − (,    )) 2 from the subset   with minimized risk to minimize the empirical risk.So SRM principle is a compromise method to balance the fitting precision and the complexity of fitting function.

SRM Index Based on Smoothness
Measure.Smoothness measure is used to evaluate the generalization ability.In the projection space formed by kernel function, smoothness measure has its more direct and natural definition.It can be described as the norm of Hilbert space of the kernel function, which is defined as follows.
Every function () in Hilbert space of kernel function  can be expressed as Then the norm of () in Hilbert space is is the eigenvalue of kernel .The generalization ability can be measured by evaluating ‖‖ 2 .The optimal regression problem becomes min ‖‖ 2 where ‖‖ 2 is the smoothness measure of sample points.

Mercer Theorem.
The criteria to determine whether a function is kernel function is as follows.
is the domain of the independent variables  and   .There is a nonnegative real function (,   ); it is continuous in  × .The sufficient and necessary condition for (,   ) to be expressed as a series is that all functions satisfy ∫  2 () < ∞,  ̸ = 0 and also satisfy the following condition: where the series is convergence in  × .  is the eigenvalue (  > 0) of (,   ).  () is the eigenvector of (,   ).A function satisfying the above definition is known as Mercer kernel and can be expressed as the inner product in Hilbert space, where ⟨, ⟩ is the operator of inner product. :  → (⋅, ) is a nonlinear transformation of kernel function.

Kriging Based on SRM Principle
3.1.Kriging Model.Kriging model includes two parts: where   () is a polynomial regression model and () is a deviation model.The covariance matrix of () is where  is the correlative coefficient matrix.(,   ,   ) is the space correlation function of any two sample points   and   in sample points set.It plays a decisive role to fitting precision. is the parameters of .  =  ()  −  ()  and the common correlation functions include the following [23].
Exponential function is Gaussian function is Linear function is Spherical function is Cube function is Applying the linear combination of the response ŷ of known samples to predict the response () of any given samples, The interpolated coefficients of Kriging were obtained by the condition of unbiasedness and the principle of minimum variation.The mean value of error must be zero with minimized variance  2 to guarantee the unbiasedness of the fitting process.The following is the corresponding Lagrangian minimization problem: The parameter of correlative coefficient matrix  can be obtained by minimizing the Maximum Likelihood Estimation (MLE) [23].

Kriging Process.
The output of Kriging model can be expressed as From Section 2.4, if (,   ) satisfies Mercer theorem, it can be expressed as an inner product in Hilbert space Equation ( 19) is replaced with Equation ( 22) is simplified as where The transformed equation ( 23) has the same form as the kernel function of SVR.It constructs a constant nonlinear mapping  :   →  from input space   to feature space .Input variable   has been mapped into feature space , and a linear model is generated for linear learning in feature space.The correlative coefficient matrix (,   ,   ) of Kriging has the same mean to the kernel matrix of SVR; it has also been called kernel matrix in this paper.Then ‖‖ 2 can be used to evaluate the smoothness of the fitting function; it has been called smoothness measure.From the above derivation we knew that SRM principle also can be used to minimize the VC dimension of Kriging model to obtain minimized expected risk [4].
Kernel matrix  must be positive definite symmetric matrix or conditioned positive definite symmetric matrix.The kernel matrix formed by Gauss basis function satisfies Mercer theorem.There are also many other functions which can be found in [24].

Anisotropic Kernel Function. The basis function of Kriging can be expressed as
The parameter  of basis function means the contribution rate of every sample point to Kriging model.For a problem with  samples, the size of parameter  also is .As the size of samples increases, the scale of the kernel parameter optimization problem also becomes larger.These may remarkably increase Kriging modeling time.Usually the parameter   is considered as the same value for all samples; the basis function reduces to an isotropic model.This simplification reduces the design space of Kriging model and weakens fitting effect.The following anisotropic kernel function with different contribution by each dimension is used to balance the fitting effect and computational cost: where  is the dimension of sample .  is the contribution of the th dimension.This improved kernel function has the advantage of constant scale of kernel parameter optimization problem for a given problem with any size of samples.

SRM-Kriging.
The kernel parameter optimization problem of Kriging model with smoothness measure is where  is the kernel parameter.‖‖ 2 is the smoothness measure as objective function: where  * is the generalized least squares solution of the polynomial problem  = .The derivative of ‖‖ 2 for  is Let  =  −1 ( −  * ); the derivative can be The derivative for each   is Then we get where {(, )} is the matrix composed by every derivative value at (, )./ is the derivative of basis function for the distance of samples.Based on this derivative information, the efficient Sequential Quadratic Programming (SQP) method is used to optimize the kernel parameter optimization problem of Kriging model.

Test Cases
Nine standard test functions with different dimension are used to validate the effectiveness of the proposed method.
The improved Kriging model also has been evaluated by sample points with uniform distribution and nonuniform distribution.

Standard Test Functions.
There are nine standard test functions with different dimension.Fun1 is Fun2 is Fun3 is Fun4 is Fun5 is Fun8 is Fun9 is These functions are from one dimension to thirty dimensions with minor different nonlinearity, which can be used to evaluate the performance of the surrogate model under different scales of variables.
These test functions are divided into three scales according to their dimension: small scale, middle scale, and large scale.The sampling information of the above functions is in Table 1.
Kriging with SRM (SRM-Kriging), Kriging with MLE (MLE-Kriging), and Kriging with constant kernel parameter (CON-Kriging) are three methods tested using the above test cases.Gauss function is chosen as the kernel function for all the three methods.Multiple correlation coefficient  2 and root-mean-square error (RMSE) are used to evaluate the fitting precision.

Samples with Uniform Distribution.
The optimal Latin hypercube sampling method is used to generate uniform distributed samples.Figure 1 and Table 2 are the comparison of the generalization ability between different test functions with uniform distribution sampling.It is observed that, as the dimension of variables increases, the fitting precision of CON-Kriging decreases rapidly, especially for the large-scale cases (Fun7∼Fun9); MLE-Kriging has better fitting precision than CON-Kriging via using optimized kernel parameter, but still with poor fitting precision for large-scale cases.SRM-Kriging has the best fitting precision than CON-Kriging and MLE-Kriging; even for the large-scale cases, it also has the smallest RMSE.
Figure 2 shows the fitting results of the Fun1 and Fun2 under uniform distribution sampling.From the fitting result of Fun1 we knew that CON-Kriging and MLE-Kriging have poor fitting precision, especially for CON-Kriging; it had oscillated badly.This illustrated kernel function optimization is very important for improving fitting precision of surrogate model.Compared with CON-Kriging and MLE-Kriging, SRM-Kriging is almost closed to original function.It has better generalization ability than others.

Samples with Nonuniform Distribution.
In a practical application, the samples are always not satisfied uniform distribution standard.This will do some impact to surrogate model building.A typical example is the sequential surrogate model [25]; as adding points goes on, the distribution of samples changes continuously; an aggregation distribution of samples has appeared.The performance of surrogate model with nonuniform distribution samples is also important.At present there are a few researches for surrogate model with nonuniform distribution samples [26].In this section, normal distribution is used to imitate and generate nonuniform distribution samples.The above three surrogate models are tested using these nonuniform distribution points.The mean and standard deviation of the normal distribution of test functions are listed in Table 3. 30 times of random normal distribution sampling are generated to test these three methods and the average value of  2 and RMSE is used to make a comparison.Figure 3 and Table 4 are the comparison of the generalization ability between different test functions with nonuniform distribution sampling.
It is observed that, for the test cases with nonuniform distribution sampling, the fitting precision of all methods decreased remarkably, especially for large-scale problem.CON-Kriging has very poor performance, even for smallscale problems.MLE-Kriging is better than CON-Kriging.SRM-Kriging still has the best fitting precision than the others.Although all three methods are failed for large-scale problems, but SRM-Kriging still has the smallest RMSE.
Figure 4 shows the fitting results of the Fun1 and Fun2 with nonuniform distribution sampling.From the fitting result we knew that all fitting curves for Fun1 and Fun2 are distorted.For Fun1, the fitting curves generated by CON-Kriging and MLE-Kriging even cannot fit the basic trends; SRM-Kriging obviously is better than the others and is in line with the basic trends except the left region with little samples.It has been proved that generalization ability is more suitable for evaluating comprehensive performance of surrogate model.For Fun2, three methods are all distorted at the right   region with less samples, but the fitting performance of SRM-Kriging and MLE-Kriging is much better than CON-Kriging, which proves the importance of kernel parameter optimization.

Conclusion
More and more attention is paid to the improvement of fitting precision as the widely spread and applied surrogate model.This paper studies the assessment criteria of surrogate model and discusses the importance of the generalization ability.Based on these, the kernel function optimization method is proposed to improve the fitting precision.Some standard benchmarks are tested to verify the effectiveness of the improved method based on sample points with uniform distribution and nonuniform distribution.In conclusion, we have the following.
(1) SRM-Kriging method based on anisotropic kernel function optimization is proposed to replace the coefficient of each samples with the coefficient of the component of samples.The computational cost is reduced.(2) A comparison is carried out among CON-Kriging, SRM-Kriging, and MLE-Kriging.Results show that kernel function optimization is very important to improve the fitting precision of surrogate model.SRM principle provides a more effective evaluation for the generalization ability.SRM-Kriging has better performance than the others, especially for problems with nonuniform distribution sampling.(3) From the Kriging process we know that Kriging can be regarded as the special case of the SVR with zero empirical risk.By setting the insensitive loss function of SVR to zero, the SVR optimization problem with regularization parameters and kernel parameters is transformed to a simple optimization problem with independent kernel parameters and a certain problem to evaluate coefficients.It reduces the design dimension and simplifies the implement process.
The distribution of samples may become more and more complicated.This paper also studies the fitting precision of Kriging model with uniform sampling and nonuniform sampling.The results show that Kriging with SRM principle has better fitting performance, but still with low precision for high dimension problems.Further research may perform full analysis of surrogate model with different nonuniform distribution sampling to find more suitable selection mechanism of the kernel function and efficient kernel parameter optimization method.

Figure 2 :
Figure 2: Fitting results of the Fun1 and Fun2 with uniform distribution sampling.

2 Figure 3 :
Figure 3: Comparison of  2 between different scale of test functions with nonuniform distribution sampling.

Figure 4 :
Figure 4: The fitting results of the Fun1 and Fun2 with nonuniform distribution sampling.

Table 1 :
Sampling information of the test functions.

Table 2 :
Comparison of the generalization ability between different test functions with uniform distribution sampling.

Table 3 :
Mean and standard deviation of the normal distribution of test functions.

Table 4 :
Comparison of the generalization ability between different test functions with nonuniform distribution sampling.