A Design Optimization Method with Sparse Scattered Data and Evolutionary Computation

Engineering design can be regarded as an iterative optimization process. This process is di ﬃ cult because of two main problems: the ﬁ rst is that computer-aided engineering (CAE) is time-consuming in terms of evaluating design solutions, while the second is the high dimensionality of design solutions. In the research community, a surrogate model is proposed to deal with the ﬁ rst problem while an evolutionary algorithm is adopted for the second. In this work, we develop a new method with only sparse scattered data, which is very common in many practical scenarios. The surrogate model can also assign a penalty factor for the predicted value, and this penalty factor can be used as one of the targets of the evolutionary algorithm to balance global exploration and local exploit. We also adopt a new evolutionary strategy, which can search high-dimensional space. Three groups of experiments are conducted to validate the proposed methods. The experimental results show that the surrogate model can predict performance and the corresponding penalty factor, the evolutionary strategy is better in terms of searching high-dimensional space compared with other evolutionary strategies, and the whole method can generate new design solutions that are near to the known design solutions. The experimental results show that this method can be used in practical scenarios, especially where they only have sparse scattered data.


Introduction
Engineering design is a complex process involving different design activities, and it can be generally regarded as the iteration of design and validation. Currently, with the help of well-developed computer-aided design (CAD) and computer-aided engineering (CAE) systems, the design model can be parameterized and the validation process can be simulated computationally. Therefore, engineering design can be regarded as an optimization problem [1], and different aspects of product can be optimized, such as shape optimization [2] and reliability optimization [3].
x = arg max f x ð Þ, ð1Þ where x = ½x 1 , x 2 , ⋯, x n is a parameter vector representing a design solution, Ω is the feasible solution space of x,x is the optimal solution, and f ðxÞ is the evaluation function of design solutions. CAE has played the role of f ðxÞ successfully in engineering design for many years. However, two characters of such function make the optimization extremely complex and difficult. The first is derivative unavailable, which means the gradient-descent methods are invalid [4]; the second is time-consuming [5], which means the evaluation of the function requires extensive computational resources. In addition to above difficulties, the high dimensionality of the design space is another problem making the optimization difficult. A response surface model (RSM) [6] is a method adopted to deal with the first two difficulties, and it is a statistical method that explores the relationships between design variables x and one or more response variables f ðxÞ [7]. The RSM method is also called as a surrogate model, metamodels, or emulators in different scenarios [8], and the underlying idea is to replace the computationally expensive model with a computationally cheaper one [8,9]. In this work, we use the terminology of "surrogate model" referring to the latter one.
Many technical methods can be used to build a surrogate model, such as linear or nonlinear interpolation, Kriging [10], neural network, radial basis function [11], and Gaussian process [12]. Although these methods have been successfully applied in solving many engineering problems, some remaining problems are hindering the applications of the surrogate model in engineering design. In many scenarios of applying the surrogate model successfully, either the available data is sufficient or the method of data collection is deliberately designed, such as "grid" data or "orthogonal" data. However, there are many different scenarios where the data is "sparse," which means the amount of available data is small, and "scattered," which means the data is located randomly in the feasible design space.
The main contribution of this paper is to propose a surrogate model method and a high-dimensional design space exploration method, respectively. We first attempt to build a surrogate model based on "sparse scattered" data and then use a new evolutionary strategy to explore and exploit the high-dimensional space. The combination of the two can realize the rapid generation from sparse scattered point data to a design scheme. The rest of the paper is structured as follows. Section 2 provides some related works for this study. Section 3 explains the technical details of both building a surrogate model and the evolutionary strategy. Section 4 conducts several groups of experiments to validate the proposed method. Section 5 summarizes this research and identifies some possible future works.

Related Works
In this section, some existing key techniques related to this work will be explained, including the surrogate model and evolutionary strategies.
2.1. Surrogate Model. There are many methods for building a surrogate model, and these methods can be categorized based on different criteria. From the perspective of data characteristics, there are methods for both grid data and scattered data. If the data is collected through Design of Experiment (DOE) [13], we can get regular grid data, and the surrogate model is easy to build. For low-dimensional problem, the Delaunay triangulation [14], natural neighbor interpolation [15], spline interpolation [16], and so on can be used to build the surrogate model, while for highdimensional problems, the simple nearest-neighbor interpolation [17], Kriging [10], and so on can be used to build the surrogate model. Zhou et al. presented the nearest-neighbor value (NNV) interpolation algorithm for the improved novel enhanced quantum representation of digital images (INEQR). Experiments show that the proposed interpolation method has higher performance in high-resolution image recognition [18]. Qian et al. proposed a general sequential constraint updating approach based on the confidence intervals from the Kriging surrogate model (SCU-CI). Results illustrate that the proposed SCU-CI approach can generally ensure the feasibility of the optimal solution under a reason-able computational cost [19]. If the data is collected randomly, we can only get irregular grid data, which are also called scattered data. In this situation, the surrogate model becomes difficult to be constructed. There are already some methods to deal with this problem, such as triangulated irregular network, radial basis function [20,21], and Kriging [10]. de Oliveira et al. proposed a new approach for occlusion detection-the surface-gradient-based method (SGBM) applied to a triangulated irregular network (TIN) representation. Experimental results demonstrated the feasibility of the SGBM for occlusion detection in the true orthophoto generation [22]. She et al. present a novel battery aging assessment method based on the incremental capacity analysis (ICA) and radial basis function neural network (RBFNN) model [23].
From the perspective of key techniques, the methods can also be categorized as interpolation and fitting. The former attempts to build a hypersurface that exactly passes the existing data, while the latter treats existing data as noisecontained data and attempts to find a hypersurface that minimizes the errors, and the hypersurface is unnecessary passing existing data. From the perspective of complexity, the methods can be categorized as a linear model or nonlinear model.
Although the surrogate models have been used widely in many domains, the successful applications always rely on either grid data or sufficient data. However, in many practical scenarios, the existing data is neither grid data nor sufficient. The data is sparse and scattered, and methods that fit for such situations should be developed.

Evolutionary Computation.
Evolutionary computation is a commonly used method to search optimal solutions from design space. Most of the evolutionary computation methods follow the same framework, but they improve the algorithm by developing different evolutionary strategies, including crossover, mutant, and selection.
Many evolutionary strategies are developed to improve the algorithm from two perspectives. The first perspective is developing a new crossover and mutant method to generate new offspring, such as differential evolution-(DE-) based method [24,25], immune-based method [26], particle swarm optimization-(PSO-) based method [27], and probabilistic modelbased method [28]. The second perspective is developing a new selection method, such as decomposition-based method [29], preference-based method [30], indicator-based method [31], and hybrid method [32]. Although the above improvements contribute to solving many design problems, there is also requirement for developing evolutionary strategies for optimizing a high-dimensional problem.

Methodology
In this section, we will explain the proposed method in detail. Generally, the method has two main parts, including surrogate model construction and design solution searching. In the first part, a surrogate model, which receives design solutions as input and return performances as output, will be constructed based on a two-stage interpolation process. In the second part, we adopt a new 2 Journal of Nanomaterials mutant strategy, to search the high-dimensional space. The two parts will be detailed in the following two sections, respectively.

Surrogate Model Construction.
We first define the data structure used to build a surrogate model and then explain the underlying ideas of the two-stage interpolation process, and finally, we explain the technical detail of this method. The mathematical symbols used in this work are summarized in Table 1.
Typically, there are two kinds of data in many engineering design problems, and they are design solution and the corresponding performance. A design solution can be represented by a vector x = ½x 1 , x d are parameters that determine the design solution, and d is the dimensionality of the design solution x.
A performance is a measurement of design solution x, and design solution x commonly has several performances for evaluation. Besides, the performance can be different under different working conditionsc = ½c 1 , c 2 , ⋯, c l , in which c 1 , c 2 , ⋯, c l are parameters that determine the working conditions, such as temperature and velocity which can jointly define a two-dimensional l = 2 working condition. Therefore, the performance can be represented by a k × m matrix where k is the kinds of performances and m is the total number of working conditions. p = p 11 p 12 ⋯ p 1m : Generally, we have a n × k × m matrix as database D for building the surrogate model, where n is the number of known design solutions. As shown in Figure 1, we know the k kinds of performance under m different working conditions of n known design solutions. It is noteworthy to say that the n known design solutions are scattered in the highdimensional space.

Underlying Considerations and
Assumptions. Based on the above data, the goal is to construct a surrogate model that receives a design solution x′ and a group of working conditions c′ as input and outputs the corresponding performances of the design solution x′ under working conditions c ′ . It is noteworthy that the design solutions x ′ and working conditions c ′ are generally not contained in the database D.
We can regard the surrogate model as a function p = f ðx, cÞ. This function maps value from d + l dimensional space R d+l to k dimensional space R k , and this highdimensional mapping requires more data to train. Since we only have sparse scattered data, two critical assumptions are drawn before learning the surrogate model. Based on Assumption 1, we do not need to learn a dedicated mapping (maps from working condition to performance) for different design solutions, which is necessary if the surrogate model is built, like p = f ðx, cÞ. Therefore, we can divide the surrogate model into two stages, of which the first stage maps design solution x ′ to performances p c under existing working conditions c by p c = gðx ′ Þ, while the second stage maps c ′ to performance under unknown working condition by p = hðc ′ Þ. By this assumption, some dependences among design solutions, working conditions, and performance are ignored, and this loses the data requirement for training the surrogate model. Based on Assumption 2, we make the surrogate model to assign a penalty factor for the predicted performance. If the new design solution x ′ is far from the known design  3 Journal of Nanomaterials solutions, the penalty factor should be high and vice versa. This penalty factor will be used during the design optimization process, and it helps to shrink the searching space and keeps the design solution close to the regions where some design solutions are known.

Two-Stage
Interpolation. The surrogate model can be implemented by two-stage interpolation. In both two stages, the inverse distance weighting (IDW) [33] is adopted to implement the interpolation, which computes an unknown value based on the following equation: where Q is the total number of data points used to predict new valueŷ and w i is the weight of the i th data point. In the IDW method, the weight is calculated by where x ′ is the data point that the performance is unknown; x i is the i th data point; k k is to calculate the distance between two data points in the high-dimensional space R n ; P is the order the distance, and this is a metaparameter to control the weights. Based on this, given an unknown design solution x ′ and unknown working condition c ′ , we can first interpolate to get the performances of the design solution x′ under known working conditions c and then further interpolate to get the performances of the design solution x′ under unknown working conditions c′. We also need to assign penalty factors to the predicted performances of unknown design solutions and working conditions. In this work, we simply use the Euclidean distance as a measurement of penalty factors, and the value can be calculated by where PF 1 is the penalty factor of the first stage and it can be obtained by (7), while PF 2 is the penalty factor of the second stage and it can be obtained by (8).
where Q is the total number of design solution used to predict the performance; x ′ is the data point that the performance is unknown; x i is the i th data point used to predict the performance; k k is to calculate the distance between two data points in the high n dimensional space R n ; P is the order of the distance, and this is a metaparameter to control the weights.
where R is the total number of known working conditions used to predict the performance of unknown working conditions, c′ is the working condition that the performance is unknown, and c j is the j th working condition used to predict the performance.

Design Solution
Searching. Based on the method of surrogate model construction in Section 3, a surrogate model can be obtained, which receives design solution and working conditions as input and predicts performances and its corresponding penalty factor as output. Based on this surrogate model, we will adopt an evolutionary algorithm to explore the design space. The penalty factor will be taken as one of the fitness functions during the evolutionary process. Basically, most evolutionary algorithms follow the same framework. As shown in Figure 2, the critical difference is the operation of "generating offspring population," which is marked by a bold rectangle. The traditional crossover and mutant operations are replaced by a simple ergodic system. The detail can be found in Algorithm 1. As shown in the algorithm, the ergodic system adopted in this work is a logistic map, and it is used to generate a random number for generating new offspring. The ergodic evolution method has the advantages of ergodicity and regularity and performs better in dealing with high-dimensional design space.

Experiment
In this section, we conduct three groups of experiments to validate the proposed method. The first group is to test whether the surrogate model can predict the performance;       Journal of Nanomaterials During the experiments, we use a small dataset provided by a research institute of aerodynamic. This dataset includes 20 known design solutions with 4 design parameters in each design solution and 4 kinds of performance under 28 working conditions. Since the 4-dimension design solutions cannot be shown in a figure, we process the 20 design solutions by Principal Component Analysis (PCA) and plot the first two components in a figure. From Figure 3, we can find that the data is sparely scattered in the design space. We regard the data with less training data and sparse distribution of the main parameter data points in the problem as sparse scattered data.

Experiment on the Surrogate Model.
Based on the method illustrated in surrogate model construction of Section 3, a surrogate model based on sparse scattered data can be constructed. This section conducts several experiments to test the model and find the optimal metaparameters of the surrogate model, such as Q in (4) and P in (5).
Since the proposed method involves two stages, we test the two stages, respectively. For the first stage, we conduct experiment based on the 10-fold crossvalidation method. This method splits all 20 data into 10 groups and uses 9 groups as training data and uses the last group as test data in each experiment. In each experiment, we calculate an  Journal of Nanomaterials averaged error by (9), and after 10 times of experiment, the error of the first stage is averaged again.
where P ij is the value of the i th predicted performance under the j th working condition, while the P ij is the real value of the i th performance under the j th working condition. Table 2 shows the results of different configurations of Q and P, and we can find that the surrogate model has the smallest error when P = 1 and Q = 5. For the second stage, we also conduct experiment based on the 10-fold crossvalidation method. In each experiment, we calculate an averaged error by (9), and after 10 times of experiment, the error of the second stage is averaged again. Table 3 shows the results of different configurations of R and P, and we can find that the surrogate model has the smallest error when P = 1 and R = 2.

Experiment on Ergodic Evolution.
In this section, we validate the ergodic evolution algorithm in terms of the capability of searching high-dimensional space. In this experiment, 16 benchmark functions are adopted as the test problems, including SCH, FON, POL, KUR, ZDT1, ZDT2, ZDT3, ZDT4, ZDT6, DTLZ1, DTLZ2, DTLZ3, DTLZ4, DTLZ5, DTLZ6, and DTLZ7. These benchmark functions are well known, and the detail can be found in [34].
We compare the performance of ergodic evolution with NSGA-II, which is a well-known evolutionary algorithm proposed in [35], and random evolution (RE), which is the same as Algorithm 1 except that the ergodic system is replaced by a general random generator. The crossover rate and mutation rate of NSGA-II are set to 0.9 and 0.1, respectively.
For each benchmark function, we run the three algorithms 30 times. For each run, the algorithms run 1000 generations with 100 individuals in each generation. After all experimental results are obtained, we conduct nondominant analysis. First, all individuals in the same generation of the three algorithms are combined, and then, count the number of individuals in the nondominant frontier after nondominant analysis. Figure 4 shows the averaged result (30 runs) of the top 100 generations. From the figure, we can clearly see that the proposed method is superior to NSGA-II and RE for most of the benchmark functions except DTLZ2, DTLZ4, and DTLZ5.
In addition, we plotted the Pareto frontiers of 5 th , 10 th , 20 th , 50 th , 100 th , 200 th , 500 th , and 1000 th generation of the three algorithms. Considering the length limitation, we only show the figures of ZDT3 ( Figure 5) and DTLZ2 ( Figure 6) in this paper.

Experiment on the Whole Method.
In this section, we validate the proposed method. For the ergodic evolution, two targets are adopted. The first is a function of the 4 performances provided by the aerodynamic institute, while the second is calculated by equation (5). The surrogate model is constructed with P = 1, Q = 5, and R = 2.
We run the whole algorithm 20 times, and Figure 7 shows the evolutionary process of the two targets. For each run, the algorithm evolves 200 generations, and there are 100 individuals in each generation. The figures include 20 lines indicating the evolutionary process of the 20 runs, and each point in the line indicates the minimum value of targets in one generation. As we can see, the algorithm converges to minimum targets almost within 100 generations.
We randomly select 4 runs and list 20 design solutions in the nondominated Pareto front in Table 4. We plot the generated design solutions and known design solutions in a single figure, and both are proceeded by PCA. From Figure 8, we can find that most of the generated design solutions are near to known design solutions, which prove that the proposed method can achieve rapid exploration of highdimensional space. This result is reasonable since the second target of the CE restricts the algorithm to exploit new design solutions in the space near to the known design solutions.
From the extensive experiment results above, we find that (1) the adoption of some basic assumptions or prior knowledge can relax the requirement of data for training the surrogate model. The prior knowledge here represents views of designers on design problems. In this work, we simply adopt two basic assumptions, and the whole model is divided into two parts, and some unimportant relationships are ignored by the surrogate model. Therefore, we think transferring basic assumption or prior knowledge into a computational manner and integrating it with the surrogate model are feasible ways to build a surrogate model with only 9 Journal of Nanomaterials sparse and scattered data. (2) The simple IDW method can train the surrogate model. However, in different scenarios, IDW may not always be feasible, and some advanced methods like Kriging and RBF network should be adopted and validated. (3) One of the merits of the proposed method is that the searching space can be expanded with the addition of more known design solutions. This means that this method is fit for both sparse and scattered data and relatively bigger data.

Conclusion
To train a surrogate model with only sparse and scattered data and find new design solutions based on this surrogate model, this work proposed a two-stage interpolation-based method for the surrogate model and adopts CE to explore the high-dimensional space. This paper uses PCA to reduce the dimension of data and prove the characteristics of sparse and scattered data. Then, combining the two-stage interpolation and ergodic evolution method, the sparse scattered point data is generated into the design method. Three groups of experiments show that the surrogate model can predict performances, and the CE is efficient in terms of exploring high-dimensional space.
Although this work proposes a feasible method to deal with the problems, some remaining problems require extensive research works. In the future, we will enhance the method from the following aspects.
(i) This work only adopts basic assumptions to relax the data requirement, and the data requirement can be further relaxed by incorporating prior knowledge, such as experts' experience and physical law. The main problem will be the technique of embedding prior knowledge into surrogate models (ii) In this work, the penalty factor is important and the simple Euclidean distance is used. In the future, some nonlinear function of the Euclidean distance can be used to measure the penalty factor. The difficulty will be how to predefine or learn a nonlinear function for the penalty factor

Data Availability
No data were used to support this study.