Parameter Control Framework for Multiobjective Evolutionary Computation Based on Deep Reinforcement Learning

,


Introduction
Multiobjective optimization problems (MOPs) are common in practical applications, such as margin trading [1], energy system design [2], scheduling [3], and water resources management [4].Tere are two signifcant characteristics in MOPs.Instead of one optimization objective, where only one goal is preferred, two or more optimization goals need to be considered in multiobjective optimization.Moreover, these objectives cannot be optimized at the same time.Te optimization of one goal is at the cost of the degradation of the other goals with the intrinsic internal conficts between targets.To depict these characteristics and formulate these practical requirements by mathematical expression, MOPs are modeled by the following formula: min F(x) � f 1 (x), . . ., f n (x)  , x ∈ ω, s.t.g i (x) ≤ 0, i � 1, . . ., q, h j (x) � 0, j � q + 1, . . ., p,

⎧ ⎨ ⎩
(1) where F(x) is the objective function with n objectives conficting against each other and g i (x) and h j (x) are constraints for solution x.Diferent from the single optimum solution for single-objective optimization problems (SOPs), MOPs hold a set of optimal solutions (Pareto optimal solutions).Terefore, how to adjust parameters to fnd more Pareto optimal solutions becomes one of the key problems.
Most single-objective or multiobjective optimization algorithms typically have problem-specifc parameters.Currently, these are mainly the following methods for parameter tuning: using common parameter tuning algorithms such as parameter tuning with chess rating system (CRS-Tuning) [14], F-Race [15], and REVAC [16] or incorporating diferent strategies during the evolutionary iteration process to adapt the parameters to the problem or iteration process.For an MOP, uncertainty can occur in the objective function, decision variables, and function parameters [17].Over the years, there have been several attempts to improve evolutionary algorithms by making parameters adaptive to the problem or to the iterative process.Basically, they can be organized into the following three categories which are rulebased, iteration memory-based, and learned knowledgebased.Te frst one is the rule-based parameter control method.Tis kind of method specifes a fxed way of changing certain parameters in an EA.For example, parameters are designed by raising or declining with iterations, such as [18,19].Te second one is the iteration memorybased parameter control method.Tis kind of method records information such as the success rate of the policy, such as [20,21].Te above two methods control the parameters in a single iteration, and the information retention and subsequent infuence also stay in the single iteration process.Te third is the learned knowledge-based parameter control method.Tis kind of method maintains the information learning from diferent problems to formulate a decision model.Every time a new problem has been solved, the model will be updated.For now, reinforcement learning (RL) and deep reinforcement learning (DRL) are used to store such experiences, and both of them belong to learning artifcial intelligence methods.Tis type of method retains past information and learns from diferent problems.After training, it can select appropriate parameters for specifc problems in diferent circumstances.In recent years, this kind of method has been discussed in [22][23][24].Most references considered the single-objective problem, such as [22,23].While few references extended this idea towards MOEA [24].Te combination of the learned knowledgebased parameter control method and MOEA is still in its infancy, and a comprehensive framework is an urgent requirement.
For this article, we mainly focus on developing an MOEA parameter control framework based on DRL.Tis learned knowledge-based parameter control framework can be applied to diferent MOEAs and improves algorithms' efciency and robustness on diferent optimization problems.Te contributions of this article can be concluded as follows: Te remainder of this article is arranged as follows.Section 2 presents the related work.Ten, we introduce RL as the preliminaries for parameterized knowledge representation in Section 3. Section 4 proposes the framework of parameter control.Section 5 illustrates the efciency of the proposed framework through comparisons between the reinforced algorithms against corresponding original MOEAs.Section 6 summarizes this article.

Related Work
2.1.Rule-Based Parameter Control Method.Tis category can be further divided into two subcategories.Te frst involves parameters changing with iterations, which is a usual dynamic parameter-changing strategy.For instance, the authors of [18] proposed a framework that adjusts the parameters in MOPSO for individual particles based on knowledge extracted from the belief space.Te authors of [25] proposed time variant MOPSO, where the acceleration coefcients and inertia weights vary with iterations.Te second subcategory involves updates by a fxed formula,

2
International Journal of Intelligent Systems emphasizing inherent rules.Te authors of [26] proposed MOEA/D-AWA with the adaptive weight vector adjustment strategy.
Te shared characteristic of these two subcategories is that the scheme of parameter control is formulated before the iteration process and does not interact with the information in the iteration process.Te advantage is that the randomness of the parameters is increased, and diferent parameter values are assigned in diferent iteration stages to better adapt to the problem and the iterative process.Te disadvantage is that this kind of parameter control method requires a continuous trial and error to fnd a suitable control strategy.At the same time, for diferent problems, the specifc strategy also needs to be adjusted to meet the requirements.

Iteration Memory-Based Parameter Control Method.
Tis kind of method restores the information from a single run and uses this information to adjust subsequent parameters.Te commonly used frst-order reference indicators for MOEA are the changes in dominance relationship.
In [27], a binary space partitioning tree structure was selected to store the evaluated solutions' positions and ftness values with a fast ftness function.In this algorithm, the variational operator is parameter-free and adapted according to the current state.Te author of [28] took feedback from the current state to modify the parameters.Moreover, the authors of [29] dynamically adjusted parameters based on average feedback.Te author of [30] recorded the parameter values of the successful crossover and variation operators and updated the parameter values of the next generation by averaging the feedback of the successful parameter values.Moreover, an adaptive velocity strategy based on the evolutionary state for PSO was proposed in [31], which realizes adaptive control for trafc signals.
Te advantage of this type of method is that it can make full use of the information during the iteration process and accordingly make real-time and specifc adjustments to the parameters of the next generation or the next individual.Since the information is extracted based on the experience within the iteration process, it cannot be saved or transferred to new scenarios.For diferent problems, the parameters have to be rearranged to adapt to diferent characteristics.

Transfer Learning-Based Parameter Control Method.
Te transfer learning-based parameter control method restores and transfers knowledge learned from diferent problems.Tese methods take full advantage of reinforcement learning or deep reinforcement learning and learn from past information.
Te authors of [24] applied q-learning on MOPSO to optimize primary control parameters, including the cognitive acceleration coefcient, inertia weight, and social acceleration coefcient.Similarly, the authors of [32] also combined q-learning with MOPSO to realize parameter control, and the distance between the previous best position and the best position of the current population is used as the state for parameter selection.Based on NSGA-II, the authors of [33] utilized q-learning to adjust the crossover and mutation probabilities with population diversity, evolutionary iteration number, and average ftness, thereby enhancing population diversity.Te authors of [34] proposed a general framework of parameter control with reinforcement learning for single-objective evolutionary computation.Tis framework designed parameter sets in advance for each evolutionary algorithm, and q-learning will help choose one parameter-based state in each iteration.Te authors of [35] combined DRL with MOEA for solving constrained multiobjective optimization problems, which took into account both population's convergence and diversity in their inputs to DRL.
Te advantages of these methods can be concluded as follows.First, it can restore and summarize the past experience of adjusting parameters in diferent states and transfer the summarized experience to diferent problems.Second, for a new optimization problem, the parameters can be updated directly according to the current iteration information and past experience, improving the efciency and accuracy of the parameter selection process.
Remark 1. Te relationship between the state and these three methods is incrementally coupled.Te methods are designed from complete random to state-based.Meanwhile, the third class of methods, transfer learning-based parameter control method, has the ability to process highdimensional evolutionary states and is more scalable and transferable in terms of problem characteristics.

Preliminaries
Reinforcement learning (RL) belongs to a machine learning method.Unlike supervised learning or unsupervised learning, RL interacts and learns from the environment to obtain an optimum policy that can maximize the reward.RL mainly contains the following contents: state s, action a, reward r, and transition probability p.All the above elements are also known as a Markov decision process (MDP) [36] and denoted as 〈s, a, r, p a 〉.Te processes for a typical RL algorithm are listed as follows: (i) Te agent performs an action a to interact with the environment.(ii) After action, the state transforms from s to a new state s ′ .(iii) Ten, the agent will be rewarded r a according to the action a and the rule of rewarding.(iv) With the reward r a , the agent will recognize whether the action selection holds a positive or negative efect.(v) If the positive reward is returned, the agent will perform that action with more probability, or else the agent will try another action to obtain better reward under the state s.

International Journal of Intelligent Systems
Deep learning (DL) has gained achievements in natural language processing (NLP), image classifcation, and many other felds.Te representational power of deep learning relies heavily on multilayer neural networks with neurons [37] as the basic units.Te perceptron [38] is the earliest prototype of the neural network and is known as the singlelayer neural network (without hidden layers).It can only perform the simplest linear classifcation tasks.Improvements in computational power and data processing techniques have gradually made deep learning the most popular branch of machine learning in recent years in both academia and industry.With the birth of some famous network structures, such as the convolutional neural network (CNN) [39], generative adversarial network (GAN) [40], and recurrent neural network (RNN) [41], DL has further expanded its applications in diferent felds.
Deep reinforcement learning (DRL) combines RL and DL.For DRL, the NNs are embedded in RL and commonly used to restore knowledge from an environment and make a preferred decision based on the current situation.Te deep q-learning network [42,43] pioneered this kind of algorithm, and it can be applied in many areas [44], such as game [45] and design of vehicle network [46].Ten, their variant DDQN [47] was proposed to overcome the drawback that overwhelms the q value (defned by the expected return starting from state s, taking action a, and then following policy π).To reduce complexity and improve training effciency, the continuous deep q-learning network with model-based acceleration (CDQN) was proposed in [48] by extending DRL from discrete to continuous space.Based on the advantage of CDQN, we propose the learned knowledgebased parameter control framework to realize parameter automatic tuning and improve the efciency of MOEAs.At the same time, considering the temporal properties of the states in the evolutionary process, since RNNs have the ability to learn and perform complex transformations of data over long time scales, we choose it to handle the state during evolution.Figure 1 shows the graphical representation for RNN.In the fgure, I t represents the value of the input layer of the t th generation, the S t is the hidden layer of the t th generation, and O t represents the value of the output layer of the t th generation.
Te notations utilized in this paper are stated in Table 1.

Learned Knowledge-Based Parameter Control Framework via Deep Reinforcement Learning
4.1.Te General Framework of Parameter Control.For MOEA, the parameter is usually set before iteration.After initialization, the population will be evaluated by the objective function, also known as a ftness value function.Ten, the population starts an evolutionary process based on various designed methods.After that, the population will be evaluated again and generate a new Pareto set.Ten, if the termination condition is not satisfed, the population will begin the next iteration.Tis represents the general process of MOEAs, although some specifc MOEAs may slightly deviate from these steps.Te fowchart of MOEAs is presented in Figure 2(a).
In this article, we embed DRL into the general process of MOEAs, as illustrated in Figure 2(b).Te steps preceding the frst termination condition assessment are similar to the general process, with the parameter for this round determined before iteration.After that, the population information is transferred to DRL.DRL will learn from this information and choose proper parameter sets for the next population P t+1 or individual x i+1 t .Ten, the chosen parameter set will be applied in the next iteration.In this process, the information transferred to DRL is defned as the state, and the chosen parameter sets are actions.Fitness evaluation and Pareto set generation are considered as the environment.Subsection 4.2 will introduce the details of the parameter control process and model the general evolutionary algorithm as an MDP.

Modeling the Parameter Control Process to MDP.
Tis subsection clears the components needed in the parameter control process and provides evidence for every feature used in the proposed framework.

Environment. Te environment includes an evolutionary computation process and a set of optimization problems (training functions)
. Te optimization problems which are represented by objective functions are used to evaluate the performance of the optimizer.Note that these objective functions should have some common features such as the number of objectives and whether it is an integer or continuous programming problem.Tese common features could help to learn and apply.

State S.
Te state is used as the feature to describe the evolution process and provide evidence for the agent to choose the proper parameters.In real-world problems, the scope of decision space and Pareto fronts (PFs) are difcult to obtain.In this article, we select feature and process information that does not require prior knowledge about decision space and PFs for parametric decision-making.Except for some basic information about considered problems, we choose the relative position in decision space, the distribution of the ftness value for this individual in the past n generations, and the grid-based inverted generational distance (grid-IGD) [49] of this individual.
(1) Te Basic Information about Considered Problems.Tis kind of feature includes the number of objectives and the number of dimensions.Te above information helps clarify the difculty of the problems.
(2) Te Relative Position of x i t .It is the basic feature of the state.Te relative position in the decision space can be described by (x i t − x u )/(x u − x l ), where x u and x l , respectively, mean the upper and lower bounds of the decision space.
(3) Te Distribution of Fitness Value.Tis value is refected by the distribution of ftness values of the whole population in the past n generations.Tis index divides [f min , f max ] into n equal parts and counts the number of individuals on the PFs in each part, respectively.f min and f max separately represent the minimum and maximum ftness values found in the current round.Tis feature helps clarify the scope of the ftness space.
(4) Grid-IGD.Grid-IGD is introduced to steer the evolutionary direction for unknown PF problems.Grid-IGD generated a set of reference points to estimate the PFs of the considered problem.Since grid-IGD generated representative nondominated solutions in the gird environment, it can help the agent to know the quality of the current solution set without knowing the true Pareto sets.

Action A and Policy π.
For an MDP, the agent can sample or choose an action from the policy π defned as a probability distribution p(A t | S t ; θ r ) under the state S t , where θ r is the parameter for the policy.In this article, the action A i t is the proper parameter set in the t th generation for the i th individual.For diferent MOEAs, the parameters that need to be adaptively modifed are diferent.For A i t , the number and scope for each parameter are defned before training and should keep the same while testing on real problems.
Te policy π is a distribution among actions under diferent states.It can be described by the following formula: ( International Journal of Intelligent Systems 4.2.4.Reward R. Te reward for MDP can be described by formula (3), in which R a s means the expected reward gained under the state s with the action a: For multiobjective optimization problems, when a solution transfers from an nondominated solution to a dominated one, this situation will be considered a success and should be given a reward.At the same time, if the number of dominated solutions in the archive at this iteration increases, this situation should also be awarded.We take into consideration these two factors and design feedback reward with memory.Te rewards not only consider the situation for now but also compare it with history memory.Tus, the reward proposed in this article is described by the following formula: 10, if the i th individual becomes dominated solution in the t th generation, 5, else if the i th individual keeps dominating in the t th generation, 0, otherwise.
Te reward designed in (3) encourages the agent to keep the individual nondominated and try to evolve more nondominated solutions.

Transmission Probability P.
Te transmission probability P represents the probability transferred from the state s to a new state s ′ , and it can be described by the following formula: In this article, since the state space of MOPs is too large to measure, P is hard to forecast.Tus, we need to choose model-free reinforcement learning methods to make decisions under diferent circumstances.

Embedding Continuous q-Learning with Normalized
Advantage Functions 4.3.1.Training Phase.Te model-free RL method is suitable for problems where the environment is unknown or the environment is difcult to accurately describe and explore.It has been developed to policy value functions and large neural networks, which makes it possible to directly pass raw representations as input to neural networks to access policies for complex problems.In this article, based on the feature of state space and continuity features of the parameters, we choose CDQN as the parameter controller, identifying the environment and making parametric decisions.With the embedded CDQN, Algorithm 1 presents the pseudocode for the proposed framework at the training phase.Lines 1-7 fnish the process of initialization, including CDQN and the evolutionary algorithm.Ten, the parameter sets a i t will be chosen by μ model , and the action that maximizes the expected reward is always given by μ(x | θ μ ).Ten, the evolutionary algorithm will update the individual x i t according to a i t .Ten, the state and reward will be updated.After that, the target network will be updated according to updated θ Q , which is in lines 16-17.In this process, the termination condition can be decided by the demand for testing or training.For example, the termination condition could be defned by the number of iterations, the maximum number of evaluations, or the length of time.

Testing Phase.
Since the knowledge from the training phase will be utilized in the testing phase, the procedure for the testing phase will load the parameters of CDQN obtained in the training stage.Te pseudocode is summarized in Algorithm 2. Te parameters for CDQN obtained from the training stage will be loaded before the start moment of the testing stage.However, compared with the training stage, the parameter will not be updated through the iterations.

Reinforced MOEAs.
In this subsection, we will apply the proposed framework to four classical MOEAs to realize adaptive parameter control, namely, reinforced-NSGA-II (R-NSGA-II), reinforced-MOEA/D (R-MOEA/D), reinforced-MOPSO (R-MOPSO), and reinforced-MODE (R-MODE).Te rationale for the four algorithms will be presented, followed by a description of the parameters incorporated into framework tuning.Te framework of four reinforced algorithms is summarized in Figure 3.

Reinforced-NSGA-II.
Diferent from NSGA-II [7], where crossover probability η c and mutation probability η m are mainly set before iteration, the proposed R-NSGA-II sets the above two parameters based on the proposed framework to realize adaptively tuning.In R-NSGA-II, the evolutionary process is mainly according to the mutation operator and crossover operator, in which the mutation operator changes the components of the individual according to the crossover probability η c , and the crossover operator is randomly selected by the individuals after mutation according to the crossover probability η m .After each iteration, the retained individuals are selected by sorting based on the nondominant rank value and the degree of crowding distance.Tis algorithm divides the population into a group of Pareto nondominant sets.An individual in a nondominant set is not dominated by any individual in the current or later nondominated set.Te method is to select all nondominant individuals that are not dominated by any other individual each time, delete a nondominant set 6 International Journal of Intelligent Systems from the population, and then repeat the process of the remaining until termination condition meets.Ten, the population is arranged by crowd distance, which is the sum of the distance between adjacent individuals in each dimension.After that, Pareto front is approached by optimizing the subproblems in collaborative ways with the neighborhood relationship between subproblems.In SBX, the two ofspring are created using the following equations: Input: Te population size N, the times of iteration T, trained parameter of network θ Output: Te Pareto sets of MOP (1) Initialize normalized Q network with fxed weight θ (2) Initialize target network Q ′ with fxed weight θ Q ′ ←θ Q (3) Initialize replay bufer (4) for episode � 1, M do (5) Initialize a random process N for action exploration (6) Initialize the population P randomly (7) Receive initial observation state s 1  1 ∼ p(x 1 1 ) (8) for t � 1, T do (9) for i � 1, N do (10) Select action a i t � μ(s i t |θ μ ) + N i t (11) Apply a i t to update the individual (12) Evaluate population and selection (13) i � i + 1 (14) end for (15) t � t + 1 (16) end for (17) episode � episode + 1 (18) end for ALGORITHM 2: Pseudocode for the proposed framework of the testing phase Input: Te population size N, the times of iteration T, the times of episode M, the discounting rate τ Output: Trained parameters of network θ.Set end for (20) t � t + 1 (21) end for (22) end for ALGORITHM 1: Pseudocode for the proposed framework of the training phase International Journal of Intelligent Systems where u i k a and u i k b are the ofspring after the SBX, p i k a and p i k b are randomly selected parent individuals, β is the random number of expansion factors, and the value of β is determined by the following equation: where r is a random value between [0, 1] and η c represents the distribution index in SBX tuned by the proposed framework.When η c is larger, the ofspring will be more similar to their parent.Conversely, when the value of η c is smaller, the ofspring will tend to be diferent from their parents.Te formula for polynomial mutation is shown as follows: where p i t is the individual before the mutation, v i t is the individual after the mutation, and u t and l t denote the upper and lower bounds of the individual, respectively: CDQN Update according to operators of each algorithm: NSGA-II: mutation and crossover operator MOEA/D: eq.( 6) and eq.( 8) MOPSO: eq.( 10) and eq.( 11) MODE: eq.( 12) and eq.( 13 [5] with the proposed parameter control framework.In R-MOPSO, the position and velocity of each particle are separately described by the following equations: where w is the inertia weight, r 1 , r 2 ∈ [0, 1] are dynamic parameters adjusted by the proposed framework, P bests (i) is the best position of the particle i, REP(h) is a value that is taken from the repository, and the index h is selected by the following method: assigning a ftness value to those hypercubes containing more than one particle and dividing any number (greater than zero) by the number of particles that they contain.[9] and the proposed parameter control framework.Similar to MODE, the main idea of R-MODE is to balance the degree of exploration and exploitation in the evolution process by selecting the learning particles and the learning ratios.Te mutation operator and the crossover operator are utilized to create the ofspring.Te individual formulated through mutation can be expressed by the following equation:

Reinforced-MODE. R-MODE is composed of the original MODE
where F is the scaling factor of disturbance which is adaptively tuned in the parameter control framework, p i k a and p i k b are randomly selected from parent individuals, and v i t is the generated particle after mutation.Binomial crossover is one of the frequently used crossover operators and can be described by the following equation: where rand i [0, 1] is a uniformly distributed random number.When it is less than the crossover rate CR, the individual generated by the mutation operator will be chosen; otherwise, the original individual x i t is retained.For MODE, F and CR are the parameters related to the algorithm's efciency.We apply the framework to adaptively tune the parameters according to the state.

Experimental Study
In this section, the implementation details are presented frst.Te comparison results against classical MOEAs and their reinforced ones are presented afterwards.(Te code will be published in https://github.com/velvet999after the paper is accepted.)5.1.Test Functions.ZDT [7], DTLZ [50], and walking fsh group (WFG) [51] benchmarks are used to train and test the proposed framework.Specifcally in this article, we choose ZDT1-ZDT4 and ZDT6 as the training sets and DTLZ1-DTLZ4 and WFG1-WFG8 as the testing sets.

Measure Metrics.
Two widely used performance indicators, the inverted generational distance (IGD) [52] and hypervolume (HV) [53], are used to evaluate the quality of the obtained nondominated solution set, which can be able to account for both convergence (closeness to the true Pareto front) and the distribution of the achieved nondominated solutions.

Inverted Generational Distance.
IGD is an integrated performance evaluation index that can evaluate the distribution and convergence of solutions simultaneously.IGD mainly evaluates the distribution performance and convergence performance of the algorithm by calculating the point (individual) to the individual settings from the real Pareto front side to the algorithm.Te better the comprehensive performances of the algorithm, the smaller the value of IGD:

Hypervolume. Te HV indicator (or s-metric
) is a performance metric that indicates the quality of a nondominated approximation set, where it is described as the "size of the space covered or size of dominated space": International Journal of Intelligent Systems where HV(f ref , x) resolves the size of the space covered by an approximation set x, f ref m ∈ R refers to a chosen reference point of the m th dimension, f m (x n ) is the ftness value of the individual n of the m th dimension, and Λ(.) refers to the Lebesgue measure.

Algorithms and Parameter Settings.
In this subsection, the general parameters are stated frst, and then, we list the specifc parameters used in the experiment for each comparison function.After that, the parameters of the proposed framework used in the experiment are provided.

Common Parameter Settings
(1) Number of variables and objectives: Te number of the objectives for both DTLZ and WFG is 3, which is a common setting in multiobjective experiments.
Te number of variables of DTLZ is 30 and that of WFG is 10.Te diferent number of variables can also test the adaptability and robustness of the algorithm.( 2) Statistical approach: Due to the heuristic characteristics of evolutionary algorithms, each algorithm independently performed 30 iterations on each function to overcome randomness.Te Mann-Whitney-Wilcoxon rank-sum test [54] is employed for this purpose, and its signifcance level is 5%.(3) Population size and the number of evaluations: Te number of evaluations and population size are the same.Te population size N is 100, and the maximum number of evaluations (MAXNFE) is 10,000.

Parameter Settings for Classical MOEAs.
All of the code and details of comparison algorithms can be found in pymoo [55] which is a multiobjective optimization tool in Python.
For NSGA-II, the crossover probability is specifed as 1.0, with the mutation probability 1/n where n is the variable number.
For MOEA/D, SBX is chosen for crossover, and its probability is 0.9; polynomial mutation is used with a distribution index of 20 and a probability of 0.2.
For MOPSO, c 1 and c 2 are set to 1.49618 and ω � 0.729844.Te polynomial mutation with a mutation index μ m � 20 and probability 1/N, where N is the population size.
For MODE, the crossover rate is 0.7, the mutation rate is 1/30, and the child variability factor is 0.7.

Parameter Settings for r-MOEAs.
Since the parameter of r-MOEAs is randomly generated at frst and adaptively adjusted by the agent during the process, Figure 4 gives the detail of CDQN, which is the parameter controller in the proposed framework.

Comparison.
In this section, we present the results of the DTLZ benchmark generated by the comparison algorithms and the proposed framework-embedded algorithms.Te statistical results of the IGD metric on 4 DTLZ test problems and 8 WFG test problems are summarized in Table 2, while those of HV are summarized in Table 3. Te mean values, standard deviation values, and results of the Mann-Whitney-Wilcoxon rank-sum test (in parenthesis) are provided.For each testing problem, the Mann-Whitney-Wilcoxon rank-sum test is performed on the results obtained by one algorithm's original version and reinforced version instead of the results in the whole table.For example, the rank-sum test of NSGA-II is performed between the values of metrics obtained by NSGA-II and R-NSGA-II.
As Table 2 shows, the average IGDs obtained by reinforced algorithms are smaller than those by their original version.Since IGD is used to measure the quality of solution and uniformity of distribution of solutions, this indicates that the solution obtained by the reinforced version is more close to the real Pareto front.R-MOEA/D achieves the best performance on DTLZ1 and 4th place among 4 original and 4 reinforced algorithms, while R-NSGA-II achieves the best on DTLZ3.For the WFG benchmark, R-MOEA/D also achieves the best among 8 problems on 5 problems, while there are 8 questions in total.Te other reinforced Bold value represents the best performance for each problem.International Journal of Intelligent Systems algorithms may not achieve the best on the problems, but they can also obtain better results than their original ones.For DTLZ3, a multimodal problem, all algorithms do not perform exceptionally well, indicating that while the framework can enhance the performance of algorithms, this improvement has limitations on problems that the algorithm inherently struggles with.Figures 5 and 6 show the boxplot of metrics IGD obtained by 30 runs.It can be observed that IGD obtained by reinforced algorithms owns more stability and superiority than their classical ones.For all problems, the reinforced algorithm obtained the best results.From the fgures, we can clearly observe that the scope of IGD obtained by the reinforced algorithm is smaller than that by the original ones, which shows that the reinforced algorithm is more stable during the 30 runs.
HV can measure the convergence and diversity of an algorithm simultaneously.Te means and standard deviation of HV obtained by four classical algorithms and their reinforced version on DTLZ and WFG over 30 runs are presented in Table 3. Te reinforced algorithms get promoted on most problems compared with their original ones.
In addition to verifying the transferability of the framework on diferent problems, we further validated its performance on the same problem but with the diferent number of variables.We choose NSGA-II, MOPSO, and their reinforced algorithms to run on WFG of 5 (6 for WFG2 and WFG3), 20, and 30 variables.Te results are summarized in Table 4. From Table 4, we can see that the reinforced algorithms are more likely to achieve success on the same problem even with diferent variables.
Meanwhile, we also applied the Friedman test [56] to the results.Table 5 and Figure 7 summarize the average ranking of each of the eight algorithms on all problems from the two test suites, where diferences in their performance are detected.Te lower the ranking, the better the performance of an algorithm.It is worth noting that the Mann-Whitney-Wilcoxon rank-sum test is used to compare the performance of only two algorithms at a time, while the Friedman test is applied to rank all algorithms based on their overall performance.Te reinforced algorithms show better performance than their classical ones clearly.

Further Analysis.
From the boxplot, it is not hard to observe that the algorithm with the embedded framework is better than the original algorithm on both the mean value and the standard deviation value.Tis further validates the generality and fexibility of the proposed framework.From this perspective, the algorithm framework is a meaningful innovation.Deep reinforcement learning can make the right choice under complex circumstances during the iteration process.Compared with the rule-based parameter control method or iteration memory-based parameter control method, this framework with deep reinforcement learning holds more scalability and fexibility and can be further adjusted according to specifc problems and algorithms.At the same time, it can be further concluded that with the designed framework, automatically selected parameters will improve both the convergence and robustness of the algorithm.
While training does require a certain amount of time and computational resources, early ofine training has an enhancing efect on the results in subsequent applications.In  12 International Journal of Intelligent Systems    As for computation time in the testing phase, with the embedding of DRL, the computation time has increased.But in some practical scenarios, such as power system optimization [57] and supply chain management [58], accuracy is more important than timeliness.At the same time, with the increase in computing power, it also provides more chances for pursuing accuracy.

Conclusion
Tis paper presents a novel parameter control framework for MOEAs.Te framework utilizes the ability of deep reinforcement learning to choose proper parameters under highdimensional state features.We clear every component of the Markov decision process including the environment, state, action, reward, and transmission probability and employ a classic and recognized deep reinforcement learning algorithm to process the state and make choices in continuous space.
We introduced four reinforced MOEAs based on classical MOEAs with the proposed framework to verify the universality and validity of the designed framework.R-MOEA/D, R-MOPSO, R-NSGA-II, and R-MODE are trained and compared with their original algorithms.Te experimental results demonstrate that the proposed framework can adapt to different algorithms, improving their efciency and robustness on various testing problems after training.As observed from the boxplot, the efciency of the improved algorithm is always better than that of their original algorithm.Tat further proves the universality of the proposed framework.
Regarding future studies, the applicability of the parameter control framework for diferent kinds of problems, such as integer optimization, will also be studied.Moreover, some real-world problems will be considered as the training and testing sets.Since real-world problems are more complex, it will be a challenge for state feature design.

Figure 2 :
Figure 2: Te fowchart of MOEA with and without parameter control.(a) Te general fowchart of MOEA.(b) Te fowchart of MOEA with parameter control.

4. 4 . 2 .
Reinforced-MOEA/D.Reinforced-MOEA/D is constructed based on MOEA/D[8] and the proposed framework.In R-MOEA/D, the multiobjective problem is divided into a set of single-objective subproblems or several multiobjective subproblems.Ten, the framework is utilized to adjust the parameters in the simulated binary crossover (SBX) operators and polynomial mutation (PM) operators.

Figure 4 :
Figure 4: Te structure of the neural network used in CDQN.

Figure 5 :
Figure 5: Te boxplot of IGD obtained by four original algorithms and their reinforced version on DTLZ.

Figure 6 :
Figure 6: Te boxplot of IGD obtained by four original algorithms and their reinforced version on WFG.

Figure 7 :
Figure 7: Average rankings of all algorithms obtained by the Friedman test on all the test functions.

Table 1 :
Te notations.Te i th individual in the t th generation P t ∈ R N×n Te population of the t th generation F 4 ∈ R N Te ftness value of the t th generation4International Journal of Intelligent Systems

8
International Journal of Intelligent Systems where r is a random value between [0, 1], η m represents the distribution index in PM tuned adaptively by the parameter control framework, δ 1 � (p i t − l t )/(u t − l t ), and δ 2 � (u t − p i t )/(u t − l t ).

Table 2 :
Means and standard deviation of IGD obtained by NSGA-II, MOEA/D, and their reinforced versions on DTLZ and WFG.

Table 3 :
Means and standard deviation of HV obtained by NSGA-II, MOEA/D, and their reinforced versions on DTLZ and WFG.
Bold value represents the best performance for each problem.

Table 4 :
Means and standard deviation of IGD obtained by NSGA-II, MOPSO, and their reinforced versions on WFG of diferent variables.

Table 5 :
Average rankings of IGD and HV by the Friedman test.