An Improved Teaching-Learning-Based Optimization Algorithm with Reinforcement Learning Strategy for Solving Optimization Problems

This paper presents an improved teaching-learning-based optimization (TLBO) algorithm for solving optimization problems, called RLTLBO. First, a new learning mode considering the effect of the teacher is presented. Second, the Q-Learning method in reinforcement learning (RL) is introduced to build a switching mechanism between two different learning modes in the learner phase. Finally, ROBL is adopted after both the teacher and learner phases to improve the local optima avoidance ability of RLTLBO. These two strategies effectively enhance the convergence speed and accuracy of the proposed algorithm. RLTLBO is analyzed on 23 standard benchmark functions and eight CEC2017 test functions to verify the optimization performance. The results reveal that proposed algorithm provides effective and efficient performance in solving benchmark test functions. Moreover, RLTLBO is also applied to solve eight industrial engineering design problems. Compared with the basic TLBO and seven state-of-the-art algorithms, the results illustrate that RLTLBO has superior performance and promising prospects for dealing with real-world optimization problems. The source codes of the RLTLBO are publicly available at https://github.com/WangShuang92/RLTLBO.


Introduction
In recent years, real-world optimization problems have become increasingly complex and diverse in a wide range of fields and disciplines. Traditional (mathematical) optimization methods, such as Newton's method and the gradient descent method can no longer meet the needs for solving current optimization problems.
Teaching-learning-based optimization (TLBO) is a meta-heuristic algorithm proposed by Rao et al. in 2011 [17]. e TLBO method is inspired by the teaching-learning process in a class and simulates the influence of a teacher on learners. Due to the advantages of rapid convergence, absence of algorithm-specific parameters and easy implementation, TLBO has become a viral optimization algorithm and has been successfully applied to real-world problems in diverse fields. Aouf et al. [18] applied TLBO to optimize the parameters of the ANFIS structure to obtain the optimal trajectory and traveling time to address the navigation problem of the mobile robot in a strange environment. Singh et al. [19] studied the application of TLBO for optimal coordination of directional overcurrent relays (DOCRs) in a looped power system. Multiobjective TLBO was applied to solve the motif discovery problem (MDP) in the bioinformatics field by Gonzalez-Alvarez et al. [20], and obtained better solutions than other biology-based multiobjective evolutionary algorithms. All the above applications have suggested that TLBO can be effectively applied to many optimization problems in various fields. e improvement and hybrid algorithms of TLBO and their applications have also been studied by several researchers [21]. Kumar and Singh [22] developed a chaotic version of TLBO with different chaotic mechanisms. A local search method was also incorporated to guide the search direction between local and global search and to improve the quality of solution. e application of clustering problems proved the effectiveness of this algorithm. Taheri et al. [23] proposed a balanced TLBO with three modifications, called BTLBO. A weighted mean replaced the mean value in the teacher phase to maintain the diversity. e tutoring phase was added as a powerful local search mechanism for exploiting regions around the best solution. e restarting phase was introduced to improve the exploration ability by replacing inactive learners with randomly initialized learners. Ma et al. [24] proposed a modified TLBO (MTLBO) by introducing a population group mechanism into the basic TLBO. All students were divided into two groups and updated by different updating strategies. e MTLBO was also applied to establish the NOx emission model of a circulation fluidized bed boiler. Xu et al. [25] introduced dynamic-opposite learning (DOL) strategy into TLBO to overcome premature convergence. e asymmetric search space and the dynamic change in the characteristics of DOL help DOLTLBO to holistically improve the exploitation and exploration capabilities. Dong et al. [26] presented a KTLBO algorithm to achieve computationally expensive constrained optimization. e kriging-assisted two-phase optimization framework was used to alternately conduct global and local searches, achieving the search acceleration. KTLBO was also adopted to design the structure of a blended-wing-body underwater glider. Ren et al. [27] developed a multiobjective elitist feedback TLBO (MEFTO) for multiobjective optimization problems. e elitism strategy was used to store the best solutions obtained thus far. e proposed feedback phase allowed students to choose whether to study directly with the teacher or to motivate themselves, providing a novel way for students to improve themselves. Zhang et al. [28] proposed a hybrid algorithm based on TLBO and a neural network algorithm (NNA) named TLNNA to solve engineering optimization problems. e experimental results suggested that TLNNA has improved global search ability and fast convergence speed. By considering the features of the WOA and TLBO, Lakshmi and Mohanaiah [29] proposed a hybrid WOA-TLBO algorithm. is was also applied to solve the facial emotion recognition (FER) functional problem, and the reported results showed its effectiveness and high accuracy. e TLBO variants proposed previously have improved searchability and accelerated the convergence process, but they still struggle with premature convergence and insufficient learning processes. us, in this paper we propose an improved TLBO algorithm to solve industrial engineering optimization problems. Given the characteristics of TLBO, reinforcement learning (RL) in machine learning is introduced to the learner phase, and enables the algorithm to choose a more suitable learning mode, which can train the search agents to perform more beneficial actions. In addition, a random opposition-based learning (ROBL) strategy is added after the whole learner phase to facilitate the convergence acceleration and avoid local optima. e proposed improved TLBO with RH and ROBL strategies is called RLTLBO. e standard and CEC2017 benchmark functions and eight engineering design problems are used to test the exploration and exploitation capabilities of the proposed method.
e RLTLBO algorithm is compared with some existing algorithms, including the basic TLBO and the Salp Swarm Algorithm (SSA), which are considered the classical algorithms, the Aquila Optimizer (AO), Harris Hawks Optimization (HHO) [30], and Horse herd Optimization Algorithm (HOA) [31], which are the recent new methods, and the memory-based Grey Wolf Optimizer (mGWO) [32], modified Ant Lion Optimizer (MALO) [33] and dynamic Sine Cosine Algorithm (DSCA) [34], which are the latest improved algorithms. e experimental results show that the proposed RLTLBO method is superior to the state-of-the-art algorithms in exploration and exploitation capabilities. Moreover, eight industrial engineering design problems are applied to evaluate the effectiveness of the algorithm when solving real-world optimization problems. e rest of this paper is organized as follows: Section 2 provides a brief overview of the basic TLBO, RL, and ROBL strategies. Section 3 describes the proposed RLTLBO algorithm in detail. Simulations, experiments and an analysis of the results are presented in Section 4. Section 5 describes industrial engineering design problems. Finally, Section 6 concludes the paper.

Teaching-Learning-Based Optimization.
e TLBO algorithm mimics the influence of a teacher on the output of learners, which can be reflected by learners' grades. As a highly learned person, the teacher gives their knowledge to the learners. e outcome of the learners is affected by the quality of the teacher. It is obvious that learners trained by a good teacher can achieve better results in terms of their grades. e optimization process of TLBO is divided into two phases: the teacher phase and the learner phase.

Teacher Phase.
e teacher phase simulates the teaching process of a teacher. e best one in the class is selected as the teacher, and then the teacher tries their best to improve the overall level of the class. e teaching process can be formulated as follows: where Xnew and Xold represent the positions of the individual after and before learning, that is, the candidate solutions after and before updating. Xteacher is the position of the teacher, which is the best individual of the population. Mean indicates the average level of search agents in the population. TF is a teaching factor that determines the change of the mean value, and rand is a random number between 0 and 1. e value of TF can be either 1 or 2, which is a heuristic step and randomly decided with equal probability as TF � round (1 + rand (0, 1){2-1}).

Learner
Phase. In addition to learning new knowledge from the teacher, learners can also increase knowledge through interaction. In the mutual learning process, a learner can randomly learn knowledge from another learner with a better grade randomly. e expression of the learner phase can be written as follows: where Xr1 and Xr2 indicate the positions of two learners randomly selected from the population. f (·) is the fitness value. e comparison between two learners determines the learning direction. e individual with a poor grade learns from the individual with a better grade. e new individual with improvements after learning will be accepted, otherwise rejected. e flow chart of the TLBO algorithm is shown in Figure 1.

Reinforcement Learning (RL).
Machine learning algorithms are also widely used to solve various optimization problems [35]. Machine learning methods generally consist of four categories, as shown in Figure 2: supervised learning, unsupervised learning, semisupervised learning, and reinforcement learning (RL). In RL algorithms, the agent is trained to learn optimal actions in a complex environment. e agent is trained in different ways and uses its training experience in the subsequent actions. RL methods generally consist of model-free and model-based approaches. e model-free approaches can be divided into two subgroups: value-based and policy-based methods. e value-based algorithms are convenient for coordinating with meta-heuristic algorithms because they are model-free and policy-free, providing higher flexibility [36]. In the valuebased RL approaches, the reinforcement agent learns from its actions and experience in the environment, such through reward and penalty. e agent measures the success of the action in completing the task goal through the reward penalty and then makes a decision based on its achievement. e Q-Learning method is one of the representative algorithms among the value-based RL methods. In the Q-Learning method, the agent takes random actions and then obtains a reward or penalty. An experience is gradually constructed based on the agent's actions. roughout process of building experience, a table called Q- Table is defined [37]. e agent considers all possible actions and tries to update its state according to the Q- Table values to select the best action that maximizes the current state's maximal rewards. erefore, the agent in action determines whether to explore or exploit the environment.
Compared to RL methods, meta-heuristic algorithms often require deep expert knowledge to establish the balance between different phases. RL methods can help discover optimal designs of parameters and more balanced strategies allowing the algorithm to switch between the exploration and exploitation phases. Metaheuristic methods usually operate with specific policies in certain situations, and thus, the dynamism is lower than that of RL algorithms, especially value-based methods. e agent in the value-based methods is online and operates beneficial actions through a rewardpenalty mechanism without following any policy. Many types of research have been presented in the literature regarding the combination of meta-heuristics and RL [38][39][40][41][42][43][44].

Random Opposition-Based Learning (ROBL).
Random opposition-based learning (ROBL) is a variant of opposition-based learning (OBL) [45] proposed by Long et al. in 2019 [46]. OBL is a powerful optimization tool that simultaneously considers the fitness of an estimate and its corresponding opposite estimate to achieve a better candidate solution. In contrast from the basic OBL, ROBL utilizes a random term to improve the OBL strategy, which is defined as follows: where x j and x j indicate the opposite and original solutions, u j and l j are the upper and lower bound of the problem in jth dimension. e opposite solution is randomly selected in the opposite half of the search space. is solution is not only opposite, but also random, with a wider range of distributions. An example of ROBL solutions is shown in Figure 3. e opposite solution with a random term described by equation (3) is more stochastic than the basic OBL and can effectively help the algorithm jump out of the local optima.    Computational Intelligence and Neuroscience learner phase. However, in the actual learning process, students learning from each other varies from person to person. Different students might choose different learning modes, such as formal communications, group discussions, presentations, etc. Moreover, the students might adjust the learning mode according to their learning situation during the learning process. erefore, in this paper, we introduce another learning mode to diversify the learning methods of the students, which can be described in the equations as follows:

The Proposed RLTLBO Algorithm
where Xr3 is the position of a learner randomly selected from the population. t and T are the current and maximum number of iterations. In this learning mode, the effect of the teacher is introduced. Sometimes the mutual learning between students is not always beneficial, and the partial intervention of the teacher is more helpful to students' improvement. Students will not only learn from each other but also ask the teacher for help. At the beginning of the iterations, the weight of mutual learning among students is larger, and the algorithm pays more attention to random learning, which can maintain population diversity and increase global searchability. In the later iteration stage, students consult more from the teacher and approach the teacher, enhancing the algorithms local searchability.

Learner Phase with RL Strategy.
To enable students to adjust their learning mode more effectively, Q-Learning in RL is introduced to complete the switching between both learning modes. e student uses Q-Table values as a guide to decide between different learning modes. e Q-table is updated using a reward-penalty mechanism. e student selects the best state by calculating the benefit degree of each possible state and taking the leaning mode with the highest Q-values for the next step. e student obtains a reward or a penalty according to its actions after each step. e general pattern of the RL agent and environmental framework is shown in Figure 4.
In the Q-Learning method, a reward table is used to reward or penalize the agent for its action or state compositions, which users can provide. e reward table in this work contains the positive (+1) or negative (− 1) rewards for each state and action couple. e Q- Table can be considered the agents experience, which should be assigned a zero value for all units in the beginning. Consequently, the student updates Q-Table using the Bellman equation (5) and prepares the Q-Table for the next iteration [44].

Test function
Test function -1 x 2 x 1 x 3 x 2 x 2 where st and st + 1 indicate the current and the next state respectively, Qt and Qt + 1 are the current Q-value and preestimated Q-value for the next state st + 1, and at represents the current action. λ and c are the learning rate value and discount factor, respectively, which are numbers between 0 and 1. e learning rate determines how fast the algorithm should learn and controls the convergence of the learning process. e discount factor defines how much the algorithm learns from the mistake and controls the importance of future rewards. rt + 1 indicates the immediate reward or penalty an agent gets for taking current action.
In each iteration, the agent uses equation (5) to calculate and weight each possible state and action for the next step, before choosing the best action (learning mode 1 or learning mode (2) with the highest likelihood to get closer to the best optimal solution. Examples of the reward table and Q-Table are displayed in Figure 5. is RL strategy helps establish a switching mechanism between different learning modes in the learner phase and find the most suitable decision scheme. Four optional actions can occur as listed below: (1) When the student is learning in learning mode 1, they still decides to stay in learning mode 1 (2) When the student is learning in learning mode 2, they still decides to stay in learning mode 2 (3) When the student learns in learning mode 1, they decides to transition to learning mode 2 (4) When the student learns in learning mode 1, they decides to transition to learning mode 2 e most important value of the RL strategy is to help the algorithm switch between different learning modes as and when needed during the learner phase. For the above reason, the algorithm can find better solutions faster and more effectively in the search space, considerably increasing the search efficiency. erefore, the convergence speed of the algorithm can be improved effectively.

e Detail of RLTLBO.
In the improved TLBO algorithm, the teacher phase of basic TLBO is carried out first. en, the learner phase with RL strategy is implemented to achieve effective and efficient investigation in the search space. Finally, ROBL is added to enhance the ability of local optima avoidability. e random opposite solution increases the probability of the algorithm finding a better solution. is variant of TLBO, which incorporates RL, is named RLTLBO. e pseudocode and the flowchart of the proposed RLTLBO algorithm are shown in Algorithm 1 and Figure 6, respectively.

Computational Complexity
Analysis. RLTLBO mainly consists of three components: initialization, fitness evaluation, and position updating. In the initialization phase, the computational complexity of positions generated is O(N). en, the computational complexity of fitness evaluation for the solution is O(2 × N) during the iteration process. Finally, we utilize ROBL to keep the algorithm from falling into local optima. us, the computational complexities of position updating of RLTLBO is O(2 × N × D), where D is the dimension size of the problem. erefore, the total computational complexity of the proposed RLTLBO algorithm is

Numerical Experiments and Results
In this section, two different kinds of benchmark functions are performed to evaluate the performance of the proposed RLTLBO algorithm. Standard benchmark functions are first tested to assess the algorithm in solving twenty-three simple numerical problems. en, the CEC2017 benchmark functions are utilized to evaluate the algorithm in solving complex numerical problems. e RLTLBO is compared with three types of existing algorithms, including the classic methods, TLBO and SSA, the recently proposed algorithms, HOA [31], AO, and HHO [30], and the improved algorithms, mGWO [32], MALO [33] and DSCA [34]. For the consistency of all tests, we set the population size to N � 30, the dimension size to D � 30, and the maximum number of iterations to T � 500. All algorithms are run 30 times independently, and the average values and standard deviations are presented as the final experimental results. All experiments are implemented in MATLAB R2020b on a PC with Intel (R) Core (TM) i5-9500 CPU @ 3.00 GHz and RAM 16 GB memory on OS windows 10.

Standard Benchmark Function Experiments.
Standard benchmark functions [47] can be divided into three types: unimodal, multimodal and fixed-dimension multimodal functions. Unimodal functions only have one global optimum and no local optima, which can be used to evaluate an algorithm's convergence rate and exploitation capability. Multimodal and fixed-dimension multimodal functions have a global optimum and multiple local optima. is characteristic makes these functions effective for testing the exploration and local optima avoidance abilities of an algorithm.
e benchmark function details are listed in Tables 1-3.  Calculate new fitness value using equation (2) % learning mode 1 (10) If (new fitness value < current fitness value) (11) Reward � 1 (12) Else (13) Reward � − 1 (14) End (15) Else (16) Calculate new fitness value using equation (4) % learning mode 2 (17) If (new fitness value < current fitness value) (18) Reward � 1 (19) Else (20) Reward � − 1 (21) End (22) End (23) Else (24) If (learning mode 1 value > learning mode 2 value) % learning mode values is obtained by Q- Table  ( [− 50, 50] 0  For the multimodal and fixed-dimension multimodal functions F8-F23, it can be seen from Table 4 that RLTLBO achieves the smallest average values and standard deviations on 12 of all 16 test functions compared to other methods, which indicates a very high accuracy and stability. Several poor results appear on F8 and F12-F14, but they are not the worst results. e satisfying results on the multimodal and fixed-dimension multimodal functions prove that the exploration and local optima avoidance capabilities of the RLTLBO are excellent, which might be derived from the ROBL strategy. Figure 7 provides the convergence curves of RLTLBO and the comparative algorithms for 23 standard benchmark functions. e convergence rate reflected by convergence curves can show us the improvement of exploration and exploitation more intuitively. For F1-F4, F7, F9-F11, and F15-F21, the RLTLBO presents a faster convergence speed than other meta-heuristic algorithms, and the convergence accuracy is also the best. e RLTLBO is ranked in the second position in terms of convergence speed for F22 and F23. For benchmark functions F5-F6, F8, and F12-F14, the RLTLBO does not perform very well, the same as the results in Table 4.

e Wilcoxon Test.
e Wilcoxon rank-sum test [48] results are listed in Table 5, which can assess the statistical performance differences between the RLTLBO algorithm and the comparative algorithms. A p-value less than 0.05 indicates a substantial difference between the two compared methods. It is obvious that the overwhelming majority p-values in Table 5 are less than 0.05, indicating that there are statistically and substantial differences between RLTLBO and the other methods. Combining the results in Table 4, it can be concluded that the RLTLBO algorithm outperforms the others. e competitive results of RLTLBO indicate that this algorithm has high capabilities of exploration and exploitation. In summary, the RLTLBO algorithm provides better results than other comparative algorithms.             Table 8. ere are only seven p-values greater than 0.05 in all test functions, which means considerable differences between the RLTLBO and the compared methods. ese results suggest that RLTLBO can achieve great results on complex problems as well.

Experiments on Industrial Engineering Design Problems
In this section, eight well-known constrained industrial engineering design problems, including the welded beam design problem, pressure vessel design problem, tension and compression spring design problem, speed reducer design problem, three-bar truss design problem, car crashworthiness design problem, tubular column design problem, and frequency-modulated sound wave design problem, are solved to further verify the performance of the proposed RLTLBO algorithm. e results of RLTLBO are compared to various optimization methods proposed in previous studies.

Welded Beam Design
Problem. e purpose of this problem is to minimize the cost of the welded beam ( Figure 8). Four variables need to be optimized: the thickness of weld (h), the thickness of the bar (b), length of the bar (l), and height of the bar (t). e mathematical formulation is listed as follows: Minimize f( z → ) � 1.10471z 2 1 z 2 + 0.04811z 3 z 4 (14.0 + z 2 ), subject to Variable range where e RLTLBO is compared to SMA [50], WOA, MPA [51], MVO [52], GA, and HS [53] methods. e comparison results presented in Table 9 show the superior of the RLTLBO algorithm with a smaller cost than other algorithms.

Pressure Vessel Design Problem.
e objective of this problem is to minimize the fabrication cost of the cylindrical pressure vessel to meet the pressure requirements. As shown in Figure 9, four structural parameters in this problem need to be minimized, including the thickness of the shell (Ts), the thickness of the head ( ), inner radius (R), and the length of the cylindrical section without the head (L). e formulation of four optimization constraints can be described as follows:  Figure 8: Welded beam design problem.
From the results in Table 10, it is obvious that RLTLBO can obtain superior optimal values compared to AO, SMA, WOA, GWO, MVO, GA, and ES [54].

Tension/Compression Spring Design Problem.
is problem aims to minimize the weight of the tension/compression spring ( Figure 10).
ree variables need to be optimized, including the wire diameter (d), the number of active coils (N), and mean coil diameter (D). is problem can be described as follows: e RLTLBO is compared to AO, SSA, WOA, GWO, PSO, GA, and HS algorithms. Results are listed in Table 11 and show that the RLTLBO can obtain the best weight compared to all other algorithms.

Computational Intelligence and Neuroscience
Compared to AO, PSO, AOA, GA, SCA [55], HS, and FA [56], RLTLBO achieves better results in the speed reducer problem, as shown in Table 12.

ree-Bar Truss Design Problem.
e three-bar truss design problem aims to minimize the weight of a truss with three bars by controlling the length of three bars (A1, A2, and A3) ( Figure 12). ree main constraints need to be satisfied, including deflection, stress, and buckling. e mathematical form of this problem is given: e result of RLTLBO is listed in Table 13, compared to AO, SSA, AOA, MVO, and GOA [57]. It can be observed that RLTLBO outperforms other algorithms in the literature.

Car Crashworthiness Design Problem.
e car crashworthiness design problem aims to minimize the weight by optimizing eleven influence variables [58], including the thickness of B-Pillar inner (x1), B-pillar reinforcement (x2), floor side inner (x3), cross members (x4), door beam (x5), door beltline reinforcement (x6) and roof rail (x7), materials of B-Pillar inner (x8) and floor side inner (x9), barrier height (x10), and barrier hitting position (x11). is problem can be formulated as follows. Minimize subject to Variable range e RLTLBO and DE, GA, FA, CS [59], GOA, and EOBL-GOA [58] are applied to solve the car crashworthiness problem. As shown in Table 14, compared to other methods, the proposed RLTLBO achieves the best result than others.

Tubular Column Design Problem.
e main intention is to find a minimum cost for a uniform column, making the tubular section be able to carry a compressive load P � 2,500 kgf. e column is made of a material with a yield stress (σy) of 500 kgf/cm 2 , a modulus of elasticity (E) of 0.85 × 106 kgf/ cm 2 , and a density (ρ) equal to 0.0025 kgf/cm 3 . e length (L) of the column is 250 cm. e cost of the column consists of material and construction costs. is problem is shown in Figure 13, and the optimization model of the problem is listed as follows.

Frequency-Modulated Sound Waves Design Problem.
is problem aims to optimize the frequency-modulated (FM) synthesizer parameter in six dimensions [60]. e following equation is given for optimization X � a 1 , ω 1 , a 2 ω 2 , a 3 , ω 3 as a sound wave, where ai (i � 1, 2, 3) is the amplitude and ωi (i � 1, 2, 3) is the angular frequency. is problem has the lowest value f(X → sol ) � 0. e objective function is calculated based on the square errors between the target wave and the estimated wave. is problem is modeled as follows.
e RLTLBO is compared with GWO, MFO [61], PSO, TSA [62], and FFA [63] algorithms, and the comparison results are listed in Table 16. It is obvious that the proposed method found a much better solution than the comparative algorithms.
In general, the excellent performance in solving industrial engineering design problems suggests that RLTLBO can be widely used in real-world optimization problems.

Conclusion
is study presents an improved teaching-learning-based optimization algorithm (RLTLBO) by incorporating reinforcement learning (RL) and random opposition-based learning (ROBL) strategies. Because of the defect of the insufficient learning processes, a new learning model is proposed in the learner phase. e two different modes uniting the inherent learning mode are switched through the Q-learning mechanism in RL. is mechanism helps the individuals learn thoroughly, resulting in accelerating the convergence speed of the RLTLBO. To improve the ability of local optima avoidance, the ROBL strategy is appended after the teacher and learner phases. e proposed RLTLBO algorithm is tested using 23 standard and eight CEC2017 benchmark functions to analyze its search performance. Experimental results illustrate competitive results compared to other state-of-the-art meta-heuristic algorithms. To further verify the superiority of RLTLBO, eight industrial engineering design problems are solved. e results are also very competitive with other comparative algorithms. e code for RLTLBO is provided at https://github.com/ WangShuang92/RLTLBO and can be used for more practical problems. However, this algorithm still suffers with premature convergence on several benchmark functions, which can be studied in the future. Moreover, RLTLBO can only solve single objective problems. For future research, binary and multiobjective versions of RLTLBO can be considered. More applications of this algorithm in different fields are valuable works, including text clustering, scheduling problems, appliances management, parameter estimation, feature selection, test classification, image segmentation problems, network applications, sentiment analysis, etc.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
On behalf of all authors, the corresponding author states that there are no conflicts of interest.