Improved Q-Learning Method for Multirobot Formation and Path Planning with Concave Obstacles

Aiming at the formation and path planning of multirobot systems in an unknown environment, a path planning method for multirobot formation based on improved Q-learning is proposed. Based on the leader-following approach, the leader robot uses an improved Q-learning algorithm to plan the path and the follower robot achieves a tracking strategy of gravitational potential field (GPF) by designing a cost function to select actions. Specifically, to improve the Q-learning, Q-value is initialized by environmental guidance of the target’s GPF. Then, the virtual obstacle-filling avoidance strategy is presented to fill non-obstacles which is judged to tend to concave obstacles with virtual obstacles. Besides, the simulated annealing (SA) algorithm whose controlling temperature is adjusted in real time according to the learning situation of the Q-learning is applied to improve the action selection strategy. The experimental results show that the improved Q-learning algorithm reduces the convergence time by 89.9% and the number of convergence rounds by 63.4% compared with the traditional algorithm. With the help of the method, multiple robots have a clear division of labor and quickly plan a globally optimized formation path in a completely unknown environment.


Introduction
As robots become more and more widely used in various industries, a single robot cannot be competent for complex tasks. Therefore, multirobot formation [1] and path planning have become research hotspots, and they have good applications [2,3] in collaborative search, exploration, handling, rescue, and group operations. Path planning of multirobot formation requires multiple robots to form a formation and maintain this positional relationship to move to the target. It is necessary not only to avoid obstacles safely but also to find a better path. In addition, compared to the simpler path planning in the known environment, higher requirements on the ability of multiple robots to plan paths are put in the unknown environment. There have been many implementation methods for multirobot formation, including behaviorbased method [4], virtual structure method [5], and leaderfollowing method [6]. The behavior-based method is to design sub-behaviors in advance and choose the behavior to execute according to the changes in the situation, but the accuracy is not enough to integrate various behaviors in a complex environment. The virtual structure method regards the formation as a fixed rigid structure and cannot effectively avoid obstacles. The leader-following method with the advantage of simple and flexible structure mainly realizes collaboration by sharing information of leader. For robot's path planning, A * algorithm [7] and reinforcement learning (RL) algorithm [8] are commonly used in global path planning; the former can effectively solve the optimal path, but it needs to know all the environmental information in advance; the latter can learn autonomously in the environment, but it takes more time. The artificial potential field (APF) method [9] is widely used in local path planning, which can cope with the real-time changing environment but lacks the global planning ability.
For the problem of multirobot formation and path planning, Chen et al. [10] proposed a new leader-following control framework by introducing the RH method, which enables fast convergence of a formation task for a group of mobile robots. Based on the path planning of a single robot, Sruthi et al. [11] designed a nonlinear controller for tracking to achieve the multirobot formation. The above two methods require rigorous modeling of the system and cumbersome theory, which are weak in the application. By mixing formation control with leader-following and priority methods, Sang et al. [12] used the MTAPF algorithm with an improved A * algorithm for path planning. Das and Jena [13] implemented collision-free path planning for multiple robots by using an improved particle swarm algorithm and evolutionary operators. Qu et al. [14] used a modified genetic algorithm to plan paths for multiple robots by adding a co-evolution mechanism. Lazarowska [15] used the discrete APF to find crash-free paths for robots in dynamic and static environments. Some of the above methods cannot be carried out in an unknown environment and some cannot plan an optimal path.
At present, Q-learning is a widely applied reinforcement learning algorithm. The limitation of Q-learning is that it is trial-and-error learning, which requires constant iteration and is time-consuming. Thus, it needs to be improved to quickly plan a globally optimal path. Maoudj and Hentout [16] initialize the Q-table to accelerate convergence by presenting a distance-based reward function. Soong et al. [17] integrated the prior knowledge gained from FPA into the traditional Q-learning, which provided a good exploration basis for accelerating the learning of mobile robots. Xu and Yuan [18] increased the step length of movement and the direction of the robot to plan a fast and smooth path. Oh et al. [19] specified the initial Q-value of the traditional Q-learning through the fuzzy rule-based Q-learning, which speeded up learning and stabilized convergence. Yan and Xiang [20] initialized the Q-table by using inverse Euclidean distance from the current position to the goal position, which improves the efficiency of Q-learning. The above methods all initialize the Q-value simply by some prior information to improve the algorithm, without considering the avoidance of concave obstacles and the adjustment of the action selection strategy.
In summary, there are still many difficult problems in the formation and path planning of multiple robots in unknown environments. In this paper, we adopt the leader-following approach to study the multirobot dynamic formation problem. The innovation in this paper is as follows: The improved Q-learning algorithm is presented to plan paths, in which environmental guidance and virtual obstacle-filling avoidance strategy are added to accelerate convergence and the SA algorithm is applied to improve the action selection strategy; the follower robot can achieve the tracking strategy of GPF by designing the cost function to select actions.

Q-Learning
Algorithm. The Q-learning algorithm [21] is an RL algorithm based on temporal-difference, which combines the Monte Carlo sampling method and the bootstrap-ping idea of dynamic programming. It is described with the Markov decision process as follows: Firstly, limited state space and action space are given. When the robot needs to accomplish a certain task, it selects and performs the action in the current state, which interacts with the environment. Then, the robot enters the next state and is given an instant reward as feedback by the environment. Finally, the value function is updated according to the update rule by using the reward which is passed to it. One round is continuing the above process until the robot reaches the target, and the rounds are iterated until the cumulative reward is maximum. The update equation of the Q − value function is where s t is the state and a t is the action at current time t, s t+1 is the state and a is the action at next time t + 1, r t is the reward obtained by performing action a t at state s t+1 , α ∈ ð 0, 1Þ is the learning rate, and γ ∈ ð0, 1Þ is the discount factor. In order to ensure the exploratory nature of the algorithm, the ε-greedy strategy is usually adopted, with the probability 1 − ε of selecting the action that maximizes the value function, and the small probability ε that is still reserved for random exploration. The mathematical equation of the strategy is where πðs t Þ is the selected strategy, ε ∈ ð0, 1Þ is the greedy factor, δ ∈ ð0, 1Þ is a random number, a random is a randomly selected action, and arg max Qðs, aÞ a is an action that maximize Q-value at state s. The classical Q-learning algorithm is described in Algorithm1.

APF Method.
The APF method is a virtual potential field established artificially, including the GPF and the repulsive potential field. The target generates a gravitational force on the robot to make the robot move towards it. The GPF function is where ξ is the GPF factor, q is the state position of the robot, U gra ðqÞ is the GPF at q, q goal is the state position of the target to be reached by the robot, and dðq, q goal Þ is the distance between the robot and the target, which can be measured by specific sensors in practice and is one-dimensional. Gravitation is the negative gradient of the GPF, and the gravitational function is defined as

Journal of Sensors
Obstacles generate a repulsive force on the robot to make the robot move away from it. The repulsive potential field function is where η is the repulsive potential field factor, q is the state position of the robot, U rep ðqÞ is the repulsive potential field at q, q obst is the obstacle state position, dðq, q obst Þ is the distance between the robot and the obstacle, which can be measured by specific sensors in practice and is one-dimensional. d 0 is the influence radius of the obstacle, which is artificially set according to the experimental requirements in practice. Repulsion is the negative gradient of the repulsion potential field, and the repulsion function is defined as Therefore, the traditional APF method guides the robot's direction of movement based on the combined force of gravitation and repulsion, but its shortcomings are as follows: (1) When the robot is far away from the target, the gravitational force is much greater than the repulsive force, and it may hit an obstacle (2) When the distance between them is relatively close, the obstacles will repel the robot too much to reach the target (3) When the two reaction forces just cancel out, the phenomenon of local optimum or oscillation may appear Because of the above shortcomings, the APF method generally cannot be used directly and needs to be improved to use.

2.
3. SA Algorithm. The idea of the SA algorithm comes from the solid annealing process, which is an algorithm that jumps out of the local optimum to get the global optimum. The algorithm uses temperature parameters T to control convergence in a finite time. Firstly, the initial temperature and the end temperature are set. The algorithm starts from the initial state and takes it as the current state. Then, it generates a new state in its neighborhood and determines whether to accept the new state based on the Metropolis criterion. The generation process of the new state iterates while the T decays until T is the end temperature. Finally, the algorithm ends with the global approximate optimal solution.
The Metropolis criterion is that when a system enters a state s new due to a certain change in state s old , the energy of the system correspondingly changes from Eðs old Þ to Eðs new Þ and then the accepted probability equation of the system from s old to s new is begin Initialization: Qðs, aÞ = f0g, ∀s ∈ S, a ∈ AðsÞ = fup, down, left, rightg %Initialize Q value with 0, determine the state set and the action set containing four actions for(episode < m) %The episode cannot exceed m which is the maximum number of episodes Given initial state s 0 ; while (s t ≠ target state) (1) Select an action a t at state s t according to ε-greedy; % ε-greedy is the action selection strategy; (2) Execute the action a t , then enter state s t+1 and get a reward r t ; %Get immediate rewards by performing actions to interact with environment (3) Update Qðs t , a t Þ using Qðs t , a t Þ = Qðs t , a t Þ + α½r t + γ max a Qðs t+1 , aÞ − Qðs t , a t Þ; % Update the value function according to the update equation by using the reward (4) s t ⟵ s t+1 ; %Update state end-while Episode = episode + 1; % Update episode end-for end 3

Journal of Sensors
When Eðs new Þ ≤ Eðs old Þ, the new state is accepted as the current state. When Eðs new Þ > Eðs old Þ, if e Eðs new Þ−Eðs old Þ/T > δ, the new state is accepted as the current state; otherwise, the new state is not accepted and the system remains in the current state. δ ∈ ð0, 1Þ is the randomly generated number.

Improved Q-Learning Proposed for Path
Planning of Leader Robot 3.1. Environmental Guidance Based on GPF of Target. The traditional Q-learning algorithm has no prior knowledge. In the early learning process, the robot's aimless exploration causes many invalid iterations and slow convergence. So, the idea of the APF method is introduced to guide moving. In this paper, the robot plans a path in an unknown environment where only the start and the target of the task are known. Due to the less environmental information and the shortcomings of the traditional APF method, only the GPF of the target is introduced to initialize the Q value without considering the effect of the repulsive potential field. In order to make the target direction consistent with the increasing direction of the Q-value, the GPF function is constructed as where ξ is the GPF factor which is negative and controls the value inversely proportional to the distance, d aim ðsÞ is the distance from the current position to the target, and η is a positive constant that prevents the denominator from being 0. When the robot moves, the instant reward is detected by sensors and the Q-table is initialized at the same time. Therefore, the instant reward of environmental information is added to the Q-value initialization. The purpose of RL is to maximize the cumulative reward by maximizing the Q-value. The robot always tends to choose the action with the maximum Q-value, which will guide the robot to move toward the target while avoiding obstacles. The mathematical equation of Q-value initialization with environmental guidance based on GPF of the target is where r q = 1, at target 0, else , k q is the scale coefficient adjusted according to the actual algorithm, γ is the discount factor, and U ′ gra ðsÞ is the GPF at state s.

Virtual Obstacle-Filling Avoidance
Strategy. There will be concave obstacles in a more complex environment. The traditional Q-learning algorithm can escape from such obstacles through continuous exploration, which greatly extends the learning time. In addition, the robot is more likely to fall into concave obstacles and cannot escape after adding GPF guidance. In the grid map environment, the obstacle grid is the infeasible area and the rest are feasible areas. Setting certain key position grids which is feasible in the path of possibly tending to concave obstacles as infeasible areas can effectively fill and avoid concave obstacles. Therefore, a virtual obstacle-filling avoidance strategy is established for concave obstacles. The strategy is to judge whether the current grid possibly tends to concave obstacles by adding real-time detection information based on the target tendency before the robot takes the next step. Then, it fills non-obstacles on the path of possibly tending to the concave obstacle with virtual obstacles until the concave shape is filled. The filled concave obstacle as a whole becomes an infeasible area, so the robot will not fall into the concave obstacle in subsequent iterations. This strategy makes full use of sensors and the environmental information which have been learned. It not only prevents the robot from falling into concave obstacles but also reduces invalid exploration of some infeasible positions, which improves the efficiency of path planning. The specific implementation of the virtual obstaclefilling avoidance strategy is as follows.
Firstly, the sensor is used to detect the position status and distance in real time. And a current position-action array is established to store the feasible adjacent positions from the current position. Before the robot moves, the Euclidean distance from the 3 * 3 grid positions adjacent to the robot's current position to the target position is calculated in turn. Next are the specific judgement steps to judge whether the current grid possibly tends to concave obstacles according to the calculated distances.
If the distance is less than the distance from the current position to the target position, it is further judged whether this adjacent position is an obstacle or not. If the adjacent position is not an obstacle, it is feasible and will be added to the corresponding position of the current positionaction array.
If the adjacent position is further away from the target or it is further judged an obstacle, it is an infeasible position and will not be added to the corresponding position.
If the final current position-action array is empty, it indicates that the current position completely tends to infeasible areas which may be in a concave obstacle. The current position will be filled with a virtual obstacle.
Finally, each step of the robot is judged until concave obstacles are filled.
One-step filling of the virtual obstacle-filling avoidance strategy is shown in Figure 1. In the figure, the red grid is the robot, and the yellow grid is the target. As Figure 1(a) shows, the robot enters the grey concave obstacle during the path planning process. According to the distance calculation, the adjacent positions which are down, right, and lower right of the robot's current position are determined 4 Journal of Sensors as the positions which are near the target more and are the dark grey grid in Figure 1(b). The three adjacent positions are further judged obstacles which are infeasible positions, indicating that the current position is completely toward the infeasible area which may be in a concave obstacle. Thus, the current position is filled with light grey virtual obstacles, which is seen in Figure 1(c).

Action Selection
Strategy Improved by SA. In the process of path planning with Q-learning, the robot expands the range of movement by exploring the environment and accumulates knowledge of environmental rewards and punishments. Finally, it uses the value function to select the optimal action. In the robot's iterative learning process, more exploration is required in the early stage, but too much or too long exploration will greatly extend the learning time and reduce the learning efficiency. On the contrary, too little exploration will lead to insufficient experience and the action selected finally may be sub-optimal. Thus, it is necessary to balance exploration and utilization. The traditional ε-greedy strategy If episode%n == 0 Then use ε = e ðQðs,a random Þ−Qðs,a max Þ−qÞ/T to calculate and update ε; %Adjust ε dynamically using T of the SA algorithm.
while (s t ≠ target state) (1) If s t exists in the Q-table then continue to next step; Else use Qðs, aÞ = k q * ðr q + γ * ξ * ðd aim ðs′Þ/ðd aim ðs′Þ + ηÞÞÞ to initialize Qðs t , aÞ; %Initialize the Q-table (2) If ðs t , aÞ is a feasible area towards the target then add it to the corresponding position of currentðs t , aÞ; Else the corresponding position of currentðs t , aÞ is kept empty; % Add the feasible adjacent positions from the current position to the current position-action array (3) If currentðs t , aÞ is empty then ðs t , aÞ is completely toward the infeasible area which possibly tends to concave obstacles and fill it with virtual obstacle; Select action a t in state s t according to ε-greedy which is improved by SA; %Fill concave obstacles using the virtual obstacle-filling avoidance strategy while selecting actions (4) Execute a t in s t , enter s t+1 and get r t ; (5) Update Qðs t , a t Þ using Qðs t , a t Þ = Qðs t , a t Þ + α½r t + γ max a Qðs t+1 , aÞ − Qðs t , a t Þ; (6) s t ⟵ s t+1 ; end-while Episode = episode + 1; end-for end Algorithm 2: Improved Q-learning algorithm. 5 Journal of Sensors often used in the Q-learning algorithm balances exploration and utilization to a certain extent by setting ε. However, the fixed greedy factor in the learning process makes random actions selected with the same probability each time, which causes slow convergence and fluctuations after convergence. Therefore, the greedy factor needs to be adjusted dynamically with the learning process. One method commonly used in experiments is to set ε to decrease at a fixed rate, but it is not universal to set a fixed rate of decrease based on experience.
In response to the above problems, the ε-greedy strategy improved by SA which is used to adjust ε dynamically is proposed. The controlled temperature of SA is adjusted in real time according to the learning situation of the Q-learning algorithm. The algorithm explores as much as possible in the early stage of path planning to increase more prior knowledge and prevent local optimum and cancels unnecessary exploration when approaching convergence later. The steps of the action selection strategy improved by the SA algorithm are as follows: (1) Define the temperature parameter T and set the initial value T 0 . Then, use the sample standard deviation of step numbers for n consecutive iterations to control the cooling temperature. The mathematical equation of T is where step m+1 , ⋯, step m+n , respectively, are the number of steps for n consecutive iterations, step avg is the average number of n consecutive iterations, and k is the control factor, which is obtained by repeatedly adjusting according to the experimental effect and controls T in a suitable range. i is a smaller non-zero constant to prevent T from being 0 after convergence (2) Calculate the accepted probability of randomly selected actions according to the Metropolis criterion. And use it to redefine the greedy factor ε at T. The mathematical equation of ε is where Qðs, a random Þ is the Q-value of the random action selected at state s, Qðs, a max Þ is the Q-value of the optimal action at state s, q is a non-zero constant  Journal of Sensors to prevent the numerator from being 0, and T is the temperature parameter Firstly, the Q-table of the original Q-learning algorithm is initially a zero-value table without any prior knowledge. The improved Q-learning algorithm uses the GPF of the known target in the task to initialize the Q-table, which adds environmental guidance and reduces invalid exploration.
Secondly, the robot moves immediately after selecting an action in the original algorithm. This algorithm designs a virtual obstacle-filling avoidance strategy for judgment before each step. It fills non-obstacles which is judged to tend to concave obstacles with virtual obstacles.
Finally, the original algorithm uses the traditional ε-greedy strategy to select actions. The strategy improved by the SA algorithm is proposed in the new algorithm. It adjusts ε dynamically by adjusting the temperature in real time according to the learning situation of Q-learning.
The steps of the improved Q-learning algorithm are shown in Algorithm 2.

Tracking Strategy Based on GPF for Follower Robot.
The steps of the tracking strategy based on GPF for the follower robot are as follows: Step 1: if the follower robot obtains the coordinates broadcast by the leader robot, it will determine the next target state according to the formation, i.e., the desired target position at this time. Otherwise, it means that the formation has reached the target position and the path planning ends.
Step 2: the follower robot moves to the target position. Firstly, the robot uses the cost function to calculate the cost    ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi The cost function equation is where d attr is the GPF which is measured, x s is the horizontal coordinate of the current state, y s is the vertical coordinate of the current state, x goal is the horizontal coordinate of the target at this moment, y goal is the vertical coordinate of the target at this moment, Cðs t , a t Þ is the cost function at t, s t is the state at t, a t is the action at t, c is the adjustment coefficient, and R static ðs t , a t Þ is the penalty function.
Step 3: if the state with the minimum cost entered is the target state at this time, return to step 1 and continue. If the state is not the target state at this time, go to step 2 and continue.

Design Scheme of Path Planning for Leader-following
Formation. Adopting the leader-following method, the design scheme of path planning for leader-following formation proposed in this paper includes three parts: (1) Initialization: the grid environment is adopted, and the starting position and target position of the multiple robots are determined. A leader-following formation is designed and the robots are divided into two types: leader and follower. Then, one robot is selected as the leader or a virtual robot is supposed to act as the leader, and the rest are follower robots. Multiple robots have eight actions including up, down, left, right, upper left, upper right, lower left, and lower right. Each robot is equipped with a sensor, which can detect the environmental information of the 3 * 3 grids centering on its position. The multirobot action, step length, and detection range of the sensor are shown in Figure 2 (2) Path planning of leader robot: the leader robot is responsible for planning the path. It uses the improved Q-learning algorithm to plan a globally optimal path with avoiding simple obstacles and concave obstacles after trial-and-error training. At the same time, it broadcasts the position of each step and some environmental information to the follower robot. The process of the leader robot's path planning is shown in Figure 3 (3) Local following of follower robot: the follower robot is responsible for following the leader robot to maintain the formation according to the requirements. When the follower robot receives the position 9 Journal of Sensors information broadcast by the leader robot, it determines the desired target depending on the formation. Then, it follows locally using the tracking strategy based on the GPF and avoids obstacles autonomously. The process of the follower robot's path planning is shown in Figure 4

Experiments Analysis
According to the design scheme of path planning for multiple robots, the method is tested experimentally. The experiment uses Python standard GUI toolkit Tkinter to establish simulation environments.

Comparison Experiments of Improved Q-Learning
Algorithm. For the improved Q-learning algorithm, a grid map with three elements: starting point, target point, and obstacles, is first established. The map size is set to 20 × 20 grids, and the resolution of each grid is 26 × 26 pixels. The starting position of the robot represented by a red grid is set at (0, 0), and the target position represented by a yellow grid is at (19,19). Obstacles which are black grids are randomly placed on the map, including concave and simple obstacles. To distinguish actual obstacles from virtual obstacles filled during the algorithm operation, virtual obstacles are gray grids.
The experiment is carried out in a comparative way, and five algorithms are implemented: Q-L1 is the traditional Q -learning algorithm, Q-L2 is the Q-learning algorithm with the dynamic greedy factor of SA, Q-L3 adds environmental guidance of GPF on the basis of Q-L2, Q-L4 is the Q -learning algorithm proposed in this paper with the improvements 3.1, 3.2, and 3.3, Q-L5 is the Q-learning algorithm with a modified reward function based on Q-L4. The implementation details of Q-L1 to Q-L5 algorithms are shown in Table 1.  Journal of Sensors The same parameter settings of the algorithm are as follows: the maximum number of iteration rounds is 10000, the learning rate α is 0.01, and the discount factor γ is 0.9. For using the traditional ε-greedy strategy in the algorithm, the greedy factor ε is 0.2, and the convergence is determined that the standard deviation of step numbers for 10 consecutive iterations is less than 5. Parameter settings for using SA in the algorithm are as follows: the initial temperature T 0 is set to 10, the number of consecutive iterations n is set to 10, the constant i is set to 0.1, the control factor k is set to 0.03, and the non-zero constant q is set to 1. In the algorithm using the GPF method to improve, the GPF factor ξ    Figure 5(c) shows that the robot is trapped in a concave obstacle and cannot escape. Figure 6(c) shows that the cumulative reward curve of path planning changes irregularly during the iterative process. Figure 7(c) shows that the step number curve of path planning changes irregularly during the iterative process. The above three results of Q-L3 ndicate the algorithm does not converge in the iterative process, and the robot cannot reach the target when only adding the GPF of the target to improve when encountering concave obstacles. (e), and 7(e) reach smoothness in fewer rounds of iteration. The above shows that by adding the improvements proposed in this paper to Q-L4 algorithm and Q-L5 algorithm, the experiments achieve better results. The robot moves while initializing the Q-value by the environmental guidance based on the GPF of the target, which makes the robot guided by the target direction all the time. It removes invalid movement and speeds up the convergence time. When the robot encounters a concave obstacle, it identifies effectively the infeasible area and fills it with light gray virtual obstacles to prevent the robot from falling into the concave obstacle. The SA method is used to dynamically adjust the greedy factor to accelerate the algorithm convergence and make it stable after convergence. In addition, the Q-L5 algorithm adjusts the reward function on the basis of the Q-L4 algorithm by giving each step a smaller penalty, and the robot learns the optimal path with the maximum cumulative reward. Table 2 compares the performance of the above five algorithms after path planning. The data in the table are the average results from conducting several experiments. The analysis is as follows: based on the traditional Q-learning   Journal of Sensors algorithm, Q-L2 uses the SA algorithm to improve ε-greedy strategy. Although the convergence round of the algorithm increases, the convergence time is shortened and the overall stability of path planning is improved. Comparing the algorithms Q-L3 and Q-L4, if the environmental guidance of GPF is added to the algorithm without the virtual obstaclefilling avoidance strategy, the algorithm is difficult to converge when encountering concave obstacles. Comparing the algorithms Q-L2 and Q-L4, by adding environmental guidance and the virtual obstacle-filling avoidance strategy, the convergence time is reduced by 98.5%, and the convergence rounds is reduced by 96.6%; the total step numbers and the length of the path are stabilized at 26 and 32.6274, respectively. Comparing the algorithms Q-L4 and Q-L5, adjusting the reward function reasonably on the improved Q-learning algorithm proposed in this paper will make the robot plan the optimal path quickly, which is 89.9% shorter than the traditional Q-learning algorithm. And the number of convergence rounds is reduced by 63.4%, the step numbers are reduced to 22 and the length of the path is reduced to 28.6274.

Experiments of Path Planning for multirobot Formation.
After experimenting with the improved algorithm of the leader robot's path planning, the follower robot is added to verify the effectiveness of the path planning method for multirobot formation in this paper. The experiment uses a triangular formation and three robots. The leader robot is represented by a red grid, and its initial position is ð2, 2Þ.
The follower robots are represented by a blue grid and a green grid, respectively, and their initial positions are ð0, 2Þ and ð2, 0Þ, respectively. The target position of the leader robot is ð19, 19Þ, which also determines the target position of the follower robots. The leader robot uses the improved Q-learning algorithm Q-L5 to plan the optimal path with the same parameter settings as in 5.1 simulation experiments. The two follower robots, respectively, use the tracking strategy based on the GPF to follow: the penalty function R static = 1 and the adjustment coefficient c = 0:09. The experimental effect is shown in Figure 8(a) that the three robots quickly reach the target with a fixed triangle formation in an obstacle-free environment. In an environment with static obstacles, the leader robot moves first, and the two follower robots immediately move to the corresponding positions in the formation. The green follower robot firstly encounters a black lateral obstacle. It moves along the obstacle to the target direction and smoothly avoids the obstacle. Then, it continues to accelerate to move to the corresponding position of the formation at the current time to maintain the formation. Finally, the leader robot plans a red path, the two follower robots avoid obstacles by themselves during the following process and plan a green path and a blue path, respectively. The three robots reach the target at the same time and complete the formation task. The experimental effect is shown in Figure 8(b).

Conclusion
In this paper, by combining the improved Q-learning algorithm and the idea of the GPF method, a method for multirobot formation and path planning is proposed. The division of labor among multiple robots is clear. The leader robot uses the improved Q-learning algorithm to plan the path. It is found that adding environment guidance of the target's GPF and virtual obstacle-filling avoidance strategy effectively accelerates iterative convergence and avoids concave obstacles. It is stable and efficient for the action selection strategy to be improved by the SA method. At the same time, the follower robot uses a tracking strategy based on the improved GPF to follow in real time, which is simple and efficient. This formation method effectively solves the formation and path planning problems of multiple robots in an unknown environment with concave obstacles. In the future, the multirobot formation will be further studied in the context of dynamic environments and privacy protection.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.