Multirobot Coverage Path Planning Based on Deep Q-Network in Unknown Environment

Aiming at the problems of security, high repetition rate, and many restrictions of multirobot coverage path planning (MCPP) in an unknown environment, Deep Q-Network (DQN) is selected as a part of the method in this paper after considering its powerful approximation ability to the optimal action value function. ­en, a deduction method and some environments handling methods are proposed to improve the performance of the decision-making stage. ­e deduction method assumes the movement direction of each robot and counts the reward value obtained by the robots in this way and then determines the actual movement directions combined with DQN. For these reasons, the whole algorithm is divided into two parts: oine training and online decisionmaking. Online decision-making relies on the sliding-viewmethod and probability statistics to deal with the nonstandard size and unknown environments and the deduction method to improve the eciency of coverage. Simulation results show that the performance of the proposed online method is close to that of the oine algorithm which needs long time optimization, and the proposed method is more stable as well. Some performance defects of current MCPP methods in an unknown environment are ameliorated in this study.


Introduction
With the continuous development of robot technology, robots gradually appear in a variety of tasks to improve the e ciency of task or reduce labor costs. As a part of robot application, coverage task has been widely used in area cleaning [1,2], disaster detection [3], postdisaster rescue, and other elds. In the rst place, researchers focus on coverage path planning (CPP) for individual robot and achieve good results [4,5]. Due to the di erent methods of map representation, model building, path generation, and their cross application, there are various methods of CPP. Some methods use trapezoidal decomposition or boustrophedon decomposition to decompose the map into several simple subregions and then cover them in sequence [6,7]. e grid-based method is another basic method, including the spanning tree coverage (STC) algorithm [8], wavefront algorithm [9], etc.
However, with the increase of coverage area and diculty of the task, a single robot is di cult to achieve the task requirements.
en, multirobot coverage path planning (MCPP) rises in response to the proper time and conditions [10]. e di culties of MCPP lie in the communication, cooperation, and collision avoidance among robots [11,12]. In order to solve these problems, some methods based on the single-robot CPP algorithm make great improvements. For example, the map is decomposed into subregions, and then the STC algorithm is applied in these subregions [13]. Some methods are also based on map decomposition [14], where simulation results show the e ectiveness of the proposed method. e template method is also a common method in MCPP. With the study of some optimization algorithms [15], heuristic algorithms are also used to optimize the coverage problem [16]. It works well in simple environments, while in complex environments, robots may get stuck in dead zones. Aiming at this problem, the A * algorithm is applied to escape the dead zones [17]. Another basic method worth mentioning is the biologically inspired neural network (BINN) method, which uses activity of grids that help robots select actions. Some improvements are made in [18], and the problems about large amount of calculation and big cost to escape from dead zones are solved.
All of these methods require prior information on the environment. Nevertheless, in some application scenarios, it is difficult to obtain all the environmental information and plan the path offline before the task starts, such as large-scale search and rescue [19] and mine clearance [20]. erefore, online MCPP methods are significant in unknown environment. In contrast, there are few researches on MCPP in unknown environment, and the current methods are immature. Robots are divided into exploration and coverage robots in [21], and the explorers cover the boundaries of the current region, while coverage robots move back and forth between the explorers. When the multirobot team meets reverse-convex (RCV) obstacles, it will split into two subteams. e subteams cannot split again. However, this method has requirements for both the start points of robots and the number of robots. A method proposed in [22] can map out the approximate path and provides ideas for future researches. e multirobot hex decomposition exploration (M-HDE) method has been proposed in [23], in which the energy restriction is taken into consideration. In this method, the map environment is represented as hexagons, and then smooth, continuous paths are generated for the robots formation. But the simulation results show the paths occasionally crossed the obstacles, which may bring some danger.
Focusing on above problem, extensive researches and applications of reinforcement learning (RL) in robotics [24] catch our attention. Nevertheless, these studies usually focus on the pathfinding problem or the singlerobot coverage problem in the known environment [25]. We think this is because the target of the coverage problem is to cover all nonbarrier areas, which is more complicated than pathfinding, especially in unknown environment. Consequently, using RL to solve the MCPP problem in unknown environment requires a huge amount of calculation, which is difficult to meet in practical applications. erefore, this paper tries to reduce the amount of RL training and solves the difficulties in MCPP in unknown environment. Compared with the current methods, the main contributions of this paper can be summarized as follows: (i) At present, most of the coverage algorithms are offline algorithms, and the planning takes a long time. e coverage problem is divided into offline training and online decision-making in the proposed method. e robots acquire prior knowledge in offline training and are able to make decisions quickly in the face of unknown environments online because different target environments only affect the decision-making stage.
(ii) Drawing on the Monte Carlo tree search method, the multirobot deduction method is proposed. is deduction method selects some possible better action groups for deduction according to the training network and judges the quality of the action groups by rewarding feedback. rough this method, under the condition of decentralized training, multirobot cooperation and collision avoidance are well solved, thereby reducing the amount of training. (iii) In order to save the robot's storage space and reduce the amount of training, during the training phase, the environment is known and fixed in size. In practice, however, the target environment is unknown, and it is highly likely that the size does not match the training environment. In view of this situation, the global image of the environment is generated multiple times according to the probability of the area that has been explored, and the deductive method is used in multiple images, and finally, each robot action is selected by voting. Second, if the environment size does not match the training size, the sliding-view method is used, and each robot is as centered as possible on itself, intercepting the environment information as a network input.
In order to implement this algorithm, the robot needs to be equipped with a basic computing microcomputer (such as Raspberry Pi) and a laser distance sensor (LDS). e LDS needs to sense the obstacle information in each direction, so the recommended installation position is the top of the robot. e rest of the paper is organized as follows. Section 2 introduces the related works. e proposed method is provided in section 3. e comparison results and the conclusion are presented in sections 4 and 5, respectively.

Deep Q-Network.
In 2013, the concepts of deep reinforcement learning (DRL) and the Deep Q-Network (DQN) algorithm were proposed on Neural Information Processing Systems (NIPS) by the DeepMind team [26]. Forty-nine classic Atari2600 games demonstrated the powerful performance of DQN. Since then, both the DeepMind team and other researchers have made improvements on this basis to improve the performance of DQN [27,28]. Some researchers consider double DQN, fuzzy systems [29], or prioritized experience replay. Double DQN has better ability to alleviate the overestimate problem; for this reason, double DQN is applied in this paper. e main formulas of DQN are (1) and (2): s means the state, and a means the action; s ′ and a ′ mean the state and action in the next time; the very important change from double DQN is that a ′ is selected by the main network but not the target network. r is reward given by environments according to R 1 ∼ R 4 . c is the discount factor. Q is the value of DQN, and L i (θ i ) is the loss function in iteration i and is used to train the network. y is the target value, and it is equal to r + cmax a″ Q(s ′ , a ″ ; θ − i ), which is influenced by the previous iteration parameter θ − i .

Monte Carlo Tree Search.
is paper is inspired by the excellent performance of the Monte Carlo tree search (MCTS) method in the chess game [30,31] and try to reduce the consumption of training, meanwhile ensuring the quality of the coverage by this way. First, the basic processes of MCTS are introduced here, then the difficulties to apply this algorithm to CPP problems are analyzed, and finally, the modification method is described in the later section. If states are represented as nodes and actions as directed edges, then MCTS can be represented as a tree construction, as shown in Figure 1. MCTS has four basic processes: selection, expansion, simulation, and backpropagation: (i) Selection: choose the action that is most likely to reach the unevaluated state. (ii) Expansion: in a certain state S t , after the action A, the new state S t+1 is obtained; then S t expands S t+1 as its child node. (iii) Simulation: start from a certain state and choose actions by some strategy, until the end. (iv) Backpropagation: at the end of the simulation of a certain state, the information of each parent node is updated along the way according to the outcome, until the root node.

Map Presentation.
In this paper, the grid-based method is used to represent the target area. e grid size depends on the effective range of the coverage sensor carried by the robot. As shown in Figure 2, r s represents the coverage radius of the sensor, and l represents the length of grid sides. l has to follow: Once the grid size is determined, the task area can be rasterized. g i,j is used to describe the grid in row i and columns j. In the rasterization process, if a grid contains any obstacle area, the whole grid will be regarded as an obstacle grid. Conversely, a grid is considered to be a grid waiting for covering which consists entirely of reachable areas. e set S � g 1,1 , g 1,2 , . . . , g 1,j , g 2,1 , . . . , g i,j represents all grids in a two-dimensional map with i rows and j columns; obstacle grids, grids waiting for covering, and covered grids are represented by S o , S w , and S d , respectively. S � S o ∪ S w ∪ S d and the intersections of S o , S w , and S d are empty sets. e grids currently occupied by n robots are represented by P r � p 1 , p 2 , . . . p n ⊂ S d . As shown in Figure 3, black represents the obstacle area S o , and white represents the feasible area S w . e robot will start from the start point and move to the feasible direction, so as to cover the feasible area. e current position of the robot is represented by a red circle, and the covered grids S d are represented by blue. e conditions for ending the task are S w � ∅, or the robots are stuck in a dead zone from which it cannot escape.
In the CPP problem, the main metrics we pay attention to are coverage rate, time cost, energy consumption, and so on. e repetition rate can be a good indicator of time cost, energy consumption, and other indicators, especially in the case of complete coverage. In the later simulation, the repetition rate is used frequently to represent the performance of the method. e coverage rate r c and the repetition rate r r are grid-based metrics and related to the path of robots. Hence, there are some definitions that should be made first. Because the paths of robots consist of ordered, repeatable, and adjacent grids, they are represented by variable-length arrays rather than sets. e path of ith robot can be denoted as P i � [g i x1 ,i y1 , g i x2 ,i y2 , . . . , g i xt ,i yt ]. L P i means the length of array P i ; t � L P i is because ith robot has taken t steps so far. en the calculation formulas of r c and r r can be defined as shown in (4) and (5).
If the robot achieves complete coverage, S � S o ∪ S d , and S w � ∅. erefore r r is only related to total length of paths in this condition.
In order to transform the index of the robot coverage problem into the RL model, the following settings are made: whenever the robot visits a reachable grid g i,j ∈ S w , the robot can get a reward R 1 and then alter the assembly state of  Journal of Robotics If the robot achieves the complete coverage of the task area (S w � ∅), a larger reward R 2 will be obtained. On account of speeding up the convergence of training, a small negative reward R 3 is added to each step of the robot. For processing of illegal areas S o (such as boundaries and obstacles), a negative reward R 4 is given if a robot accesses the illegal grid g i x2 ,i y2 and then makes the robot stay in the place g i x1 ,i y1 . R 1 is set as a unit and pick other reward parameters in view of simulation results. e simulation environment is a 6 × 6 map (i � j � 6), and the maximum number of robot steps is limited to 180. e ϵ-greedy strategy is used in training.
ere are 8 sets of parameters that have been picked as representatives, and the average of the last 10 episodes' results are taken and shown in Table 1. R 1 � 1, R 2 � 10, R 3 � −0.01, and R 4 � −1 are the best sets of parameters in our simulations.

3.2.
Training. Multiagent reinforcement learning is an important part of RL. ere are many discussions on Nash equilibrium. e fully centralized method and centralized training with the decentralized execution method are adopted primarily. Naturally, in these methods, there is more resource consumption than in decentralized training. For the MCPP problem, the huge computing consumption of training multiple centralized robots cannot be afforded, especially when the number of robots is uncertain. Finally, the single-agent reinforcement learning method (DQN) is selected for these reasons above, and how to resolve multiagent conflicts is explained later in this article. e state is the network input in DQN. In the coverage problem of the given environment, each grid has three possibilities: S o , S d , and S w . In the initial simulations, a matrix is used, and the matrix size is corresponding to the size of the environment to represent the state, and different numbers represent the different situations of the grids (0: S w ; 1: S o ; 2: S d ; 3: P r ). However, it is found that this state representation method leads to sluggish convergence. In our opinion, it is owing to the concentrated representation of multiple grid situations that the convolution layer has more burden of feature extraction. After this, one-hot encoding is used to modify the input states; 001, 010, and 100 represent S o , S d , and P r , respectively. If a grid is neither an obstacle nor has it been accessed or is being visited, it must be a grid waiting for the robot's coverage. erefore, in this way, there is no need to independently encode the grid waiting for covering (or it can be regarded as 000). Finally, each bit of the encoding is set to a different channel, and thus, the matrix is transformed into a three-channel input, as shown in Figure 4. Simulations demonstrate the effectiveness of this state representation: all other conditions being equal and i � j � 6, there are 3000 episodes of training for both methods, and the environment maps for each episode are randomly generated. Some of the other training parameters are shown in Table 2. In order to observe repetition rate variation clearly, the averages of repetition rates per 100 episodes are calculated. e results are shown in Figure 5. e results in Figure 5(c) are the results from Figure 5(a) minus the results from Figure 5(b). Both these methods can achieve complete coverage. e difference of repetition rates demonstrate that the threechannel input method gains the edge after 500 episodes of training. As a result, all simulations without a detailed description in this paper adopt this state representation way. In order to prove that the proposed method is applicable to a larger environment, the same method is used to train environments when i � j � 10, and it can be seen in Figure 6 that although the repetition rate is high, the repetition rate curve still has a downward trend. ese results prove that the wider environment requires a greater amount of training.  AlphaGo can adopt MCTS in the execution stage, and there are some primary reasons: first, AlphaGo has multiple threads, central processing units (CPUs), and graphics processing units (GPUs). As a result, there is a large capacity of calculation, and an enormous amount of computation can be borne. Second, AlphaGo has both policy network and value network. e choice of action and the assessment of state use different networks, respectively. Reciprocally, computing resources are limited in CPP. What is more, the DQN-based method only has the value network. After considering these difficulties and focusing on the idea of deduction, which is the key of MCTS, in the game of Go, an action is made in a certain state, but the next state is affected by the opponent's move. In a CPP problem, the next state is determined after the action choice. A simpler environment allows the simulation stage to use the value network and get acceptable results. On the other hand, if simulation is continued until the end of the task, the task may be impossible to complete due to the inaccuracy of the value network: the robot may move back and forth between certain grids. Another reason for no continuous simulation until the end of the task is that not only the complete coverage ability but also the repetition rate are concerned, and the repetition rate is even more important. For these reasons, the number of simulation steps is set as the parameter N max step . Finally, the final decision of the MCTS is the action which is selected the most times (the more times selected, the more likely to win). In the CPP problem, the robot only has four actions at most (front, back, left, and right), and it is much less than Go. We can evaluate all the actions by the rewards in the simulation stage. e algorithm is displayed in Algorithm 1.
Comparative simulation demonstrates the effectiveness of the proposed method. e experimental group uses the deduction method, while the control group uses the value network for execution. In Table 3, for each simulation, a random map is generated for two groups to cover. Both groups use the same network which is under 3000 episodes of training. e simulation was repeated 100 times, and each time, the environment was randomly generated. Each grid has a 5% chance of being an obstacle grid, and if the connectivity of all nonobstacle grids cannot be guaranteed, all obstacle grids are regenerated. Finally, the starting point location of the robot is randomly selected in the waiting coverage area. If the map is covered completely, the corresponding data in this table are the repetition rates. e character string "NULL" is filled in the form otherwise. e experimental group achieves complete coverage in 100  experiments, and the average repetition rate is 6.92%. In contrast, the control group fails to cover completely in 17 experiments, and in other experiments, the average repetition rate is 15.05%. is simulation emphasizes that the deduction method is much better than the way it uses the value network only. e next step is to derive single-robot decisions from multirobot decisions. e difficulty of MCPP is how to allocate tasks reasonably and avoid collision. Due to the result of limited computing resources in training, extra computation is not expected to see, so the existing value network is used to solve the problem of task allocation and collision avoidance. rough experiments, it can be found that the deduction method proposed in the previous section can achieve task allocation and collision avoidance problems when the distance between robots is close. However, the robots bring little influence and can be ignored in far distance. Figure 7 illustrates how the deduction method works with two close robots. If the decision is made by the value network, robots R a and R b will often choose the same grid B as the target to cover. After grid B is covered, grid A is selected as the target by two robots. is leads to an increase in repetition rates, which represents an increase in energy expenditure and task time. e deduction method works by  the rewards obtained for each step and the discount factor. Obviously, one robot aiming at grid A and the other robot aiming at grid B will get the highest total reward. In order to make the deduction method suitable for MCPP, some modifications have been made: (i) In multirobot coverage path planning, the objects to score are action groups instead of actions, which contain the actions selected by each robot and the priority of each robot (after decision robots run in the same time). (ii) e number of possible actions for a single robot is 4; the number of possible action groups is 4 n with n robots. is exponential complexity is obviously unacceptable, so at most, 64 action groups are chosen according to the value network in the beginning. (iii) e action groups are evaluated by the total rewards of all robots and the discount factor. (iv) Finally, select the action group with the highest score, and let each robot perform the corresponding action.
e algorithm is shown in Algorithm 2. e complexity of the whole algorithm is determined by the number of robots n and the maximum simulation steps N max step . As mentioned above, in order to avoid the rapid increase in the time complexity of the algorithm with a large number of robots, the number of action groups is limited. Otherwise, the number of action groups is equal to the cubed number of robots. In later simulations, N max step tends to take 3, and the discount factor c equals to 0.95. ( e discount factor in Algorithm 2 is not equal to the discount factor at the time of training.) By this algorithm, the multirobot can finish coverage tasks in a known environment.

Unknown Environment: Probability and Statistics.
Although the proposed method of the unknown environment in this paper can be used in most cases, it is better for random or uniform distribution of obstacles, such as the field environment where trees grow randomly. In an unknown environment, robots should follow these two basic requirements: (i) In grid-based map, the sensor used by the robot for perception can at least explore the information of neighboring grids around it (including vertex adjacency). (ii) e robot can record the information of previously explored grids. Figure 8 shows the unknown environment in the gridbased map. e grids are represented by gray color if they have never been explored. In the beginning, the robot only knows the information about the surrounding grids. As it continues to cover, observe, and record, the robot can collect more information about the grids, and as a result, the number of unexplored grids is reduced. In known grids, there are three possible situations that are discussed before: S o , S w , and S d . However, in unknown grids, there are only two situations: S o and S w . erefore, when converting the current state to a threechannel input, the second channel and third channel are fixed. Only the first channel is influenced by the unexplored grids. e contents of the unknown part of the channel can be regarded as a two-point distribution that obeys probability P o .
where S ko , S kw , and S kd mean S o , S w , and S d grids within a known range, respectively. Equation (6) signifies that the grids are recorded by the robot and viewed as a sample of the entire environment. e probability of obstacles in the entire environment is estimated by treating these explored and recorded grids. All unexplored grids in the first channel are evaluated according to probability P o . A total of M times are repeated to generate M maps, and actions are selected and voted by applying the deduction method in the previous section in M maps. Finally, the action with the most votes is executed really.    Similarly, experiments are designed to verify the effectiveness of the method. M is set to 1, 2, and 5 to observe the influence on coverage performance. ere are four groups in experiment. In group 1, the environment is known, and group 2 uses the proposed method, and parameter M is different. Group 3 and group 4 regrade unexplored grids as the feasible grids and obstacle grids, respectively. A total of 120 experiments are conducted in these four conditions. Table 4 shows the results.

Journal of Robotics
When M is small, the results may be unstable but acceptable. is is because the voting session becomes meaningless when generated maps are not enough. For example, a random action will be chosen in the "1 : 1" conditions (and that is why it gets bad results when M equals 2). Randomness can also lead to better results: a map generated based on probability may get an appropriate action (of course, the results could be worse, with an average repetition rate of 0.1 percent higher in group 2 than in the third group over an additional 40 experiments). However, larger M will get more stable results with a longer response time. e parameter M can be set according to the real condition.

Nonstandard Size Environments: Sliding View Method.
In the previous simulations, the training and execution are carried out in 6 × 6 environments; however, the situation must be considered that the practical size is bigger or smaller than 6 × 6 size. A simple case is discussed first, when the representation of the target environment is smaller than the training size, for example, the target environment is expressed having 5 × 5 size. A simple approach is to treat the boundary as an obstacle, as shown in Figure 9(a). e situation is more complicated when the environment representation is larger than the training size. e first method is to train in a larger environment as much as possible. Although the coverage performance of this method may be better, the training time will increase geometrically as the training size increases. As shown in Figure 6, during training, repetition rates decreased much more slowly in the 10 × 10 environment than in the 6 × 6 environment. Another processing method named sliding view is proposed in this paper. e sliding-view method not only records all the explored grids information but also generates robots' own Initialization: maximum number of simulation steps N max step , discount factor c, temporary environment information E temp , action of each robot a, total rewards from ith action group R i , current reward from selected action r now Choose k(k < � 64) action groups for n robots A k � (a k1 , a k2 , . . . , a kn ) for i←1 to k (k action groups) do while the number of current step N step < N max step do for j←1 to n do if a ij is feasible then Choose a feasible next action according to the value network, r ij � r ij + c N step − 1 r now Update E temp according to selected action else r ij � −1 end if end for N step � N step + 1 end while R i � r i1 + r i2 + · · · + r in end for Choose an action group by R ALGORITHM 2: Deduction method in MCPP.  views with the training size. e method keeps the robot in the center of the view as much as possible and provides special treatment if the robot is close to the boundary, as shown in Figure 9(c). If there is no grid to be covered in some robots' views, such as R a in Figure 9(c), the pathfinding algorithm is called to drive this robot go to the nearest grid waiting for covering. Because the size of view is the same as the training size, the DQN can work successfully, and other robots can select the optimal action group according to Algorithm 2. When affordable training costs are low, this is a suited compromise method without additional cost. e disadvantage of this approach is that the action value function does not contain all known information about the environment; in other words, the sliding-view method is a greedy method that focuses on the local. Different size environments are dealt with the slidingview method, with the method of regarding boundary as an obstacle. When the environment is small, two robots participate in the simulation, and three robots participate in larger environment. A set of results is picked and as shown in Figure 10, the different colored lines in the figure represent the trajectories of different robots. e robots complete the coverage task in different environments, and these results prove that the size of the environment that can be processed breaks through the limitation of the training size by the proposed methods.

Comparison
As far as the authors know, under the similar model in this paper, there is no other MCPP method in unknown environment. Since there are two methods in known environment that are selected to contrast, the Glasius bioinspired neural network (GBNN) [18] method enables the robots to select actions by transferring activity among the grids. e parameters set in the simulation are consistent with those in that article: β � 0.6, α � 2, E � 100, c � 0.5, and R � � 2 √ . e cell potential and motion pattern driven coverage (CPMPC) [17] method uses the heuristic algorithm to select a pattern and directs robots motion. e crossover rate (65%) and mutation rate (5%) parameters in CPMPC are the same in that article. Since the environment map is smaller than environment in that article, the population size (25) and maximum number of iterations (100) are reduced accordingly. Two kinds of environments are considered. One is the artificial partition environments, which simulates different rooms in the space; the other is the generated random environments to simulate an outdoor situations. In random environments, each grid has a 5% probability of being an obstacle grid, and the robot starting points are randomly generated in the nonobstacle grids. For different numbers of robots, 10 sets of multipartition environments and 40 sets of random environments are generated. It should be ensured that the initial conditions for each method of each set of experiments are the same. Because the simulation takes place in a large number of environments, showing all environments in figures is difficult. ese environments are similar to those shown in Figure 10. Results are shown in Table 5.
e results verify that although the proposed method does not yield the best results in all cases, the results are stable and acceptable. A more important fact is that this method can work in unknown environment and complete the plan in a short time. us, the proposed method is more applicable in practical task. e repetition rates are 10% bigger than Table 4 because multirobot stops exploring until all the grids are covered except obstacles. is means that when the last few remaining grids can be covered by one robot, other robots will still approach. In a small environment, this approach will increase the repetition rate significantly. GBNN is a simple and rapid method, which combines activity of grids and templates method. Templates dominate the coverage process, while activity of grids helps robots escape from dead zones. GBNN is faster than the other two methods as there are no complicated calculations in the whole process. Nevertheless, this simple principle is restricted in some environment.
e results show the unstable state of GBNN. e performance of CPMPC is ideal in most situations with the help of the powerful search capabilities from GA. ere is no gain without a loss. Application of heuristic algorithms means that it will cost a lot of time searching for the optimal solution with given environment. As a result, this method will not work in an urgent task. What is more, the number of selected patterns is set by users, and in that article, it was 2. In fact, the number of patterns should be set to a parameter and be changed with the complexity of the environments. In complicated artificial environments CPMPC may achieve wrong results with few selected patterns.

Conclusion
In this paper, taking into account large-scale and urgent tasks such as mine clearance and postdisaster relief, a gridbased MCPP method in unknown environment is proposed. DQN is used for planning in view of the great performance. However, the value network is hard to train accurately under the huge number of possible states. So the idea of deduction in MCTS is considered, and then the deduction method in MCPP is proposed. Modifying the deduction method not only reduces the time and cost of training but also completes the task assignment and avoids collision in MCPP. On the other hand, the explored grids are regarded as sampling from the whole environment to solve the problem about unknown environment. Since the real environment may be different in size from the training environment, the slidingview method is proposed. Finally, this total method can achieve stable and decent results rapidly. e method in this article can be divided into two parts: training and decisionmaking, both of which require computation, so we need to balance the computational cost and algorithm performance. For example, how many episodes of training should be carried out, the number of action groups considered in decision-making, the larger the value of these parameters, the better the coverage performance of the robot, but the more time it takes to train or make decisions. In partial comparisons, the proposed method performance in the repetition rate is a bit poor than in CPMPC; however, as an online method, the applicability is much stronger than other offline algorithms. Strong robustness is also a highlight of this approach. First, if the training environment is big enough, this method is valid in any case: the areas outside the real environment can be treated as obstacles. Second, this method is applicable to any environment, any number of robots, and any initial positions. ird, this frame may be suitable in other planning problems such Journal of Robotics 13 as pathfinding problems. ere are two points we are interested in and can be improved in the future. One is trying to use the actor-critic method to replace DQN and comparing the results. e other is reducing the repetition rate.
Data Availability e raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Conflicts of Interest
e authors declare that they have no conflicts of interest.