Intelligent Path Planning for AGV-UAV Transportation in 6G Smart Warehouse

. Recently, deep reinforcement learning (DRL) has attracted increasing interest in the feld of intelligent navigation and path planning in smart warehousing. Te latest imitation augmented DRL (IADRL) model has achieved good performance for the cooperative transportation tasks of automatic guided vehicles (AGVs) and unmanned aerial vehicles (UAVs). However, this model cannot always transport target cargoes with the optimized policy due to premature convergence. Terefore, we propose an intelligent path planning model for AGV-UAV transportation in this paper. Te proposed model utilizes the proximal policy optimization with covariance matrix adaptation (PPO-CMA) in the imitation learning and DRL networks, which enables the AGV-UAV coalition to plan the optimal transportation route at a lower cost. Experiments conducted in simulation warehousing scenarios demonstrated the proposed model and improved the accumulated training reward by more than 10%, outperforming the existing state-of-the-art models in terms of efectiveness and efciency.


Introduction
With the rapid deployment of 5G networks worldwide, 6G and its applications in the industry have attracted more and more attention from researchers [1][2][3]. Te number of materials stored in smart warehouses has increased signifcantly recently. Maximizing warehouse space utilization is one way to make 6G smart warehouses more common in the future. In modern intelligent warehousing, the transportation of goods is mainly completed by automatic guided vehicles (AGVs) [4]. Due to the limited reachable height of the AGVs, it is impossible to transport goods at higher positions, which constrains the height of the goods storage racks in the warehouse, resulting in a waste of warehouse space. When the quantity of goods exceeds the afordability of the warehouse, additional warehouse space can only be opened to store the goods, and the new warehouse space means an increase in cost and a decrease in proft.
With the development of hardware devices, more and more unmanned aerial vehicles (UAVs) have been developed for various operations, e.g., monitoring, ground target tracking, optical remote sensing, and precision agriculture [5][6][7][8]. Te most signifcant advantage of UAVs is that they can conduct tasks at high positions. Applying UAVs to cargo transportation tasks can overcome the limitation of AGVs. However, the energy consumption of UAVs is much higher than that of AGVs. Tus, the working time and operational distance of UAVs are compromised, making it impossible to carry out long-distance transportation tasks [9]. Terefore, we cannot directly replace AGVs with UAVs in cargo transportation tasks.
Based on the previous facts, we intend to combine AGV and UAV to form a cooperative AGV-UAV transportation for cargo transportation tasks and solve problems they cannot complete alone. During transportation, UAVs can target goods at higher positions, while AGVs can target those at lower positions. For goods at a long distance and a high position, the AGV can carry the UAV to the location of the goods, and then, the UAV can fy to process the goods. In this way, the UAV makes up for the height limitation of the AGV, improving the space utilization of the warehouse efectively, and reducing the working time and the power consumption of the UAV.
For AGV-UAV transportation, path planning is an essential part of its navigation process. Selecting the shortest transportation route during transportation can reduce transportation costs in terms of time and energy. Recent years have witnessed the emergence of various path planning algorithms, including traditional path planning algorithms (e.g., Dijkstra algorithm [10], A * algorithm [11], artifcial potential feld algorithm [12]), and intelligent path planning algorithms (e.g., genetic algorithm [13], particle swarm algorithm [14], and ant colony algorithm [15]). Tese algorithms achieved specifc achievements in path planning but are easily disturbed by environmental factors and cannot process data in large-scale state space. With the popularity of artifcial intelligence, deep reinforcement learning (DRL) is playing an increasingly important role in intelligent navigation and path planning due to its excellent perception and decision-making capabilities [16]. Particularly, Zhang et al. proposed an imitation augmented deep reinforcement learning (IADRL) model for transportation tasks in complex environments [17]. Compared with the traditional algorithms, IADRL enables the AGV-UAV coalition to accomplish cargo tasks at a lower cost.
However, IADRL may converge in advance and fall into a local optimum in the training process [18]. To target the previous problem, we propose an intelligent path planning model for AGV-UAV transportation in this paper. By introducing the proximal policy optimization with covariance matrix adaptation (PPO-CMA) [19] into the policies of the imitation learning (IL) and DRL networks, our model can not only learn the latent behavioral features of the AGV-UAV coalition from the demonstration data but also provides behavioral decisions for the coalition with better optimization policy. Experimental results show that our model is superior to its rivals by solving the premature convergence problem, enabling the AGV-UAV coalition to complete the transportation task at a lower cost.
Te remainder of this paper is organized as follows. Section 2 discusses the related work, and the proposed approach is detailed in Section 3. Section 4 presents the experimental results and Section 5 concludes this paper.

Related Works
Path planning has recently been a hot issue in robotics research, and the core requirement is to fnd an optimal path from the starting point to the endpoint with the lowest cost (e.g., distance, time, and energy). Existing algorithms can be mainly divided into three categories: (1) traditional algorithms, (2) intelligent algorithms, and (3) DRL-based algorithms.

Traditional Algorithms.
Traditional path planning algorithms include the Dijkstra algorithm [10], A * algorithm [11], and artifcial potential feld algorithm [12]. Te Dijkstra algorithm is a classic algorithm in the feld of path planning, which uses a greedy policy to expand one node at a time to traverse the nodes in the environment to achieve the shortest path from the start to the end. Based on the Dijkstra's algorithm, the A * algorithm adds heuristic rules to converge faster when nodes expand. Although the A * algorithm has been widely used in many felds, the application scenarios of the A * algorithm are limited to discrete spaces. Te artifcial potential feld algorithm sets the gravitational force between the agent and the target and the repulsion force between the agent and the obstacle so that the agent can reach the target position along the direction of the resultant force. However, the force ratio for diferent scenes can only be manually coordinated, making the optimal confguration difcult to obtain, which limits its applications in complex environments.

Intelligent Algorithms.
Intelligent path planning algorithms are a series of algorithms produced by observing natural phenomena and animal habits, including the genetic algorithm [13], particle swarm optimization (PSO) algorithm [14], and ant colony algorithm [15]. Te genetic algorithm imitates the selection and genetic mechanism of nature to seek the optimal solution. However, it depends on the initial population selection, and its convergence speed is slow when solving large-scale problems. Te ant colony algorithm and the PSO algorithm imitate the swarm intelligence behavior of ant colonies and bird swarms and have good parallelism and fast convergence speed. Nevertheless, the parameter setting afects the performance of these two algorithms, making them easily fall into the local optimal solution.

DRL-Based
Algorithms. Reinforcement learning can optimize the agent's action policy by maximizing long-term returns without background knowledge. It can fnd the optimal path through continuous trial and error in a completely unknown environment [20]. Terefore, researchers applied DRL to target path planning problems. Mirowski et al. proposed a DRL method to train agents to navigate within large and visually rich environments by introducing memory and auxiliary learning targets [21]. Sallab et al. presented the DQN algorithm for the discrete actions and deep deterministic actor-critic algorithm for continuous actions to lane keeping assist [22]. Chen et al. designed a time-efcient navigation policy based on socially aware collision avoidance with DRL, which can enable fully autonomous navigation of a robotic vehicle in an environment with many pedestrians [23]. Kendall et al. applied the DRL to a full-sized autonomous vehicle, which can learn a policy for lane following in a handful of training episodes via a single monocular image as input [24]. By combining imitation learning (IL) and DRL, Zhang et al. proposed an IADRL model for the AGV-UAV coalition [17] to cooperatively and cost-efectively accomplish tasks. However, the IADRL model sufers from the local optimum problem due to the convergence in advance. Terefore, there is still space to enhance the path planning performance in AGV-UAV transportation tasks.

Motivation and Challenges. As discussed in
Introduction Section, the IADRL model combines deep reinforcement learning and imitation learning to learn the cooperative and complementary behavior mode of AGV-UAV transportation alliance from expert data and interactive data. As the action policy adopted by IADRL, however, the defects of proximal policy optimization (PPO) itself may cause IADRL to fall into local optimum in the learning process and thus be unable to fnd the optimal path.
To better analyze the shortcomings of PPO, we create an environment only containing two-dimensional actions, which facilitates us to visualize the distribution of the actions chosen by the policy during the iterative process. In this environment, the reward is negatively correlated with the sum of squares of the actions chosen by the policy so that the policy reaches the optimum when both actions chosen by the policy are zero. In Figure 1, we visualized the distribution of actions selected by diferent policies at diferent iterations, where green represents positive-advantage actions and red represents negative-advantage actions.
In the frst row of Figure 1, when the policy performs multiple minibatch gradient descent with the same data in PPO style without considering the clipping loss, the actions chosen by the policy at 9 iterations deviate from the optimal point. Such a situation happens because the negativeadvantage actions push the policy away from the negative-advantage actions. In contrast, the positiveadvantage actions pull the policy towards the positiveadvantage actions. Each step of the updating process moves the policy away from the negative-advantage actions, eventually causing the strategy to deviate from the optimal point.
As shown in the second row of Figure 1, compared with the frst row, PPO does not deviate during the iterative process but approaches the optimal point as the iteration proceeds. However, the fnal policy still does not exactly reach the optimal point. Tis is because PPO limits the update range of the policy through the clipping loss to prevent the policy's deviation. But the clipping loss also causes the policy to converge early and fall into the local optimum [22].
Based on our research on reinforcement learning algorithms, we noted that PPO-CMA [25] can solve the previously mentioned problems of PPO well. PPO-CMA prevents the early convergence of policy by using the standard policy gradient loss instead of clipping loss and updating the policy's variance and mean with separate networks, respectively. Moreover, PPO-CMA avoids the policy deviation problem caused by negative-advantage actions by converting negative-advantage actions to positive ones through a mirroring method. As seen in the third row of Figure 1, PPO-CMA starts to converge only when it is close to the optimal point and fnally reaches the optimal point of the strategy exactly.
All these observations inspired us to propose a new model based on PPO-CMA to solve the premature convergence problem presented in IADRL and to provide path planning for AGV-UAV alliances in transportation tasks.

Te Proposed Model.
To deal with the problem of premature convergence in IADRL, we propose a new model for path planning of the AGV-UAV alliance using PPO-CMA as the action policy. Specifcally, the clipping loss is frst replaced by the standard policy gradient loss to prevent premature convergence. Afterward, the mean and variance of the policy are updated separately using separate networks to further extend the variance in the optimal search direction. Moreover, the negative-advantage action is turned into a positive-advantage action by a mirroring method.
Te AGV-UAV transportation coalition can be described by the tuple 〈ϵ, o, a, r, c, M〉, where ϵ represents the environment, r is the reward function, c ∈ (0, 1] is the discount factor for future rewards, and M is the complementary cooperation model of the AGV-UAV. Te o � (o 1 , o 2 ) represents the observed values of the coalition on the environment, consisting of o 1 for the observation value of the AGV and o 2 for the observation value of the UAV. Te a � (a 1 , a 2 ) ∼ M means the action of the transportation coalition, which consists of the action a 1 taken by the AGV and the action a 2 taken by the UAV. Te goal is to learn a joint value-action function Q π c (o, a; θ) that enables the AGV-UAV coalition to achieve maximum overall reward (or minimum overall cost), while accomplishing various tasks.
According to the generative adversarial imitation learning (GAIL) model [25], the IL model in this paper includes a generator G and a discriminator D. Te generator G, also the policy π in the DRL model, is responsible for producing actions closer to the distribution of expert data based on a given observation o to pass the detection of the discriminator D. Te discriminator D distinguishes the expert data from the data obtained by the generator G. During the training process, the value function should be maximized, described as follows [17]: o, a; ω))] D(o, a; ω))] − λH(π). (1) Here, ω is the weight of the D, H(π) is the entropy of the policy π [26], λ ≥ 0 is the discount factor for H, and τ E is the expert policy provided by the demonstrated data.
Te value function Q π c in the DRL model is used to process the received rewards and evaluate the current action selected by policy π. Te training of the DRL model aims to maximize the value function Q π c of the AGV-UAV, defned by

Mobile Information Systems
where θ is the parameter of the function Q π c , c ∈ (0, 1] is the discount factor for future rewards, and r au is the augmented reward function.
To prevent premature policy convergence, the following standard policy gradient loss is used as the loss function of the policy π instead of the clipping loss.
where φ is the parameter of the value function J φ , i is the mini-batch sample index, j indexes the operand variables, and K is the number of sample batches. A π (o i , a i ) represents the advantage function for measuring the payof of taking action a i in state o i . In addition, the mean and variance of the policy are generated using separate networks so that the variance can be updated before the mean is updated. Tis allows the policy to fnd the optimal point more quickly by elongating the exploration distribution along the optimal search direction rather than converging the variance prematurely [27].
Considering that negative-advantage actions may cause policy deviation, a mirroring technique is employed to convert negative-advantage actions into positive ones. Given the linearity of advantage around the current policy mean μ(s i ), it is possible to mirror negative-advantage actions into positive-advantage actions about the mean. Specifcally, we set a i ′ � 2μ(s i ) − a i , A π (a i ′ ) � −A π (a i )ψ(a i , s i ), where ψ(a i , s i ) is a Gaussian kernel that assigns less weight to actions far from the mean.

Experimental Results and Analysis
In this section, we frst conducted the experiment of PPO and PPO-CMA in the gym environment provided by OpenAI. After that, we built an experimental environment for the AGV-UAV problem and detailed the environment confguration. Based on this, we demonstrated the efectiveness and superiority of the proposed model by comparing the experimental results with other models. Figure 1, we can see that PPO-CMA solves the problem of PPO's early convergence, and it is no longer disturbed by negative-advantage actions. To better demonstrate the advantages of PPO-CMA, we further compare the two algorithms in the gym environment.

Gym Experiment. From
As can be seen from Figure 2, the experiments in MountainCar-v0 and BipedalWalker-v3 show that PPO-CMA can achieve higher rewards in the experiment, which shows that PPO-CMA is superior to PPO.  Mobile Information Systems Moreover, PPO-CMA is obviously faster than PPO in convergence speed. Figure 3 gives the variance of the two policies in the training process. It can be seen that the sampling variance of PPO-CMA reduces to the minimum value more slowly than that of PPO, which efectively expands the exploration variance, prevents the policy from falling into local optimum, and achieves a better fnal training efect of PPO-CMA.

Experimental Confguration.
We designed a virtual simulation scenario for the proposed model based on the Unity3D ML-Agents platform [28], and we deployed an AGV-UAV coalition with the size of 50 m × 50 m × 10 m, and the mission of the coalition was to complete the transportation of goods in the shortest path. As shown in Figure 4, the cyan-blue squares represent the AGV, the yellow square represents the UAV, and the green, red, and purple spheres represent the target cargoes at diferent heights and positions.
In the experiments, each agent's ray-cast sensor provided by Unity3D collects the environment states. Te ray-cast sensor casts rays into the surrounding environment and the position of all detected objects and their distances can be obtained. Te ray of the AGV only detects the environment in the horizontal direction, while the ray of the UAV swings up and down 45 degrees to detect the environment. Te detection range of all rays is set to 20 meters. Te observation o of an AGV-UAV coalition is a vector containing environmental information combined with all its detected ray returns.
Te action of the AGV is expressed as a 1 � [a x , a y ], and the action of the UAV is expressed as a 2 � [a z ], where a x , a y , and a z represent the agent's acceleration in the x, y, and z directions. Te action of the AGV-UAV coalition is composed of the action of the AGV and the UAV, a � (a 1 , a 2 ).
In the proposed model, the discriminator is set up with two hidden layers of 128 neural units each. Meanwhile, the value function is set up with three hidden layers with 512 units per layer, and the policy π is set up with three hidden layers with 512 units per layer. In addition, the initial positions of the AGVs, the UAVs, and the target cargoes are random.
Te environmental reward is designed based on the situation that the AGV-UAV coalition may encounter. For the coalition to learn the least expensive path, we set a small penalty of 0.01 for each step of the coalition. Since the battery life of the AGV is 5 to 10 times that of the UAV, we set the penalty for each step of the UAV to be 6 times that of the AGV. Terefore, under normal circumstances, the UAV should be carried by the AGV to the destination, and then, the UAV starts to work. We set the reward for each goal to be 120 to encourage the coalition to complete the task. Considering that there may be obstacles in the actual situation, we set up obstacles in the scene and made a large penalty of −30 for the coalition to collide with the obstacles. Te fnal reward of 120 is obtained when the coalition has achieved all objectives.
In the experiments, we can manually control the agent to complete some simple tasks and record the data to train the model as expert data. We collected the running data of the agent for 10,000 steps, where the data includes all basic scenarios of AGV and UAV cooperating to complete the task. It should be noted that the expert data enables the model to learn the cooperative and complementary relationship between AGV and UAV, not to learn the optimization policy of the path. Terefore, our demo data only needs to refect the behavioral characteristics of the AGV-UAV coalition. Tat is to say, the AGV frst carries the UAV to the target position, and Mobile Information Systems then, the UAV takes of and starts to work. Moreover, there is no need to artifcially optimize the route from the coalition to the target.

Experimental Results.
In the AGV-UAV transportation task, the maximum training step for each episode is set to 20,000. If the coalition gets all the goods, the episode terminates immediately, otherwise, training continues until the agent runs out of the maximum step. In the experiments, we compare the proposed model with four models including PPO, behavior cloning (BC) [29], GAIL, and IADRL for performance evaluation. To ensure a fair comparison, we use the same parameters, i.e., the number of targets, the learning rate, and the maximum step, for all models. Figure 5 frst compares the rewards obtained by all fve models. Obviously, the proposed model has the highest rewards, indicating the best optimization ability of path planning. According to the results, the highest reward of the proposed model is 4400, but the highest reward of IADRL is less than 4000, resulting in more than 10% improvement. Te IADRL outperforms the PPO, GAIL, and BC models due to the combination of IL and DRL. Te PPO model can learn policies based on the environment, so it can quickly learn to avoid obstacles at the beginning of the training process. However, without the guidance of demonstration data, it cannot learn the behavior characteristics of the AGV-UAV coalition, resulting in its training speed and fnal reward being lower than IADRL and the proposed model. In addition, it can be seen that the PPO, IADRL, and the proposed models tend to converge in the end. But the GAIL and BC models fail, which is basically consistent with the theoretical conjecture that GAIL and BC only replicate the  actions and policies provided by the demonstration data, rather than obtaining optimal policies by obtaining higher rewards. As shown in Figure 6, the number of moving steps in each episode of the BC and GAIL models is always equal to the maximum step of 20000, which means that both of them fail to complete the transportation task of all targets. Tis is because these two models depend highly on expert data and cannot be adaptively suitable for complicated environments.
During the training process, the agents collect data by continuously interacting with the environment. Te training samples can get more and more with the increment of moving steps. To evaluate how well the proposed model performs in diferent sizes of training samples, Figure 7(a) shows the accumulated training reward values for diferent periods according to the training sample size. In particular, we colored the reward curve in diferent periods: green for the period with a small size of training samples, orange for a medium size, and red for a large size. It can be seen that the reward is low but increases faster when it has a small size of training samples. In this case, our model still outperforms the IADRL and PPO models with a higher reward, as shown in Figure 7(b). Figure 8 shows the number of collisions between the AGV-UAV coalition and obstacles in each episode. It can be seen that the PPO, IADRL, and proposed models can quickly reduce the number of collisions to a minimum after training with many collisions in the early stage of training. But the BC and GAIL models keep a high number of collisions due to lacking environmental rewards.  To better show the superiority of the proposed model, we use the following metrics for evaluation: (1) the mission completion rate, that is, the percentage of goals reached by the coalition in each episode to the total number of goals and (2) the number of steps required to complete an episode of tasks. Te IADRL and proposed models are further compared in Figure 9, where (a) shows the task completion rate and (b) shows the number of moving steps. It can be seen that the completion rate of the proposed model is superior to that of the IADRL model. However, after about 250 episodes, the completion rate of the proposed model reaches one and keeps stable for the following episodes. According to Figure 9(b), in the early learning stage, it is difcult for the models to complete the transportation of all goods without a suitable policy. Terefore, the number of steps consumed by the models in each episode reaches a maximum of 20000. With gradual training, the policy is gradually optimized, and the number of steps for completing an episode decreases. It can be seen that the proposed model outperforms the IADRL in terms of task completion rate and the number of moving steps.

Discussion.
In this paper, we have taken into account the energy constraints of the UAV and outlined the operational guidelines for the AGV-UAV alliance. Te AGV will initially transport the UAV to the desired location, where it will then take of to the required altitude to complete the task. Tis approach restricts the UAV's operational radius to only the area directly above the AGV. Unfortunately, when the target is beyond the AGV's reach or can only be accessed via a lengthy detour, the UAV's limited range of motion will increase the overall cost of completing the mission for the alliance.
For example, as shown in Figure 4, the purple target is located directly above the obstacle. Although the UAV can reach the height where this target is located yet it cannot complete the transportation mission because the AGV cannot reach directly below the target. In addition, the red target in Figure 4 is located on the other side of the obstacle, which requires the AGV to go around the obstacle to reach directly below the target before the UAV can take of to handle the target. In this case, if the UAV can move horizontally, then as soon as the AGV reaches the vicinity of the obstacle, the UAV can take of to handle the target and the alliance can accomplish the task with less cost.

Conclusion
In this paper, an intelligent path planning model was proposed for the AGV-UAV transportation task in 6G smart warehouse environments. Te proposed model utilizes PPO-CMA in the IL and DRL networks to prevent premature convergence of policy. Tis enables the AGV-UAV coalition to learn behavior patterns and complete transportation tasks at a lower cost. Te experiments conducted in a simulated warehouse environment demonstrate that the proposed model outperforms the baselines. In the future, the focus will be on enabling the AGV-UAV alliance to accomplish transport missions in complementary and cooperative working modes. In addition, exploring ways to allow the UAV to move horizontally to further reduce costs will be a topic of interest.

Data Availability
Te data used to support the fndings of this study are included in the article.