Optimal Path Planning for Wireless Power Transfer Robot Using Area Division Deep Reinforcement Learning

*is paper aims to solve the optimization problems in far-field wireless power transfer systems using deep reinforcement learning techniques. *e Radio-Frequency (RF) wireless transmitter is mounted on a mobile robot, which patrols near the harvested energy-enabled Internet of*ings (IoT) devices.*e wireless transmitter intends to continuously cruise on the designated path in order to fairly charge all the stationary IoT devices in the shortest time. *e Deep Q-Network (DQN) algorithm is applied to determine the optimal path for the robot to cruise on. When the number of IoT devices increases, the traditional DQN cannot converge to a closed-loop path or achieve the maximum reward. In order to solve these problems, an area division Deep Q-Network (AD-DQN) is invented. *e algorithm can intelligently divide the complete charging field into several areas. In each area, the DQN algorithm is utilized to calculate the optimal path. After that, the segmented paths are combined to create a closedloop path for the robot to cruise on, which can enable the robot to continuously charge all the IoTdevices in the shortest time.*e numerical results prove the superiority of the AD-DQN in optimizing the proposed wireless power transfer system.


Introduction
e wireless power transfer technique is proved to be the most effective solution to the charging problem as the number of the IoT devices grows drastically, since it is impossible to replace the batteries of all IoT devices [1]. In recent years' Consumer Electronics Show (CES), a large number of wireless power transfer products have come into consumers' sights. ere are two types of wireless power transmission products: near-field and far-field. In near-field wireless power transfer, the IoT devices, which are charged by resonant inductive coupling, have to be placed very close to the wireless transmitters (less than 5 cm) [2]. In far-field wireless power transfer, the IoT devices use the electromagnetic waves from transmitters as the power resource and the effective charging distance ranges from 50 centimeters to 1.5 meters [3][4][5]. Compared to the near-field transmitters, the far-field wireless power transmitters can charge the IoT devices (including the mobile IoT devices) that are deployed in a larger space.
However, the far-field wireless power transfer is still in its infancy for two reasons. First, the level of power supply is very low due to the long distance between the power transmitters and the energy harvesters. In [6], the authors mentioned that the existing far-field RF energy harvesting technologies can only achieve nanowatts-level power transfer, which is too tiny to power up the high-powerconsuming electronic devices. In [3], the authors investigated the RF beamforming in radiative far field for wireless power transfer. e authors demonstrated that, with beamforming techniques, the level of the energy harvesting can be boosted. However, as the distance between the transceivers increases to 1.5 meters, the amount of the harvested energy is less than 5 milliwatts, which is still not ideal to power up the high-energy-consuming devices. Second, most of the existing wireless charging systems can only effectively charge stationary energy harvesters. In [7], a set of wireless chargers (Powercast [8]) are deployed on the square area. e Powercast transmitters can adjust the transmission strategies to optimize energy harvested at the stationary energy harvesters. In [9], the Powercast wireless charger is mounted on the moving robot. erefore, the charger is a mobile wireless charger, which can adjust the transmission patterns of the stationary sensors while moving. However, the number of the IoT devices to be charged is too small. In order to wirelessly charge multiple IoT devices, some researchers proposed using Unmanned Aerial Vehicle (UAV) to implement the wireless power transfer [10][11][12][13]. e UAV is designed to plan the optimal path to charge the designated IoT devices. However, it is very inefficient to use UAV to charge the IoT devices, since UAV has very high power consumption and very short operational time. Installing the wireless power emitter on the UAV will further shorten the operational time of UAV. In order to enhance the level of the energy harvesting and efficiency in charging a large number of energy-hungry IoT devices, in this paper, we assembled the wireless power transfer robot and applied deep reinforcement learning algorithm to optimize its performance. In the system, the wireless transmitter aims to find the optimal path for the wireless power transfer robot. e robot cruises on the path, which can charge each IoT device in the shortest time. DQN has been widely used to play the complicated games which have a large number of system states even when the environment information is not entirely available [14]. Lately, a lot of researchers have started to implement DQN in solving the complicated wireless communication optimization problems because the systems are very complicated and environment information is time-varying and hard to capture [15][16][17][18]. In particular, the researchers applied deep reinforcement learning to plan the optimal path for autodrive robots [19][20][21][22] and the robots can quickly converge to the optimal path. Henceforth, we found that DQN is a perfect match to solve our proposed optimization problem. However, those papers either only proposed the theoretical model or could not implement wireless power transfer functions. To the best of our knowledge, we are the first ones to implement the automatic far-field wireless power transfer system in the test field and invent a DQN algorithm to solve it. In our system, the entire test field is evenly quantified into the square spaces. e time is slotted with the same interval. We consider the relative location of the robot in the test field as the system state, while we define the direction to move in the next time slot. At the beginning of each time slot, the wireless power transfer robot generates the system state and takes it as the input to DQN. e DQN can generate the Q values for each possible action and the one with the maximum Q value is picked to guide robot's move during the current time slot.
As the number of IoT devices increases and the testing field becomes more complicated, the traditional DQN cannot generate the close-loop path for the robot to cruise on, which does not satisfy the requirement of charging every regular time interval. In order to deal with this problem, area division deep reinforcement learning is proposed in this paper. At first, the algorithm divides the whole test field into several areas. In each area, DQN is utilized to calculate the optimal path. Next, the entire path is formulated with the paths of each separated area. In this way, a closed loop is guaranteed and the numerical results prove that the calculated path is also the optimal path.

System Model
e symbols used in this paper and the corresponding explanations are listed in Table 1.
As shown in Figure 1, a mobile robot that carries two RF wireless power transmitters cruises on the calculated path to radiate the RF power to K nearby RF energy harvesters. Both the power transmitter and the RF power harvesters are equipped with one antenna. e power received at receiver where p tx is the transmit power; G tx is the gain of the transmitter's antenna; G rx is the gain of the receiver's antenna; L is the distance between the transmitter and harvester k; η is the rectifier efficiency; λ denotes the wavelength of the transmitted signal; l p denotes the polarization loss; μ is the adjustable parameter due to Friis's free space equation.
Since the effective charging area is critical in determining the level of energy harvesting and it is the parameter to be adjusted at the transmitter, equation (1) is reexpressed using the effective area: where S tx is the maximum effective transmit area; S rx is the effective received area; α is the angle between the transmitter and the vertical reference line.
Since we consider the mobile energy harvesters in the system, the distance and effective charging area may vary over the time; we assume that the time is slotted and the position of any mobile device within one time slot is constant. In time slot n, the power harvested at receiver k can be denoted as For a mobile energy harvester, the power harvested in different time slots is determined by the angle between the transmitter and the vertical reference line α(n) together with the distance between the transmitter and the harvester L(n) in the time slot.
In our model, the mobile transmitter is free to adjust the transmit angle α(n) and L(n) as it can move around the IoT devices. We assume that the effective charging is counted only when α(n) � 0 and L(n) < � 45 cm.

Problem Formulation
In this paper, the optimization problem is formulated as a Markov Decision Process (MDP) and reinforcement learning (RL) algorithm is utilized to solve the problem.
Furthermore, DQN algorithm is applied to address the large number of system states.

Problem Formulation.
In order to model our optimization problem as an RL problem, we define the test field consisting of same area unit square, whose side length is 30 cm. K � 8 harvested energy-enabled IoT devices are deployed in the test field, whose orders are 0, 1, 2, 3, 4, 5, 6, and 7, respectively. e map is shown in Figure 2. e system state s n at time slot n is defined as the position of a particular square where the robot is currently located at in the test field, which is specified as s n � pos(h, v), where h is the distance between the present square and the leftmost edge, which is counted by the number of squares. v indicates the distance between the present square and the upmost edge, which is counted by the number of squares. For example, the No. 5 IoT device can be denoted as o 5 � pos(2, 0). e shadow area adjacent to No. k IoT devices indicates the effective charging area for the respective IoT devices, which is denoted as eff k . For example, the boundary of effective charging areas for No. 6 IoT device is highlighted in red. We define the direction of movement in a particular time slot n as the actions a n . e set of possible actions A consists of 4 different A � U, D, L, R { }, where U is moving upward one unit, D is moving downward one unit, L is moving left one unit, and R is moving right one unit.
Given the above, the mobile wireless charging problem can be formulated as minimizing the time duration T for the robot to complete running one loop at the same time the robot has to pass through one of the effective areas of each IoT device.

(4)
Time duration for the robot to complete running one loop is defined as T. e starting position is the same as the last position, since the robot cruises in a loop. In the loop, the robot has to pass through at least one of the effective charging areas of each IoT device.
Adapting to the different positions, the agent chooses different action at each time slot. Henceforth, we can model our proposed system as a Markov chain. In the system, we use the current position to specify a particular state s. S denotes the system state set. e starting state s 0 and final state s T are the same, since the robot needs to move and return to the starting point. e MDP process can be described as the agent chooses an action a from A at a specific system state s. After that, a new system state s ′ will be transit into. p s,s′ (a), s, s ′ ∈ S and a ∈ A, denotes the probability that system state transits from s to s ′ with a.
e reward of the MDP is denoted as w(s, a, s ′ ), which is defined for the condition that system state transits from s to state s ′ . e optimization problem is formulated as reaching s T in the fewest transmission time slots; henceforth, the reward has to be defined to motivate the mobile robot that  does not repeatedly pass through any effective charging area of any IoT devices. Besides, the rewards at different positions are interconnected with each other, since the goal of the optimization is to pass through the effective charging areas of all the IoT devices. We assume that the optimal order to pass through all the IoT devices is defined as Specifically, the reward function can be expressed as In the above equation, acc o k−1 � 1 if the robot has already passed through as effective area of the o k−1 th IoT device; and ζ denotes the unit price of the harvested energy.
As we have defined all the necessary elements for MDP, we can characterize the formulated problem as a stochastic shortest path search that starts at s 0 and ends at s T . At each system state s, we derive the best action a * (s) which can generate the maximum reward. e optimal policy sets are defined as π � a(s): s ∈ S { }.

Optimal Path Planning with Reinforcement Learning.
If the systematic dynamics obey a specific transition probability, reinforcement learning will be the perfect match to solve the optimization problems. In this section, Q-learning [23] is first introduced to solve the proposed problem. After that, to address the large states and actions sets, the DQN algorithm [14] is utilized to determine the optimal action for each particular system state.

Q-Learning
Method. e traditional Q-learning method is widely used to solve the dynamic optimization problem provided that the number of the system states is moderate. Corresponding to each particular system state, the best action can be determined to generate the highest reward.
Q(s, a) denotes the cost function, which uses a numerical value to describe the cost of taking action a at state s. At the beginning of the algorithm, all the cost function is zero since no action has ever been taken to generate any consequence Q(s, a) � 0. All the Q values are saved in the Q table. Only one cost function is updated in each time slot as the action is taken and the reward function is calculated.
e cost function is updated as where e learning rate is defined as σ(s ′ , a). When the algorithm initializes, the Q table is empty since no exploration has been made to obtain any useful cost function to fill the Q table. Since the agent has no experience about the environment, the random action selection is implemented at the beginning of the algorithm. A threshold ϵ c ∈ [0.5, 1] is designed to start the exploration. In each time slot, a numerical value p ∈ [0, 1] is generated and compared with the threshold. If p ≥ ϵ c , action a is picked as However, provided that p < ϵ c , an action is randomly selected from the action set A.
After iteratively updating the value in the Q table, the Q value converges. We can calculate the best action corresponding to each action and state by π * (s) � arg max a∈A Q * (s, a), (9) which corresponds to finding the optimal moving direction for each system state explored during the charging process.

DQN Algorithm.
e increase in the number of IoT devices has led to an increase in the number of system states. Suppose that Q-learning algorithm is used; a very large Q table has to be created and the convergence speed is too slow. DQN algorithm is more compatible since there is a deep neural network in the structure that can be well trained and take immediate action to determine the best action that is going to be taken. e deep neural network in the structure has the system state as the input and the Q value for each action is defined as the output. Henceforth, the function of the neural network is to generate the cost function for particular state and action. We can describe the cost function as Q(s, a, θ), where θ is the weight on the neuron nodes in the structure. As we collect the data when different actions are taken in different time slot, the neural network is trained to update the weight of the neural network, which can output a more precise Q value: s, a). (10) ere are two identical neural networks existing in the structure of DQN [24]: one is called the evaluation network eval net, and the other is called the target network target net. Since these two deep neural networks have the same structure, multiple hidden layers are defined for each network. We use the current system state s and the next system state s ′ as the input to eval net and target net, respectively. We use Q e (s, a, θ) and Q t (s, a, θ ′ ) to define the output of two deep neural networks eval net and target net. In the structure, in order to update the value of the weight of neuron nodes, we only continuously train the evaluation network eval net. e target network is not trained. It periodically duplicates the weights of the neurons from the evaluation network (i.e., θ ′ � θ).
e loss function is described as follows, which is used to train eval net: We use y to represent the real Q value, which can be expressed as We denote the learning rate as ϵ. e idea of backpropagation is utilized to update the weight of eval net; as a result, the neural network is trained. e experience reply method is utilized to improve the training effect, since it can effectively eliminate the correlation among the training data. Each single experience includes the system state s, the action a, and the next system state s ′ , together with the reward function w(s, a, s ′ ). We define the experience set as ep � s, a, w(s, a, s ′ ), s ′ . In the algorithm, D individual experiences are saved and, in each training epoch, only D s (with D s < D) experiences are selected from D. As the training process is completed, target net copies the weight of the neurons from the evaluation network (i.e., θ ′ � θ). D different experiences are generated from ep, while only D s are picked to train the evaluation network eval net. e total number of training iterations can be denoted as U. Both evaluation network and target network share the same structure, in which the deep neural networks have N l hidden layers.

Dueling Double DQN.
In order to leverage the performance of DQN, which can effectively select the optimal action to charge multiple harvesters in a time-varying channel conditions, we redesign the structure of the deep neural network by using Dueling Double DQN. Doubling DQN is an advanced version of DQN which can prevent the overestimating problem appearing throughout the training [24]. Dueling Double DQN can efficiently solve the overestimating problem throughout the training process. In the same training epochs, Dueling Double DQN is proved to outperform the original DQN in learning efficiency.
In traditional DQN, as shown in equation (12), the target network target net is designed to derive the cost function for a particular system state. Nevertheless, because we do not update the weight of the target network target net in each training epoch, the training error will increase while training, hence prolonging the training procedure. In Doubling DQN, both the target network target net and the evaluation network eval net are used to calculate the cost functions. We use evaluation network eval net to calculate the best action for system state s ′ .
e latest research proves that the training error can be dramatically reduced using the Doubling DQN structure [24].
In traditional DQN, we only define the cost function Q value as the output of the deep neural network. e Dueling DQN is invented to speed up the convergence of the deep neural network by designing 2 individual streams of the output for the deep neural network. We use the output value V(s, θ, β) to represent the first stream of the neural network. It denotes the cost function for a specific system state. We name the second stream of the output as advantage output A(s ′ , a, θ, α), which is utilized to illustrate the advantage of using a specific action to a system state [25]. We define α and β as the parameters to correlate the output of two streams and the neural network. e cost function can be denoted as Q(s, a, θ, α, β) � V(s, θ, β) e latest research proves that Dueling DQN can speeds up the training procedure by efficiently annihilating the additional freedom while training the deep neural network [25].

Area Division Deep Reinforcement Learning.
In this paper, the optimization problem can be seen as calculating the optimal close-loop path which generates the maximum accumulated reward. However, the traditional DQN shows the difficulty converging to the optimal path because of the complicated experimental field. In order to leverage the performance of traditional DQN, we invent an AD-DQN in this paper. At first, the experimental field is divided into multiple separate parts. DQN is run on each part individually to obtain the optimal path for the robot, respectively. Finally, the entire close-loop path is formulated using the path on each part. In area division, the whole area is defined as W. e whole area is divided at multiple specific locations. p i ∈ P.
Wireless Power Transfer e criterion to pick p i is finding the squares, which exist in more than one effective charging area of the IoT devices. j�0,1,...,K−1 . In the clockwise direction, we find that the IoT device o i has the shortest distance to p i , and then add both o i and the effective charging area of o i to N i . e new area can be expressed as Next, we find the IoT device having the shortest distance to the IoT device o i that is just added to set N i , and then add both the new IoT device and the effective charging area of it to N i . Iteratively, we find that all the IoT devices besides the ones in K e are included in one N i . Finally, classify all the rest squares to the nearest N i . N i � W.
In each area, the DQN is run to determine the optimal path for the robot. In each area, the starting point is the same as the position of p i ; the end point is one of the effective charging squares of the furthest IoT device from the starting point in the same area. After the optimal path is calculated for each individual area, the close-loop optimal path for the entire area can be synthesized. e algorithm is shown in Algorithm 1. e number of the area to be divided is |P| + 1.
In the clockwise direction, find the the IoT devices, that has the shortest distance to r 1 . e order of the IoT device is: else (xi) In the counterclockwise direction, find the the IoT devices, that has the shortest distance to r 2 . e order of the IoT device is: g � argmin o i ∉ K e |o i − r 2 |. N i is updated as: en d (xiii) en dw hile (xiv) en d (xv) for i � 1, . . . , |P|: 1, 2, . . . , |P| + 1: (xx) for j � 1, 2, . . . , |J|: (xxi) e starting point is defined as p i . e end point is defined as e j ∈ eff c , j ∈ J. (xxii) e weight of the neuron nodes θ are randomly generated for the eval net and the weights are copied by target netθ else (xxviii) Randomly choose the action from action set A.
(xxix) en d (xxx) while s ′ ≠ s T (xxxi) e state transit into s ′ after taking the action. s, a, w(s, a, s ′ ), s ′ . Suppose D keeps unchanged if it goes over the experience pool's limitation, d � 1; otherwise, $ D � d$. t � t + 1. s � s ′ . After enough data has been collected in experience pool, eval net is trained using D of D s experiences. Minimize the loss function Loss(θ) using Back-propagation. target net copies the weight from eval net periodically. (xxxii) en dw hile (xxxiii) en dw hile (xxxiv) e optimal path of the entire test field is synthesized with the optimal path in each W i .

Experimental Results
e implementation of the proposed wireless power transfer system is shown in Figure 3.
In the test field, 8 harvested energy-enabled IoT devices are placed as Figure 2 indicates. e top view of the test field can be seen as a 2D map. Henceforth, the map is modeled and inputted into the computer. en the AD-DQN algorithm is implemented in computer using Python and the optimal charging path can be derived. At the same time, a wireless power transfer robot is assembled. Two Powercast RF power transmitters TX91501 [8] are mounted on two sides of the Raspberry Pi [26] enabled intelligent driving robot. Each transmitter is powered by 5 V power bank and continuously emits 3 Watts RF power. e infrared patrol module is installed on the robot to implement the autodrive on the test field; henceforth, the robot can automatically cruise on along the path and continuously charge the multiple IoT devices, as shown in Figure 1. To the best of our knowledge, we are the first ones to implement the automatic wireless power transfer system in the test field and invent AD-DQN algorithm to design the optimal path for the wireless power transfer robot. Since we are the first ones to design and implement the mobile far-field wireless power transfer system, there is no hardware reference design we can refer to and use for validation. So the validation of our work is done in the software aspect. But referring to the flowchart, our mobile wireless power transfer system can be replicated.
For the software, we use TensorFlow 0.13.1 together with Python 3.8 in Jupyter Notebook 5.6.0 as the software simulation environment to train the AD-DQN. e number of hidden layers is 4 and each hidden layer owns 100 nodes. e learning rate is less than 0.1. e mini-batch size is 10. e learning frequency is 5. e training starting step is 200. e experience pool is greater than 20000. e exploration interval is 0.001. e target network replacement interval is greater than 100. Reward decay is 0.99.
First, different reward functions are tested for the optimal one. Reward one reward 1 is defined using equation (5).
e unit price is defined as ζ � 4. Reward two reward 2 is defined as where ζ � 4. Reward three reward 3 is defined with equation (5); however, ζ � 2. Two factors are observed for the performance of different rewards, which are average reward during the training and average time consumption during the training. Based on the procedures of AD-DQN in Algorithm 1, the experimental field is divided into two areas along the only shared effective charging area for both device 2 and device 3. In area I, IoT devices 2, 3, 4, 5, and 6 are included, while in area II, IoT devices 0, 1, 2, 6, and 7 are included.
In area I, the performances of three different rewards are compared in Figures 4 and 5.
In area II, the performances of three different rewards are compared in Figures 6 and 7.
From Figures 4 and 5, we can observe that reward 1 is optimal. Since all three rewards perform similarly on the time consumption, reward 1 is the highest reward among all,  which means that reward 1 can effectively charge most of the IoT devices compared with the other two rewards. From Figures 6 and 7, we can observe that reward 3 performs best on the time consumption to complete one episode; however, reward 1 is much more average reward than reward 3 . at can be explained as follows: compared with reward 1 , reward 3 can only effectively charge fewer number of the IoT devices.
Overall, reward 1 has optimal performance in both areas I and II; henceforth, reward 1 is used to define the reward for AD-DQN.
In Figures 8 and 9, the performances of four different algorithms are compared. e random action selection randomly selects the action in the experimental test field. Same as AD-DQN, reward 1 is used as the reward of Q-learning and DQN.
We define the successful charging rate as the number of IoT devices that can be successfully charged in one complete charging episode over the total number of the IoT devices. From Figure 8, we can observe that random action selection has the worst successful charging rate. at can be explained as follows: random action selection never converges to either suboptimal or optimal path. Q-learning has a better performance than random action selection; however, it is outperformed by the other two algorithms, since Q-learning can only deal with the simple reinforcement learning model. DQN performs better than Q-learning and random action selection; however, it is outperformed by the AD-DQN, since the rewards for different states are defined as interconnected; even the reward decay is 0.99; DQN still cannot learn the optimal solution. When the total number of the IoT devices decreases, both DQN and AD-DQN perform the same since the decrease of the number of the IoT devices degrades the interconnections between different system states. From Figure 9, we can observe that, compared with the other algorithms, AD-DQN is not the one consuming   the least time slots to complete one charging episode; however, AD-DQN is still the optimal algorithm, since all the other algorithms cannot achieve 100% effective charging rate; hence they consume fewer time slots to complete one charging episode. In Figure 10, the optimal path determined by AD-DQN is shown as the bold black line. e arrows on the path show the direction of the robot to move as we assume that the robot is regulated to cruise on the path in the counterclockwise direction. In this way, the robot can continuously charge all the IoT devices. e experimental demonstration is shown in Figure 1.

Conclusions
In this paper, we invent a novel deep reinforcement learning algorithm AD-DQN to determine the optimal path for the mobile wireless power transfer robot to dynamically charge the harvesting energy-enabled IoT devices. e invented algorithm can intelligently divide a large area into multiple subareas and implement the individual DQN in each area, finally synthesizing the entire path for the robot. Compared with the state of the art, the proposed algorithm can effectively charge all the IoT devices on the experimental field.
e whole system can be used in a lot of application scenarios, like charging IoT devices in the dangerous area and charging medical devices.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Yuan Xing designed the proposed wireless power transfer system, formulated the optimization problem, and proposed the innovative reinforcement learning algorithm. Riley Young, Giaolong Nguyen, and Maxwell Lefebvre designed, built, and tested the wireless power transfer robot on the wireless power transfer test field. Tianchi Zhao optimized the performance of the proposed deep reinforcement learning algorithm. Haowen Pan implemented the comparison on the system performance between the proposed algorithm and the state of the art. Liang Dong provided the theoretical support for far-field RF power transfer technique.