Optimal Wireless Information and Power Transfer Using Deep Q-Network

In this paper, a multiantenna wireless transmitter communicates with an information receiver while radiating RF energy to surrounding energy harvesters. (e channel between the transceivers is known to the transmitter, but the channels between the transmitter and the energy harvesters are unknown to the transmitter. By designing its transmit covariancematrix, the transmitter fully charges the energy buffers of all energy harvesters in the shortest amount of time while maintaining the target information rate toward the receiver. At the beginning of each time slot, the transmitter determines the particular beam pattern to transmit with. (roughout the whole charging process, the transmitter does not estimate the energy harvesting channel vectors. Due to the high complexity of the system, we propose a novel deep Q-network algorithm to determine the optimal transmission strategy for complex systems. Simulation results show that deep Q-network is superior to the existing algorithms in terms of the time consumption to fulfill the wireless charging process.


Introduction
For a wireless transceiver pair with multiple antennas, optimizing the transmit covariance matrix can achieve high data-rate communication over the multiple-input multipleoutput (MIMO) channel. Meanwhile, the radiated radio frequency (RF) energy can be acquired by the nearby RF energy harvesters to charge the electronic devices [1].
e problem of simultaneous wireless information and power transfer (SWIPT) has been widely discussed in recent years. SWIPT systems are divided into two categories: (1) the receiver splits the received signals for information decoding and energy harvesting [2,3]; (2) separated and dedicated information decoders (ID) and RF energy harvesters (EH) exist in the systems [4]. For the second type of the system, different transmission strategies have ever been proposed to achieve good performance points in the rate-energy region [1,2,5]. For the multiple RF energy harvesters, which are in the vicinity of the wireless transmitter, the covariance matrix at the transmitter is designed to either maximize the net energy harvesting rate or fairly distribute the radiated RF energy at the harvesters [6,7]. e achievable information rate of the wireless transmitter-receiver pair is beyond a minimum requirement for reliable communication. Most of the existing works assume the channel state information (CSI) is completely known. Given the complete CSI, the transmitter designs the transmit covariance matrix to achieve the maximum information rate while satisfying the RF energy harvesting requirement [4,8].
However, in practice, it is difficult for the transmitter to obtain the channel state information to the nearby RF energy harvesters because the scattering distribution of the hardware-limited energy harvesters makes the channel estimation at the RF energy harvesters challenging [9,10]. e analytic center cutting plane method (ACCPM) was proposed for the transmitter to approximate the channel information with a few bits of feedback from the RF energy receiver iteratively [10]. Since this method is implemented by solving a convex optimization problem, the algorithm leads to high computational complexity. To reduce complexity, channel estimation based on Kalman filtering was proposed [11]. Nevertheless, the disadvantage of this approach is the slow convergence rate. In order to effectively deal with the CSI acquisition problem, in our paper, we will use the deep learning algorithm to solve the optimization problem in the SWIPT system only with partial channel information.
e partial CSI is easy to acquire, which is already enough to achieve superior system performance using the deep Q-network. To the best of our knowledge, we are the first one to use the deep Q-network to optimize the SWIPT system performance and validate its superiority.
In our model, the transmitter intends to fully charge all surrounded energy harvesters' energy buffers in the shortest time while maintaining a target information rate toward the receiver. e communication link is defined as a strong lineof-sight (LOS) transmission, which is supposed to be invariant, but the energy harvesting channel conditions vary over time. Due to current hardware limitations, we assume that the estimation of the energy harvesting channel vectors is not able to be implemented under the fast varying channel conditions. As a result, the wireless charging problem can be modeled as a high complexity discrete-time stochastic control process with unknown system dynamics [12]. In [13], a similar problem has been explored. A multiarmed bandit algorithm is used to determine the optimal transmission strategy. In our paper, we apply a deep Q-network to solve the optimization problem and the simulation results demonstrate that the deep Q-network algorithm outperforms the multiarmed bandit algorithm. Historically, deep Q-network has a strongly proven record of attaining mastery over complex games with a very large number of system states, and unknown state transition probabilities [12]. More recently, a deep Q-network has been applied to deal with complex communication problems and has been shown to achieve good performance [14][15][16]. For this reason, we found deep Q-network fitting for our model. In our model, we consider the accumulated energy of the energy harvesters as the system states, while we define the action as the transmit power allocation. At the beginning of each time slot, each energy harvester sends feedback about the accumulated energy level to the wireless transmitter, and the transmitter collects all the information in order to generate the system state and inputs it into a well-trained deep Q-network. e deep Q-network outputs the Q values corresponding to all possible actions. e action with the maximum Q value is selected as the beam pattern to be used for the transmission during the current time slot.
Based on the traditional deep Q-network, the double deep Q-network and dueling deep Q-network algorithms are applied in order to reduce the observed overestimations [17] and improve the learning efficiency [18]. Henceforth, we apply dueling double deep Q-network to solve the varying channel multiple energy harvester wireless charging problem. e novelties of this paper are summarized as follows: (i) e simultaneous wireless information and power transfer problem is formulated as a Markov decision process (MDP) in an unknown varying channel condition for the first time. (ii) e deep Q-network algorithm is applied to solve the proposed optimization problem for the first time. We demonstrate that, compared to the other existing algorithms, deep Q-network shows the superiority in efficient and stable wireless power transfer. (iii) Multiple experimental scenarios are explored. By varying the number of transmission antennas and the number of energy harvesters in the system, the performance of both the deep Q-network and the other algorithms is compared and analyzed. (iv) e evaluation for the algorithms is based on the real experimental data, which validate the effectiveness of the proposed deep Q-network in realtime simultaneous wireless information and power transfer systems. e rest of the paper is organized as follows. In Section 2, we describe the simultaneous wireless information and power transfer system model. In Section 3, we model the optimization problem as a Markov decision process and present a deep Q-network algorithm to determine the optimal transmission strategy. In Section 4, we present our simulation results for different experimental environments. Section 5 concludes the paper.

System Model
As shown in Figure 1, an information transmitter communicates with its receiver while perceived by K nearby RF energy harvesters [8]. Both the transmitter and the receiver are equipped with M antennas, while each RF energy harvester is equipped with one receive antenna. e baseband received signal at the receiver can be represented as where H ∈ C M×M denotes the normalized baseband equivalent channel from the information transmitter to its receiver, x ∈ C M×1 represents the transmitted signal, and z ∈ C M×1 is the zero-mean circularly symmetric complex Gaussian noise with z ∼ CN(0, ρ 2 I). e transmit covariance matrix is denoted by Q, i.e., e covariance matrix is Hermitian positive semidefinite, i.e., Q ≽ 0. e transmit power is restricted by the transmitter's power constraint P, i.e., Tr(Q) ≤ P. For the information transmission, we assume that a Gaussian codebook with infinitely many code words is used for the symbols and the expectation of the transmit covariance matrix is taken over the entire codebook. erefore, x is the zero-mean circularly symmetric complex Gaussian with x ∼ CN(0, Q). With transmitter precoding and receiver filtering, the capacity of the MIMO channel is the sum of the capacities of the parallel noninterfering single-input singleoutput (SISO) channels (eigenmodes of channel H) [19]. We convert the MIMO channel to M eigenchannels for information and energy transfer [20,21]. A singular value decomposition (SVD) on H gives H � UΣV H , where Σ � diag(σ 1 , σ 2 , . . . , σ M ) contains the M singular values of H. Since the MIMO channel is decomposed into M parallel SISO channels, the information rate can be given by where q m are the diagonal elements of Q with Q � V H QV. e RF energy harvester received power specifies the harvested energy normalized by the baseband symbol period and scaled by the energy conversion efficiency. e received power at the ith energy harvester is where g i ∈ C M×1 is the channel vector from the transmitter to the ith energy harvester. With MIMO channel decomposition, the received power at energy harvester i is denoted as where g im are the elements of vector g i with g i � V H g i . We define the simplified channel vector from the transmitter to the ith RF energy harvester as for each i ∈ K � 1, 2, . . . , K { }. e simplified channel vector contains no phase information. e K simplified channel vectors compose matrix C ∈ R M×K as In what follows, we assume that time is slotted, each time slot as a duration T, and that each energy harvester is equipped with an energy buffer of size B i ∈ [0, B max ], i ∈ K. Without loss of generality, we assume that, at t � 0, all harvesters' buffers are empty, which corresponds to system state s 0 � [0, 0, . . . , 0]. At a generic time slot t, the transmitter transmits with one of the designed beam patterns. Each harvester i can harvest the specific amount of power p i , and its energy buffer values increase to B t+1 i � B t i + p i T. erefore, each state of the system includes the accumulated harvested energy information of all K harvesters, i.e., where B t i denotes the ith energy harvester's accumulated energy up to time slot t.
Once all harvesters are fully charged, we assume that the system arrives at a final goal state denoted as We note that the energy buffer level B max also accounts for situations in which B i > B max .

Problem Formulation for Time-Varying Channel Conditions
In this section, we suppose that the communication link is characterized by strong LOS transmission, which results in an invariant channel matrix H, while the energy harvesting channel vector g varies over time slots. We model the wireless charging problem as a Markov decision process (MDP) and show how to solve the optimization problem using reinforcement learning (RL). When the number of system states is very large, we apply a deep Q-network algorithm to acquire the optimal strategy at each particular system state.

Problem Formulation.
In order to model our optimization problem as a RL problem, we define the beam pattern chosen in a particular time slot t as the action a t . e set of possible actions A is determined by equally generating L different beam patterns with power allocation vector q � [q 1 , . . . , q m ] that satisfies the power and information rate constraints, i.e., Each beam pattern corresponds to a particular power level p i , which depends not only on the action a t but also on the channel condition experienced by the harvester during time slot t.
Given the above, the simultaneous wireless information and energy transfer problem for a time-varying channel can be formulated as minimizing the time-consumption n to fully charge all the energy harvesters while maintaining the information rate between the information transceivers: In general, the action selected at each time slot will be different to adapt to the current channel conditions and H TX Figure 1: Wireless information transmitter and receiver surrounded by multiple RF energy harvesters. current energy buffer state of the harvesters. erefore, the evolution of our system can be described by a Markov chain, where the generic state s is identified by the current buffer levels of the harvester, i.e., s � B 1 , B 2 , . . . , B K . e set of all states is denoted by S. Among all states, we are interested in the state in which all harvesters' buffer is empty, namely, s 0 � 0, . . . , 0 { }, and the state s G in which all the harvesters are fully charged, i.e., s G � B max , . . . , B max . If we suppose that we know all the channel coefficients at each time slot, problem P 1 can be seen as a stochastic shortest path (SSP) problem from state s 0 to state s G . At each time slot, the system is in a generic state s, the transmitter selects a beam pattern or action a ∈ A, and the system moves to a new state s ′ . e dynamics of the system is captured by transition probabilities p s,s′ (a), s, s ′ ∈ S, and a ∈ A, describing the probability that the harvesters' energy buffers reach the levels in s ′ after a transmission with beam pattern a. We note that the goal state s G is absorbing, i.e., P s G ,s G (a) � 1, ∀a ∈ A.
Each transition also has an associated reward, w(s, a, s ′ ), that denotes the reward when the current state is s ∈ S, action a ∈ A is selected, and the system moves to state s ′ ∈ S. Since we aim at reaching s G in the fewest transmission time slots, we consider that the action entails a positive reward related to the difference between the current energy buffer level and the full energy buffer level of all harvesters. When the system reaches state s G , we set the reward as 0. In this way, the system not only tries to fully charge all harvesters in the shortest time but will also uniformly charge all the harvesters. In detail, we define the reward function as where and λ denotes the unit price of the harvested energy. It is noted that different reward functions can also be selected. As an example, it is also possible to set a constant negative reward (e.g., a unitary cost) for each transmission that the system does not reach the goal state and a big positive reward only for the states and actions that bring the system to the goal state s G . is can be expressed as follows: We note that the reward formulation of equation (11) is actually equivalent to minimizing the number of time slots required to reach state s G starting from state s 0 .
Using the above formulation, the optimization problem P � (S, A, p, w, s 0 , s G ) can then be seen as a stochastic shortest path search from state s 0 to state s G on the Markov chain with states S and probabilities p s,s′ (a) , actions a ∈ A, and rewards w(s, a, s ′ ). Our objective is to find, for each possible state s ∈ S, an optimal action a * (s) so that the system will reach the goal state following the path with maximum average reward. A generic policy can be written as π � a(s): s ∈ S { }. Different techniques can be applied to solve problem P 1 , as it represents a particular class of MDPs. In this paper, however, we assume that the channel conditions at each time slot are unknown, which corresponds to not knowing the transition probabilities p s,s′ (a) . erefore, in the next section, we describe how to solve the above problem using reinforcement learning.

Optimal Power Allocation with Reinforcement Learning.
Reinforcement learning is suitable for solving optimization problems in which the system dynamics follow a particular transition probability function, however, the probabilities p s,s′ (a) are unknown. In what follows, we first show how to apply the Q-learning algorithm [22] to solve the optimization problem and then show how we can combine the reinforcement learning approach with a neural network to approximate the system model in case of large states and action sets, using deep Q-network [12].

Q-Learning
Method. If the number of system states is small, we can depend on the traditional Q-learning method to find the optimal strategy at each system state, as defined in the previous section.
To this end, we define the cost function of action a on system state s as p s,s′ (a) , with s ∈ S, a ∈ A. e algorithm initializes with Q(s, a) � 0 and then updates the Q values using the following equation: where and α(s ′ , a) denotes the learning rate. In each time slot, only one Q value is updated, and hence, all the other Q values remain the same. At the beginning of the learning iterations, since the Qtable does not have enough information to choose the best action at each system state, the algorithm randomly explores new actions. Hence, we first define threshold ε c ∈ [0.5, 1], and we then randomly generate a probability p ∈ [0, 1]. In the case that p ≥ ε c , we choose the action a as On the contrary, if p < ε c , we randomly select one action from the action set A.
When Q * converges, the optimal strategy at each state is determined as 4 Wireless Power Transfer which corresponds to finding the optimal beam pattern for each system state during the charging process.

Deep Q-Network.
When considering a complex system with multiple harvesters, large energy buffers, and timevarying channel conditions, the number of system states dramatically increases. In order to learn the optimal transmit strategy at each system state, the Q-learning algorithm described before requires a Q-table with a large number of elements, making it very difficult for all the values in the Q-table to converge. erefore, in what follows, we describe how to apply the deep Q-network (DQN) approach to find the optimal transmission policy. e main idea of DQN is to train a neural network to find the Q function of a particular system state and action combination. When the system is in state s, and action a is selected, the Q function is denoted as Q(s, a, θ). θ denotes the parameters of the Q-network. e purpose of training the neural network is to make According to the DQN algorithm [17], two neural networks are used to solve the problem: the evaluation network and the target network, which are denoted as eval net and target net, respectively. Both the eval net and the target net are set up with several hidden layers. e input of the eval net and the target net are denoted as s and s ′ , which describe the current system state s and the next system state s ′ , respectively. e output of eval net and target net are denoted as Q e (s, a, θ) and Q t (s, a, θ), respectively. e evaluation network is continuously trained to update the value of θ; however, the target network only copies the weight parameters from the evaluation network intermittently (i.e., θ ′ � θ). In each neural network learning epoch, the loss function is defined as where y represents the real Q value and is calculated as where ε is the learning rate. As the loss function updates, the values are backpropagated to the neural network to update the weight of the eval_net. In order to better train the neural network, we apply the experience reply method to remove the correlation between different training data. Each experience consists of the current system state s, the action a, the next system state s ′ , and the corresponding reward w (s, a, s ′ ). e experience is denoted by the set ep � s, a, w(s, a, s ′ ), s ′ . e algorithm records D experiences, and randomly select D s (with D s < D) experiences from D for training. After the training is finished, target_net clones all the weight parameters from the eval_net (i.e., θ ′ � θ). e algorithm used for the DQN training process is presented in Algorithm 1. In the algorithm, we define in each training iteration, we generate D usable experiences ep and select D s of all for training the eval_net. In total, we suppose there are U training iterations. We consider that, for both the eval_net and the target_net, there are N l layers in the neural network. In the learning process, we use C to denote all energy harvesters' channel condition in a particular time slot.

Dueling Double Deep Q-Network.
Since more harvesters and time-varying channel conditions incur more system states, even if we utilize the original DQN, it is hard to study the transmit rules for the transmitter. erefore, we can apply dueling double DQN in order to deal with the overestimating problem during the training process and improve the learning efficiency of the neural network. Doubling DQN is a technique that strengthens the traditional DQN algorithm by preventing overestimating to happen [17]. In traditional DQN, as shown in equation (18), we utilize the target_net to predict the maximum Q value of the next state. However, the target_net is not updated at every training episode, which may lead to an increase in the training error and therefore complicate the training process. In doubling DQN, we utilize both the target_net and the eval_net to predict the Q value. e eval_net is used to determine the optimal action to be taken for the system state s′ as follows: It can be shown that, following this approach, the training error considerably decreases [17].
In traditional DQN, the neural network only has the Q value as the output. In order to speed up the convergence, we apply dueling DQN by setting up two output streams from the neural network. e first stream is represented by the output value V(s, θ, β) results of the neural network, which represents the Q value of each system state. e second stream is called advantage output A(s ′ , a, θ, α) and describes the advantage of applying each particular action to the current system state [18]. α and β are parameters that relate the two streams and the neural network output, which is denoted as Dueling DQN can efficiently eliminate the extra training freedom, which speeds up the training [18].

Simulation Results
We simulate a MIMO wireless communication system with nearby RF energy harvesters. e wireless transmitter has at Wireless Power Transfer 5 most M � 4 antennas.

e 4 × 4 communication MIMO channel matrix H is measured by two Wireless Open-Access
Research Platform (WARP) v3 boards. Both WARP boards are mounted with the FMC-RF-2X245 dual-radio module, which is operated in 5.805 GHz frequency band. e Xilinx Virtex-6 FPGA operates as the central processing system and the WARPLab is used for rapid physical layer prototyping which is compiled by MATLAB [23]. We deploy two transceivers as line-of-sight transmission. e maximum transmitted power is P � 12W. ρ 2 � −70 dBm. e information rate requirement R is 53 bps/Hz. e average channel gain from the transmitter to the energy harvester is −30 dB. e energy conversion efficiency is 0.1. e duration of one time slot is defined as T � 100 ms.
DQN is trained to solve for the optimal transmit strategies for each system state. e simulation parameters used for DQN are presented in Table 1.
As described in Section 3.2, the exploration rate ε c determines the probability that the network selects an action randomly or follows the values of the Q-table. Initially, we set ε c � 1 because the experience pool has to accumulate reasonable amount of data to train the neural network. ε c � 1 decreases with 0.001 at each training interval and finally stops at ε ch � 0.1, since the experience pool has collected enough training data.
Refer to [24]. e dueling double DQN is used in our paper, which is shown in Figure 2. e software environment for simulation is TensorFlow 0.12.1 with Python 3.6 in Jupyter Notebook 5.6.0.
For the energy harvesters' channel, to show an example of the performance achievable by the proposed algorithm, we consider the Rician channel fading model [25]. We suppose within each time slot t, the channel is invariant and varies in different time slots [26]. At the end of each time slot, the energy harvester feedbacks the current energy level back to the transmitter. For the Rician fading channel model, the total gain of the signal is denoted as g � g s + g d , where g s is the invariant LOS component and g d denotes a zero-mean Gaussian diffuse component. e channel between the transmit antenna m and the energy harvester i can be denoted as g im � g s im + g d im .
e magnitude of the faded envelope can be modeled using the Rice factor K r such that K r im � ρ 2 im /2σ 2 im , where ρ 2 im denotes the average power of the main LOS component between the transmit antenna m and energy harvester i and σ 2 im denotes the variance of the scatter component. We can derive the magnitude of the main LOS component as |g s im . e mean and the variance of g im are denoted as μ g im � g s im and σ 2 g im � σ 2 im , respectively. In polar coordinates, g im � r im e jθ im .
First, we explore the optimal deep Q-network structure under fading channels. We suppose the number of antennas is M � 3 and the number of energy harvesters is K � 2. e channel between each antenna of the transmitter and each harvester is individually Rician distributed. e action set A contains 13 actions satisfying the information rate requirement: e LOS amplitude components of all channel links are defined as r im � 0.5, with i � 1, 2 and m � 1, 2, 3. e LOS phase components of all channel links are defined as θ 11 � π/4, θ 12 � π/2, θ 13 � −π/4, θ 21 � −π/2, θ 22 � 0, and θ 23 � 3π/4. e standard deviation of the g im amplitude and phase is denoted as σ im and 1/ ���� � 2K r im , respectively. We suppose σ im � 0.05, ∀i, m.
Using the fading channel model above, in Figure 3, we show how the structure of the neural network together with the learning rate can affect the performance of the DQN, for a fixed number of training episodes (i.e., 40000). e performance of DQN is measured by the average number of (1) Randomly generate the weight parameter θ for the eval_net. e target_net clones the weight parameters θ′ � θ. u � 1. s � s 0 .
(2) At the beginning of the time slot, randomly generate a probability p ∈ [0, 1].
IF D > 200 and p ≥ ε ch : we choose the action a as a � max a∈A Q(s, a) ELSEIFp < ε ch : Randomly choose the action from action set A. e transmitter transmits with the selected beam pattern. (3) roughout the whole time slot, the RF energy is accumulated in the harvesters' energy buffer, as s i ′ � s i + M m�1 |g t im | 2 a m T, ∀i ∈ K. At the end of each time slot, each harvester feedbacks the energy level to the transmitter and the system state is updated to s′. (4) ep(d) � s, a, w(s, a, s′) After experience pool accumulates enough data, from D experiences, randomly select D s experiences to train the neural network eval_net. Backpropagation method is applied to minimize the loss function loss(θ). Clone the weight parameters from eval_net to target_net after several time intervals.  Figure 3 shows that if the deep Q-network has multiple hidden layers, a smaller learning rate is necessary to achieve better performance. When the learning rate is 0.1, the DQN with 4 hidden layers performs worse than a neural network with 2 or 3 hidden layers. On the other side, when the learning rate decreases, we can see that the neural network with 4 hidden layers and a learning rate of 0.00005 achieves the best overall performance. We do not see a monotonic decrease in the average number of time slots due to the stochastic nature of the channel that causes some fluctuations in the DQN optimization. After an initial improvement, decreasing the learning rate results in a slight increase in the average number of charging steps for all three neural network structures. is is due to the fixed number of training episodes. As a result, for all the simulations presented in this section, we consider a DQN algorithm using a 4 hidden layer deep neural network, with 100 nodes in each layer and a learning rate of 0.00005.
In Figure 4, we can observe that the size of the experience pool also affects the performance of DQN (40000 training episodes). To eliminate the correlation between the training data, we select part of the experience pool for training. In our  Wireless Power Transfer simulation, this parameter, called mini-batch, is set to 10.
Larger experience pool contains more training data; hence, selecting the mini-batch from it for training can eliminate the correlation between the training data. However, we need to balance the size of the experience pool and the target_net weight replacement interval. If the experience pool is large but the replacement iteration interval is small, even if we address the correlation problem between the training data, the neural network does not have enough training episodes to reduce the training error before the weight of the target_net is replaced. From Figure 4, we can observe that a large number of replacement iteration intervals may not be the best choice too. erefore, we determine that, for our problem, DQN achieves the best performance when the size of the experience pool and the neural network replacement iteration interval are 60000 and 1000, respectively. Figure 5 shows the impact of the reward function (see Section 3.1) on the DQN performance. In this figure, we consider the following reward functions: Reward 1 : otherwise; Reward 2 : w(s, a, s ′ ) � 10 if s ′ � s G and w(s, a, s ′ ) � −1 otherwise; Reward 3 : w(s, a, s ′ ) � 1 if s ′ � s G and w(s, a, s ′ ) � −1 otherwise. Here, K � 2 and λ � 0.25. All three reward functions are designed to minimize the number of time slots required to fully charge all the harvesters. However, from Figure 5, we can observe that the best performance can be obtained using Reward 1 . In this case, the energy level accumulated by each harvester increases uniformly, which results in the DQN to converge faster to the optimal policy. Both Reward 2 and Reward 3 , instead, do not penalize states that unevenly charge the harvesters and therefore require more iterations to converge to the optimal solution (not shown in the figure) due to the large number of system states to explore. erefore, in the following simulations, we use the reward function Reward 1 in both Figures 5 and 6, we average 40000 training steps every 100 steps in order to better show the convergence of the algorithm. Figure 6 shows that when each energy harvester in the system is equipped with a larger energy buffer, the number of system states increases, and therefore, DQN requires more training period to converge to the steady transmit strategy for each system state. We can observe that when B max � 1.6 mJ, the system only needs less than 5000 training episodes to converge to the optimal strategy, and when B max � 3.2 mJ, the system needs around 12000 training episodes to converge to the optimal policy. However, for B max � 4.8 mJ, the system needs as many as 20000 training episodes to converge to the optimal strategy.
In the following simulations, we explore the impact of the channel model on optimization problem P 1 . For the Rician fading channel model, we consider K r im ≥ 10 and to be the same for all i, m. In this way, we can approximate the Rician distribution as a Gaussian distribution. We fix r im � 0.5, ∀i, m, but allow the standard deviation of both the amplitude and the phase of the channel to change to evaluate the performance on the system under different channel conditions. Since r im � 0.5 and ���� � 2K r im � r im /σ im , 1/ ���� � 2K r im � 2σ im . We define σ im ≤ 0.1 to guarantee K r im ≥ 10. In Figure 7, we express the standard deviation σ amp � σ im , ∀i, m of the phase and amplitude of the channel, and we compare the performance attained by the optimal policy with the performance of different other algorithms. e multiarmed bandit (MAB) algorithm is also implemented to compare with the DQN. In MAB, each bandit arm represents a particular transmission pattern. e upper confidence bound (UCB) algorithm [27] is implemented to     Wireless Power Transfer maximize the reward w(s, a, s ′ ) and determine the optimal action. Once the action is selected from the action space A, it will be used for transmission continuously. e myopic algorithm is another machine learning algorithm that can be compared with DQN. Myopic solution has the same structure as the DQN; however, the reward discount is defined as c � 0. As a result, the optimal strategy is determined only according to the current observation instead of considering the future consequence. Myopic solution has been widely used to solve the complex optimization in wireless communication problem and achieve good system performance [28]. Besides two machine learning algorithms, another two heuristic algorithms are also used for system performance comparison. For even power allocation, the transmit power P is evenly allocated on parallel channel for transmission. e random action selection is also applied for performance comparison. e random action selection has the worst performance while DQN performs best. Compared to the optimal existing algorithm multiarmed bandit algorithm, the DQN can consume 20% fewer time slots to complete charging. In some channel conditions, the myopic solution can achieve a similar performance as the DQN. However, the myopic solution cannot perform stably. For example, as the standard deviation of the channel amplitude is σ amp � 0.025, DQN can outperform myopic solution by 45%. e instability can be explained as the myopic solution makes the decision only on the current system state and the current reward, which does not consider the future consequence. Hence, the training effects cannot be guaranteed. Overall, the DQN has superiority in both the charging time consumption and performing stability corresponding to different channel conditions.
To better explain the performance of the optimal policy, in Figure 8, we plot the action selected by DQN at a particular system state when σ amp � 0.05. When σ amp � 0.05,   the optimal action selected by multiarmed bandit is the third action a 3 � [2, 6, 4] T , which can finish charging both harvesters in around 60 time slots. Meanwhile, the optimal policy determined by DQN can finish charging in around 43 time slots. To this end, Figure 8 shows that the charging process can actually be divided into two parts: before harvester 1 accumulates 1.2 mJ energy and harvester 2 accumulates 0.8 mJ energy, mostly action 4 a 4 � [2, 8, 2] T is selected. After that, mostly action 1 a 1 � [2, 2, 8] T is selected. As defined above, both the amplitude and the phase of the channel are Gaussian distributed with zero standard deviation, g 0 1 � [0.05, 0.59, 0.11] T and g 0 2 � [0.04, 0.19, 0.51] T . So when both the amplitude and the phase of the channel change, the simplified channel state information will be distributed around g 0 1 and g 0 2 . As a result, it can be shown that a policy that selects either action 1 or action 4 with different probabilities can have better performance than the policy that only selects action 3. Henceforth, the DQN can consume 40% fewer time slots to fully charge two energy harvesters.
In Figure 9, the performance of the DQN is compared with the other four algorithms by varying the number of energy harvesters in the system. In general, as the number of energy harvesters increases, all four algorithms consume more time slots to complete the wireless charging process. Compared to the random action selection, DQN can consume at least 58% less time slots to complete the charging. e performance of the multiarmed bandit and the even power allocation is very similar, which can be explained as the optimal action determined by the multiarmed bandit algorithm is close to the even power allocation strategy. Compared with two fixed action selection strategies (multiarmed bandit and even power allocation), DQN can reduce the time consumption by up to 72% (when the number of energy harvesters is N � 3). e myopic solution is still not the optimal strategy. From the figure, we can observe that the myopic solution outperforms two fixed action selection algorithms. Even though in some conditions (N � 6), the performance difference between DQN and myopic solution is very small, the myopic solution consumes more than 15% of the time slot than DQN in average. Overall, the DQN is the optimal algorithm which consumes fewest time slots to fully charge all the energy harvesters regardless of the number of energy harvesters.
In Figure 10, the number of transmit antennas is increased from M � 3 to M � 4. e number of energy harvesters varies from N � 2 to N � 6. ough the number of antennas increases, the channel conditions between the transmitter and the energy harvesters become more complicated; DQN still outperforms all the other four algorithms. Compared with myopic solution, multiarmed bandit, even action selection, and random action selection, DQN can consume up to 27%, 54%, 55%, and 76% fewer time slots to fulfill the wireless charging, respectively. As the number of energy harvesters increases, the superiority of the DQN becomes more obvious compared to two fixed action selection algorithms, which can be explained as it is more inefficient to select one fixed action to deal with a more complicated varying channel environment. Even though in some conditions, the performance of the myopic solution and DQN is similar, the myopic solution is not stable in dealing with different energy harvesters conditions. e

Conclusions
In this paper, we design the optimal wireless power transfer system for multiple RF energy harvesters. Deep learning methods are used to enable the wireless transmitter to fully charge the energy buffers of all energy harvesters in the shortest time while meeting the information rate requirement of the communication system.
As the channel conditions between the transmitter and the energy harvesters are time-varying and unknown, we model the problem as a Markov decision process. Due to the large number of system states in the model and the difficulty of training, we adapt a deep Q-network approach to find the best transmit strategy for each system state. In the simulation section, multiple experimental environments are explored. e measured real-time data are used to run the simulation. Deep Q-network is compared with the other four existing algorithms.
e simulation results validate that the deep Q-network is superior to all the other algorithms in terms of the time consumption for fulfilling wireless power transfer.

Data Availability
e simulation data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.