Data Transmission Evaluation and Allocation Mechanism of the Optimal Routing Path: An Asynchronous Advantage Actor-Critic (A3C) Approach

The delay tolerant networks (DTN), which have special features, differ from the traditional networks and always encounter frequent disruptions in the process of transmission. In order to transmit data in DTN, lots of routing algorithms have been proposed, like “Minimum Expected Delay,” “Earliest Delivery,” and “Epidemic,” but all the above algorithms have not taken into account the buffer management and memory usage. With the development of intelligent algorithms, Deep Reinforcement Learning (DRL) algorithm can better adapt to the above network transmission. In this paper, we firstly build optimal models based on different scenarios so as to jointly consider the behaviors and the buffer of the communication nodes, aiming to ameliorate the process of the data transmission; then, we applied the Deep Q-learning Network (DQN) and Advantage ActorCritic (A3C) approaches in different scenarios, intending to obtain end-to-end optimal paths of services and improve the transmission performance. In the end, we compared algorithms over different parameters and find that the models build in different scenarios can achieve 30% end-to-end delay decline and 80% throughput improvement, which show that our algorithms applied in are effective and the results are reliable.


Introduction
Delay tolerant network (DTN) which has high delay and lower delivery rate is a newly developing network framework, aiming to realize the interconnection and asynchronous data stable transmission in hybrid environment. DTN has a wide range of application, like the sensor networks and the mobile networks, which have obtained the attention and deep research of the academic and industries.
Although DTN can be applied in many challenged scenarios, the reliability cannot be guaranteed for the characteristics of discontinuity and randomness. So many scholars have proposed plenty of routing algorithms based on "carry-storeforward" to improve the quality of transmission. The algorithms can be classified into two types of strategies; the first kind of algorithms gets better delivery rate through message copies, like the "Epidemic" algorithm forwards the data in the manner of flooding, but too many copies of the message occupy much memory and increase the overhead of network. The second kind of algorithms forwards the data through classifying, like "First Contact" chooses end-to-end paths randomly to forward data and takes no account of the priori data; "Minimum Expected Delay" uses Dijkstra algorithm to find the path if minimum delay, but which only considers the limited prior knowledge not necessarily global optimal. Although the above algorithms can provide great convenience for us, they also increase the risk if the security of an SDN (Software Defined Network (SDN)) network is compromised. So a new authentication scheme called the hidden pattern (THP) was proposed, which combines graphics password and digital challenge value to prevent multiple types of authentication attacks at the same time [1]. DTN can complete the delivery of the message in the complex environments of frequent interruption just because the nodes can store messages. But the above routing algorithms have not considered the memory management, so it is important to determine the optimal end-to-end paths considering the effective management and usage of node capacity.
So in our paper, we have did research on the DTN and forming three different scenarios when the communication links break down firstly, and then, we applied the DQN algorithm and A3C algorithm in our proposed optimal models; finally, we compared algorithms over different parameters and find that the models build in different scenarios can achieve 30% end-to-end delay decline and 80% throughput improvement.
The main innovation of this paper lies in the following: (i) We have researched on various scenes and different actions of the nodes, build optimal models in different scenarios (ii) We adopt the DQN algorithm and A3C algorithm in the optimal models, with the aim of optimizing the throughput of the service data (iii) We compare the DRL algorithm with other DTN routing algorithms over different parameters The composition of this paper is as follows. Section 1 outlines the characters of DTN and the related works. The optimal models which built over different scenarios can be found in Section 2.
Section 3 states the procedure and structure of the DQN algorithm and A3C algorithm. Section 4 gives the simulation topology and parameters. Section 5 shows the performances of the algorithms over different simulation parameters and gives the analysis results. At last, Section 6 states the conclusions and future improvements.

The Outline of DTN
For the existence of the Bundle Layer in DTN, it can implement store-and-forward message switching and the service of custody transfer. These two functions are described in details below.
When forwarding messages from the source node to the destination node in TCP/IP network, the messages can query route to find the path in the relay nodes and cannot be stored permanently, for this network has continuous connection to complete the transmission. But in DTN networks, the nodes can store the messages for a period of time and move by carrying the messages when meet the appropriate node to forward the message in the manner of message bundle like the Figure 1 shown.
After the nodes sending the messages to the next rely node in the form of bundles, if the node has not received the receipt confirmation information from the next node, then the node will choose an appropriate time to forward the bundle again. As Figure 2 shown, the relay node will return a receipt to the previous node, and when the relay node forwarding messages to the next hop, the next relay node will also send a receipt to the relay node; this procedure is carried on until the destination node receive the bundle [2]. The purpose of the custody transfer is to increase the reliability of the data transmission; only the node receives the receipt from the next hop, or the message is overdue, or the memory is full; the node deletes the message.
Ensuring DTN completes the service data transmission is important, so the scholars have studied and improved the routing algorithm in specific scenarios and proposed many routing algorithms [3].
In view of whether the infrastructure is required in the process of data forwarding, the DTN routing algorithms are composed of infrastructure-aided algorithms and noninfrastructure-aided algorithms as Figure 3 shown.
For the above problems, some recent studies [4][5][6] have proposed efficient cooperative caching schemes, in which data is cached at proper nodes or router nodes with limited sizes. But these papers need long time and large memory to broadcast the services. In [7], a joint optimization framework about caching, computation, and security for delaytolerant data in M2M communication networks was proposed and adopted deep Q-network (DQN) in the model. In [8], a shortest weighted path-finding problem is formulated to identify the optimal route for secure data delivery between the source-destination pair, which can be solved by employing the Dijkstra's or Bellman-Ford algorithm.
In [9], this paper uses the reliability of travel time as the weight of path selection, and solving by Dijkstra algorithm can reflect the actual vehicle path selection more accurately. This method is a beneficial improvement to the problem of static path selection. A dynamic routing algorithm based on energy-efficient relay selection (RS), referred to as DRA-EERS, is proposed in [10] to adapt to the higher dynamics in timevarying software-defined wireless sensor networks. In [11], a solution to the data advertising problem that is based upon random linear network coding was provided; the simulation results show that the proposed approach is both highly scalable and can significantly decrease the time for advertisement message delivery. A routing architecture and algorithm based on deep neural networks was proposed in [12], which can help routers 2 Wireless Communications and Mobile Computing make packet forwarding decisions based on the current conditions of its surroundings. A limited copy algorithm MPWLC based on service probability was provided in [13]; not only is the number of copies limited but the storage resources of the satellite are also taken into account to ensure reliable data transmission. The simulation results show that the proposed algorithm can effectively improve the efficiency of the network and ensure the reliable data transmission. In [14], a mathematical framework for DTN is introduced and suggested and applies it to a space network that is simulated using an orbital analysis toolkit. In [15], the problem of autonomously avoiding memory overflows in a delay tolerant node was considered, and reinforcement learning was proposed to automate buffer management given that this paper can easily measure the relative rates of data coming in and out of the DTN node.

Scenarios and System Model
Definition 1 (connected directed graph). We use G = ðV, EÞ to denote the connected graph if (i) G is a directed graph (ii) If the connections exist between node v i ∈ V and node v j ∈ V, there will have e i,j ∈ E The connected directed graph can be seen in Figure 4, the communication nodes responsible for forwarding the messages, the schedule nodes responsible for scheduling of service data, the connections among the nodes are affected by the actual environments.
Assume in our graph that there exist N nodes and M links and K services; the communication nodes and schedule nodes can assemble together in V = fv i , i = 1, 2, ⋯, Ng, and the broadband links and narrow-band links can assemble together in E = fe j , j = 1, 2, ⋯, Mg; the service data S = fs k , k = 1, 2, ⋯, ng transmitting from the initial node v s to the end node v d and the end-to-end paths are expressed by P = fp k , k = 1, 2, ::, ng = fV pk , E pk g, and the time slots are expressed by T = ft, t = 1, 2, ::, ng.
Due to the exceptional application environments of DQN, which always have long transmission delay and uncertain end-to-end paths, we have researched on the "carry-store-forward" and "custody transfer" mechanism and build optimal models in the following scenarios.
During service data communication, assume all service data fragments start with the shortest end-to-end path p k . Each source node of service can send s η fragments altogether but can only delivery one fragment f k ðtÞ in time slot t and ascertain the transmission situations of the fragments which has been delivered before simultaneously. I f k ðtÞ indicates the connection status from the node to the next-hop node in time slot t.

Wireless Communications and Mobile Computing
Definition 2 (only consider cache scenario). We define a scenario only consider cache as Figure 5 shown, if (i) All fragments choose the shortest paths when sending from the source node (ii) If the fragments encounter interruption, the fragments only choose to store at the interrupted nodes and wait the nodes return to normal In this scenario, we believe that every fragment f k ðtÞ just chooses to store at the interrupted nodes, but at the same time, the cache data will increase the node cache processing delay α f ðtÞ and the link interrupt waiting delay β f ðtÞ; otherwise, there are no communication links interrupted ðI f k ðtÞ = 0Þ; the fragments need to consider the delivery delay on the link γ e . This process is iterated until all the fragments reached the destination and calculate the throughput. Sup-pose the total delay used to complete the service transmission is λ 1 : m k is assumed to be the interrupted node in the shortest path p k , and when all the services accomplish the transmission, the total length of fragments that reached the terminal nodes is ω 1 , and the throughput is ε 1 : So the optimal model is: Wireless Communications and Mobile Computing Here, D v i denotes the total delay when the fragments store in the nodes in (1), D e j denotes the transmission delay in the path in (2), (3) states that the sum of processing delay D v i and transmission delay D e j cannot exceed the value bound s k R , (4) states that the bandwidth should be enough for the data transmission, and (5) denotes the maximum caching space of every node, so the total cache fragments cannot exceed the target value.
Definition 3 (only consider detour scenario). We define a scenario only considered choosing the detour path as Figure 6 shown, if (i) All fragments choose the shortest paths when sending from the source node (ii) If the fragments encounter interruption, the fragments only choose other available paths which have more identical nodes with the initial shortest path, so the fragments choose not to store and wait the nodes return to normal In this scenario, we believe that every fragment f k ðtÞ just chooses other detour paths p s in interrupted nodes ðI f k ðtÞ = 1Þ; the shortest path is p k ; otherwise, the links are connected ðI f k ðtÞ = 0Þ; the fragment just transmitted along the initial shortest path and only takes into account the delivery delay γ e j . The source nodes need to continuously pay attention to the transmission till all services accomplish data delivery. We assume that the fragments run into interruption in the u k -th node; the transmission delay of every service in the alternate path is Dsðf k R ðtÞ. Assume the total delay of all services is λ 2 .
when all the services accomplish the transmission, the total length of fragments that reached the terminal nodes is ω 2 as formula (3), and the throughput is ε 2 , so the optimal model is Assume the end-to-end path is A->B->D.
The link between the node A and the node B is interrupted.
The node A and C exchange information and forward it to node D.
Node A choose node C to forward the information not to wait for the link connect Figure 6: Only consider detour scenario.
Wait for the link to connect The link between the node A and the node B is interrupted.
The node A and B can exchange information after the link connecting and forward it to node D.
Node A waits for the interruption and keeps the information all along

Wireless Communications and Mobile Computing
Here, (1) states the transmission delay D k R in the initial shortest path, D s R denotes the transmission delay of every service in the alternate path, (2) states that the sum of transmission of the entire path cannot be exceed the target value s k R , and (3) and (4) state that the bandwidth of the entire path should be enough for the data transmission.
Definition 4 (comprehensive scenario). We define a scenario comprehensive the end-to-end path as Figure 7 shown, if (i) All fragments choose the shortest paths when sending from the source node (ii) If the fragments encounter interruption, the fragments jointly consider choosing other available paths which have more identical nodes with the initial shortest path or storing at the interrupted nodes, at last choose a path has the minimum end-to-end delay after comparing the above circumstances In this scenario, we believe that every fragment f k ðtÞ takes both the storage and the detour paths into account; U f ðtÞ indicates the choice of the fragment. If the fragments chooses to wait at the nodes ðU f ðtÞ = 1Þ, which generates the waiting and transmitting delay λ 1 , which denotes by λ 1 ÞÞ; if the fragment chooses a detour path ðU f k ðtÞ = 0Þ and tries to ensure that the available paths have more identical nodes with the initial shortest path and the total delivery delay λ 2 is λ 2 = ∑ K k=1 ð∑ η l=1 ð∑ m j=1 ðI f ðtÞ ðj k − 1Þ ⋅ γ e j + D s R f ðtÞÞÞÞ, the fragment will compare the above delays and choose the path that has minimum delay. The source nodes need to continuously pay attention to the transmission till all services accomplish data delivery. Assume the total delay of all services is λ 3 .
when all the services accomplish the transmission, the total length of fragments that reached the terminal nodes is as formula (3), so the optimal model is as follows: Choose node B or node C ?
The minimum end-to-end delay The link between the node A and the node B is interrupted.
The node A will choose the shortest end-to-end path to forward information.
Node A will compare the endto-end delay to node B and c. max

Input:
The initial locations of the service nodes Output: The optimal path and throughput of every service 1 Initialize the topology of all the nodes and the start and end nodes of every service; 2 Initialize thread step counter t ⟵ 1; 3 whileT < = T max do 4 Reset gradients: dθ ⟵ 0 and dθ v ⟵ 0; 5 Synchronize thread-specific parameters θ 1 = θ and t start = t; 6 Get state s t , that is the start node of everyservice; 7 whileterminal s t not is the end node of every service or t − t start ! = t max do 8 Perform a t that is the next-hop according to policy πða t | s t ; θ 1 Þ; 9 If all the constraints in models are satisfied, in consideration of the sequence of the fragments stay at the interrupted node and according to the choice of every fragment, like continue to store at the node or choose other detour paths, which can build different scenarios and have various the next-hop s t+1 and reward r t ; 10 Perform asynchronous update of θ using dθ and of θ v using dθ v ; 20 end Algorithm 1: Data transmission evaluation of the optimal routing path with A3C algorithm.

Wireless Communications and Mobile Computing
Here, D v i denotes the total delay when the fragments store in the nodes in (1), D e j denotes the transmission delay in the path in (2), D s indicates the transmission delay of the alternate path in (3), the sum of transmission of the entire path cannot be exceed the target value s k R , (4) states that the sum of processing delay D v i and transmission delay D e j cannot exceed the value bound s k R , (5) and (6) state that the bandwidth of the entire path should be enough for the data transmission and B k = min B v i , ∀v i ∈ V p k , and in (7), C v i denotes the maximum caching space of every node, so the total cache fragments cannot exceed the target value.

DRL Algorithm Procedure and Structure
Deep Reinforcement Learning (DRL) is a special and environment-friendly machine learning method, which uses the environment feedback as input and is the learning from environment state to behavior mapping; RL can maximize the cumulative return of system behavior from the environment, mainly consisting of agents and the external environment. But traditional reinforcement learning has bottlenecks; it uses table to save every state and the Q value of every action based on state [16]. And deep Q-learning network (DQN) can adopt neural networks to solve the above problems and make the state and action as the input of neural network; it can obtain the Q value through the neural network not the table, reducing the memory con-sumption. In DQN, they use experience reply to avoid correlation of data samples, but every time the agent interacts with the environments needs huge memory and computing power, and experience reply can only generate data by the old policy. So, A3C uses multithread of CPU to realize the parallel actor learner for multiagent instance; each thread corresponds to different exploration strategies. This parallelization can be used to decorrelate data and replace experience replay to save storage cost.

DQN Algorithm.
Deep Learning has been proved to be a powerful tool to solve nonconvex and high complexity problems and has been widely used in many ways. Reinforcement learning pays more attention to the maximal rewards over a long period time which obtained by interacting with environment and carry out the optimal action. The deep Qlearning adopts deep neural network to develop an action plan and behave well when deal with dynamic timevarying environments. So, DTN provides a promising technology for the data transmission in delay tolerant network.
With regard to the reinforcement learning, it interacts with the environment through the agent, which can inspect the environment and obtain the states (t) and then take action aðtÞ based on the state at the time slot t. Next, the external environment observes the action taken by agent and deliveries the latest state sðt + 1Þ and the reward rðtÞ to the agent. The above process is aimed at finding the maximal reward value by the optimal policy π * . DQN uses neural networks to approximate the value function Qðs, aÞ, and Input: The initial locations of the service nodes Output: The optimal path and throughput of every service 1 Initialize replay memory D to capacity N; 2 Initialize action-value function Q with random weights θ; 3 Initialize target action-value function Q b with weights θ − = θ; 4 forepisode = 1 to Ndo 5 Initialize the topology of all the nodes; 6 Get the initial state X k of all nodes and the distance between nodes; 7 Set sequence s 1 ⟵ e 1 ,preprocess φ 1 ⟵ φðs 1 Þ; 8 fort = 1 to Tdo 9 Select a random action a t for every node i with probability € otherwise select a t = max aQðφðstÞ, a ; QÞ; 10 Execute action a t in emulator and observe reward r t ; 11 If all the constraints in models are satisfied, in consideration of the sequence of the fragments stay at the interrupted node and according to the choice of every fragment, like continue to store at the node or choose other detour paths, which can build different scenarios and have various the next-hop, Set st + 1 ⟵ ðst, at, et + 1Þ and preprocess φ t+1 = φðs t + 1Þ, otherwise go back to step 7; 13 Store transition ðφ t , a t , r t , φ t+1 Þ in D; 12 Sample random mini batch of transitions ðφ t , a t , r t , φ t+1 Þ from D; 13 Set y j = fr j stops at step j + 1 Perform a gradient descent step on y j − Qðφ j , a j ; θÞ 2 with respect to the network parameters θ; 15 Every C step do resetQ = Q; 16 end 17 end Algorithm 2: Data transmission evaluation of the optimal routing path with DQN algorithm. 8 Wireless Communications and Mobile Computing the reward value Q * ðs, aÞ can be obtained according to the following: In which the reward is computed based on the state s and the action a, γ represents the discount factor which means the future impact on the present calculation, and E * ½• denotes the expectation function. Hence, DQN chooses the action which can get the Q value to the maximum.
The Q value updated at every step in DQN as follows: In which σ denotes the leaning rate and should be in the range of [0,1], with the increase of the learning rate, the influence of the past on the present is getting smaller and smaller. The process of the DQN is shown in Figure 8.
In DQN, <A, S, R, P > is a typical Quaternions [17], in which Action A indicates the action set taken by agent, State S indicates the state set observed from the environment, Reward R indicates the set of rewards value, and P denotes the deep learn mode in probability state space of agent learning. Based on the above typical Quaternions, the specific definition of DQN is shown as follows: (1) State Space. S = ðX 1 , X 2 , ⋯, X K Þ, in which it is a vector and denotes the location of the source nodes of the k-th service; the vector has k dimensions, and only one dimension is 1 (2) Action Space. A = ðA 1 , A 2 , ⋯, A K Þ, in which it is a vector and denotes the nodes of the k-th service nodes can be connected; the vector has k dimensions, and only one dimension is 1  Figure 9: The work process of A3C algorithm.

Wireless Communications and Mobile Computing
(3) System Reward. After each time slot t, the system will get the immediate reward rðtÞ based on different taken action aðtÞ. In our paper, we denote the reward as the cost of the current node to other nodes; the reward rðtÞ is smaller if the distance between the nodes is longer; otherwise, reward is bigger; rðtÞ is shown as follows: With above analysis, the DQN procedures for the optimal path and throughput are shown in Algorithm 2.

A3C
Algorithm. A3C uses the method of multithreading; at the same time, it interacts with the environment in multiple threads, and each thread summarizes the learning results to global net as shown. In addition, each thread regularly takes back the results of common learning from global net to guide the learning interaction between the thread and the environment. Through this method, A3C avoids the problem of strong correlation of experience playback and achieves asynchronous and concurrent learning model.
The A3C algorithm is based on the actor-critic consists of the value function Vðs t , θ v Þ and policy function πða | s t ; θÞ, which will not use the traditional Monte Carlo to update the parameters until the end of the scenario episode but uses temporal-difference learning to update parameters in each steps. About the actor-critic, it has two networks; one is the actor network, which is responsible for choosing the actions on account of the policy πða | s t ; θÞ; the critic network is responsible for evaluating each action from the actor network. And after the actor network obtains the scores of the action, it will optimize the policy to get the maximal reward over the algorithm executions. The critic network use the following formula to calculate the score of the in which, the R denotes the reward of taking the action a. Through calculating the gradient of the formula (4), the actor network can update the parameter θ; the gradient can be seen from the following: In A3C, it has defined a new function which called advantage function Aðs, aÞ that can be shown as It expresses that if the action chosen is better than the average, then the advantage function is positive; otherwise, it is negative. Figure 9 shows the process of A3C; it has one global network which includes the functions of actor network and critical network. And it also has n workers; each worker has the same network structure as the global neural network, and each worker will interact with the environment independently to get experience data. These workers do not interfere with each other and run independently.
Each worker and the environment interact with a certain amount of data, then calculate the gradient of the neural network loss function in its own thread; but these gradients do not update the neural network in its own thread but update the global neural network. In other words, n workers will independently use the accumulated gradient to update the common part of the neural network model parameters. Every once in a while, the thread will update the parameters of its own neural network to the parameters of the common neural network and then guide the subsequent environment interaction.
The specific description of A3C algorithm is presented in Algorithm 1.

Simulation Parameters.
The simulation scenario is shown in Figure 10, and we assume to send two services in total, and the services are send in the different source nodes, and every service will send hundreds of data segments; the topology not only has broadband links but also has narrowband links, so the two services have different shortest paths, so when the service segments encounter disruption in the process of service data transmission, they can choose to cache in the interrupted nodes or choose other detour paths to arrive the target nodes. The purple and the yellow links denote the service's end-to-end path, respectively; the red links denote the interruption in the shortest end-to-end path of every service.
In the process of simulation, we assume the transmission rate of service data is same in every source node, and the number of data segment cache disunified in every node. And the transmission rate of service segments is the constant value ð3 * 10 4 bpsÞ.
In the above topology, the source nodes may send many fragments, which can result in block at the inter-rupted node for too many fragments store at the node, making DQN algorithm consumes more time to analyze the queue problem in the node. If the simulation topology is more complicated, the DQN algorithm needs more time to train and ensure the end-to-end path of services. So in the simulation process of the DQN algorithm, we assume the DQN network consists of three layers of neural network, and the every layer has a certain number of neurons, and the learning rate of DQN is 0.001; the discount factor to calculate the reward is 0.09 [18], and the size of memory pool which used to execute experience replay is 2000. Other specific parameters can get in Table 1; based on the above settings, the DQN algorithm can find the optimal path of services.
For the execution process of the A3C algorithm, the simulation environment is the same in every worker which is the independent kernel in computer, and we set the learning rate of actor, and critic is 0.001; the discount factor to calculate the reward is 0.09; other specific parameters can get in Table 1. Based on the parameters and the special structure of the algorithm, it has higher execution speed and performance than DQN algorithm. The source node of service 2 The source node of service 1 Wait for the link to connect

12
Wireless Communications and Mobile Computing We will compare the performance of the algorithm from the following aspects, and the parameters of simulation are shown in Table 1.
The delivery rate can be expressed as the ratio of the number of fragments reached the destination node and the number of fragments send from the source nodes, as the following shows: in which Ld k denotes the number of fragments that reach the destination node of service and Ls denotes the total number of fragments that send from the source nodes.
The end-to-end delay can be represented as delay from the time of the source nodes start to send fragments to the time of the last fragment reach the destination, as the following shows: in which, t d is the time of the last fragment reach the destination, and t s is the time of the service data start to transmit. The throughput of service can be expressed as the number of service data successfully transmit from the source node to the destination node per unit time, as the following shows: in which, ∑ K k=1 Ld i is the total service data reach the destination node and De is the end-to-end delay.
The equilibrium of nodes is the number of average service carrying of every node, as the following shows: in which, N i is the number of service carrying of the i-th node, and P is the total number of nodes in the topology.

Wireless Communications and Mobile Computing
The equilibrium of links is the number of average service carrying of every link, as the following shows: in which, L i is the number of service carrying of the i-th link, and Q is the total number of links in the topology.

Simulation Results
6.1. Simulation Algorithms. The "Epidemic" algorithm belongs to the spread routing; it forwards the data in the manner of flooding; that is to say, all nodes encountered will "infect" the message [19], so there will have lots of copies of the message in the network and occupy much memory and increase the overhead of network.
But in the circumstance of sufficient network resources, the Epidemic algorithm will show faster delivery rate and be the alternative algorithm in this circumstance. If there are too many messages to transmit, some messages may be discarded, resulting in higher packet loss rate.
The ED algorithm has not taken the queue problem into consideration, and routing path is determined when the source nodes send the service data, so the ED algorithm belongs to the source routing. But when there have more fragments and queuing, it will affect the computation of weight of the ED algorithm; thus, making the computation of the source route emerges errors and cannot produce the optimal path.
The MED algorithm has taken the transmission delay, the propagation delay, and the average waiting delay into consideration; the goal of the algorithm is to find the path of minimum delay, and the path adopted is identical when the source nodes and destination nodes are same. After ensuring the path of source routing, even there has better choice, this algorithm will not change the routing choices [20], so it is just the optimal path over the limited prior knowledge and not necessarily global optimal, so the MED algorithm belongs to "time-invariant" algorithm.  Deep Learning has been proved to be a powerful tool to solve nonconvex and high complexity problems and has been widely used in many ways. The deep Q-learning adopts deep neural network to develop an action plan and behaves well when deal with dynamic timevarying environments.  A3C uses the method of multithreading; at the same time, it interacts with the environment in multiple threads, and each thread summarizes the learning results to global net. In addition, each thread regularly takes back the results of common learning from global net to guide the learning interaction between the thread and the environment.     owing to the different operating mechanism of algorithms, they choose different paths when encountering the interrupted nodes. We can see that only the DQN algorithm and A3C algorithm can change the route when facing different scenarios, and they often choose the paths that have the minimum end-to-end paths. In the comprehensive scenario, the DQN algorithm and A3C algorithm have compared the end-to-end delay of the above scenarios and have the optimal path in the three scenarios.

Comparison of Delivery Rate.
In this paper, we broadcast 2 services in this topology and record delivery rate based on different link break delay over different algorithms and assume the minimum delivery rate is 0.75. It is shown in Figure 14 that as the link break delay increases, the delivery rate is decrease over majority of algorithms, but in the only consider detour scenario and the comprehensive scenario of DQN, the delivery rate is remain unchanged and is the maximum, because in these scenarios, the source nodes choose the detour path and not affected by the interrupted links, so increasing the delivery rate. Through the result of all algorithms shown in Table 2, we found that the DQN algorithm has higher delivery rate in the majority of circumstances, which shows the delivery rate improvement of our proposed models and the DQN algorithm. But the A3C algorithm has the highest delivery rate in every scenario, because the A3C has many subthreads which can find the optimal paths in a very short time. The A3C algorithm and DQN algorithm can satisfy the constrained delivery rate in most cases.

Comparison of End-to-End Delay.
In DTN network, due to the particularity of data connection, there may be open circuit between data connections, which makes data have to be stored in the node waiting for the link to be connected again. However, the storage space at the node is limited. When some greedy algorithms are adopted, multiple data storage may cause the use of nodes pace, so that when the following data arrives, it will cause data loss.  Interrupted delay  30  110  100  100  80  80  80  80  40  110  130  110  100  80  80  80  50  110  140  110  120  80  80  80  60  110  150  110  120  80  80  80  70  110  160  110  140  80  80

Wireless Communications and Mobile Computing
The waiting delay at the node is reflected in the total end-to-end delay of data transmission, including not only the transmission delay on the link but also the waiting delay at the node. When using epidemic, ED, and other algorithms, due to the particularity of the algorithm, it will copy multiple copies of data to be transmitted in the network, so compared with DQN and other reinforcement learning algorithms, it will increase the waiting delay at the node.
Because the transmission of multiple copies of data will cause congestion at the node and when the transmission continues, it will increase the queuing delay. For the transmission of data, intelligent algorithms such as DQN will

18
Wireless Communications and Mobile Computing not transmit multiple copies of the same data but take the reward in the algorithm as the guidance, minimize the end-to-end delay in the network, and reduce the occurrence of congestion at the node, so the end-to-end delay of each algorithm is shown in the figure below. The total transmission delay of the service 1 and service 2 is shown in Figures 15-18; it can be seen that in the three scenarios, the minimum transmission delay is 100 ms, and the A3C algorithm has the minimum transmission delay, and DQN algorithm has lower delay, but in the scenario 1, the total transmission delay of service 1 and service 2 is increased as the interrupted delay of link is increasing, and the specific data are shown in Table 3, because in this scenario when the fragments encounter the interruption, they choose to store at the node, so the delay keeps on rising. Because in our paper, we think the capacity of node is available, so over the Epidemic algorithm, the fragments can arrive the destination node smoothly, and the total transmission delay of the Epidemic is not too high. But over the ED algorithm, it has the highest transmission delay; for the ED algorithm is the source routing algorithm, it determines the transmission path when the source nodes send the fragments, so when the fragments encounter the interruption, they will not change the path and result the high delay.
6.2.4. Comparison of Throughput. The throughput of service is shown in Figures 19-21; it can be seen that in the three scenarios, the A3C algorithm has the maximum throughput, which is better than the DQN algorithm; in the consider cache scenario, the throughput is decrease, for in this the scenario the transmission delay is a little higher and the nodes that reach the destination node are not a lot. The specific throughput data of all algorithms are shown in Table 4, and the throughput of ED algorithm and MED algorithm of service 1 and service 2 is also decrease, because the delay is increase as the interrupted delay of link increase. But in the only consider detour and comprehensive consider scenarios, there has the maximum throughput, which can be proved that our models and the adopted algorithm have improved the transmission.

Comparison of Node Equilibrium and Link
Equilibrium. The node equilibrium of services can be seen from Figure 22; we can see the Epidemic algorithm has the maximum node equilibrium; for in the process of the algorithm, it forwards the fragments over the flooding manner, which means when the node comes into the communication scope of other nodes, if the node found that other nodes

Wireless Communications and Mobile Computing
have not the fragment, it will send the fragment to the other nodes. So it can lead to many copies of fragments in the work, so every node may store every fragment of every service, so the equilibrium is the highest. The node equilibrium of the ED algorithm and the MED algorithm is a little lower, and average node equilibrium of DQN is a little higher, which has to be improved. And for the link equilibrium, the Epidemic algorithm has the maximum link equilibrium; the cause of the result is the same to the node equilibrium, because it forward too many duplicates in the network. The average link equilibrium of DQN is a little higher but not very large as the node equilibrium, so it also has to be improved, so the A3C algorithm has improved the node and link equilibrium, which has lower equilibrium compared to the DQN algorithm.
6.2.6. Comparison of Total Reward and Loss. From Figure 23, we can see that the A3C algorithm has higher reward and can converge very quickly. At first, the value of reward is random jitter, because the exact value cannot be obtained in a short time. And the reward of A3C can get close to its top value with 400 episodes, but DQN needs about 700 episodes. The results prove that our models and algorithms adopted can reach an optimal value and converge.

Conclusions
In this paper, we have proposed optimal models based on different scenarios consist of the only consider cache scenario, the only consider detour scenario, and comprehensive consider scenario; the models are intend to jointly consider the behavior and the buffer of the nodes to improve the performance of the data transmission. Owing to different choices of the nodes, there will form three scenarios, and we adopted the DQN algorithm to solve the complex nonlinear optimization problem and to get the optimal solutions, which consist of lower end-to-end delay, higher throughput, and better data delivery guarantees. The results of simulation show that compared to other algorithms like Epidemic, ED algorithm, and MED algorithm, the DQN algorithm we adopted has better performance improvement.
As future work, we are going to improve the optimal models and decrease the overhead of nodes and links, expecting the application of DQN can be further studied in the delay tolerant networks.

Conflicts of Interest
The authors declare no conflict of interest.