Stochastic Adaptive Forwarding Strategy Based on Deep Reinforcement Learning for Secure Mobile Video Communications in NDN

Named Data Networking (NDN) can eﬀectively deal with the rapid development of mobile video services. For NDN, selecting a suitable forwarding interface according to the current network status can improve the eﬃciency of mobile video communication and can also avoid attacks to improve communication security. For this reason, we propose a stochastic adaptive forwarding strategy based on deep reinforcement learning (SAF-DRL) for secure mobile video communications in NDN. For each available forwarding interface, we introduce the twin delayed deep deterministic policy gradient algorithm to obtain a more robust forwarding strategy. Moreover, we conduct various numerical experiments to validate the performance of SAF-DRL. Compared with BR, RFA, SAF, and AFSndn forwarding strategies, the results show that SAF-DRL can reduce the delivery time and the average number of lost packets to improve the performance of NDN.


Introduction
Recently, with the development of technology and the increase of mobile devices, the proportion of mobile video services in network communications has increased rapidly. At the same time, users pay more attention to the acquired video content information and no longer pay attention to its storage location. is brings about huge challenges to the current host-based TCP/IP network architecture [1][2][3]. Although there are many research studies to improve current network performance [4][5][6], they have not been reformed in essence. To deal with these challenges, Named Data Networking (NDN) [7] is proposed as one of the most potential candidate network architectures in the future.
Because the routing nodes in NDN have the crucial feature of being cached in network, the efficiency of video communication can be effectively improved in NDN [8].
In addition, by decoupling information content and communications. In addition, Yi et al. [12,13] proved that the forwarding plane in NDN is stateful, where routing only provides a guiding role for adaptive forwarding. erefore, designing an effective adaptive forwarding strategy for NDN can greatly improve efficiency of secure mobile video communications.
Adaptive forwarding in NDN is a dynamic and complex process. When the router receives the Interest packet, it can select one or more certain available interfaces for forwarding [14][15][16][17]. Due to the selection of certain available interface forwarding, these strategies lack the exploration of unknown links and cannot find better links in time, which may cause network load imbalance. us, some researchers propose probability-based adaptive forwarding strategies [18][19][20][21]. In these strategies, the Interest packet will stochastically select the interface for forwarding according to the forwarding probability of the available interface. However, these strategies are less robust to emergencies in the network, such as network congestion or link failure. erefore, assigning a suitable and robust forwarding probability to each available forwarding interface is the main research content of designing stochastic adaptive forwarding strategy in NDN.
In recent years, the rise and development of reinforcement learning theory has brought about new ideas for the design of NDN adaptive forwarding strategies [22]. Reinforcement learning aims to obtain the optimal strategy through independent learning through interaction with the environment [23]. In this paper, we use the advantages of reinforcement learning to design a more suitable and robust stochastic adaptive forwarding strategy in NDN. When a user requests mobile video data in NDN, the user cannot obtain all the mobile video data by sending one Interest packet to the server. It is necessary to continuously send multiple Interest packets with a common prefix to request all the mobile video data. We introduce the twin delayed deep deterministic policy gradient (TD3) [24] algorithm to solve this continuous control problem. en, we take the throughput, delay, and error rate of each interface as the state of the algorithm, the total utility of the node as the reward function of the algorithm, and the forwarding probability of each interface as the action of the algorithm. rough continuous iterative training, an adaptive forwarding strategy with maximum network utility can be finally obtained.
We summarize our main contributions as follows: We propose an adaptive forwarding framework based on deep reinforcement learning in NDN. is framework can assign a suitable and robust forwarding probability for each available interface. We propose a stochastic adaptive forwarding strategy for secure mobile video communications based on deep reinforcement learning (SAF-DRL) which introduces the TD3 algorithm into NDN.
We conduct numerical experiments on the SAF-DRL algorithm under different topologies. By comparison with four other adaptive forwarding strategies, the experimental results show that the SAF-DRL can achieve higher network performance. e rest of the paper is arranged as follows: Section 2 introduces related work on adaptive forwarding strictly in NDN. Section 3 discusses the system model of adaptive forwarding in NDN. Section 4 presents and describes the SAF-DRL algorithm. Section 5 evaluates the performance of the SAF-DRL. Section 6 summarizes this paper and proposes future work.

Related Work
In recent years, NDN has received more attention and there are a large number of researchers studying this field. Many of them are making great efforts to study adaptive forwarding strategies. In this section, we review some typical works.
Yi et al. [15] proposed a framework for adaptive forwarding in NDN networks and also proposed an adaptive forwarding strategy based on the color (GREEN, YELLOW, and RED) of the forwarding interface. After this, adaptive forwarding in NDN is mainly divided into two categories. One is the early use of mathematical optimization methods, the most important of which is the adaptive forwarding strategy based on probability, and the other is the use of reinforcement learning method in recent years through continuous iterative training to get the optimal adaptive forwarding strategy.
Probability-based adaptive forwarding has a large number of documents in the early stage. Qian et al. [18] proposed an adaptive forwarding strategy based on probability. is strategy mainly assigns forwarding probability for each available interface and minimizes the delay of Interest packet transmission on each interface as an optimization goal. e objective function is optimized through the ant colony optimization algorithm to finally achieve the optimal adaptive forwarding strategy. Lei et al. [19] proposed a maximizing deviation based probabilistic forwarding strategy.
is forwarding strategy comprehensively considers multiple related attributes of the node and by using the maximizing deviation method assigns the optimal weight to each attribute. On this basis, the comprehensive score of each available forwarding interface can be obtained, and this is taken as the forwarding probability of the forwarding interface. Lei et al. [20] proposed an entropy-based probabilistic forwarding strategy. is forwarding strategy uses the entropy weight theory to assign weights among multiple attributes. e node combines its own performance and the assigned weight to calculate the availability of each available interface and then uses this availability as the forwarding probability of the available interface. Posch et al. [21] proposed stochastic adaptive forwarding (SAF). SAF imitates the water pipe system in reality. It can guide and distribute Interest packets intelligently in network nodes to avoid link congestion. SAF adopts the overpressure valve, takes the throughput of the link as an important measure, divides the Interest packets into satisfied and unsatisfied, and allocates the forwarding probability of each interface, so that the congested nodes can reduce the pressure independently.
As the advantages of reinforcement learning gradually manifest, some researchers use reinforcement learning to find the optimal adaptive forwarding strategy. Yao et al. [25] proposed an adaptive forwarding strategy called SMDPF.
is strategy regards the request forwarding in the network as a Semi-Markov Decision Problem (SMDP). en, based on SMDP theory and considering the randomness of network requests, an optimal adaptive forwarding strategy is designed to deal with the request forwarding by combining Q-learning with artificial neural network. Akinwande [26] proposed an adaptive forwarding strategy based on reinforcement learning and random neural network. Based on the dynamic self-awareness strategy layer, the strategy can reply to the request content quickly through local Content Store (CS). At the same time, it uses probe Interest packet and combines it with local routing information to actively seek new available delivery path under controllable degree. Zhang et al. [27] proposed an intelligent forwarding strategy using reinforcement learning. e strategy does not rely on the model programmed in advance but trains a neural network model to select the interface by collecting information from nodes. At the same time, it only learns new decisions by observing the results of past decisions. Zhang et al. [28] proposed an adaptive forwarding strategy using improved Q-learning. e strategy is mainly divided into two phases, exploration and exploitation. In the exploration phase, the information in the network is collected, and then the information is used as the basis to guide forwarding of Interest packets in the exploitation phase.
Probability-based adaptive forwarding can greatly reduce the resource waste caused by deterministically selecting one or more interfaces in mobile video communications. At the same time, it can avoid attacks due to the selection of certain interfaces, thereby improving security. However, adaptive forwarding based on reinforcement learning has higher robustness, especially in the case of link failures. We use the advantages of both and use reinforcement learning to train the forwarding probability assigned to each available forwarding interface, finally obtaining the adaptive forwarding strategy with high robustness in NDN.

System Model
In this section, we introduce the system model in detail. We summarize the major variables and expressions, which are depicted in Table 1.
We use a directed graph G(V, E) to depict the network model. e directed graph consists of two parts, the set of nodes V and the set of links E, where E⊆V × V. For each node i ∈ V, it may maintain several physical interfaces F (i,j) , where j is the neighbor node of node i and the tuple (i, j) ∈ E. We define F i as the set of all physical interfaces with node i, F i : � ∪ (i,j)∈E F (i,j) , and N i is the number of the interfaces for node i, N i : � |F i |. For node i, we define F i + ∈ F i as the set of in-interfaces which receive the Interest packet and F i − ∈ F i \F i + as the set of out-interfaces which return the requested data packet or forward the Interest packet to next hop node searched in the FIB. F L ∉ F i is the lose-interface, where the Interest packet needs to be lost. As above, we want to learn an adaptive forwarding (AF) strategy for each node. Since the algorithm proposed in this paper only needs local information to train the AF strategy, the algorithm is installed on each node, and no communication is required between the nodes. We only focus on a single node, so we will omit the subscript for the node in the next discussion.
e content catalogue of this system can be labeled as a set K. For k ∈ K represents the content with the common prefix requested by the user, we define p k (F f ) as the forwarding probability for the interface F f with the common prefix k and p k (F L ) as the packet loss rate with prefix k when network congestion or link fails. In this paper, we mainly focus on a common prefix, so we want to omit subscripts for prefix in the remainder of the paper, and we will consider different prefixes in our future work. We show the system model in Figure 1. ere are many interfaces for Interest packet to AF for a node in this system. For example, the mobile device (User1 and/or User2) wants to obtain the video /pre1/pre2/n1.mp4, stored in server (Server1 or/and Server2). e mobile device sends Interest packet to router R1. e router R1 has two interfaces (F 1 and F 2 ) to forward the Interest packet by probability of 2/3 for F 1 and 1/3 for F 2 (confirm forwarding). en forwarding continues until the Interest packet encounters the requested content. Finally, the data packet with video is returned to the mobile device via the reverse path.
According to the α-fairness [29,30] model widely used for Network Utility Maximization (NUM), the utility function is defined as where α is positive numbers and fairness tuning parameter. For α > 0, the function is strictly nondecreasing. If α � 1, the function is proportional fairness and is widely used in NUM. Similar to [30], we define the utility function of interface where x f , y f , and z f are represented as the throughput, delay, and error rate of the fth interface, respectively; β and c represent the relative importance of the throughput versus delay versus error rate and β, c ∈ [0, 1]. Especially if Interest packet has to be discarded, the corresponding utility is defined as 0; that is, U(x L , y L , z L ) � 0, where x L , y L , z L , respectively, represent the throughput, delay, and error rate of the lose-interface F L . For the utility function proportional fairness, we set α 1 � α 2 � α 3 � 1. e utility function becomes

Security and Communication
Networks According to [31], the total utility of each node in the network can be expressed as erefore, the objective of optimizing the AF strategy is to maximize the total network utility U all .

Stochastic Adaptive Forwarding Based on Deep Reinforcement Learning
In this section, we first introduce the background knowledge about the TD3 algorithm and then propose the SAF-DRL algorithm and describe the algorithm in detail.

Background of TD3.
In this subsection, in order to better understand the SAF-DRL algorithm, we will introduce some necessary background knowledge about the TD3 algorithm. For a standard reinforcement learning (RL), the training process is to interact with the environment during each decision epoch and finally obtain the optimal strategy gradually. e specific process is that, at epoch t, the agent observes a state s t and selects an action a t according to this state. After execution, the environment state will convert from s t to s t+1 , and, at the same time, the single-epoch reward r(s t , a t ) given by the environment will be obtained. e goal of reinforcement learning is to find an optimal policy π * to maximize the discounted future reward is a discounted factor. At epoch k, for RL, the goal is to require an optimal policy π ω under parameterization ω to maximize the expected reward,' roughput, delay, and error rate of the lose-interface F L . U(x L , y L , z L ) e network utility of lose-interface F L . U all e total utility of each node.

Security and Communication Networks
where p π is the sampling space of the state. In order to learn the problem of continuous decisionmaking, Sutton et al. [32] proposed the policy gradient (PG) method. is method uses a probability distribution function P(a|s; ω) to represent the optimal policy for each epoch and performs action sampling according to the policy distribution at each epoch to obtain the best action value: π ω (a|s) � P(a|s; ω). We can use the gradient of reward ∇J(ω) to update the parameterized strategy π ω . erefore, we can get where, according to the expected return given by equation (5), , a). Since the process of generating actions is a stochastic process, the last strategy learned is also a stochastic policy. However, since the action space is usually a high-dimensional vector, frequent sampling in the high-dimensional action space is an extremely computationally consuming task. erefore, Silver et al. [33] proposed deterministic policy gradient (DPG). For each epoch of the action, the determined value is directly obtained through the function μ, a � μ ω (s). erefore, we can get In order to deal with high-dimensional input problems, DeepMind introduced deep learning into Q-learning and proposed Deep Q-Network (DQN) [34]. At the epoch k, DQN uses a greedy strategy π(s k ) � argmax a k Q(s k , a k ) and then trains by minimizing loss, where Tar k is the target Q value, which can be expressed as For continuous control problems, the actor-critic method [35] is often used to solve the problem. is method usually uses the policy gradient method to find the optimal policy. DeepMind combined DQN and DPG and proposed a new actor-critic method, deep deterministic policy gradient (DDPG) [36]. e training of critic network in DDPG is based on the DQN method, and the training of actor network is based on the DRL method. e specific update formula is Although DDPG has achieved certain success, there are still problems such as overestimation of Q value and excessive variance. erefore, Fujimoto et al. [24] proposed an improved DDPG algorithm TD3. e first point of improvement is to eliminate the problem of overfitting. TD3 introduces the idea of Double DQN [37] and uses two critic networks. In the practice of actor-critic algorithm, due to the high similarity between the current network and the target network, independent evaluation cannot be made. In order to solve this problem, Double DQN uses this update method: where π(·|ω ϕ 1 ) and π(·|ω ϕ 2 ) are optimized with Q(·|ω q 1 ) and Q(·|ω q 2 ), respectively. After practice, it is found that the final effects of the two actor networks are relatively similar, so one actor network is selected. e update goals of both critic networks are the same, y 2 � y 1 . Because actor network will choose high evaluations, the overestimated ones will gradually accumulate, so when choosing a smaller evaluation, the final goal can be expressed as In order to solve the problem of excessive variance, TD3 introduced the idea of delaying strategy updates. is is to adjust the update frequency of critic network to be a bit higher than that of actor network. Solve the problem that the DQN is constantly updated, which may cause blind iteration of actor network. When calculating the Q function, a certain action is randomly selected within a certain range to achieve smooth strategy, so as to get rid of the influence of false peaks. In this way, the problem of incorrect strategy caused by inaccurate Q function in DDPG can be solved.

SAF-DRL Algorithm.
In this subsection, we present the stochastic adaptive forwarding strategy based on DRL (SAF-DRL) for secure mobile video communications in NDN. For all DRL approaches, the state space, the action space, and the reward function are important components: STATE: the state space mainly consists of three parts: throughput, delay, and error rate of each interface in the network. erefore, the state at epoch t is s t � [(x 1 warding probability of each interface and the Interest packet loss rate p L for specific content. erefore, the action at epoch t is a t � [(p t (F 1 )), (p t (F 2 )), REWARD: the reward function is the objective of AF strategy, which is the total utility of each node in the network. Formally, Please note that the design of the state space, action space, and reward function will seriously affect the performance of the SAF-DRL algorithm. Our design is based on the full consideration of mobile video communications and can be closer to the real situation. Moreover, the probability selection interface can improve the security of mobile video communications, especially when encountering link failures or congestion; and reinforcement learning shows higher robustness than mathematical calculations, thus ensuring the performance of mobile video communications.
Since each node receives a large number of Interest packets to request mobile video content, we assume that the

Security and Communication Networks
Interest packets received are continuous. Nodes need to continuously make forwarding decisions for these Interest packets. At the same time, in order to make the forwarding probability assigned to each available interface in mobile video communication have high robustness, we use the TD3 algorithm in the actor-critic algorithm. In addition, as the TD3 algorithm is also a deep reinforcement learning [34] algorithm, it can deal with the high-dimensional problems caused by a large number of entries in FIB in reality. e situation faced by all interfaces of the router constitutes the training environment of DRL Agent. We collect the information of throughput, delay, and error rate for each interface, which constitutes the Agent's state space. is information is mainly composed of local storage in the forwarding process and does not communicate with other nodes in the network. e Interest packet forwarding probability of each interface in the node constitutes the action space of the Agent. en actions are performed to update the forwarding probability of the interface, and finally a more suitable and robust forwarding probability is got by training. e suitable and robust forwarding probability on the interface can effectively improve the forwarding efficiency, thereby improving the overall performance of the node. As for the reward, it has been discussed in Section 3. We show the detailed framework of SAF-DRL in Figure 2. For the available forwarding interface of the node, we collect the local information of throughput, delay, and loss rate as the states. en the Agent uses the TD3 method to train the collected information. rough training, we get an AF strategy that maximizes rewards.
We propose the SAF-DRL algorithm as Algorithm 1. First the algorithm initializes two critic networks Q(s, a|ω and ω u . In order to enable the target network to be updated slowly, a delayed update method is used. On the basis of soft update, the target networks are updated with the update rate φ after every d critic update (line 16). We use the uniform distribution U(0, |F − |) to initialize forwarding probability of each available forwarding interface, p(F − ) � (1/|F − |), where |F − | is the number of the out-interfaces, and we initialize the Interest packet loss rate to 0, p(F L ) � 0 (line 4).
We use replay buffer B to store the transition samples and we initialize it in line 3. We first store the experience gained through interaction with the environment in B (lines 6-9) and then we train the actor network and critic network according to sample M random transition samples from B (lines [10][11][12][13][14][15][16]. e function of the clip (N(0, σ), −k, k) is that the value range of N(0, σ) is limited between −k and k. When the value is less than −k, the value is equal to −k, and when the value is greater than k, the value is equal to k (line 11). Calculate the minimum Q value of the two critic networks (line 12) and compute the critic update by minimizing commonly used mean square error loss (line 13). e computation of the actor network update uses the DPG approach [33] (line 15). According to the SAF-DRL algorithm, we can get an optimal Interest packet forwarding probability for each available interface in the node. Finally, we can get an adaptive forwarding strategy that maximizes network utility.

Numerical Experiment
We conducted numerical experiments on SAF-DRL method in the NDN environment. e node has certain computing capabilities and can adaptively forward requests for users. At the same time, the node can move to a certain extent, and the main content requested is the video service. us, we use one node in the network as an agent to study. en, a comparison is performed with the other four adaptive forwarding strategies in terms of delivery time, the average number of lost packets, load balancing factor, and hop count.
We use the Python language to generate three topologies based on Erdős-Rényi (ER) [38] model, Barabási-Albert (BA) [39] model, and random model, as shown in Figure 3. Each topology is composed of 100 nodes. e possibility of creating links between two nodes in ER topology is set to 0.08, one edge is added between two nodes in BA topology at a time, and the distance threshold between two nodes in the random topology is set to 0.15. For each link of the three topologies, we allocate bandwidth of 1 Mbps to 5 Mbps, and the delay of each link is within [10 ms, 30 ms]. We randomly selected 10 nodes in the network as the users, 10 nodes as servers, and the rest as routers. In order to better study the adaptive forwarding strategy, we set the CS capacity on each node to 0. We set β � c � 1 to balance the importance of throughput, delay, and error rate. e experiment has gone through 1500 time-slots' iterative training every cycle, and finally we take the average value through multiple experiments. We run and train SAF-DRL algorithm on Windows 10 operating system with Intel (R) Core (TM) i5 2.4 Ghz CPU with 8 GB memory.
To explore the reward convergence of different agents under different network topologies, we adopted three topologies of ER topology, BA topology, and random topology and performed experiments on five agents (1, 10, 30, 50, and 70) on each topology. In Figure 4, we can see that, under the same topology, although all agents converge at different speeds, they eventually converge to approximately the same stable value. Because there are certain differences among the three topologies (such as the degree of connectivity), the final stable convergence values obtained by the different topologies are not completely equal, but the difference among the convergence values is very small. is proves that our scheme is convergent and can be used in different topologies. erefore, in the follow-up experiments, we only explore the comparative experiments under ER topology, and the trend is the same in other topologies.
In order to evaluate the performance of our algorithm in many ways, we compare our SAF-DRL algorithm with four other adaptive forwarding strategies: BestRoute (BR) [14]: interest packets are forwarded through the available interface with the lowest cost (e.g., hop count).
Stochastic Adaptive Forwarding (SAF) [21]: interest packets are forwarded through the interface with the largest throughput-based measurement.
Adaptive Forwarding Strategy in NDN (AFSndn) [28]: interest packets are forwarded through the interface with the lowest delay. (4)Initialize the forwarding probability of the out-interface F − , and p(F L ) � 0; (5)Receive the observed initial state s 1 ; / * * Decision Epoch * * / (6)for t � 1 to T do (7)Obtain the action a t � μ(s t |ω u ) + N t by the current policy μ(s t |ω u ) and exploration noise N t ∼ N(0, σ); (8)Execute action a, receive reward r and observe next state s′; (9) Store transition sample (s, a, r, s′) in B; / * * Training Transition Sampling * * / (10)Sample a mini-batch of M transition (s, a, r, s′) from B; (11)Execute action a � μ(s|ω u ) + N, where the N ∼ clip (N(0, σ), −k, k);    Security and Communication Networks Request Forwarding Algorithm (RFA) [40]: interest packets are forwarded through the interface with the least count of pending Interest packets. We mainly compare our algorithm with other algorithms in four aspects.

Delivery Time.
e delivery time is mainly the average time it takes for the Interest packet to find the specific content it requests. e delivery time can be specifically defined as Here, send i represents the moment when the i-th Interest packet is sent, get i represents the moment when the target node receives the i-th Interest packet, and K represents the total number of Interest packets requested.

e Average Number of Lost Packets.
e average number of lost packets indicates the average number of Interest packets lost due to other reasons (e.g., not finding the target node or network congestion) during all episodes. e average number of lost packets can be specifically defined as Here, lost i represents the number of Interest packets lost in the i-th episode and T represents the number of episodes.

Load Balancing Factor.
e load balancing factor represents the degree of dispersion of the number of Interest packets forwarded by each node in the network. We use coefficient of variation for calculation, so the load balancing factor can be specifically defined as Here, NI(v) represents the number of Interest packets forwarded by the v node and stdev[NI(v)] represents the standard deviation of NI(v).

Hop Count.
e hop count represents the average number of hops experienced by all Interest packets when they find their target node. e hop count can be specifically defined as Here, h i represents the number of hops taken by the i-th Interest packet before finding the target node.
We describe in detail the performance of each aspect as follows. Figure 5, we can see the delivery time of the five algorithms under different bandwidth (1 Mbps, 3 Mbps, and 5 Mbps). e delivery time of SAF-DRL is lower than the delivery times of the other four algorithms. is is because the SAF-DRL algorithm uses delay as one type of the link status information, and delay is also one of the indicators of the reward function, which can minimize the delay. erefore, the delay of SAF-DRL is lower than those of SAF, RFA, and BR. Although the AFSndn algorithm also considers the delay information as the indicator of the reward function, the AFSndn algorithm needs to spend a certain amount of time in the early stage of forwarding for exploration. At the same time, because the AFSndn algorithm uses Q-learning in reinforcement learning and Q-learning has certain limitations in processing high-dimensional data, it will increase the delivery time to be higher than that of our SAF-DRL algorithm as the number of entries in the FIB increases. Because the BR algorithm selects a link for forwarding, causing network congestion to exceed other algorithms, its delivery time becomes the longest. e RFA algorithm can avoid link congestion through load balancing, which can reduce the delivery time to a certain extent. Figure 6 shows the average number of lost packets of the five algorithms under different Link Failure Rate (LFR) (10%, 30%, 50%, and 70%). As can be seen from the figure, with the gradual increase of LFR, the average number of lost packets of the five algorithms is increasing. But SAF-DRL algorithm has always had a lower average number of lost packets, among which the BR algorithm has the largest average number of lost packets. is is because the BR algorithm uses the least hop count as the forwarding basis, which may cause network congestion and eventually may discard a large number of Interest packets. e RFA algorithm only uses the count of pending Interest packets as the basis for forwarding probability. Although network congestion can be avoided as much as possible, interfaces with a small number of pending Interest packets may have poor link status, so the number of lost Interest packets is only lower than that in the BR algorithm. e SAF algorithm considers information such as link throughput and can select an effective interface for forwarding, thereby reducing the number of lost packets.

e Average Number of Lost Packets.
e AFSndn algorithm is more robust than the SAF algorithm through reinforcement learning training. However, due to the large number of Interest packets sent in the previous exploration phase, only a few are effective, and a large number of unused Interest packets are discarded. e SAF-DRL algorithm has relatively the lowest average number of lost packets, because it considers multiple different types of link state information, and training through reinforcement learning has high robustness. At the same time, each interface can be assigned a higher efficiency forwarding probability, which reduces the average number of lost Interest packets. Figure 7 shows the results of the load balancing factor under different LFR. It can be seen from the figure that, in the case of four link failures, the SAF-DRL algorithm has a lower load balancing factor. When the user retrieves the content, the BR algorithm only considers the shortest path for forwarding. With the increase of LFR, the congestion on this link intensifies, and then the resources on other links are idle, making the network load unbalanced and the load balancing ability is poor. e SAF algorithm selects links for forwarding. After the Interest packet cannot be satisfied, the SAF algorithm will distribute the traffic on the failed link to other links according to throughput-based measure, which can appropriately improve the load capacity of the network. AFSndn is based on the information in the early exploration phase, and when guiding the forwarding of Interest packets, it tries to avoid network congestion, ensuring the load capacity of the network. Compared with the SAF algorithm and AFSndn algorithm, the SAF-DRL algorithm is more robust due to the reinforcement learning training, which makes the forwarding probability distribution allocated on each available forwarding interface more robust, especially when the link fails. e RFA algorithm has the lowest load balancing factor, because RFA algorithm uses the count of pending Interest packets as the reference basis for forwarding. e count of pending Interest packets can  reflect the load status in a period of time in the future, so RFA algorithm can effectively balance the load with the lowest load balancing factor. Figure 8 shows the results of the average hop count. It shows that the average hop count of the BR algorithm is the lowest, and the average hop count of the RFA algorithm is the highest. e average hop counts of the SAF algorithm, AFSndn algorithm, and SAF-DRL algorithm are between the two, and the average hop count of AFSndn algorithm is higher than those of SAF algorithm and SAF-DRL algorithm. is is mainly because the BR algorithm mainly considers forwarding through the least hop count and does not consider other indicators, so the average hop count of the BR algorithm is the lowest. However, the RFA algorithm only considers the count of pending Interest packets and does not care about delaying this information, so its hop count is the highest. As for the AFSndn algorithm, in the early exploration phase, Interest packets will be  forwarded through all available interfaces, and there are a certain number of links with very long paths, which leads to higher average hop count. Since the SAF algorithm and the SAF-DRL algorithm are adaptive forwarding strategies based on probability, the forwarding probability of selecting a link with better performance is greater, but the link with better performance is not necessarily the shortest. e SAF algorithm only uses throughput as the measure. However, the SAF-DRL algorithm takes delay and other information into account, which is equivalent to considering the length of the link to a certain extent, so that it can find the target node faster with fewer hops.

Conclusion and Future Work
In this paper, we have proposed stochastic adaptive forwarding strategy based on deep reinforcement learning (SAF-DRL), a novel adaptive forwarding strategy for secure mobile video communications in NDN. SAF-DRL can forward each Interest packet with a common prefix according to the forwarding probability. To obtain a more robust forwarding probability on each available interface, we have also introduced the twin delayed deep deterministic policy gradient to NDN for adaptive forwarding.
rough numerical experiments, the results have shown that SAF-DRL algorithm can achieve final stability under ER topology, BA topology, and random topology. Compared with BR, RFA, SAF, and AFSndn, SAF-DRL has obvious advantages in delivery time and the average number of lost packets. Since we only considered the same video prefix in this paper, in future work, we will consider the priority between different video content prefixes requested by mobile devices; and different applications require different weights for different status information. For example, the transmission of live broadcast service requires lower delay. We will combine the content priority and the weight of the interface status to improve the security and efficiency of mobile video communications.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.