Adaptive Routing Strategy Based on Improved Double Q-Learning for Satellite Internet of Things

Satellite Internet of /ings (S-IoT), which integrates satellite networks with IoT, is a new mobile Internet to provide services for social networks. However, affected by the dynamic changes of topology structure and node status, the efficient and secure forwarding of data packets in S-IoT is challenging. In view of the abovementioned problem, this paper proposes an adaptive routing strategy based on improved double Q-learning for S-IoT. First, the whole S-IoT is regarded as a reinforcement learning environment, and satellite nodes and ground nodes in S-IoT are both regarded as intelligent agents. Each node in the S-IoT maintains two Q tables, which are used for selecting the forwarding node and for evaluating the forwarding value, respectively. In addition, the next hop node of data packets is determined depending on the mixed Q value. Second, in order to optimize the Q value, this paper makes improvements on the mixed Q value, the reward value, and the discount factor, respectively, based on the congestion degree, the hop count, and the node status. Finally, we perform extensive simulations to evaluate the performance of this adaptive routing strategy in terms of delivery rate, average delay, and overhead ratio. Evaluation results demonstrate that the proposed strategy can achieve more efficient and secure routing in the highly dynamic environment compared with the state-ofthe-art strategies.


Introduction
Satellite Internet of ings (S-IoT) is an integration of satellite networks [1] and IoT [2]. S-IoT not only strengthens communication by using relay satellites, but also forms a new mobile Internet [3] oriented toward the integrated satellite-terrestrial information network architecture [4]. S-IoT has the advantages of wide coverage and high robustness and can provide ubiquitous services for social networks, so it has attracted considerable attention [5,6].
As the fundamental of communication protocol for S-IoT, the routing strategy is responsible for data packet forwarding and is of great significance to the communication security [7][8][9]. Compared with terrestrial networks, S-IoT has the following characteristics. (1) e high-speed movement of satellite nodes and the frequent failure of sensor nodes result in the dynamic topology structure, causing unstable end-to-end path in S-IoT. (2) e complex space environment and the uneven amount of terrestrial access data lead to the dynamic node status. (3) Due to the limited energy of satellite nodes and sensor nodes, energy efficiency must be taken into account in the routing strategy to reduce the overhead ratio. (4) e large number of nodes and the heterogeneity among nodes impose specific requirements upon the efficiency and security during data packet forwarding. With all these characteristics in mind, we conclude that the routing strategy for the terrestrial network is not applicable to S-IoT.
We study S-IoT as a delay tolerant network (DTN) without intersatellite links. Since S-IoT involves heavy data service workloads, which are generally not in requirement of very low delay, the store-carry-forward mechanism of DTN, which can cope with the dynamic topology structure in S-IoT, is used by the satellite nodes to forward data packets. In recent years, DTN has attracted extensive attention of researchers, and many routing strategies for DTN have been proposed. Existing routing strategies usually can be classified into three categories including the flood-based, the utility-based, and the mobility modelbased routing strategies. To be specific, we select several representative routing strategies falling within individual categories and discuss them in brief. e Epidemic routing strategy proposed by Vahdat et al. [10] is one of the floodbased routing strategies, in which one node forwards data packets to every node it encounters.
is virus-like propagation mode results in excessive overhead ratio. In order to improve Epidemic, Spyropoulos et al. [11] proposed the Spray-and-Wait routing strategy. e process of data packet forwarding consists of two phases, i.e., spraying and waiting. e data packets are diffused into some copies in the spraying phase. ese copies are directly forwarded to the destination node in the waiting phase. is strategy reduces the overhead ratio and achieves the similar performance in transmission with Epidemic. As one of the utility-based routing strategies, the Prophet routing strategy was proposed by Lindgren and Doria [12]. In this strategy, each data packet makes a copy to the node only in case of a high encountering probability, for the purpose of reducing the amount of replication and the overhead ratio. Sharma et al. [13] proposed the machine learning routing strategy based on Prophet (MLRSP). is strategy takes the speed and location of nodes into account and uses the decision tree as well as the neural network to calculate the encountering probability, achieving better performance than Prophet dose. Among mobile model-based routing strategies, the contact graph routing strategy, which was proposed by Araniti et al. [14], is capable of reducing the average delay by selecting the next hop node based on the minimum hop count and the shortest path.
However, the aforementioned routing strategies for DTN cannot quickly adapt to the frequent changes of node status, and the copies in these strategies bring in the challenge of communication security in S-IoTs. To tackle these challenges, we propose employing reinforcement learning on the basis of our previous work [15] to develop a novel adaptive routing strategy for S-IoT. Since the reinforcement learning can obtain optimal results even if the system environment changes frequently, it has been successfully applied in a variety of fields such as industrial manufacturing, analogue simulation, game competition, and scheduling management. As a reinforcement learning algorithm, double Q-learning [16] chooses the next better hop node by selflearning to cope with the dynamic changes of topology structure and node status in S-IoT while satisfying the communication security requirement.
In view of the dynamic topology structure and the dynamic node status, this paper presents an adaptive routing strategy based on improved double Q-learning for S-IoT. e main contributions of this paper are as follows: (1) We apply the reinforcement learning to the S-IoT routing strategy to make it adapt to the dynamic changes of topology structure and node status in S-IoT. (2) We improve the forwarding performance by means of optimizing the mixed Q value, the reward value, and the discount factor, respectively, based on the congestion degree, the hop count, and the node status. (3) We establish the S-IoT model, which consists of a ground layer, a LEO layer, and a MEO layer, to perform simulation experiments. Simulation results demonstrate that the proposed strategy improves the performance of data packet forwarding, in terms of delivery rate, average delay, and overhead ratio, compared with the state-of-the-art strategies. e rest of this paper is organized as follows. Section 2 introduces the related work. e description of the proposed adaptive routing strategy is detailed in Section 3. Section 4 discusses how to improve the Q value in double Q-learning. Simulation results and the associated analysis are given in Section 5. Section 6 concludes this paper.

Routing Strategy for Satellite Networks.
Satellite networks not only provide remote transmission capability for IoT, but also provide cloud computing capability [17][18][19], so satellite networks have direct impact on the overall performance of S-IoTs. e routing strategy for satellite networks is responsible for data transmission and distribution between satellites under various security requirements. In recent years, routing strategies for satellite networks are extensively studied in the literature.
Some researchers paid attention to the dynamic changes of the topology structure caused by the high-speed movement of satellites. Gounder et al. [20] proposed a routing strategy based on snapshot sequence. Mauger and Rosenburg [21] proposed a routing strategy based on virtual nodes. Hashimoto and Sarikaya [22] proposed a routing strategy based on division of the coverage area. Wang et al. [23] proposed a routing strategy based on position and velocity of the nodes.
ough simple and easy-to-implement for routing computation, they often need high storage capacities. Some researchers focused on the limited energy caused by the lack of continuous energy supply. Ekici et al. [24] proposed a routing strategy for saving the energy cost. Yang et al. [25] proposed an energy-efficient routing strategy. Marchese and Patrone [26] proposed an energy-aware routing strategy. ese strategies can reduce energy consumption, but they induce high computational burden. Some other researchers are concerned about the poor QoS caused by long distance between nodes and unstable links. Mao et al. [27] proposed a routing strategy separating the collection and calculation of QoS. Huang et al. [28] proposed a routing strategy under guaranteed delay constraints. Xu et al. [29] proposed a routing strategy based on asynchronous transfer mode. However, these strategies focused on improving the QoS of voice and multimedia services and failed to consider data services.
It is worth emphasizing that all the routing strategies mentioned above use the intersatellite links. In existing low earth orbit (LEO) and medium earth orbit (MEO) constellation systems, only Iridinm is equipped with intersatellite links due to the high cost and complex system. Other constellation systems, such as Ocbcomm, Globalstar, and O3b, have no intersatellite links [30]. For this reason, it is more reasonable to construct the S-IoT based on the constellation systems without intersatellite links, which is the main purpose of this work.

Routing Strategy Based on Reinforcement
Learning. In recent years, reinforcement learning has attracted widespread attention. As a classic reinforcement learning algorithm, Q-learning [31] obtains the sample data sequence (state, action, and reward value) through interacting with the environment and uses the state-action function value (Q value) to find the best action for the current state. In addition, Q-learning ensures communication security by the self-learning mechanism. Q-learning has been applied in many fields. Deng et al. [32] applied Q-learning to the task allocation of edge computing. Zhao et al. [33] applied Q-learning to the DoS attack of many core systems.
In the routing field, Elwhishi et al. [34] proposed a Q-learning routing strategy for DTN. In this strategy, nodes collaborate with each other and make forwarding decisions based on connections. However, node status is not considered in this work. Plate and Wakayama [35] proposed a Q-learning routing strategy based on kinematics and sweep features. is strategy can adapt to the constantly dynamic changes of the topology structure caused by the node mobility and energy consumption. Rolla and Curado [36] proposed an enhanced Q-learning routing strategy for DTN.
is strategy calculates the reward value based on the distance between nodes such that more data packets in densely populated areas can be delivered. Wu et al. [37] proposed an adaptive Q-learning routing strategy based on anycast (ARSA). is strategy focuses on anycast communication from a node to multiple destination nodes, while considering the encountering probability and the relative speed of nodes.
However, the abovementioned Q-learning routing strategies suffer from the overestimation issue in certain cases. e reason is that Q-learning algorithm uses the same Q value for the action selection with the action evaluation and uses the maximum Q value as an approximation to the maximum expected Q value. Q-learning tends to produce a positive estimate deviation, since the overestimated Q value has the higher chance to be selected. e double Q-learning algorithm, which was proposed by Hasselt [16], uses two Q values to separate action selection and action evaluation. Double Q-learning has been applied in many fields. Zhang et al. [38] applied double Q-learning to the speed control of autonomous vehicle. Vimal et al. [39] applied double Q-learning to improve energy efficiency of cognitive radio networks. Zhang et al. [40] applied double Q-learning to the energy-saving scheduling of edge computing. So far, double Q-learning has been rarely used in the routing field. e kernel idea of the double Q-learning algorithm is that the action is selected based on the greedy algorithm in each step and the two Q values are adaptively updated with the changes of environment. One Q value selects the action, and the other one evaluates the selected action. e selection is decoupled from the evaluation for reducing the positive deviation. Furthermore, double Q-learning algorithm has a similar computational efficiency compared with Q-learning algorithm.
erefore, we use double Q-learning to avoid selecting neighbor nodes with overestimation.

Proposed Strategy
e whole S-IoT is regarded as a reinforcement learning environment in this paper. Satellites in S-IoT are regarded as satellite nodes, whereas sensors and data centers are regarded as ground nodes. For each individual node, all other nodes it can encounter constitute its neighbor node set. In particular, ground nodes generate and receive data packets, and satellite nodes use the store-carry-forward mechanism to forward data packets.
Both satellite nodes and ground nodes are considered as intelligent agents. Each node learns the network environment of the whole S-IoT through interacting with other nodes it encounters. Furthermore, all nodes are included to form the state set of reinforcement learning. A ground node or satellite node selects one node from its neighbor node set to forward data packets. is procedure is considered as an action selection of reinforcement learning. In this manner, the neighbor node set for this node can be regarded as the possible action set. e state transitions are equivalent to forwarding data packets from one node to a neighbor node.
In the proposed strategy, each node is assigned with two Q tables (Q A and Q B ) to store the Q value of the action which is referred to as selecting a neighbor node to forward data packets to the destination node. Each node only updates its own two Q tables and shares its local information only with its neighbor nodes. e two Q values stored in the corresponding Q tables are used to determine and evaluate the greedy strategy, respectively. More importantly, the two Q values are decoupled to address the issue of overestimation which may cause the local optima of routing. e two Q values change with the topology structure and node status such that the proposed strategy can be adaptive to the highly dynamic environment.
Initially, a new node has no knowledge of the whole S-IoT environment with two empty Q tables. When this node encounters other nodes, it records the identities of other nodes and initializes the corresponding Q values to 0 in two Q tables. e selection of neighbor node for each data packet would update the two Q values. Each data packet has a destination node. When the data packet reaches its destination node, the Q values of all nodes on this forwarding path will be updated by a rewarding procedure. In the proposed strategy, the two Q values are intensively learned Security and Communication Networks 3 from two different experience sets of the S-IoT. e mixed Q value depending on the two Q values decides which node should be selected to forward data packets. Figure 1 illustrates the general routing process of a specific node. If the destination node is in its neighbor node set, this node forwards data packets to the destination node to complete data transmission. Otherwise, depending on the largest mixed Q value, this node selects a neighbor node to forward data packets. It stores and carries these data packets until it encounters the selected node. Such operations are repeated until the simulation is terminated. e greedy algorithm ensures the largest cumulative future rewards. Take node c, for example, the node selected from its neighbor node set, can be expressed as where N c is the neighbor node set of node c and node x is one of the neighbor nodes in is the mixed Q value of the node selection action, and node d is the destination node of the data packets. e improved method for calculating Q ∧ c (d, x) will be given in the next section. If two nodes have identical mixed Q value, we select one of them at random.
As the learning task is assigned to each node, the learning process is accordingly the updating process of Q tables. If the topology of node c changes, the Q values in Q A c will be updated. If the status of node c changes, the Q values in Q B c will be updated. In this sense, Q A c and Q B c represent an experience set of topology change and an experience set of status change, respectively. Q A c and Q B c learn from each other. e updates of Q A c and Q B c are given by where N x is the neighbor node set of node x and α is the learning rate manipulating the updating speed of Q values. R c (d, x) is the instant reward value (R value) and c c (d, x) is the discount factor of the node selection action. y * and z * are the nodes with the largest Q value in Q B x and Q A x , respectively. e improved method for calculating R c (d, x) and c c (d, x) will be given in the next section.

Mixed Q Value Based on the Congestion Degree.
e next hop node of data packets is determined according to the mixed Q value. Because network congestion has an important impact on routing, we consider the congestion degree to give the corresponding weights of two Q values to calculate the mixed Q value. Take node c for example; if node c selects neighbor node x to forward data packets, the mixed Q value is calculated by where node d is the destination node of the data packets and Q A c (d, x) and Q B c (d, x) are the Q values provided by Q A and Q B , respectively, indicting the Q values of the action in which node c selects node x to forward data packets. β(x) is the congestion factor of node x, and it is calculated by In particular, the smaller con d(x) value is, the larger β(x) value is, so that the influence of topology change is greater. Under the reverse situation, the influence of status change is greater. con d(x) can be calculated by  where S(y) is the size of all data packets currently in the buffer of neighbor node y and B y is the buffer size of neighbor node y. In addition, N x is the neighbor node set of node x, and C(x) is the number of neighbor nodes of node x.

Reward Value Based on the Hop
Count. An important component in the Q value updating rule (refer to equations (2) and (3)) is the calculation of R value defining the instant reward value after forwarding data packets. R value reflects the advantages and disadvantages of one-time forwarding. Limited by the energy capacity of the S-IoT, the hop count is taken into account in the calculation of reward value to control energy consumption and to reduce the overhead ratio.
Take node c, for example; if node c has forwarded the data packets to neighbor node x, the reward value for the node selection action can be calculated by where node d is the destination node of the data packets, h 1 , h 2 , . . . , h i , . . . , h k are the hop counts on different satellite orbits, and w 1 , w 2 , . . . , w i , . . . , w k are the weights of different satellite orbits satisfying k i�1 w i � 1. A higher satellite orbit height stands for a greater amount of energy consumption for data transmission between the ground node and the satellite node. Hence, we set a relatively higher w i value for a satellite node with a higher height orbit. As a result, the reward value for forwarding data packets to a satellite with a higher height orbit is lower.

Discount Factor Based on the Node Status.
e discount factor is a multiplicative coefficient for the sum of subsequent reward values, which affects the possibility of reselecting a previously selected neighbor node to forward data packets. In order to adapt to the node status, the distance, direction, and buffer occupancy are considered in the calculation of discount factor. Take node c, for example; if node c has forwarded the data packets to neighbor node x, the discount factor for the node selection action is calculated by where node d is the destination node of the data packets and c is the setting value subject to 0<c<1. Dir F(d, x), Dis F(d, x), and Buf F(d, x) denote the direction factor, the distance factor, and the buffer factor, respectively. e larger these factors are, the larger the discount factor is and accordingly the larger the updated Q value is. As such, the possibility of reusing this node to forward data packets in the next time will be larger. e direction factor is calculated by where θ(x, d) stands for the angle between neighbor node x and destination node d. e smaller θ(x, d) value is, the larger Dir F(d, x) value is. e distance factor is calculated by where D(x, d) is the distance from node x to destination node d and D max is the maximum distance between the nodes in the network. e smaller D(x, d) value is, the larger Dis F(d, x) value is. e buffer factor is calculated by where S(x) is the size of all data packets currently in the buffer of neighbor node x and B x is the buffer size of neighbor node x. e smaller S(x) value is, the larger Buf F(d, x) value is.

Simulation Environment.
We use the ONE simulator to analyze and evaluate the proposed routing strategy. e S-IoT model in simulation experiments is shown in Figure 2. e ground layer is composed of 110 ground nodes, which are uniformly distributed over the Earth's surface. e LEO layer consists of 48 satellite nodes as the Globalstar constellation system. e MEO layer consists of 24 satellite nodes as the GPS constellation system. Table 1 lists the node parameters in each layer. Ground nodes generate and receive data packets, and both the source node and the destination node are randomly generated among ground nodes. Since we assume no intersatellite links in this S-IoT model, data packets cannot be forwarded between any two satellite nodes moving through their orbits periodically. e network environment parameters in simulation experiments are shown in Table 2. Regarding the double Q-learning procedure, the learning rate is set to 0.8, and c in the discount factor is set to 0.9. e weights of hop count on LEO and MEO satellite orbits are set to 0.3 and 0.7, respectively. e delivery rate, average delay, and overhead ratio are used to evaluate the routing strategies at different data packet generation intervals with different failure probabilities.

Simulation Results.
We compare the proposed adaptive routing strategy based on improved double Q-learning for S-IoT (ARSIDQL) with the adaptive routing strategy based on original double Q-learning (ARSDQL), the adaptive routing strategy based on original Q-learning (ARSQL), the Spray-and-Wait routing strategy [11], MLRSP [13], and ARSA [37] in terms of delivery rate, average delay, and overhead ratio with different failure probabilities. Figure 3 shows the comparison of delivery rates achieved by all routing strategies at different data packet generation intervals with different failure probabilities. On the whole, MLRSP achieves the lowest delivery rate, since MLRSP calculates the encountering probability of each node and copies data packets only to the node with the largest encountering probability. However, MLRSP fails to take into account the data packet loss caused by the high buffer occupancy of nodes. e delivery rate of Spray-and-Wait is higher than that of MLRSP by taking the advantage of flood. To be specific, the data packets are diffused into several copies to increase the probability of data packets arriving at the destination node. e delivery rates of ARSA and ARSQL are higher than that of Spray-and-Wait; since the Q-learning algorithm is self-learning and selfadaptive, ARSA and ARSQL can explore a suitable path in a highly dynamic environment. However, the encountering probabilities of nodes in S-IoT are fixed. ARSA considers the encountering probability, resulting in lower delivery rate than ARSQL. e delivery rate of ARSDQL is higher than that of ARSQL. e reason is that ARSDQL decouples data packet forwarding from the Q value evaluation of this forwarding, and the node used for forwarding is determined depending on the mixed Q value without positive deviation. Built upon ARSDQL, ARSIDQL incorporates the congestion degree and node status. Hence, data packets are more likely to arrive at the destination node before arriving at the end of their TTLs, so ARSIDQL achieves the highest delivery rate.

Delivery Rate.
With the increase of the data packet generation interval, the delivery rate of MLRSP improves significantly. Since there are a large number of data packets in the network at low generation interval, the buffer size of each node is limited, causing many data packet losses. e delivery rate of Spray-and-Wait remains unchanged, since Spray-and-Wait limits the number of data packet replicas to reduce the buffer occupancy rate and further the number of data packet losses. e delivery rates of ARSA, ARSQL, and ARSDQL are relatively stable, because they can find the best action in the current state depending on the Q value through interacting with the environment. e delivery rates of ARSIDQL are relatively stable and high at low generation interval. is strategy can adapt to the buffer occupancy and forward data packets to nodes with low buffer occupancy rates to reduce the number of data packet losses and to achieve good performance.
With the increase of the failure probability, the delivery rate of MLRSP decreases. Since MLRSP forwards data packets depending on the encountering probability even if node failures have taken place, MLRSP cannot adapt to the changes of topology structure. e delivery rate of Sprayand-Wait decreases slightly. Because the data packets are diffused into some copies, the delivery rate can be guaranteed with insignificant degradation. e delivery rates of ARSA, ARSQL, ARSDQL, and ARSIDQL are relatively stable and high even with high failure probabilities owing to their abilities of self-learning. Since the Q value of forwarding data packets to the failed node would be smaller, these strategies can avoid forwarding data packets to the failed node and thus can adapt to the dynamic topology structure. Figure 4 shows the comparison of average delays of routing strategies at different data packet generation intervals with different failure probabilities. On the whole, the average delay of Spray-and-Wait is the highest, due to the fact that in this strategy each node can only move and cannot forward data packets until it encounters the destination node in the waiting phase. e   average delay of MLRSP is also high, since MLRSP only takes into account the encountering probability when each node forwards data packets. However, MLRSP cannot find an appropriate path as the encountering probability cannot reflect the node status. ARSQL can learn by itself to find the next hop node with a relatively low average delay. e average delay of ARSA is lower than that of ARSQL, since ARSA considers the relative speed of nodes. e average delay of ARSDQL is low, since ARSDQL solves the overestimation problem through two Q values and can find the global optimal path to reduce the average delay. Built upon ARSDQL, ARSIDQL can adapt to the congestion degree and hop count to achieve the lowest average delay.

Average Delay.
With the increase of the data packet generation interval, the total number of data packets in S-IoT decreases such that the waiting time in the buffer and the average delay of Sprayand-Wait can be reduced. e average delay of MLRSP is reduced to a greater extent. However, the large number of data packets and copies made by MLRSP in S-IoT at low packet generation interval would lead to node congestion and long waiting times in the buffer. e average delays by suing ARSA, ARSQL, ARSDQL, and ARSIDQL decrease slightly with low failure probabilities, because the total number of data packets in S-IoT decreases with the increase of the data packet generation interval. In the cases of high failure probabilities, the average delays remain stable, since these strategies have found a suitable path at low packet generation interval. In addition, the high failure probability leads to fewer nodes in the network. e change of generation interval no longer affects the average delay.
With the increase of the failure probability, the average delays of Spray-and-Wait and MLRSP get worsen accordingly, due to the fact that these strategies cannot make adjustments to failed nodes in a timely fashion. e average delays of ARSA, ARSQL, ARSDQL, and ARSIDQL also degrade slightly. e good thing is that, because the update of the Q value reflects the changes of topology structure, the routes by using these strategies can bypass failed nodes and the degradation of average delay is not significant. Figure 5 shows the comparison of overhead ratios of various routing strategies at different data packet generation intervals with different failure probabilities. e overhead ratio depending on the forwarding time reflects the energy efficiency. On the whole, the overhead ratio of Spray-and-Wait is the highest. As a flood-based routing strategy, Spray-and-Wait increases the forwarding time in case of a large number of copies of data packets in the network. Compared with Spray-and-Wait, MLRSP achieves a lower overhead ratio, since MLRSP copies data packets only to the node with the largest encountering probability to restrict the forwarding time. ARSQL and ARSDQL, which are not flood-based routing strategies, result in less forwarding time due to fewer data packets in the network. e overhead ratio of ARSA is lower than that of ARSDQL. e reason is that ARSA reduces the forwarding time since it considers multiple destination nodes as the same virtual destination. Built upon ARSDQL, ARSIDQL takes the hop count and node status into consideration, thus achieving the lowest overhead ratio.

Overhead Ratio.
With the increase of the data packet generation interval, the overhead ratios of all strategies decrease slightly. As the total number of data packets in S-IoT decreases as data packet generation interval increases, the forwarding time is reduced. As a consequence, lower energy consumption and overhead ratio are achieved.
With the increase of the failure probability, the overhead ratios of Spray-and-Wait and MLRSP increase. e reason is that, under the circumstance of node failures, Spray-and-Wait retransmits data packets in order to maintain a fixed number of copies, whereas MLRSP still forwards data packets depending on the encountering probability. e overhead ratios of ARSA, ARSQL, ARSDQL, and ARSIDQL increase slightly. ese strategies are capable of bypassing failed nodes. e bypassing procedure would inevitably lead to the increase of forwarding time, energy consumption, and overhead ratio.
In summary, compared with ARSDQL, ARSIDQL can improve the delivery rate, average delay, and overhead ratio by taking into account the congestion degree, hop count, and node status in the S-IoT model. Also, compared with ARSQL and ARSA, ARSIDQL can find the best next hop node of data packets due to the decoupling of selection and evaluation. Compared with traditional routing strategies, such as the flood-based routing strategy and the utility-based routing strategy, ARSIDQL can significantly improve the delivery rate, average delay, and overhead ratio with the integration of reinforcement learning.

Conclusions
S-IoT is a new mobile Internet to provide services for social networks. e routing strategy determines the communication performance of S-IoT. Traditional routing strategies cannot cope with frequent changes of topology structure and node status and cannot meet the requirement of communication security in S-IoTs. is paper proposes an adaptive routing strategy based on improved double Q-learning for S-IoT. e proposed strategy selects the next hop node of data packets relying on the mixed Q value. Moreover, in order to optimize the Q value, this paper makes improvements on the mixed Q value, the reward value, and the discount factor, respectively, based on the congestion degree, the hop count, and the node status. Simulation experiments show that the proposed strategy not only can operate efficiently and securely in complex environments but also can increase the delivery ratio and reduce the average delay and Security and Communication Networks 9 overhead ratio. Considering the large sizes of the two Q tables due to the increasing number of nodes in S-IoT, future work can be directed toward replacing the two Q tables with two neural networks.
Data Availability e simulated evaluation data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.