Reinforcement Learning-Based Routing Algorithm in Satellite- Terrestrial Integrated Networks

Satellite-terrestrial integrated network (STIN) is an indispensable component of the Next Generation Internet (NGI) due to its wide coverage, high flexibility, and seamless communication services. It uses the part of satellite network to provide communication services to the users who cannot communicate directly in terrestrial network. However, existing satellite routing algorithms ignore the users’ request resources and the states of the satellite network. Therefore, these algorithms cannot effectively manage network resources in routing, leading to the congestion of satellite network in advance. To solve this problem, we model the routing problem in satellite network as a finite-state Markov decision process and formulate it as a combinatorial optimization problem. Then, we put forth a Q-learning-based routing algorithm (QLRA). By maximizing users’ utility, our proposed QLRA algorithm is able to select the optimal paths according to the dynamic characteristics of satellite network. Considering that the convergence speed of QLRA is slow due to the routing loop or ping-pong effect in the process of routing, we propose a split-based speed-up convergence strategy and also design a speed-up Q-learning-based routing algorithm, termed SQLRA. In addition, we update the Q value of each node from back to front in the learning process, which further accelerate the convergence speed of SQLRA. Experimental results show that our improved routing algorithm SQLRA greatly enhances the performance of satellite network in terms of throughput, delay, and bit error rate compared with other routing algorithms.


Introduction
As the 5th generation communication technologies are widely used, terrestrial network can provide high bandwidth and low delay communication services to the users within the coverage of base stations [1]. However, for those remote areas where base stations are not deployed or where base stations are destroyed by natural disasters, terrestrial network usually cannot meet the communication needs of users. Infrastructure of satellite network is rarely damaged by natural disasters, and it has wide coverage [2][3][4], so it is usually regarded as an essential component of terrestrial network. Satellite-terrestrial integrated network (STIN) has wide coverage and high flexibility and is able to compatible with the existing 5G network. Thus, it is a reliable paradigm to provide the Internet services and is receiving much attention from researchers [5][6][7][8]. In particular, when users are unable to communicate through terrestrial network in aviation or navigation, the STINs can provide them with communication services.
Satellite routing algorithm is an important technique in STINs, and there are much researches on it. Considering that the number of satellites is small and the structure of traditional satellite network is simple, existing satellite routing algorithms are developed from the routing algorithms of terrestrial network, such as OSPF [9], RIP [10], and AODV [11]. However, most algorithms are depended on the shortest path or minimum cost. Therefore, satellite network is prone to congestion in the process of routing. In addition, with the expansion of satellite network (e.g., Starlink), these routing algorithms cannot converge quickly in the limited communication time, which seriously degrades the communication performance of satellite network and wastes the communication resources of satellites at the same time. On the other hand, most of the work on satellite routing ignores the impact of user request resources on the performance of satellite network. Liu et al. [12] proposed a fragment-based load balancing route scheme to control the traffic of LEO satellite network. Qi et al. [13] improved the quality of service (QoS) of users by jointly optimizing the rate and routing in LEO satellite network. Considering the traffic distribution density in different areas, the authors in [14] proposed a distributed routing algorithm based on traffic prediction. However, the above algorithms do not take into account the impact of the current user's routing on the subsequent user's routing, resulting in the network performance is degraded. In addition, considering the power and computing resources of LEO satellite, it is not appropriate to deploy these algorithms on satellites. Therefore, it is very challenging to design an efficient satellite routing algorithm.
Recently, machine learning and deep learning [15,16] have been used extensively. Some researchers began to use these methods to solve network communication problems [17][18][19]. The authors in [20] regarded satellite network topology as a series of snapshots and used particle swarm optimization algorithm for routing in each snapshot. However, deep learning is an approximate algorithm, which is not suitable for sequential decision problems. Reinforcement learning is a method based on trial and error. In the process of learning, the agent interacts with the environment and gets a corresponding reward. And this reward guides the agent to find the best strategy. Moreover, reinforcement learning is very suitable for dealing with sequential decision problems and achieves better results than human beings [21], and it has been applied in resource allocation [22], capacity management [23], and combinatorial optimization [24,25]. Compared with other reinforcement learning algorithms, Q-learning is a simple and efficient reinforcement learning method, which has a fast convergence speed. In addition, it is very suitable for solving discrete problems. Inspired by the above references, we regard the satellite routing problem as a turn-base game and model it as a finite-state Markov decision process. Then, we put forward a Q-learning-based routing algorithm to solve satellite routing problems.
In this paper, we mainly investigate the satellite routing problem in STINs. Choosing a path from source node to destination node can be regarded as a turn-based game. And this is a finite-state Markov decision process. So we model the routing problem as a Markov decision process and define its state space, action space, and reward function. We propose QLRA algorithm to make full use of satellite network resources and improve the quality of service of users. In addition, in order to accelerate the convergence speed of QLRA algorithm, we propose a split-based speedup convergence strategy and design a speed-up Q-learningbased routing algorithm (SQLRA). Moreover, we update the Q value from back to front to further improve the speed of SQLRA algorithm. The contributions of this paper are summarized as follows: (1) We model the satellite routing as a Markov decision process and define its action and state spaces and reward function. And we propose a Q-learningbased satellite routing algorithm (QLRA). QLRA algorithm can select the optimal paths according to the current states of satellite network and the users' request resources when routing (2) Aiming at the slow convergence speed of QLRA algorithm, we analyse the problem and propose a split based speed-up convergence strategy to accelerate the convergence speed of QLRA. Based on QLRA algorithm, we design a speed-up Q-learning-based routing algorithm (SQLRA). In addition, we update the Q-value from back to front to further improve the convergence speed of SQLRA algorithm. Experimental results show that SQLRA algorithm converges faster than QLRA algorithm (3) Our proposed algorithm SQLRA can make full use of network resources while meeting the requirements of users. Numerical simulation results show that SQLRA algorithm effectively enhances the network performance compared with other algorithms The remainder of this paper is organized as follows. In Section 2, the related work is reviewed. In Section 3, we introduce network model and problem formulation. In Section 4, the satellite routing algorithm based on Q-learning and the speed-up Q-learning-based routing algorithm are presented. In Section 5, we evaluate the performance of SQLRA algorithm in two different scenarios, analyse, and discuss the experimental results. Section 6 concludes this paper and gives future research issues.

Related Work
There are much researches on satellite network routing issues. In order to reduce the link congestion and the imbalance of load distribution, Liu et al. [26] proposed an iterative Dijkstra algorithm to optimize satellite communication path. The traditional LEO satellite network ignores the delay of links in routing, which leads to the incomplete evaluation of satellite network performance. To solve this problem, in [27], a satellite routing algorithm taking delay into account was proposed. The authors in [28] proposed a routing algorithm based on cooperative game theory to solve the problem of propagation delay and traffic load imbalance in LEO satellite network. Jiang et al. [29] designed a routing algorithm based on fuzzy theory to meet the multilevel needs of users. By leveraging orbit prediction information, Pan et al. [30] put forward a dynamic on-demand routing scheme to reduce the routing convergence and the communication overhead. Hao et al. [31] proposed a routing strategy based on energy-aware and load-balancing to meet the different communication services of users.

Wireless Communications and Mobile Computing
As a reliable communication paradigm, much work has been done on STINs. In order to improve the power utilization of satellites, the authors in [32] proposed a data offloading scheme in STINs to jointly allocate the power and resources of satellites. Zhang et al. [33] used edge computing techniques to improve the QoS of STINs. In order to reduce the cost of gateway deployment and data routing in STINs, the authors in [34] proposed a joint satellite gateway deployment and routing scheme. Xu et al. [35] proposed a hybrid routing algorithm to realize the seamless integration of STINs. The authors in [36] presented an end-to-end routing method based on heuristic strategy to improve the QoS of STINs.
Reinforcement learning is an effective method to cope with sequential decision problems, and it has been applied in many fields. Liu et al. [37] used Q-learning to implement the content caching problem in dynamic cloud content distribution network. To improve the efficiency of the Internet of Things, Pan et al. [38] used Q-learning to identify blocked links. A Q-learning method is proposed in [39] to improve the network performance and reduce the energy consumption of wireless sensor networks. Qiao et al. [40] proposed a joint optimization scheme of cache content placement and bandwidth resource allocation based on deep reinforcement learning in the Internet of Vehicles. Q-learning technique is widely used, and there are few researches on routing using Q-learning technique in satellite network. In this research, we use reinforcement learning to solve the routing problem in satellite network.

System Model and Problem Formulation
3.1. Network Model. The STINs used in this paper are shown in Figure 1. The STINs are composed of a terrestrial network and a LEO satellite network. The terrestrial network consists of base stations, routers, satellite gateways, and user terminals, and the satellite network consists of a large number of LEO satellites. The terrestrial network is able to connect with the satellite network with the aid of satellite gateways. Considering the high-speed mobility of the satellites in satellite constellation, each satellite is only connected to its neighbour satellites or satellites in its adjacent orbits. The communication link between satellites is bidirectional, and the specific structure is shown in Figure 2.
When users communicate with their peers, the system first determines whether to reach their peers through the terrestrial network. Specifically, if their peers can be reached through the terrestrial network, then the data will be transmitted directly to their peers through the terrestrial network. Otherwise, the terrestrial network will transmit the data to the LEO satellites through the satellite gateways and then retransmit the data to their peers through the LEO satellite network. With the increasing number of satellites, existing satellite routing algorithms become unsuitable, which seriously degrade the performance of satellite network. Therefore, we focus on the satellite routing problem in this paper.
Considering that the topology of satellite network changes with time, inspired by reference [20], here we divide the whole operation time T of satellite network into N T time slices, and the duration time of each time slice is T t . We assume that the satellite network topology is fixed in each time slice. So the total time can be obtained by The number of snapshots is related to the number of orbits and the number of satellites in each orbit. The time interval of the snapshot T t is related to the inclination of the orbits. The smaller the time interval, the higher the accuracy of the snapshot. If the time interval is small, a large number of topologies will be generated, which leads to the complexity of the network structure. In practice, the time interval is no more than the minimum visible time of the satellite links. We define T t as where τðu, vÞ represents the visible time between satellite u and satellite v. Here, we set the time interval T t to 4 minutes.
Here, we use undirected graph G = ðV, EÞ to represent satellite network topology, where V represents the set of satellites, V = f1, 2, ⋯, Ng, and N is the number of satellites. And E is the set of links between satellites. Here, we assume that the network structure is a connected graph. We define it as where linkðu, vÞ represents the link between satellite u and satellite v. And satellite v is a neighbour of satellite u. Considering that the state of satellite link consists of many parameters, we redefine link linkðu, vÞ as where variables bandwidth, delay, error, and time represent the available bandwidth, the propagation delay, the bit error rate, and the available time of link linkðu, vÞ, respectively. In addition, considering the duration of each time interval is short, we assume that the communication time of each user is greater than the duration of each time slice. We define it as When transmitting data, it is necessary to find an optimal path from source satellite to destination satellite according to the current link states of satellite network. We assume that satellite 0 is source satellite and satellite 5 is destination satellite in Figure 3. There are multiple paths from satellite 0 to satellite 5. However, with a large number of users accessing to satellite network, satellite network resources are exhausted due to the load imbalance, which leads to the congestion of satellite links in advance. Therefore, the performance of satellite network is seriously degraded. For example, in the beginning, the optimal path 3 Wireless Communications and Mobile Computing from satellite 0 to satellite 5 is 0-3-4-5 in Figure 3. With the increase of users' number, link 0-3 is congested because of the consumption of bandwidth resources, resulting in the next user cannot choose link 0-3 in the optimal path. Therefore, the path from satellite 0 to satellite 5 changes from path 0-3-4-5 to path 0-1-4-5 and finally to path 0-1-4-3-5. The specific process of path changing is shown in Figure 3.
In Figure 3, the dotted lines with different colours represent different selected paths. From Figure 3, we see that the optimal path from satellite 0 to satellite 5 changes gradually with the consumption of communication resources of satellite links.

Problem Formulation.
We assume that the bandwidth capacity of link linkðu, vÞ is cðu, vÞ, variable u req i represents the bandwidth resource requested by the ith user. Before transmitting data, we need to find a path from source satellite v s to destination satellite v d for the ith user. The path is defined as Here, y is an indicator function which indicates whether there is a link in the selected path. If satellite link linkðu, vÞ is in the selected path, then yðlinkðu, vÞÞ = 1; otherwise, yðlinkðu, vÞÞ = 0.
Here, we use functions BðxÞ, DðxÞ, EðxÞ, and TðxÞ to represent the average bandwidth, the delay, the bit error rate, and the available time of the user in path x, respectively.

Wireless Communications and Mobile Computing
Variable path i represents the path of the ith user from source satellite to destination satellite.
where function lengthðpath i Þ is the length of the path path i and function bandðlinkðu, vÞÞ is the bandwidth of link linkðu, vÞ. Similarly, functions delayðlinkðu, vÞÞ, errorðlink ðu, vÞÞ, and timeðlinkðu, vÞÞ are the delay, the bit error rate, and the available time of link linkðu, vÞ, respectively. Our goal is to maximize the utility of all users by considering the bandwidth, the delay, the bit error rate, and the available time of network links in the process of routing.
where M is the number of users accessing to satellite network. Equation (12) ensures that the bandwidth resource requested by users is less than the total bandwidth resources of each link. Equation (13) is an indicator function which indicates whether link linkðu, vÞ is in the selected path. If link linkðu, vÞ is in the selected path, yðlinkðu, vÞ = 1; otherwise, yðlinkðu, vÞÞ = 0. Equation (14) is used to ensure that for any intermediate link, the incoming traffic and the outgoing traffic are equal. Equation (11) is a combinatorial optimization problem, and we use reinforcement learning to solve it. In the experiment, we use analytic hierarchy process (AHP) to judge the influence of the weight of each parameter on the performance of the satellite network [41].

A Satellite Routing Algorithm Based on Reinforcement Learning
Reinforcement learning is mainly composed of the agent and the environment. The agent interacts with the environment and learns the optimal strategy according to the feedback of environment. In particular, the reinforcement learning framework is shown in Figure 4. In the current state s t , the agent chooses an action a t according to the policy π and execute it. And the environment returns a corresponding reward to the agent, and the environment moves its state from s t to the next state s t+1 . The agent interacts with the environment continuously until the episode is end or the number of interaction steps reaches the threshold set in advance.
In STINs, the environment is the link states of satellite network, and it is time-varying. And the agent is deployed in ground control center. In route discovery phase, the agent chooses a valid action according to the users' request and jumps to the satellite whose index is the valid action value. The environment gives the agent a corresponding reward, and the link states of satellites are changed simultaneously. The agent interacts with the environment until a path from source satellite to destination satellite is selected. The routing process is modelled as Markov decision processes (MDPs) and represented by M = ðS, A, P, RÞ, where S is state space, A is action space, and R is reward value. Furthermore, P is the state transition probability function, P ðs ′ | s, aÞ = P ðs ′ = s ′ | s = s, a = aÞ. The specific details are defined as follows: (1) State Space. The satellite link state considered in this paper includes available bandwidth, propagation delay, bit error rate, and available time. We define the state of link as where N is the number of satellites. And link i,j denotes the 5 Wireless Communications and Mobile Computing link state between satellite i and satellite j. In addition, variables b i,j , d i,j , e i,j , and t i,j represent the available bandwidth, the propagation delay, the bit error rate, and the available time of link between satellite i and satellite j, respectively. The variable S t represents the states of all links in satellite network.
where NðiÞ represents the set of neighbours of satellite i. And variable S t is the environment of reinforcement learning.
(2) Action Space. In satellite network, the action is used to describe the process of the agent moving from one satellite to another. For example, taking action a, the agent moves from the current satellite to the satellite whose index is a. The number of satellites is N, so the action set is denoted by A = f1, 2, ⋯, Ng.
For the convenience of calculation, the action is coded by one-hot coding in our simulations (3) Reward Value. The rewards are used to motivate the agent to search for the optimal strategy. The agent obtains the rewards by the states of satellite links. Different link states give different rewards. In order to avoid the impact of different rewards on the accuracy of results, we use min-max operation to normalize them. Function max ðbÞ is the maximum of variable b, and function min ðbÞ is the minimum of variable b. We elaborate the specific operation below: where rðbÞ is the reward generated by the bandwidth of the selected link. Similarly, rðdÞ, rðeÞ, and rðtÞ represent the rewards generated by the delay, the bit error rate, and the available time of the selected link, respectively. The link delay and the bit error rate are negative to the link selection. Therefore, we use monotone decreasing function to present the corresponding rewards in the process of normalization. In this way, the total reward generated by the selected links is shown in Equation (22).
where variables θ, β, λ, and ω are weight coefficients, respectively, which are used to represent the importance of each reward. Here, we use analytic hierarchy process (AHP) to determine the value of these parameters. In addition, variable R t is the cumulative reward that the agent gets by taking action a t in current state s t . We define it as We use state-action value Qðs, aÞ to represent the cumulative reward value obtained by the agent taking action a t in current state s t . And this value indicates the quality of each action in the current state. We define it as Choosing different strategy functions will get different state-action values. Our goal is to find the best strategy function to make the agent choose the appropriate action in each state.
According to Equation (25), we maximize the stateaction value Q ðs, aÞ to find the optimal strategy. The strategy with the symbol " * " is the optimal strategy. At last, the best action in each state can be selected by searching Q-table.
4.1. Path Checking Algorithm. When the communication resources of satellite links are exhausted, the links will be congested. If there is not a path from source node to destination node in satellite network, the routing algorithm cannot find a suitable path to destination node. In order to avoid this situation, we first judge whether there is a reachable path to destination node before looking for a path. If there is no such a path, it means that links are congested or disconnected in satellite network. At this time, the routing algorithm stops looking for paths, reducing the waste of computing resources. We use pseudo code to describe the details of path checking algorithm.
where f lag = 1 indicates that there is a path from start_ node to end_node and f lag = 0 indicates that there is no path from start_node to end_node. values. Because Q-learning is a mode-free method, it does not need prior knowledge in the process of learning. In addition, Q-learning learns the optimal strategy by trial and error, and it is very suitable for the dynamic satellite network. Therefore, here we try to use Q-learning to select the optimal route. The main idea of Q-learning algorithm is we first initialize the Q- where α denotes the learning rate, λ is the discount rate, and s ′ represents the next state. Furthermore, variable r denotes the immediate reward obtained from the environment. a = select an action randomly, r′ < ε, In the process of training, we use ε-greedy strategy, as shown in Equation (27), to avoid the result falling into the local optimal solution. The strategy can achieve the trade-off between exploration and exploitation, where r′ represents the random number generated in the process of selecting the action and ε represents the probability of action exploration. Furthermore, in order to speed up the convergence of Q- We first judge whether there is a feasible path. If there is a path, the optimal path is selected by means of Q-learning routing algorithm. Then, the reward matrix and structure of satellite network are updated. The specific Q-learning-based routing algorithm is described as follows: where the function PCA in QLRA represents the path checking algorithm proposed above.

Speed-Up Q-Learning-Based Routing Algorithm (SQLRA).
In the process of selecting the next hop, the agent will jump from the current state to the previous state. And this operation will result in some repeated and invalid sequences in the selected path. When selecting the path from source satellite node B to destination satellite node J, there will be some repeated and invalid sequences. The specific details are shown in Figure 5. For example, in path B − >E − >F − >C−>B−>E −>H−>G−>H−>G−>H−>G−>J, the sequences in the magenta dashed box indicate that a routing loop has occurred, and the sequences in the blue dashed box indicate that a pingpong effect has occurred. From Figure 5, we can see that sequences E − >F − >C − >B and G − >H are repeated and invalid. These repeated and invalid sequences will not only waste computing resources but also lead to the slow convergence speed of QLRA algorithm.
Although QLRA algorithm uses ε-greedy strategy to select effective actions in the learning process, it still generates invalid sequences. To avoid the routing loop or pingpong effect, we must prevent the agent from jumping from the current state to the previous visited state, when the agent selects an effective action. To solve this problem, we propose a split-based speed-up convergence strategy. The specific split process is shown in Figure 6.
Similar to the broadcast mechanism, we split the satellite network according to the neighbour information of nodes. As shown in Figure 6, for destination node J, we regard its neighbour nodes as the first layer, and nodes with the same colour belong to the same layer. We update the Q value of all Input: start_node, end_node, graph. Output: the flag which indicates whether there is a valid path from start_node to end_node. 1.Initialize flag = 0, path ={}. 2.Get the neighbours of start_node based on the network structure graph, neighbours_list. 3.Let path = path∪{start_node}. 4.while neighbours_list: 5. Pop a node from neighbours _list, node. 6. if node not in path: 7.
Get the neighbours of node, node_neighbours.
path = path ∪{node}. The traditional reinforcement learning updates the Q value of each node from front to back. But in the satellite network, we can obtain all the states of the satellite network in advance. Therefore, no matter what state the agent is in, we can know the next state of agent according to the actions taken by the agent. In addition, according to Equation (26), we know that updating the node's Q value from back to front make Q-table converge faster. Figure 7 illustrates how the agent updates the Q value of each node. Based on QLRA algorithm, we propose SQLRA algorithm with speed-up convergence strategy. The specific pseudo code of SQLRA is as follows: where the function BFS in SQLRA represents the breadth first search algorithm. We search from the last node end_node to get the traversal sequences of the whole network. And we use function Neighbour to get the neighbour information of each node from the satellite network structure. Figure 8 shows the convergence speed of SQLRA and QLRA. We observe Figure 8 that SQLRA converges faster than QLRA. We also observe that SQLRA needs 30 episodes to converge, and QLRA needs 60 episodes to converge. The main reason is that in the process of routing, our split-based speed-up convergence strategy reduces the invalid sequences. In addition, we update the Q value of the nodes from back to front, which further accelerate the convergence speed of SQLRA. for i=1 to Episodes: 6. current_state = start_node. 7.
Select the action a based on Eq. (27). 9.
The agent move to the next state s'.

12.
Update the Q- There is no path from start_node to end_node.  Wireless Communications and Mobile Computing

Performance Evaluation
In this section, we verify the effectiveness of the proposed algorithm SQLRA. First, the simulation environment and related parameters are introduced. Then, we compare SQLRA with QLAODV [42], QSR [43], OSPF [30], and ACO [44] in the performance of throughput, delay, bit error, and visible time. At last, we analyse and discuss the simulation results.

Experimental Parameter Settings.
We conduct numerical simulations to verify the effectiveness of our proposed routing algorithm SQLRA. As for the satellite network used in this paper, we use satellite tool kit (STK) to simulate it.
The satellite constellation adopts the Walker delta model. The satellite network consists of eight orbital planes, and every orbit has six satellites. There are 48 LEO satellites in total. The inclination angle of each satellite orbit is 45 degrees, and the altitude of satellite orbit is 650 km. Each satellite is only connected to its particular neighbour satellites. Please refer to Section 3 for more details. Due to the long Step 2 for i=1 to Episodes: 7.
for neighbour in neighbours: 11.
Get the corresponding reward value generated by each parameter according to Eq. (18), (19), (20) and (21) There is no path from start_node to end_node.  9 Wireless Communications and Mobile Computing distance between satellites, the delay of satellite communication is mainly determined by the propagation delay of satellite links. Therefore, we mainly consider the propagation delay of satellite links in this paper. Furthermore, the propagation delay of satellite links in the same orbit is also different. For simplicity, we assume that the bit error rate of each satellite link follows a uniform distribution. And the bandwidth resources requested by each user follows a Poisson distribution. The specific parameters of satellite constellation are shown in Table 1. In our simulations, we use Pycharm as development tool. The environment is Win10 Operating System with 16 G RAM and 3.2 GHz CPU.
For QLRA and SQLRA, the learning rate α affects the convergence speed of algorithms. And the discount factor represents the impact of future rewards on the current result, which can prevent the agent from falling into the local optimum. The discount factor is between 0 and 1. The higher the value is, the more critical the future reward is. Through the analysis of experimental results, when the learning rate α and discount factor γ are set to 0.001 and 0.9, respectively, the convergence effect of the algorithms is the best. At this time, the convergence results of QLRA and SQLRA are the same. The weight of each parameter in the reward value can be obtained by analytic hierarchy process. Here, we set the values of θ, β, λ, and ω in Equation (11) as 0.30, 0.15, 0.18, and 0.37, respectively.

Results
Analysis and Discussion. In this section, we evaluate the performance of SQLRA algorithm in two different scenarios: one is that all users communicate with each other by the same source satellite node and destination satellite node, and the other is that all users communicate with each other through different source satellite nodes and destination satellite nodes.
(1) Performance in communication scenario with same node pair The users' requests in this scenario have the same source satellite node and destination satellite node. Because the source satellite node and destination satellite node of the selected paths are fixed, here we use average throughput, average delay, average bit error rate, and average visible time to measure the performance of routing algorithms. We compare the performance of algorithms with varying numbers of users.
(A) Average throughput analysis Figure 9 shows the average throughput of different algorithms with varying number of users. From Figure 9, we can see that as the number of users increases, the average throughput obtained by all algorithms is increasing. At the same time, we can draw the following conclusions from Figure 9. First, QLAODV algorithm has the worst performance. Compared with AODV, QLAODV considers not only the number of hops and the delay but also the bit error rate and the available bandwidth. However, QLAODV still prefers to select the path with fewer hops when selecting the next hop. In satellite network, the distance between satellites in the same orbit is different from that between satellites in different orbits; the path with the minimal hops is not always the optimal path. Second, although both ACO and OSPF consider the same characteristics of satellite links in the process of routing, the performance of ACO is better than that of OSPF. The main reason is that OSPF is based on greedy strategy and is easy to fall into the local optimum, while ACO algorithm tries to find the global optimum as much as possible by using the positive feedback mechanism. Lastly, compared with other routing algorithms, our proposed SQLRA has the best performance. The reason is that in the process of path selection, SQLRA not only considers the states of current satellite links but also considers the impact of future rewards on the current selected links. In addition, the Q-table of SQLRA can converge after a certain number of iterations. Therefore, SQLRA mostly finds the optimal solution.

(B) Average delay analysis
We show in Figure 10 the average delay with different number of users. We observe that with the number of users   For QLAODV, it prefers to select path with fewer hops. In addition, only when the current path is disconnected or saturated due to the consumption of users, QLAODV starts to select new path. Therefore, QLAODV does not choose new path frequently. This explains why the curve of QLAODV is mostly unchanged at the beginning. For QSR, OSPF, ACO, and SQLRA, they all consider the delay of links when calculating the cost function. Therefore, they prefer to choose the path with lower delay. As the number of users increases, the current paths cannot meet the needs of users, and these algorithms begin to select new paths. At this time, the delay of the selected paths is greater than that of the previously selected paths. This is the reason why the average delay of paths by QSR, OSPF, ACO, and SQLRA increases. Furthermore, we also observe that SQLRA performs better than ACO. The main reason is that ACO always achieves the suboptimal solution, and SQLRA mostly attains the global optimum.
(C) Average bit error rate analysis Figure 11 plots the average bit error rate changes with varying number of users. As the number of users increases, the average bit error rate obtained by all algorithms presents an upward trend. Because these algorithms consider the bit error rate characteristic of the satellite links when selecting the path, the paths with the lower bit error rate are selected at first. As the resources of links are consumed by the increasing users, these algorithms begin to choose new paths which have larger bit error rate. Compared with ACO, SQLRA has the best performance. Even if the number of users is largest, the average bit error rate of SQLRA algorithm is lowest.

(D) Average visible time analysis
We show in Figure 12 the variation trend of the average visible time with varying number of users. We from Figure 12 observe that the average visible time of all algorithms shows a descend trend to varying degrees for algorithms QSR, OSPF, ACO, and SQLRA when the number of users is increasing. Considering that the visible time has an important impact on the performance of satellite network, these algorithms prefer to select paths with large visible time at the beginning. With more users accessing the network, the current paths cannot meet the requests of users. These algorithms start to select new paths with lower visible time. This explains why the average visible time of these algorithm gradually decreases. For QLAODV, the visible time of the links is not considered when selecting path. Therefore, when selecting the path, QLAODV algorithm does not prefer to choose the link with long visible time.   (2) Performance in communication scenario with different node pairs The users' requests in this scenario have different source satellite nodes and destination satellite nodes. Considering the destination nodes of users are different, we use cumulative average throughput, cumulative average delay, cumulative average bit error rate, and cumulative average visible time as metrics, which can more clearly and reasonably evaluate the overall performance of the algorithms.
(A) Cumulative average throughput analysis Figure 13 shows the relationship between the cumulative average throughput and the number of users. We note from Figure 13 that as the number of users increases, the cumulative average throughput of all algorithms presents an upward trend to varying degrees. Moreover, we also find that SQLRA has the best performance and QLAODV has the worst performance. The main reason is that when selecting the next hop, SQLRA considers the impact of the next link on the current link, and it attains the global optimum, while QLAODV does not consider the visible time of links when selecting the next hop. Furthermore, we also know that the visible time of the satellite link has a great impact on the network performance. When selecting the next hop, OSPF considers the available bandwidth, bit error rate, delay, and visible time. However, it adopts greedy strategy to select the next hop, which is easy to fall into local optimum. Compared with OSPF, ACO algorithm is initialized by random strategy, and it achieves global optimum as much as possible in the process of routing.
(B) Cumulative average delay analysis Figure 14 presents the change of cumulative average delay with varying number of users. As the number of users increases, the cumulative average delay obtained by all algorithms is increasing. Because users have different destinations, the paths selected by all algorithms are different. With the number of paths increasing, the total delay of the selected paths in the network increases. Therefore, the cumulative delay of the paths also increases. We note that SQLRA performs better than other algorithms. The main reason is that when the user's request is satisfied, SQLRA tends to choose the path with less delay, and it almost get the global optimum. We also observe that the performance of algorithms QSR, OSPF, and ACO is almost the same.

Wireless Communications and Mobile Computing
The main reason is that although these algorithms all consider the delay of satellite links when computing the cost function, they use heuristic strategy to select the optimal path.
(C) Cumulative average bit error rate analysis Figure 15 demonstrates the cumulative average bit error rate with varying number of users. We find from Figure 15 that the performance of QSR is worst and that of OSPF is best. We observe that although the performance of SQLRA is not the best, it is close to that of OSPF. Moreover, we also note that the performance of QLAODV is not worst compared with the curve in Figure 11. The main reason is that due to the paths of users are different, QLAODV algorithm begins to choose a new path when a new user arrives. This operation reduces the probability of selecting the path with high bit error rate. In addition, QLAODV algorithm considers the bit error rate when selecting new path.
(D) Cumulative average visible time analysis Figure 16 shows cumulative average visible time with different number of users. From Figure 16, we see that as the number of users increases, the cumulative average visible time of all algorithms presents an upward trend to varying degrees. We also find that SQLRA performs better than other algorithms. When the number of users is 5, algorithms ACO and OSPF have the same performance. However, as the number of users increases, ACO performs better than OSPF. The main reason is that compared with ACO, OSPF is easier to fall into the local optimum. In the second scenario, the source nodes and destination nodes requested by users are different. As the number of paths increases, the total visible time of paths also increases. Therefore, the cumulative average visible time of the selected paths increases gradually.
By testing our proposed SQLRA algorithm in two different cases, we find that SQLRA performs better than other algorithms. In addition, we also find that SQLRA not only has good performance but also has strong robustness. In practice, we train the SQLRA algorithm at the ground control center, which reduces the consumption of computing resources and storage resources of satellites. Therefore, SQLRA algorithm is very suitable for dynamic satellite networks.

Conclusions
In this paper, we investigated the routing problem in STINs. We considered that selecting different satellite routes has an essential impact on the QoS of users and satellite network performance. We modelled the routing problem as a finitestate Markov decision process and proposed a routing algorithm based on Q-learning (QLRA). In addition, to solve the problem of slow convergence speed of QLRA algorithm, we proposed a split-based speed-up convergence strategy and designed a speed-up Q-learning based routing algorithm (SQLRA) and adopted a back-to-front update scheme to further improve the convergence speed of SQLRA algorithm. Moreover, we evaluated the performance of SQLRA in two different scenarios. Experimental results show that SQLRA algorithm has the best communication performance compared with other routing algorithms.
Although SQLRA algorithm performs better than other routing algorithms in two different scenarios, it does not have the ability of online learning. When the number of users in satellite network changes or some satellites do not work well, SQLRA needs to retrain model. In the future research, we are going to use deep neural network to design a routing algorithm with online learning ability. When the number of users changes or the link state of satellite network changes, SQLRA algorithm is able to update the existing model and reduce the training time as much as possible. In addition, traffic scheduling and satellite handoff management are also our future research issues.

Data Availability
The simulation data used to support the findings of this study are available from the corresponding author upon request.