Vehicle-Mounted Self-Organizing Network Routing Algorithm Based on Deep Reinforcement Learning

Through the research on the vehicle-mounted self-organizing network, in view of the current routing technical problems of the vehicle-mounted self-organizing network under the condition of no roadside auxiliary communication unit cooperation, this paper proposes a vehicle network routing algorithm based on deep reinforcement learning. For the problems of massive vehicle nodes and multiple performance evaluation indexes in vehicular ad hoc network, this paper proposes a time prediction model of vehicle communication to reduce the probability of communication interruption and proposes the routing technology of vehicle network by studying the deep reinforcement learning method. This technology can quickly select routing nodes and plan the optimal route according to the required performance evaluation indicators.


Introduction
With people's continuous concern about the intelligent and safety of vehicles, the current artificial intelligence technology related to vehicle-driverless technology has emerged.Driverless technology can replace the human's own operation of the vehicle [1][2][3], so that the human can liberate their hands when using the vehicle for walking.Driverless technology makes the car react faster when encountering danger in the process of driving, to ensure the safety of passengers [4,5].Although the main research on driverless technology of major technology companies is to solve the problem of automatic control of vehicles, it does not mean that driverless technology only has the application field of intelligent vehicle control [6][7][8][9].Vehicles with driverless technology can also be equipped with sensors and communication systems to cruise according to the set route in the military theater.While building the vehicular ad hoc network, due to the use of driverless technology, it can also prevent the enemy from causing casualties when attacking our vehicular system [10][11][12].Using driverless technology to build a vehicle self-organizing network, it can also be deployed to extreme environments such as desert and disaster areas to perform tasks such as environmental data, animal and plant data collection, emergency rescue, and disaster relief information contact, etc.And through the vehicle network, the relevant data will be transmitted to the terminal, and the processing and decision-making can be carried out at the terminal, which can reduce the manpower input and avoid the safety problems of personnel in the execution of tasks.When unmanned vehicles are used to build communication networks in war zones or extreme environments with limited communication, technical problems are not only the automatic cruise technology of unmanned vehicles but also the deployment of vehicle networks and the realization of communication routes [13][14][15].
Vehicular ad hoc network, due to the fast-moving speed of vehicle nodes, the topology of the link will change dramatically in the process of network deployment, the network connection state of some vehicle nodes in the network will continue to change, the communication link will be interrupted, and it is unable to build an effective communication route.Therefore, routing technology is a major technical problem of vehicular ad hoc network.Moreover, vehicles carrying fixed fuel in theater or extreme environment will face the problem of insufficient fuel when performing tasks and may not be able to continuously supply energy for on-board communication nodes.When the energy is insufficient, node death will cause topology change, communication link interruption, and data transmission interruption.Therefore, when performing data transmission tasks, it is necessary to consider the energy problem of the transmitting node.When data is transmitted through an effective routing node, according to the transmission rate of the node, the energy of the routing node can be maintained for a period of time until the data is successfully transmitted [16][17][18].To solve the abovementioned vehicle self-organizing network routing technology problem, this paper proposes an adaptive routing technology in the vehicle self-organizing network based on the research of deep neural network technology, which can consider energy consumption and data transmission efficiency indicators, and automatically construct node data transmission route to complete the successful transmission of task data.

Related Work
In the research field of the routing technology of the vehicle ad hoc network, many researchers have done a lot of research work on the transmission efficiency and performance optimization of the multihop routing of the vehicle network.Zhang et al. proposed a link duration model based on duration, which can evaluate link reliability and use it as a key parameter for designing a new routing protocol.The new routing protocol can dynamically adjust the routing path through interaction with the surrounding environment [19].Li et al. proposed an optimization model for the lowcarbon vehicle routing problem under multigraph timevarying networks.The researchers started from the study of the multipath attributes of the real road network and designed a time division method that conforms to the timevarying network carbon emission calculations.The impact of driving speed changes and vehicle load on emissions, a low-carbon vehicle path optimization model under a multichannel time-varying network, is established [20].Researchers such as Ahmed et al. proposed a highly secure QoS-aware routing algorithm that uses an optimal trust management scheme.Multihop clusters are implemented through an improved whale optimization algorithm, and the trust value is used to complete intercluster routing.This method can quickly discover routes and reduce the packet loss rate [21].Silva proposes a new routing protocol called adaptive, which considers routing performance indicators such as transmission rate, average delay, and average number of hops.The protocol is based on a predictable connection concept and uses the history of meeting nodes to determine the best way to route and discard data packets on the network, and this method can effectively improve the transmission efficiency of routing data [22].David and Vanathi proposed a vehicle-mounted self-organizing network clustering model that reduces data packet loss and uses clustering algorithms to cope with frequent topology changes and high mobility of vehicle networks, manage vehicles in an effective way, and provide intervehicle uninterrupted communication [23].Meng et al. designed a control strategy for official vehicles in the traffic road network to improve the K-means algorithm to make the nodes of the official vehicle network adapt to the route, increase the weight of the backpressure strategy according to traffic pressure conditions, and improve the parameters through optimization the adaptability of official vehicles in the traffic road network [24].Beirigo et al. proposed a dual-mode vehicle routing in hybrid autonomous and nonautonomous regional networks and introduced a new mathematical programming model in the routing to achieve coordinated routing planning for autonomous and conventional vehicles [25].

Time Prediction Model for Vehicular Ad Hoc Networks
For the vehicle-mounted ad hoc network without the cooperation of the roadside auxiliary communication unit, the biggest problem is that the vehicle communication link may be interrupted at any time due to the change of the vehicle speed, which greatly affects the communication quality of the vehicle ad hoc network.Therefore, in the research of vehicle network routing technology, we must first consider the problem of link interruption in vehicle self-organizing network [26,27].According to the sensing range of onboard sensors, the distance of end-to-end vehicle nodes, vehicle speed, vehicle speed variation range, task packet size, and network transmission rate, this paper proposes a time prediction model, which is used for the current vehicle.The node predicts a batch of candidate next hop nodes [28,29].The candidate nodes selected by the time prediction model meet certain conditions, that is, when the current vehicle node selects these candidate nodes as the relay point of the next hop, the probability of link interruption is small [30,31].Therefore, the time prediction model needs to comprehensively consider the transmission time of task packets, sensing range, vehicle spacing, vehicle speed, and possible speed mutation.As shown in Figure 1, the model will select candidate nodes that meet the time prediction function.It can be seen from the schematic diagram in Figure 1 that car A is the current node, and it is assumed that the range that car A's on-board sensor can sense is within the range of the black circle.Among them, car B, car C, and car D are the candidate nodes selected by car A. Although other nodes are also in the sensing range of car A, they do not meet the requirements of time prediction function, so they cannot be used as candidate nodes.
Assuming that the vehicle car A needs to select the next hop node to forward the data packet, the sensing radius of car A is R, and the initial interval between it and a certain vehicle (assuming car B) is L 0 ðL 0 < RÞ.The communication time for car A to transmit complete data is t.The time prediction model is used to judge whether car B can be used as a candidate node for the next hop: let all vehicles drive at a constant speed, acceleration, or deceleration, and the maximum acceleration is α max .The conditions for car B to be a candidate node for the next hop are as follows: if car A runs at a constant speed with the minimum speed v 0 , car B runs at a constant speed with the initial speed v 0 and the maximum 2 Wireless Communications and Mobile Computing acceleration α max , and after t time, car B is still in the sensing radius R of car A, then car B can be a candidate node for car A's next hop.First, the relationship between driving distance and time of car A and car B is calculated: After t seconds, the distance relationship between car A and car B satisfies the following relationship, then car B can be the next hop candidate node of car A:

Utility Functions of Candidate Vehicle Routing Nodes
For the current vehicle node x1, in order to successfully transmit the data packet to y1 through the relay node, as shown in Figure 2 Selecting the best relay node from the candidate node set of each hop can constitute the multihop optimal routing of vehicle network.In this paper, deep reinforcement learning is used to select nodes from each candidate node set.In order to train in the deep reinforcement learning algorithm, it is necessary to determine the reward (or penalty) of the current system under the node selection behavior.In this section, we use the comprehensive utility function to evaluate the reward (or punishment) of node selection behavior.
The comprehensive utility function mainly considers the energy loss and transmission rate of data transmission between nodes.Suppose that in the vehicular network, the vehicular sensor nodes all adopt wireless communication where G l is the antenna gain of the vehicle sensor, G r is the antenna gain of the receiver, h l is the antenna height of the transmitter, and h l is the antenna height of the receiver.
Assuming that the transmitting power of the current node is h l and the unit noise power is P c , the direct data transmission rate between the current node and the next hop node is where n 0 is interference noise.
The comprehensive utility function of the current node and the next hop node is expressed as: The higher the transmission rate and the lower the link loss between the current node and the next hop node, the larger the comprehensive utility function.

Vehicle Routing Based on Deep Reinforcement Learning
In vehicle network, the system state only considers the channel state of nodes.In Figure 2, the vehicular sensor node x1 transmits data to y1 through the relay node, and the node switches to the next state by selecting the relay node.We model the relay node selection scheduling problem as a Markov decision process.
In the Markov decision function, the next state S t+1 of the system is only related to the current state S t P sś a = P S t+1 = ś | S t = s, A t = aÞ: ð ð6Þ Let A x1 = fa t,1 , a t,2 , ⋯, a t,n g be the next hop candidate node set of the starting node x1 in time slot t, that is, A x1 is also the action set in time slot.a t,i ða t,i ∈ A x1 Þ means that the i-th node is selected as the relay node from the candidate node set of x1 in time slot t.
The current system performs node selection in state S t .In order to successfully transmit data from the original node x1 to y1, assuming that k relay nodes are needed, the profit of the future k steps is obtained through the state value function: Among them, γ t is the discount factor at step γ t , and RE t+1 represents the comprehensive utility function value of the current node and the selected next hop node.
In order to evaluate the state and behavior of the system, the Q π ðs, aÞ value is used to represent the action value function: Combined with the state value function, the following action value function is used to evaluate the system's profitability:

State Action
Convolutional layer

Convolutional layer
Fully connected layer Among them, R a s represents the sum of rewards accumulated in all states after performing a set of actions.
In order to obtain better node selection behavior data in the network, iterative update is required, and iterative formuare used to achieve optimized learning of action value functions: Among them, λ represents the learning rate, and γ represents the impact of future returns on current behavior.
Because the traditional Q-learning reinforcement learning method is based on the past state, statistics, and iterative Q value.Therefore, the state and action space applicable to Q -learning is very small, and if a state never appears, Q -learning cannot handle it.Therefore, here we use a deep reinforcement learning algorithm to replace the Q table with a neural network to obtain the Q value corresponding to the state and action.According to the state and node selection behavior, the Q value of each node selection action is output through the convolutional layer and the fully connected layer, as shown in Figure 3.
The optimization objective of deep reinforcement learning is to minimize the loss The gradient descent method is used to update the weight Select multi-hop routing for vehicle network based on deep reinforcement learning Suppose the source node is,Node source the target node is, Node aim there are N nodes between the source node and the target node, and is the node set.
Step 1: According to formulas (1.1) and (1.2), calculate all possible multi-hop routes from the source node Node source to the destination node Node aim to form a candidate multi-hop route set R N .
Step 2:Calculate the comprehensive utility value of all candidate multi-hop routes in R N according to formula (1.5).
Step 3:The comprehensive utility value is used as a reward, and according to formulas (1.6)-(1.12),deep reinforcement learning is used to adaptively select the best multi-hop route.

Wireless Communications and Mobile Computing
The deep reinforcement learning method is used to update the Q value iteratively, so that the vehicle network can adaptively construct the optimal multihop routing from the source node to the destination node according to the comprehensive utility value.
The algorithm implementation process is shown in Algorithm 1:

Simulation Results
In the simulation experiment, there are 200 vehicular nodes in the simulated vehicular ad hoc network scenario.The nodes use wireless communication and data relay mode.Assuming that the computing power of the nodes is suffi-cient, the computing time is ignored, the noise power is set as P c = 1 × 10 −2 W, the discount factor is γ = 0:6, and the training error of the deep network is lower than Loss = 1 × 10 −4 .In the process of simulation, the simulated road is a straight passage, the width of the road is 16 meters, the width of the vehicle node is 2 meters, the set minimum speed of the vehicle is 20 km/h, and the maximum speed is 110 km/h.
In order to verify the performance of this algorithm, we set up two contrast algorithms in the experiment, one is to select the routing node according to the method of minimum transmission energy consumption, and the other is to select the routing node according to the method of minimum transmission packet loss rate.The two algorithms are tested under the same conditions as the algorithm in this paper.
In order to verify the performance of the proposed algorithm in terms of transmission energy consumption and packet loss rate, we conducted 100 simulation experiments and obtained the average results.In the simulation experiment of vehicular ad hoc network, we calculate the total transmission energy consumption and packet loss rate of the algorithm when the network system transmits 100~800 packets from random source node to random destination node.
According to the experimental statistical chart of the total energy consumption of the system transmission in Figure 4 and the experimental statistical chart of the packet loss rate of the system transmission in Figure 5, the more packets are transmitted, the greater the total energy consumption of the system, and the packet loss rate of the system has no obvious change trend with the increase of the packets.Among them, the minimum energy routing method can better save the system transmission energy consumption, but the system packet loss rate is higher, because the routing node selected by this method only considers the transmission link distance between the current node and the next hop node, so although it can better save the transmission energy consumption of the vehicle network, it will cause higher data packet loss.It can be seen from Figure 5 that the minimum packet loss routing can reduce the total packet loss rate of the system, but from the result of Figure 4, the total energy consumption of the minimum packet loss routing is more.However, the vehicle routing method based on deep reinforcement learning proposed in this paper selects the candidate next hop node according to the time prediction model, although there may be packet loss when the next hop node cannot be selected.However, from the experimental results in Figure 5, this method can maintain a low packet loss rate on the whole, and by using the comprehensive utility as the reward and punishment factor of deep reinforcement learning, the vehicle routing will further consider the transmission energy consumption when selecting nodes.From the results of Figure 5, this method can keep low transmission loss.Therefore, in general, the vehicle routing algorithm based on deep reinforcement learning proposed in this paper, although it cannot achieve the lowest transmission energy consumption and the lowest packet loss rate, is the best in the overall performance of the system transmission energy consumption and packet loss rate.
In order to verify the efficiency of the vehicle network routing algorithm proposed in this paper in data transmission, in the experiment, by increasing the number of experimental nodes, the time required for the vehicle network to transmit data from the random source node to the random destination node is counted.It can be seen from the experimental statistical results in Figure 6 that with the increase of the number of nodes, the time required to transmit data from the source node to the destination node increases gradually.Among them, the minimum energy routing method needs the least data transmission time.This is because the method adopts the shortest path data transmission method, which can quickly transmit the data to the destination node without considering the packet loss.The method in this paper uses the time prediction model, so it takes into account the risk of link interruption, and the transmission time is shorter than the minimum packet loss routing algorithm, which has obvious advantages.
In order to verify the performance of this algorithm in reducing the probability of communication interruption between nodes, in the experiment, we increase the total number of nodes tested and count the average probability of communication interruption between nodes in the vehicle network, as shown in Figure 7. From the figure, we can see that with the increase of vehicle nodes, the average outage probability of communication between nodes will continue to decrease, because the increase of vehicle nodes means that more nodes can act as relay nodes, reducing the possibility of link outage.It can be seen from the comparison in Figure 7 that the algorithm in this paper has a better effect on reducing the average probability of communication interruption between nodes.Because the method in this paper adopts the time prediction model, it can select the better next hop node to complete the data transmission task.

Conclusion
In order to reduce the probability of link interruption and improve the energy consumption and transmission efficiency of vehicular ad hoc network, a routing algorithm based on deep reinforcement learning is proposed in this paper.In this algorithm, a time prediction model of vehicular ad hoc network is proposed, which can effectively reduce the probability of communication interruption between vehicle nodes.The algorithm also uses the deep reinforcement learning method to select multihop routing, which can reduce the transmission loss of vehicle network routing and provide transmission efficiency.

Figure 1 :
Figure 1: Schematic diagram of current node selection candidate node.

Figure 2 :
Figure 2: Schematic diagram of vehicle network packet transmission route.

Figure 3 :
Figure 3: Obtaining Q value through deep network.

Algorithm 1 :
Algorithm implementation steps.System total data transmission energy consumption (J) Number of data packets

Figure 4 :
Figure 4: Total energy consumption of system transmission.

Figure 5 :
Figure 5: System transmission packet loss rate.

Figure 6 :
Figure 6: The total time for the source node to transmit data to the destination node.

Figure 7 :
Figure 7: Average probability of communication interruption between nodes.