Advances in Computational Intelligence Techniques-Based Multi-Intersection Querying Theory for Efficient QoS in the Next Generation Internet of Things

An environment of physically linked, technologically networked things that can be found online is known as the “Internet of Things.” With the use of various devices connected to a network that allows data transfer between these devices, this includes the creation of intelligent communications and computational environments, such as intelligent homes, smart transportation systems, and intelligent FinTech. A variety of learning and optimization methods form the foundation of computational intelligence. Therefore, including new learning techniques such as opposition-based learning, optimization strategies, and reinforcement learning is the key growing trend for the next generation of IoT applications. In this study, a collaborative control system based on multiagent reinforcement learning with intelligent sensors for variable-guidance sections at various junctions is proposed. In the future generation of Internet of Things (IoT) applications, this study provides a multi-intersection variable steering lane-appropriate control approach that uses intelligent sensors to reduce traffic congestion at many junctions. Since the multi-intersection scene's complicated traffic flow cannot be accommodated by the conventional variable steering lane management approach. The priority experience replay algorithm is also included to improve the efficiency of the transition sequence's use in the experience replay pool and speed up the algorithm's convergence for effective quality of service in the upcoming IoT applications. The experimental investigation demonstrates that the multi-intersection variable steering lane with intelligent sensors is an appropriate control mechanism, successfully reducing queue length and delay time. The effectiveness of waiting times and other indicators is superior to that of other control methods, which efficiently coordinate the strategy switching of variable steerable lanes and enhance the traffic capacity of the road network under multiple intersections for effective quality of service in the upcoming IoT applications.


Introduction
With the continuous increase in the number of motor vehicles in my country, the contradiction between the supply and demand of road trafc is increasingly intensifying. Especially in the intersection scene, the trafc fow of each turn at the intersection presents an uneven distribution at diferent time periods, which easily leads to congestion and waste of lanes resources. In order to solve this problem, the variable steerable lane technology using intelligent sensors came into being, which uses the lane as a variable space resource and dynamically allocates it according to the needs of each turning trafc fow on the basis of communications between the wireless sensors, so as to improve the utilization rate of road space resources. An intelligent sensor is a function that can sense and detect information from a specifc item, as well as learn, judge, and receive signals, and has a new type of sensor with management and communication features. Te intelligent sensor is capable of autonomously calibrating, compensating, and collecting data. Te intelligent sensor's capabilities determine its high precision and resolution, stability and dependability, and fexibility. It provides a great performance-to-price ratio when compared to standard sensors. Smart sensors come in three varieties: those that can judge; those that can learn; and those that can be creative. Intelligent velocity sensors, intelligent acceleration sensors, intelligent fow sensors, intelligent position sensors, intelligent attitude sensors, intelligent displacement sensors, and intelligent dimension sensors are a few examples of intelligent sensor types.
Variable steerable lanes at a single intersection of the scene, the traditional control method can efectively alleviate the problem of unbalanced steering, but as the number of variable steering lanes increases, the trafc fow changes between multiple intersections become more complex, and the present ability of the traditional method is more difcult, and it is impossible to coordinately control multiple. Terefore, how to make the cooperation between the multiintersection variable-guidance lanes more efcient becomes a new problem. Variable steering lane with many intersections is a technique for fguring out the length of a changeable guide lane for a signal control junction approach is disclosed in an invention in the feld of road trafc sign marked lines. Te process entails frst obtaining trafc data information on the variable guide lane's steering in the desired direction, followed by translating the received trafc volume into an equal standard small vehicle trafc volume. Using a model in the theory of queuing, it is then possible to determine the number of vehicles queuing behind a stop line for the signal control intersection approach using a multipath queuing multichannel system and a combination of an approach lane queuing imbalance coefcient. Finally, the vehicle number is converted to a measurement of the span of the variable guide lane for the signal control intersection approach using the vehicle average queuing length in meters. Te approach trafc efciency may be increased by using the technique for estimating the length of the variable guide lane for the signal control intersection approach, which allows cars on the upper stream of the variable guide lane to enter the intersection approach smoothly.
Te term "variable-guided vehicle route" describes a track where the travel direction is not totally set for left-hand rotation, right turns, or continuing straight ahead, but instead incorporates various turning functions depending on the time of day. Te overall trafc efciency in the crossing can be greatly increased thanks to the way this track is laid out. Since the entrance driveway to the signalized crossing is not suitable for widening or having many canalizations, the wagon fow lacks uniformity signifcantly and is not suitable for the condition under which a signal control device can solve the problem. Apply in places like Hangzhou, Wuxi, Tangshan, and Yantai, China's equivalent of Shanghai. Variable lanes are now being installed at crossroads in several places in order to better use road resources and increase trafc capacity. Te variable lane control method primarily employs manual observation of trafc conditions for switching, or switching is performed at a fxed-time based on statistics of historical data. Te manual switching mode is inefcient, and the statistical technique does not accurately refect current trafc conditions. However, the rapid development of artifcial intelligence and vehicle-road cooperation technologies can supply more real-time and precise data for variable lane control. It is feasible to regulate the variable lane more efciently by making proper use of these data.
In this paper, we present a collaborative control system for variable-guidance sections at various intersections based on multiagent reinforcement learning and intelligent sensors. Tis study proposes a multi-intersection variable steering lane-appropriate control strategy that leverages intelligent sensors to minimize trafc congestion at multiple junctions in the next generation of Internet of Tings (IoT) applications. Te priority experience replay algorithm is also incorporated to increase the efciency of the transition sequence's utilization in the experience replay pool and accelerate the algorithm's convergence for efective quality of service in forthcoming IoT applications.

Related Work
Te trafc fow at the intersection changes dynamically in time and space. For example, in the morning and evening rush hours, the trafc fow of each turn at the intersection shows obvious regular changes, and there is a serious imbalance in the queue length of vehicles in diferent guide lanes. In order to improve the trafc capacity of the intersection and solve the problem of urban road congestion, some trafc researchers have carried out research on the dynamic control method of variable steering lane steering, mainly focusing on three aspects: traditional control method, intelligent control method, and reinforcement learning method. Supervised learning is a machine learning term that refers to the approach of constructing a function independently by learning from a number of related samples. Tis is the method of learning a broad notion from a small number of examples that are comparable to others. Contrarily, reinforcement learning is a branch of machine learning that builds on the idea of behavioral psychology and focuses on interacting with the environment directly. It is an important part of the feld of artifcial intelligence. Regression and classifcation are the two major jobs in supervised learning, whereas exploitation or exploration, Markov's decision processes, policy learning, deep learning, and value learning are the diverse tasks in reinforcement learning. Basic reinforcement is specifed in the model Markov's decision process in reinforcement learning, whereas supervised learning examines the training data and generates a generalized formula. Each example in supervised learning will have a pair of input objects and an output with the desired values, whereas, in reinforcement learning, the agent interacts with the environment in discrete steps, making an observation for each time period "t," receiving a reward for each observation, and then attempting to accumulate as many rewards as possible in order to make more observations.
Te formula to measure the queuing length of the lanes can be explained as we can demonstrate for the M/M/1 queue that L q � ρ2/1 − ρ We can demonstrate for the M/G/1 queue that where L q is the mean number of customers in the queue, the likelihood that it is busy, and the percentage of time it is busy are all represented by the formula ρ � λ/(cµ), 1/E [Interarrival-Time], where E stands for the expectation operator, represents the mean rate of arrival, and σ 2 s variance of service time.

Traditional Control Method.
Te traditional variable lane control method research uses the method based on experience or historical data to set the control plan in advance and design the rules for the steering of the variable steerable lane at the intersection. Literature [1] proposed a signal for lane optimization at the intersection based on experience with phase-integrated design. Literature [2,3] proposed a steering control model for a variable-steering lane at an intersection based on the empirical rules for the setting conditions of the variable-steering lane. Te change characteristics of trafc fow and the actual trafc steering demand are more closely matched in Literature [4] and comprehensively considered the real-time trafc factors at a single intersection, and evaluated the preimplementation plan.
Te specifc lane function and signal timing switching scheme are deployed, but the preplan must be repeatedly tested and the accuracy is not high. Literature [5,6] carried out integer nonlinear programming according to multiple road constraints associated with the target intersection. Te optimization of the model achieves the smallest critical fow ratio after the optimization of a single intersection. Literature [7,8] integrated the road conditions of key intersections and downstream adjacent intersections to achieve an associated control model. Te above-mentioned work only considers the key intersection to a single adjacent intersection impact but does not design a comprehensive optimization scheme for associated intersections. Literature [9] proposed a control method to coordinate the design of multiple intersection variable signs and corresponding signal groups based on the collected data rules to better reduce the average delay of vehicles.
Te abovementioned method of presetting the steering rules for variable steering lanes through experience or summarizing historical data rules can adapt to the needs of regular trafc state changes to a certain extent, but it is difcult to dynamically adapt to road trafc conditions and sudden changes in supply and demand abnormal trafc fow.

Intelligent Control Method.
Research on intelligent control methods for variable steering lanes makes intelligent decisions based on various trafc fow data collected in realtime, and improves the adaptability to real-time trafc fow changes at intersections. Some works use the collected realtime trafc fow data on roads, such as the space of each turning lane, occupancy rate [10], trafc fow, speed, queue length, and other characteristics obtained through video detection [11], dynamic decision-making variable steering lane switching strategy, but its adaptability to subsequent trafc fow changes is not good when combined with realtime collected data. Literature [12] predicted each turning trafc fow as the basis for judging lane direction switching and minimized the average delay time at the intersection. Literature [13] used the least squares dynamic weighting and the short-term trafc fow prediction model with fusion algorithm as the core combined with the trafc state prediction model with fuzzy data theory and neural network system theory as the core to realize automatic control of variable steering lane steering. Literature [14] constructed a mixed integer and the two-layer programming model is solved by a particle swarm algorithm to achieve the goal of minimizing the total travel time of variable lanes based on the prediction model.
Te above-related research work has two limitations: (1) it is mainly applied to intelligent control decision-making of variable-steering lane steering at a single intersection and (2) the prediction-based intelligent algorithm is mainly based on historical and real-time data, and cannot quickly update rules to adapt to the dynamics of trafc fow variety.

Reinforcement Learning Methods.
In recent years, reinforcement learning technology has developed rapidly. It has low requirements for prior knowledge of the environment and can achieve good learning optimization performance in complex nonlinear systems. Terefore, it is suitable for complex and changeable multi-intersection and variableguided lane intelligent control scenarios. In the multi-intersection collaborative control problem, trafc signal optimization research has widely used the reinforcement learning method. Literature [15] combined deep reinforcement learning with the trafc signal control problem, respectively, defning the state, action space, and reward function, using DQN (Deep Q-Network) model, extensive experiments on synthetic and real datasets demonstrate the superiority of reinforcement learning methods. Literature [16] used multiagent reinforcement learning technology to defne the joint Q-value as the weighted sum of local Qvalues, by minimizing the weighted sum of individual Qvalues and the global Q-value to ensure that a single agent can take into account the learning process of other agents and realize automatic control of large-scale trafc signals. Literature [17] proposed that diferent agents exchange strategies after each round of learning to achieve a zero-sum game. Based on this realize the signal control strategy of autonomous vehicles, and designed a rewarding method that combines individual efciency with overall efciency. In the multi-intersection collaborative control scenario, the trafc Computational Intelligence and Neuroscience signal optimizes the trafc situation in the time dimension, and the intelligent variable guide lane is used as the spatial dimension. Te above-mentioned two directions are suitable for the use of reinforcement learning methods to carry out global optimization research.

Variable Steering Lane Cooperative Control Method Employing Intelligent Sensors
3.1. Overall Structure. Tis research proposes a multi-intersection variable-direction lane cooperative control algorithm using intelligent sensors based on multiagent reinforcement learning. Te method mainly includes multiagent reinforcement learning model merged with intelligent sensors, a global reward decomposition algorithm, and a priority experience playback algorithm. Te multiagent reinforcement learning model proposed is constructed based on the value function decomposition algorithm of the QMIX algorithm [18]. Te QMIX algorithm adopts the strategy of centralized training and distributed execution and uses the global reward function to optimize joint actions during training with the aid of intelligent sensors. Te value function can achieve the efect of multiagent cooperative control, and each agent constructs and extracts the corresponding local strategy from the joint action-value function, which can not only deal with the problems caused by environmental nonstationary through centralized training but also through joint action-value function backpropagation learns the local "best" policy for each agent, enabling multiagent decentralized execution. Trough the provision of a more fexible version of the constraint, QMIX enhances the VDN algorithm. Te constraint is described as follows: where Q tot denotes the total value function and Q a denotes the value function for each agent. Te weights of each particular value function Q a should be positive, according to an obvious explanation. If the weights of the individual value function Q a are negative, the agent will be discouraged from cooperating since the greater Q a , the lower the joint value Q tot . For consistency between the centralized and decentralized rules, QMIX comprises agent networks that represent each Q a and a mixing network that brings them together into Q tot rather than just adding them up like in VDN. By requiring the mixed network to have positive weights, it also imposes constraints. Since QMIX's factored representation grows well with the number of agents, it is able to describe complicated centralized action-value functions and makes it simple to extract decentralized policies using individual argmax operations in linear time.
Te global reward decomposition algorithm improves the global reward distribution method in the value function decomposition algorithm and imposes constraints between the global value function and the value function of a single agent. In some complex scenarios, the global optimal joint action may require the intelligent sensor to make some behaviors that sacrifce individual interests. Decomposition techniques change a difcult problem into a simpler one. Only binary constraints, whose scopes form a directed acyclic graph, are present in the new problem. Each set of variables from the original problem is represented by a variable in the new problem. Tese sets encompass the set of the initial variables even if they are not necessarily disjoint. In relation to each set of variables, the translation uncovers all partial solutions. Te interplay between local solutions is refected in the translation problem. A decomposition approach creates a binary acyclic issue by defnition; these problems may be solved in a time that is polynomial in their size. In response to this problem, this study decomposes the global reward into two parts, one part is the basic reward, and the specifc distribution to diferent agents is realized through the QMIX hybrid network; the other part is the performance reward, according to the agent. Te state hierarchically is assigned to each agent which is the IoT so that a single agent can maximize the global reward while taking into account its own reward, and realize the secondary distribution of the global reward. In RL, the agent receives a reward that is often a sum of many reward components, each designed to encode some aspect of the desired agent behavior. From this composite reward, it learns a single composite value function. Using value decomposition, an agent learns a component value function for each reward component. To perform policy optimization, the composite value function is recovered by taking a weighted sum of the component value functions. While prior work has proposed value decomposition methods for discrete-action Q-learning. Te development of autonomous agents is frequently done via reinforcement learning. In the RL framework, the agent is permitted to behave in the environment and is rewarded numerically at each step rather than being explicitly programmed. Te goal of RL algorithms is to discover a strategy that maximizes the overall predicted reward (or some related criterion). Terefore, the reward function implies that optimal behavior is described. In order to assess actions in terms of trade-ofs among the kinds, the technique decomposes rewards into sums of semantically relevant reward categories. To concisely describe why one action is preferable over another in terms of the kinds, we particularly propose the idea of minimal adequate explanations.
In the priority experience replay algorithm, in view of the uneven quality of randomly sampled experience, resulting in low training efciency and slow algorithm convergence speed, the joint value function in the value function decomposition algorithm is used to calculate the error, and combined with the number of extractions to calculate priority of samples to speed up algorithm convergence.

Multiagent Sensors Reinforcement Learning Models.
Te multiagent sensor reinforcement learning model based on value function decomposition is shown in Figure 1. Based on the value-decomposition networks (VDN) algorithm [19], the original linear mapping is replaced by a nonlinear mapping, and the super network is introduced to add additional global state information to the mapping process to improve the performance of the algorithm [20]. Using the current observation state of each agent which are the intelligent sensors that performs the action of the previous time step as input, a global action-value function is learned through the hybrid network by these intelligent agents, where τ is the global state, α is the global action. In the multiintersection variable steering lane scenario, the involved elements such as state space, action space, and reward function are defned. In order to make the input state more realistic and richer, the queue length of the lanes in each direction is adopted as the average waiting time of vehicles, and the ratio of average delay time as indicators. In addition, in order to accurately describe the position distribution of vehicles, the variable steering lane area is discretized and encoded to obtain the vehicle mapping matrix, as shown in Figure 2. Te lanes are divided into the same size the grid covers the entire road section. Each grid in grid represents the present state of a vehicle. A value of 1 means that the vehicle exists at the grid, and 0 means that the vehicle does not exists compared with the intersection image directly information as input, this method can compress the data dimension and remove redundant information, thereby speeding up the training practice speed.
In the multi-intersection scenario, the state space expression is defned as follows: where In the variable steering lane scenario, the straight-left variable steering lane is mainly studied and applied, and the right turn direction is not considered, so the action space is left turn or straight.
Te global reward function is defned as the weighted sum of the following metrics.
(1) Te average queue length L of vehicles in all lanes (2) Te average delay time ratio D of vehicles in all lanes Te expression of the single lane delay time ratio D i is as follows: where v lane is the average speed of vehicles on lane i, and v max is the vehicle speed maximum speed limit for road i: where v j is the average speed of each vehicle.
Assign corresponding weights to the above-mentioned diferent trafc indicators, and fnally calculate the global reward: In the formula, k1, k2, k3, k4, k5 are the weight parameters, and the efect of the fnal trafc condition optimization is set by analyzing the global reward results in 3200 experiments.

Global Reward Decomposition Algorithm.
Te global reward R is decomposed into two parts, the basic reward R b and the performance reward R p based on the proportion, as shown in Figure 3 for the global reward decomposition algorithm structure. Te performance reward is an additional reward that is used to distribute to the agents with larger contributions.
Te global reward decomposition function is given as follows:

Computational Intelligence and Neuroscience
Te traditional hybrid network method is used to distribute the basic rewards to each agent which are the intelligent sensors. Te performance reward is used to motivate the agents that contribute more in the process of regional cooperative control. Te expression of the performance reward obtained by each agent at the current time is given as follows: Real-world situations like strategic confict resolution, coordination between autonomous vehicles, and agent cooperation in defensive escort squads all include cooperative multiagent challenges. Such issues may be modeled as dual-interest situations, in which each agent simultaneously seeks to maximize its individual payout (local reward) and the team's performance as a whole (global reward). Two diferent forms of modern, state-ofthe-art MARL algorithms exist. While algorithms like MADDPG and M3DDPG concentrate on optimizing local rewards without any explicit idea of coordination, algorithms like COMA and QMIX strive to maximize the global reward for the success of the group. We frst defne multiagent cooperation as a joint optimization on reward assignment and demonstrate that each agent has an approximately optimum strategy that decomposes into two parts: one that just depends on the agent's own state and the other that is connected to the states of adjacent agents. CollaQ decomposes each agent's Q-function into a selfterm and an interacting term, using a multiagent reward attribution (MARA) loss to regularize training. CollaQ is tested on multiple StarCraft maps and surpasses existing state-of-the-art approaches (such as QMIX, QTRAN, and VDN) by increasing the win rate by 40% while using the same amount of samples.
In the formula, R i p is the performance reward obtained by the i th agent, L ′ is the ratio of the average queuing length of the lane group in a certain direction of the agent to the overall length of the lane, L is the average queuing length of the lane group during the execution of the previous decision, L S′ is the overall length of the lane, and L L′ is the ratio of the average queuing length of the straight lane group to the overall length of the lane, the ratio of the average queuing length of the left-turn lane group, L T the threshold of the determination level of the lane, V out the maximum trafc fow that can be driven out during the green light of the lane, and the capacity of the lane. V max is the maximum trafc fow [21].
All agents are graded when the performance reward is allocated R p , and the performance reward corresponding to each level is diferent [22,23]. As shown in formula (8), when the L L′ left turn and L S′ straight lane group queue length ratio and are L T less than the threshold when the trafc fow of the road section is determined to be small, L L′ and L s′ the performance reward is allocated by the ratio of  (τ 1 ,a 1 ) Q n (τ n ,a n ) ... the average and the larger, more performance rewards should be allocated [24].

Priority Experience Replay Algorithm.
Deep learning uses target networks to increase the training's stability. Te main training network and the target network are the two networks that the DQN method trains. Te diference between the two networks, squared, is the loss the algorithm trains on (often replaced with Huber loss nowadays). Te primary training network periodically replaces the target network's weights as training progresses. Te target network predicts the optimum Q value from all actions that may be done from the next state in each data sample. Tis is the desired Q value. To train the Q network, the loss is computed using the predicted Q value, target Q value, and observed reward from the data sample. In single-agent reinforcement learning, in order to solve the problem of uneven quality of training samples extracted during the training process, a priority experience playback algorithm [25] was proposed, and the temporal-diference (TD) method was used to measure the importance of samples. Te samples with larger errors are set as high priority, and the samples with high priority are extracted for training to improve learning effciency. In the multiagent where agents are the intelligent sensors the reinforcement learning based on the value function decomposition algorithm, the joint value function can be used to calculate the TD error, and then use it for the calculation of the priority. In order to realize the priority experience playback algorithm, the target network loss L n must be calculated: In the formula, S, A are the joint state and joint action of multiagent, c is the attenuation coefcient, S ′ , A ′ is the joint state and joint action of multiagent at the next moment. Te larger the value, L n the higher the corresponding experience sequence will be priority.
Using L n as the only indicator to measure the importance of samples may cause some samples to be drawn less frequently due to their small size. Terefore, this study combines the target network loss and the number of times to be drawn N sam as an indicator to measure the importance of samples. At the same time, considering diferent empirical the loss value of the target network has a large diference f Top (L n ), and it is converted into a dimensionless sorting amount, which is the position of the loss in the increasing sorting. Te expression of the fnal priority PR i is given as follows: In the formula, f Bot (N i sam ) is the position of the number of extractions in descending order; ε ∈ (0, 1.0) is the ofset of the probability, which is used to correct the situation that the priority is too small to cause the probability of sample selection to be too low.

Experimental Results and Analysis
In order to verify the efectiveness of the cooperative control algorithm that employs intelligent sensors to alleviate the trafc congestion at multiple intersections methods for effcient quality of service in the next generation IoT applications using reinforcement learning. In the multiintersection variable steerable lane scenario, the cooperative control BASE algorithm is combined with fxed-time control (FT) and traditional adaptive control algorithm for multiintersection (MTAC), single-agent reinforcement learning adaptive algorithm (DQN), multiagent reinforcement learning adaptive algorithm (QMIX) and other methods, and analyze the performance of each algorithm in the data set, including algorithm-level reward value, trafc average queue length, average delay-to-time ratio, average waiting time, and average travel time metrics at the level. In cooperative multiagent systems, agents work together to complete tasks in exchange for a group reward as opposed to individual benefts. Credit assignment techniques are often used to diferentiate the contributions of various agents in the absence of individual reward signals in order to promote successful cooperation. As credit assignment has recently been widely implemented using the value decomposition paradigm, QMIX has emerged as the leading technology. Robot swarms, autonomous vehicles, sensor networks, and cooperative multiagent reinforcement learning are just a few areas where this technique has found widespread use. In these activities, each agent must learn a decentralized policy through a shared team reward signal because individual incentives are not available. In order to achieve successful collaboration, agents must allocate credit in a discriminatory manner. Credit assignment techniques using cooperative multiagent reinforcement learning have made signifcant strides in recent years. Value-based approaches have among them demonstrated cutting-edge performance on difcult challenges. Value factorization, which is based on the centralized training with decentralized execution (CTDE) paradigm, has recently gained a lot of popularity. It specifcally integrates separate value functions Q i during centralized training to factorize the combined value function Q tot . Decentralized policies may be easily determined during execution by greedily picking individual actions from the local value function Q i . Because Q i is learned by maximizing the overall temporal-diference error on a single global reward signal, an implicit multiagent credit assignment is accomplished.

Experimental Setup.
Te experimental equipment confguration of this research is as follows: the CPU is AMD 2.10 GHz, the memory is 16 GB, and the operating system is Windows 10 (64 bit). Simulation experiments are carried out based on the microscopic trafc simulation platform SUMO v1.7.0. Interface to interact with the simulation Computational Intelligence and Neuroscience environment, obtain the trafc status in real-time and adaptively adjust the control strategy of the variable steering lane. As shown in Figure 4, the experimental environment includes 4 intersections and a total of 24 road sections, and the road sections are numbered 1∼9. Tere are 9 variable steering lanes in total. Te section with variable steering lanes consists of 5 lanes: fxed left turn, variable left turn/ straight, straight, straight, and right turn lanes. Section number 10∼24 is a conventional road section, there are 15 in total, and the road section adopts a conventional fxed threelane confguration. Te signal cycle of each intersection is the same.
Te experimental data set is collected from diferent trafc capture data of streets, roads, etc. such as upstream and downstream road section codes, capture time, lane number, and license plate number.
In this experiment, trafc fow data were collected at intersections in the urban area. Te types of vehicles mainly included cars, station wagons, buses, and large passenger cars. Te actual collected vehicle types and numbers were input into the simulation system, which were 24,17,592 cars, 156,008 station wagons, 91,089 buses, and 25,113 large passenger cars, accounting for 89.88%, 5.80%, 3.39%, and 0.93% of the total fow, respectively. In the calculation process of the simulation system, in order to standardize the vehicle position information at the intersection is converted into a discretized encoded vehicle position matrix, which is used as the quantitative input of the reinforcement learning model. In addition, the actual input vehicle type and quantity are converted into standard cars based on the conversion standard. Te equivalent conversion coefcients of the models corresponding to standard cars are 1.2 for station wagons, 2.0 for buses, and large passenger cars. As is well known, trafc noise conversion and calculation used to be done using the prediction model in "Specifcations for the Environmental Impact Assessment of Highways," whereas we now know that its levels or grades, as well as its equivalent conversion, can be calculated in a variety of ways using the equivalent conversion coefcient. Te fndings of the speed survey and their analysis allow us to conclude that the speed computation for all types of vehicles must follow the Gaussian distribution in free trafc fow, allowing us to suggest a speed discretization approach. Te approach described above assists in converting various vehicles with varying speeds into car comparable numbers, while overall trafc volume may be converted into that of passenger cars by evaluating themselves at the same noise levels.
In order to ensure the fairness of the comparison algorithm, the network structure and hyperparameter settings of the reinforcement learning algorithm are the same. Te value of the discount factor is 0.95, the value of the learning rate is 0.001, the value of the greedy strategy ε is 0.05, and the size of the memory bank is 1000. Te number of samples for each update is 32, and the model update step is 5 signal cycles. In order to improve the stability of the algorithm, the target network is updated with a delay. Te weight replacement step of the target network is set to 30 signal cycles and RMS prop (root mean square prop) algorithm is the update algorithm.
In this study, the weight of global reward decomposition is set to λ � 0.4 in the BASE algorithm. Queuing length is used as an indicator to measure the efciency of road trafc diversion under unbalanced trafc fow, and 32 combinations of weight values are set. In the process of training and testing, the accumulated vehicle queue length is calculated, 100 experiments are carried out for each scheme, and the average queue length is obtained by taking the average of the results. Te smaller the average queue length the better the trafc diversion efect a total of 3,200 experiments were conducted using the numerous sensors. Te results determine the weight of each infuencing factor in the global reward function set in this experiment k1, k2, k3, k4, k5 are set to −1.0, −0.5, −0.5, 0.5, 1.0, respectively.

Experiment Analysis.
Performance the test set is data of 6 periods, namely morning peak (periods 1, 2), evening peak (periods 3, 4), and fat peak (periods 5, 6). Te performance of each control method is shown in Figures 5 and 6. In Figure 5, R is the reward index. In the data sets of multiple periods, the BASE algorithm performs better overall, followed by the QMIX algorithm, indicating that the multiagent collaborative algorithm is still efective in real scenarios. Te performance of BASE, QMIX, IQL, and MTAC is signifcantly better than that of the timing control scheme (FT), indicating that the adaptive algorithm can efectively adapt to the trafc fow changes in the real scene, and can make appropriate decisions according to the realtime trafc state. Figure 5 and Table 1 show that the performance of the BASE algorithm in the datasets of multiple time periods is better than that of other algorithms, and the BASE algorithm has a signifcant lead in the morning and evening peak hours compared to the fat peak hours.
As shown in Figure 6 and Table 2, compared with other algorithms, the average queuing length index of the BASE algorithm in the morning and evening peak hours is reduced by 25.76% ∼ 54.97%, and the index in the of-peak hours is reduced by 49.00% ∼ 70.67%, indicating that the BASE algorithm is in the congestion scene. Compared with other algorithms, the average delay time of the BASE algorithm is reduced by 15.54%∼55.09% as shown in Figure 7 and Table 3. In the test sets of 5 and 6 during the peak period, the road trafc in the network is small, the demand for the   Computational Intelligence and Neuroscience function of the variable steering lane is weak, and the performance of each algorithm is close. Te performance of the algorithm (BASE) in this study still maintains a slight lead. As shown in Figure 8 and Table 4, compared with other algorithms, in the BASE algorithm the average waiting time is reduced by 9.28%∼42.39%, and in peak hours such as time period 3, the trafc state is more congested. Te average waiting time of the IQL algorithm on this dataset is slightly better than that of the QMIX algorithm. It can still maintain the leading performance, which further proves the efectiveness of the improved algorithm. As shown in Figure 9 and Table 5, the average travel time is reduced by 6.44%∼29.93% compared with other algorithms, and the performance is stable under the test set of 6 periods, always has better performance.
Comparison results of each trafc index in the test set are shown in Figures 6 to 9.
Te best performance of the algorithm in this study verifes the improvement of the multiagent collaborative algorithm for the multi-intersection variable-guidance lane scene: global reward decomposition, the proposed algorithm in this study can learn the policy better than the QMIX algorithm performance of the training process by comparing the BASE algorithm with the QMIX algorithm, the performance indicators of the training process are tested. Te comparison of the average accumulated reward value index is shown in Figure -and Table 6. In Figure 10, E is the number of iterations.

Conclusion
Computational intelligence is built on several learning and optimization techniques. Terefore, an emerging trend of major signifcance in the future generation of IoT applications is the integration of new learning methods, such as opposition-based learning, optimization approaches, and reinforcement learning. Tis research proposes a multiagent reinforcement learning-based collaborative control method employing intelligent sensors for variable-guidance lanes at multiple intersections. Tis method improves the performance in congested scenarios through a global reward decomposition algorithm and improves learning efciency through a priority experience replay algorithm. Cooperative control of variable-guidance lanes at multiple intersections: compared with other control methods, this method has better efects in reducing the average queue length, average delay time, average travel time, etc., while convergent faster, and therefore, enhancing the quality of service in the next generation of IoT applications. Te follow-up work includes combining the algorithm with trafc signal control and performing joint optimization in the two dimensions of time and space to further improve the trafc capacity of multi-intersection scenarios.

Data Availability
Te data shall be made available on request.

Conflicts of Interest
Te authors declare that they have no conficts of interest. Computational Intelligence and Neuroscience 11