Online Self-Organizing Network Control with Time Averaged Weighted Throughput Objective

We study an online multisource multisink queueing network control problem characterized with self-organizing network structure and self-organizing job routing. We decompose the self-organizing queueing network control problem into a series of interrelated Markov Decision Processes and construct a control decision model for them based on the coupled reinforcement learning (RL) architecture. To maximize the mean time averaged weighted throughput of the jobs through the network, we propose a reinforcement learning algorithm with time averaged reward to deal with the control decision model and obtain a control policy integrating the jobs routing selection strategy and the jobs sequencing strategy. Computational experiments verify the learning ability and the effectiveness of the proposed reinforcement learning algorithm applied in the investigated self-organizing network control problem.


Introduction
Queueing network optimization problems widely exist in the fields of manufacturing, transportation, logistics, computer science, communication, healthcare [1], and so on.With the rapid development of the Internet of Things, large-scale logistics distribution network, wireless sensor network [2][3][4], new generation wireless communication network, and other network technologies, more and more new network structures and new network optimization problems emerge.Optimization of network control is an important factor to affect the efficiency of network operation.
Self-organizing networks are a kind of new queueing network system.In self-organizing networks, each station or node can establish a link with its adjacent stations or nodes, receive jobs from other stations or nodes, and transfer them to other stations or nodes.Due to the complex link relationship of stations or nodes, the paths and the sequence of the jobs to go through the network are very complicated.Consequently, the control problem of this kind of networks is very complicated.In literature, researchers concentrate on the control of multihop network, which is a kind of network with selforganizing characteristic.The research methods of multihop network control mainly include two categories.The first one is to decompose it into a series of single-station queueing problems or tandem queueing network problems [5].The second kind of methods is to simplify the multihop network control problem into link scheduling problem [6] or queue management problem [7].The main task of link scheduling is to establish a link between the stations and select the appropriate paths for job transferring.He et al. [8] proposed a loadbased scheduling algorithm to optimize the link scheduling between stations so as to achieve the load balance of each station and reduce the degree of paths congestion.Pinheiro et al. [9] studied link scheduling and path selection by fuzzy control.Augusto et al. [10] simultaneously optimized link scheduling and routing planning.Nandiraju et al. [11] studied the problem of restricting the length of transmission path and improved the efficiency of long-path transmission.In order to enlarge the network capacity, Gupta and Shroff [12] optimized link scheduling and path selection by solving the maximum weighted matching problem subject to the -hop interference constraints.The main task of queue management is to classify the jobs to the job groups and to determine the transmission order of the job groups.Fu and Agrawal [7] focused on the problem of jobs classification in queue management and improved the efficiency by batch processing of the jobs.Nieminen et al. [13] and Wang et al. [14] studied optimization of energy management and queue management in multihop networks.Liu et al. [15] reduced the transmission delay and shortened the queue length by modeling and analysis based on Markov chain.Kim et al. [16] considered the fairness of customer services and improved the efficiency of the network while reducing the difference of customers' waiting time.Vučević et al. [17] and Zhou et al. [18] used a reinforcement learning (RL) algorithm to optimize queue management that allocates the data packets to the queues.
In this paper, we study an online multisource multisink queueing network control problem limited by the queue length.We consider the inherent self-organization characteristic of the queueing network problem, transform the problem into Markov Decision Processes (MDP), and then construct an RL system to deal with them.An optimized control strategy and a global optimized solution are obtained by the proposed RL system.The rest of this paper is organized as follows: we introduce the self-organizing queueing network control problem in Section 2, formulate the problem into an RL model in Section 3, present the detailed RL algorithm in Section 4, conduct computational experiments in Section 5, and draw conclusions in Section 6.

Problem Statement
The online self-organizing network control problem concerned in this paper is described as follows.There are  stations in the network and  types of jobs arrive at the network.Let  = {  | 1 ≤  ≤ } denote the set of stations in the network and let  ,ℎ (1 ≤  ≤ ) denote the ℎth job of type .Take the network in Figure 1 as an example.As shown in Figure 1, the self-organizing queueing network is composed of three types of stations.The first type of stations is arrival stations.Each type of jobs has a specific arrival station.The specified arrival station for type  jobs is   ∈   (  ⊂ ), where   denotes the set of arrival stations.The second type of stations is transfer stations, which receive jobs and send them to other transfer stations or destination stations.The set of transfer stations is denoted by   = {  | 1 ≤  ≤   }, where   denotes the th transfer station,   denotes the number of transfer stations, and   ⊂ .The third type of stations is destination stations.Each type of jobs has a specific destination station and the jobs of the same type aim to arrive at the same destination station.The specified destination station for type  jobs is   ∈   , where   (  ⊂ ) denotes the set of destination stations.Once a job is processed by its specified destination station, it passes through the entire network.
Jobs of type  arrive at their arrival station   following Poisson process with rate parameter   .The arriving jobs wait in the queue of station   for transferring.Let Ω(  ) (Ω(  ) ⊂ ) denote the set of stations visible to station   .Specifically, each arrival station corresponds to a set of visible transfer stations.Ω(  ) (Ω(  ) ⊂   ) denotes the set of transfer stations visible for the arrival station   of type  jobs.Each transfer station also corresponds to a set of visible stations.qualified to transfer the jobs of the types in set (  ) ((  ) ⊆ {1, 2, . . ., }).One station transfers only one job at a time.From the arrival station to the destination station, each job needs to pass through at least one transfer station.Under certain conditions, the arrival station   (1 ≤  ≤ ) can establish a link with transfer station   if   ∈ Ω(  ) and then send a type  job to station   .The job waits in the queue to be transferred to another station.The arrival station   (1 ≤  ≤ ) cannot send a job to transfer station   if   ∉ Ω(  ).Similarly, transfer station   can establish a link with station   if   ∈ Ω(  ), select a job from its queue, and send this job to station   .If station   ∈   , then transfer station   sends a type  job to station   only if   is the destination station for type  jobs, that is, station   .

Ω(𝑁
There exist a lot of feasible paths for a job from its arrival station to its destination station.Take the network in Figure 1 as an example.Assume that the arrival station for type  jobs is station  1 and the destination station for type  jobs is station  6 .The sets of visible stations of stations  1 ∼ 5 are { 2 ,  3 ,  7 }, { 4 ,  5 }, { 4 ,  5 ,  7 }, { 6 ,  8 ,  9 ,  10 }, and { 6 ,  9 ,  10 }, respectively.Thus, many paths are feasible for type  jobs from the arrival station to the destination station, such as The queue capacity of each transfer station is limited, which is denoted by   .The maximum number of simultaneous transferring jobs for each station is also limited; that is, the number of jobs being transferred to this station cannot exceed a predetermined quantity   .Though a station can be linked by more than one upstream station, it is allowed to link to at most one downstream station.In order to establish a link between two stations to send jobs, the following conditions must be met: (1) the downstream station is visible for the upstream station; (2) the number of stations transferring jobs to the downstream station is less than the predetermined maximum number; (3) the queue length of the downstream station has not reached the upper limit.
A station is not allowed to send a job to another station if a link is not established between the two stations.An upstream station is allowed to send one or more jobs after establishing a link with a downstream station until it establishes a new link with another downstream station.Assume that a station can only send a job to another station at a time.The time required for establishing a link between two stations is a random variable.Let (  ,   ) denote the time required for establishing a link between stations   and   (1 ≤ ,  ≤ ), which follows exponential distributed random variable exp(  (  ,   )) with parameter   (  ,   ).The time consumed in transferring a job depends on the station and the job type.The transferring time of a type  job by stations   is denoted by  , , which follows exponential distributed random variable exp(  (  , )) with parameter   (  , ).
The task of network control is to control the routes and the transferring sequence of the jobs.Based on the dynamic status of the queueing network, each station selects an appropriate job from its queue and sends it to an appropriate transfer station or its destination station.The control objective function is to maximize the time averaged weighted throughput (i.e., the weighted throughput rate) of the jobs across the network, which is defined as where  is the running time,   is the total number of jobs of all types passing across the network by time , and   (1 ≤  ≤   ) is the weight of the th job passing through the network.
The problem addressed above is of a new queueing network problem with the following characteristics.(1) The first one is that the problem is an online dynamic control problem for multisource multisink networks with limited queue length.(2) The second one is self-organization characteristics of the jobs' transferring paths.There are multiple kinds of jobs with different destination stations.For arbitrary job, many alternative paths exist from the arrival station to its destination station.The most suitable path is not necessarily the shortest one or the one with the fewest transfer stations.Moreover, the more complex the network structure is, the more flexible the path selection is.Network control needs to be conducted considering the factors such as the global situation, the transferring time of each job on each station, the efficiency of each station, and the length of each station's queue.(3) The third one is the self-organization characteristics of network structure.The topological structure of the network is complex and may be dynamic; that is, the location and the number of stations and the relationship among the stations may vary over time.The control approach for the queueing network should be able to adapt to the changes of network topology structure.
In the following sections, an RL model is constructed to depict the above network control problem and an RL algorithm is proposed to deal with it.

The Reinforcement Learning Model
To depict the size of the feasible solution space and the difficulty of the self-organizing network control problem, we use a tandem queueing network control problem as an extremely simple example.This tandem queueing network is composed of  tandem stations ( 1 ,  2 , . . .,   ) and  jobs of different types that need to be processed on each station in the order of  1 ,  2 , . . .,   .Suppose that each station processes only one job at a time and the processing sequences of the jobs on different stations are independent.For each station, there are ! possible permutations of the jobs.Thus the number of feasible solutions to this -station network control problem is (!)  .If  = 7 and  = 4, then (!)  is an enormous figure much larger than 10 9 .Moreover, the general online self-organizing network control problem is much more complicated than the above tandem queueing network control problem with the same number of stations and job types.Due to the large scale of the self-organizing network, it is difficult to formulate the whole system as a unified model and solve it.We formulate the RL model of the self-organizing queueing network problem described in the previous section.According to the characteristics of the selforganizing queueing network control problem and following the decomposition-association strategy, the whole queueing network is decomposed into a number of closely connected small-scale subnetworks and a Markov Decision Process (MDP) model is constructed for each subnetwork.That is, the whole queueing network control problem is converted into a plurality of interconnected MDP problems.The subnetworks are connected by the coupling mechanism.By using this method, we can enhance the adaptability and robustness of the model and make it more adaptive to the changes of the topology structure of self-organizing networks so as to reduce the size of the problem and keep the essential structure of the original problem.
Construct a subnetwork for each station in the selforganizing queueing network.For each station, its corresponding subnetwork is centered on this station and contains its adjacent stations linked with this station.Each subnetwork corresponds to an RL subsystem which is used to solve the MDP model formulating the control problem for this subnetwork.The state transition of an RL subsystem directly causes the state variation of its adjacent RL subsystems, thus the adjacent RL subsystems are coupled in state transition.
Reinforcement learning (RL) is a machine learning method proposed to solve large-scale multistage decision problems or Markov Decision Processes with incomplete probability information.In the following we convert the selforganizing queueing network control problems of the subnetworks into RL problems, mainly including representation of states, construction of actions, and definition of the reward function.In this section we also introduce the coupling mechanism of the RL subsystems.
where   is the weight of the th job type,   (  ) is the label of the central station   for the th job type,  ,2 is the label of the downstream station to which the type  job just transferred from station   , and   is the length of the shortest path from the arriving station to the destination station of a type  job (i.e., the least amount of flow time for a type  job across the whole network from its arrival station to its destination station).The label of a station for the th job type is defined as the length of the shortest path from the th station to the destination station of the th job type, that is, the least amount of flow time for a type  job from the current station to the destination station.The immediate reward ( , ,  , ) represents the progress of a job's passing through the network during the time between two state transitions.In the following we show the property of the reward function.
Lemma 1.For a type  job (1 ≤  ≤ ), assume that this job attains a transfer station   with label   (  ).Whichever path this job attains transfer station   through, the accumulated reward (  ) of all RL subsystems caused by this job is where   is the weight of the th job type and   is the length of the shortest path from the arriving station to the destination station of the th job type.
Proof.Without loss of generality, assume that the job starts from its arrival station   and before it attains station   , it is transferred successively by transfer stations  ,1 ,  ,2 , . . .,  ,ℎ .Let   ( , ) (1 ≤  ≤ ℎ) denote the label of station  , for the th job type.Then the reward caused by the type  job during the process of being transferred from station   to station  ,1 is Similarly, the reward caused by the job during the process of being transferred from station  , (1 ≤  ≤ ℎ − 1) to station  ,+1 is Consequently, the accumulated reward caused by the job during the process of being transferred from station   to station   is By Lemma 1 we obtain the following lemma.
Lemma 2. For a type  job (1 ≤  ≤ ), assume that this job attains its destination station   .Whichever path this job passes through the network from its arrival station to its destination station, the accumulated reward of all RL subsystems caused by this job during the whole process is   .
Proof.By Lemma 1, the accumulated reward caused by this job during the whole process of passing through the network is where   (  ) denotes the label of station   .Since   is the destination station of type  jobs, by the definition of a station's label we get   (  ) = 0. Hence, A state transition takes place when a new job arrives at the network or a job is completely transferred by any station.Without loss of generality, we assume that the sojourn time of any arriving job in the network is finite.Hence, the total number of jobs staying in the network at any time is finite.According to Lemmas 1 and 2 we prove the following theorem.

Theorem 3. If there exists a positive integer 𝑈 such that the total number of jobs staying in the network is less than or equal to 𝑈, then maximizing the time averaged weighted throughput of the network (i.e., the control objective function (1)) is equivalent to maximizing the time averaged reward of all RL subsystems over infinite time.
Proof.Assume that the jobs arriving at the network are divided into two sets  1 and  2 . 1 is the set of jobs having passed through the network by time  and  2 is the set of jobs still staying in the network at time .Let   denote the th arriving job.Thus, according to Lemmas 1 and 2, the accumulated reward of all RL subsystems by time  is where   is the weight of job   ,   is the length of the shortest path from the arrival station of job   to the destination station of job   , and   denotes the label of the station at which job   is at time .By definition, the time averaged weighted throughput by time  is It follows from (10) and (11) that Because the total number of jobs staying in the network is less than or equal to , that is, | 2 | ≤ , we have It follows from ( 12) and ( 13) that It follows from ( 14) and ( 15) that lim Consequently, maximizing the time averaged weighted throughput is equivalent to maximizing the time averaged reward over infinite time.This links the long-term average reward of the RL system and the optimization of the objective function value for the network control problem.

The State Transition Mechanism and the Coupling Mechanism.
The trigger events for state transition in an RL subsystem are completion of transferring a job to the central station of this subsystem and completion of transferring a job from the central station of this subsystem.Take the th RL subsystem as an example to illustrate the state transition mechanism.Currently the th RL subsystem is at the th decision-making state  , .This subsystem takes an action  , and it transfers to an interim state  * , immediately.When a trigger event for the th RL subsystem takes place, the system transfers to a new decision-making state  ,+1 and receives a reward  +1 , which is computed due to  , and  , .The above procedure is repeated until a terminal state is attained; that is, all the jobs reach their destination stations.An episode is a trajectory from the initial state to a terminal state of a schedule horizon.With the states representation defined as (2), the decision process is a Markov Decision Process.An RL subsystem is coupled with another RL subsystem if a link is established between the central stations of these two subsystems.For example, for any two stations   and   , if there is a link from   to   or from   to   , then the two RL subsystems with central stations   and   are coupled with each other.The coupling mechanism of the RL subsystems (as shown in Figure 2) mainly contains the following two aspects.(1) The first one being the coupled correlation of the state transitions of coupling RL subsystems.One action in a subsystem can change the state of its corresponding coupling subsystem; that is, the state transition of a subsystem directly causes the variation of state variables of its corresponding coupling subsystem.(2) The second one being broadcast mechanism of reward signals in the coupling RL subsystems and the coupled iteration of state values of RL subsystems.
To describe the coupling mechanism more precisely and explain the overlap among subsystems, we give an illustrative example.Suppose that   is a visible station to station   and they are the central stations of the th RL subsystem and the th RL subsystem.Currently the th RL subsystem is at the th decision epoch and station   is idle.The th RL subsystem is at the th decision-making state  , and the th RL subsystem is at the Vth state  ,V , where  , = [0;  ,, (1 ≤  ≤ );  ,, ;  ,, ] and  ,V = [ ,V ;  ,,V (1 ≤  ≤ );  ,,V ;  ,,V ].At this decision epoch, the th subsystem takes an action  , which selects a job of type  * waiting in its queue, establishes a link with station   , and transfers the job to station   .The th subsystem transfers to an interim state  * , = [ * ;  ,, (1 ≤  <  * ),   * ,, − 1,  ,, ( * <  ≤ );   ;  ,, ] immediately.The difference of states  * , and  , is that the job type being transferred on station   is  * , the number of type  * jobs waiting the queue of station   decreases by one, and the downstream station to which station   is linked is station   .The th subsystem transfers to an interim state  * ,V = [ ,V ;  ,,V (1 ≤  ≤ );  ,,V ;  ,,V +1] immediately.The difference of states  * ,V and  ,V is that the number of upstream stations linking station   increases by one.When station   completes transferring the type  * job, the th subsystem receives a reward ( , ,  , ) and it transfers to the next decision-making state  ,+1 = [0;  ,, (1 ≤  <  * ),   * ,, − 1,  ,, ( * <  ≤ );   ;  ,, ].The th subsystem transfers to the next decision-making state  ,V+1 = [ ,V ;  ,,V (1 ≤  <  * ),   * ,, + 1,  ,,V ( * <  ≤ );  ,,V ;  ,,V +1] simultaneously.For each RL subsystem, the state transition process continues as above until all the jobs reach their destination stations.
For the th RL subsystem, when its central station completes transferring a type  job at the ( + 1)th decision epoch, the state value of the th RL subsystem is updated with its immediate reward ( , ,  , ) following the proposed RL algorithm.The state values of all the subsystems coupled with the th subsystem are also updated.Assume that the th subsystem is coupled with the th subsystem; then the virtual reward, denoted by ( , ,  , , ), for updating the state value of the th subsystem is defined as where   is the shortest path from the arrival station to the destination station of the th job type and  , is the transfer time of a type  job from station   .The detailed computation procedure of the RL algorithm is shown in Section 4.

A Reinforcement Learning Algorithm with Time Averaged Reward
The online self-organizing queueing network control problem is converted into an RL problem in the previous section.We apply reinforcement learning to solve the RL problem and use -greedy policy to balance exploration and exploitation.-greedy policy means that the algorithm selects the greedy action with probability 1 −  and selects an available action randomly with probability , where  (0 ⩽  ⩽ 1) is usually a small positive number.We propose the following reinforcement learning algorithm (Algorithm 4), where  is the number of RL subsystems,   denotes the state space of the th RL subsystem,   ( , ) denotes the value of state  , ,  is the learning rate for state values,   denotes the estimated reward rate of the th RL subsystem,  is the learning rate for the estimated reward rates,  is the total number of jobs required to pass through the network, and  is the number of jobs currently having passed the network.
Algorithm 4 (a reinforcement learning algorithm with time averaged reward for online self-organizing queueing network control problem).
Step 1.For each job type  (1 ≤  ≤ ), create a network   composed of the stations qualified to transfer type  jobs.For each station   (1 ≤  ≤ ) in network   , compute its label   (  ).
Step 2. Set parameters , , , and  to the predetermined values and initialize  to be zero.For each  (1 ≤  ≤ ), set  = 0 and current time  , = 0 and initialize   to be one.For each  (1 ≤  ≤ ), set current state  , as the initial state  ,0 and initialize   (  ) = 1 for all   (  ∈   ); that is, initialize the state value function for all states of all RL subsystems.
Step 3. Determine the station where the trigger event occurs (e.g., station   ).Determine the current state  , for station   and the set of available actions for the th RL subsystem at state  , .
Step 4. Select action  , based on the current state value   ( , ) of the th RL subsystem following the -greedy control policy.
Step 5. Implement action  , and determine the next time  ,+1 , the time at the ( + 1)th decision-making epoch of the th RL subsystem, following the state transition mechanism.Determine the state  ,+1 at time  ,+1 and calculate reward ( , ,  , ) by (3).
Step 6. Update the state value   ( , ) as Update   as where   denotes an available action which may be taken at state  ,+1 (20) Step 8. Set  =  + 1 and adjust the current time by  ,+1 for the th subsystem.Update the current states of all subsystems following the state transition mechanism.
Step 9.If the trigger event is that a job finishes passing through the network, then  =  + 1.If the number of jobs across the network is , then the algorithm terminates; otherwise go to Step 3.

Computational Experiments
In this section, we conduct computational experiments to examine the learning ability and the performance of the proposed reinforcement learning algorithm (Algorithm 4).We first use a queueing network with four types of jobs as the test bed.A test problem specifies the number of jobs to be scheduled and each problem generates 50 instances.To verify the convergence of the state value function during the learning process, a test problem with 10000 jobs is used for each instance.An instance is the whole process of generating a schedule for the 10000 jobs in an instance from the initial state to the state when all the jobs have passed through the network.The weights of all types of jobs   (1 ≤  ≤ ) are in the range [0.1, 1] and the parameters of transferring times   (  , ) and link establishing times   (  ,   ) are in the range [1,10].The transferring times and link establishing times are exponentially distributed.The parameters of the proposed algorithm are set with  = 0.01,  = 0.05,  = 0.01, and   = 1 (1 ≤  ≤ ).The maximum size of the queues of transfer stations   is set to be 5.
To investigate the convergence of the state value function, we examine the variation of the state values during the learning process.The experiment results in this section take the average over 50 instances.Let  , denote the number of states in the state space of the th subsystem and   denote the mean value of all states.  is defined as (21), where   denotes the state space of the th subsystem and   (  ) denotes the value of state   (  ∈   ). Figure 3 shows the variation of   with respect to the number of jobs having crossed the network.When the number of finished jobs is larger than 3000,   decreases slowly and gradually converges to −20.53.
Let   (  ) denote the value of state   (  ∈   ) at the time when the th job completely passes through the network.Let ADV  denote the average value of the absolution of the difference of values of all states between the two time points when the successive two jobs completely cross the network.For example, at the time when the th job completely passed through the network, ADV  is defined as Figure 4 shows the variation of ADV  with respect to the number of finished jobs.Although both the curves in Figures 3 and 4 converge asymptotically, the shape of the curve in Figure 4 is not so smooth as Figure 3.When the number of finished jobs is larger than 3000, ADV  is less than 0.2.Figures 3 and 4 show that when the number of finished jobs increases, ADV  asymptotically converges to zero which indicates that the state values are gradually stable.
For a given problem, we can draw a "Learning Curve" to examine the learning ability of the proposed reinforcement learning algorithm.In a Learning Curve, the objective function values are averaged over 50 instances.As shown 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 Number of jobs 0.17 in Figure 5, -coordinate represents the number of finished jobs and -coordinate represents the time averaged weight throughput from the initial time to the current time.For example, the point (2000, 0.196) on the Learning Curve indicates that the weighted throughput rate from the initial time to the time when the 2000th job is finished is 0.196.The Learning Curve increases asymptotically and rapidly in the first 3000 jobs and then fluctuates in the latter jobs.This curve shows that RL system learns quickly through interaction and finds a good policy in the former 3000 jobs.Thereafter, the improvement of the control policy gradually slows down in the latter jobs.
To validate the adaptability of Algorithm 4 for various problems and examine the effectiveness of Algorithm 4, more extensive test problems are also randomly generated and conducted to demonstrate the performance of Algorithm 4. We consider the networks with different topology structures corresponding to the test problems with different numbers of stations ( takes 10, 15, 20, 25, 30, 35, and 40, resp.).For a specific number of stations, the values of the relative arrival intensity index (RAII) is, respectively, 0.50, 0.75, 1.00, 1.50, and 2.00.The RAII index indicates the number of arrival jobs at the arrival stations in these extensive test problems relative to the above test problem.The larger RAII index is, the more jobs arrive at the arrival stations.Each extensive test problem generates 50 instances.We use three approaches, the -greedy approach (Algorithm 4), the completely random routing (CRR) approach, and the purely greedy routing (PGR) approach, to solve the test problems.The CRR approach and the PGR approach correspond to  = 1 and  = 0, respectively.Table 1 shows the time averaged weighted throughput (AWT) of the jobs across the network, respectively, obtained by three comparative approaches when 10000 jobs have passed through the network.The experiment results in Table 1 are also averaged over 50 instances.As shown in Table 1, the AWT index increases with respect to the growth of the RAII index.When the RAII index is larger than one, the growth rate of the AWT index is slower than the case that the RAII index is less than one, since high intensity of arrival jobs leads to congestion of the networks.For each test problem, -greedy approach obtains larger AWT index than the CRR approach and the PGR approach.Table 2 lists the relative AWT values of the -greedy approach and the PGR approach.The relative AWT value of a problem for an approach is defined as the AWT index obtained by this approach divided by the AWT index obtained by the CRR approach.
As shown in Table 2, the relative AWT values obtained by the PGR approach range from 1.178 to 1.303 and the relative AWT values obtained by the -greedy approach range from 1.212 to 1.351.For the test problems with the RAII index taking 0.50, 0.75, 1.00, 1.50, and 2.00, respectively, the average relative AWT values obtained by the PGR approach are 1.223, 1.226, 1.249, 1.256, and 1.265, respectively, and the average relative AWT values obtained by the -greedy approach are 1.262, 1.269, 1.291, 1.305, and 1.316, respectively.Compared with the CRR approach, the PGR approach and the -greedy approach improve the weighted throughput rate with an average proportion of 24.4% and 28.8%, respectively.Experiment results show that the -greedy approach is superior to both the CRR approach and the PGR approach.Experiment results validate the adaptability and robustness of the proposed algorithm for test problems of various scales and different topology structures.Experiment results also indicate that the reinforcement learning system learns to select an appropriate action on different occasions, links stations and schedules jobs flexibly in online environment, and obtains optimized results.

Figure 1 :
Figure 1: An example of self-organizing network with multitype stations.

Figure 2 :
Figure 2: The constitution of the RL architecture.

Figure 3 :Figure 4 :
Figure 3: Variation of   with respect to the number of finished jobs.

Figure 5 :
Figure 5: The Learning Curve of weighted throughput rate.
State variables describe the primary characteristics of the RL subsystem.The th ( = 0, 1, 2, ...) state of the th (1 ≤  ≤ ) RL subsystem is represented by vector  , , which is composed of state variables and defined as  , = [ , ;  ,, (1 ≤  ≤ ) ;  ,, ;  ,, ] ,(2)where  , denotes the type of the job being transferred from the central station of the th RL subsystem at the th state ( , equals zero if the central station is idle),  ,, (1 ≤  ≤ ) denotes the number of type  jobs waiting in the queue of the central station of the th RL subsystem at the th state,  ,, denotes the downstream station to which the central station of the th RL subsystem is linking at the th state, and  ,, denotes the number of upstream stations linking the central station of the th RL subsystem at the th state.The trigger events for state transition are arrival of a job and completion of transferring a job on a station.3.2.Actions.When the central station of an RL subsystem is idle and a trigger event occurs, the RL subsystem selects an action.Since the task of the queueing network control is to control the routes and the transferring sequence of the jobs, an action of an RL subsystem contains two decisions: one decision is to determine which station this central station is going to connect to and the other decision is to select a job from the jobs waiting in its queue to transfer.The reward function indicates the instant and long-term impact of an action; that is, the immediate reward indicates the instant impact of an action and the average reward indicates the optimization of the objective function value.Thus, the whole RL system receives larger time averaged reward for larger time averaged weighted throughput.Let  , denote the time at the th decisionmaking epoch of the th RL subsystem, that is, the time when the state of the th RL subsystem transfers from  ,−1  , ,  , ) =   [  (  ) −  ,2 ] 3.1.StatesRepresentation.For the th subsystem, the number of available actions is ∑   ∈Ω(  ) |Ψ(  ,   )|, where Ω(  ) denotes the set of stations visible to station   , Ψ(  ,   ) denotes the set of qualified job types that station   may select and transfer to station   , and |Ψ(  ,   )| denotes the cardinality of set Ψ(  ,   ).For an RL subsystem (e.g., the th subsystem), where a trigger event occurs, a feasible action for this subsystem is to select station   to link with and send a type  job if all the following conditions are satisfied: (1)   ∈ Ω(  ); that is, station   is a visible station for station   ; (2) the number of upstream stations currently transferring jobs station   is less than the predetermined number   ; (3) the queue length of station   is less than   −   ; (4) job type  is a qualified job type for station   to select and transfer to station   .Trivially, a null action (i.e., selecting no job) is selected if no job is waiting in the queue of station   or one of the above conditions is not satisfied.3.3.The RewardFunction.into  , .Let ( , ,  , ) denote the reward that the th RL subsystem selects action  , at state  , and receives reward at time  ,+1 .Without loss of generality, assume that the central station   of the th RL subsystem transfers a type  job during time interval ( , ,  ,+1 ] and completes transferring the job at time  ,+1 ; then ( , ,  , ) is defined as  ( ,   denotes the state to which if action   is taken at state  ,+1 , then the th subsystem is transferred,   denotes the time when the next state transition takes place if action   is taken at state  ,+1 ,  denotes an available action which may be taken at state  , ,  denotes the state to which if action  is taken at state  , , then the th subsystem is transferred, and  denotes the time when the next state transition takes place if action  is taken at state  , .Update the RL subsystems coupled with the th RL subsystem as follows.For each RL subsystem coupled with the th RL subsystem (e.g., the th subsystem at state  ,V ), compute ( , ,  , , ) by (17) and then update   ( ,V ) and   as  ( ,V ) =   ( ,V ) +  {1 −   (,+1 −  , )  ( , ,  , ) } ⋅  ( , ,  , , ) ,   =   +   ,+1 −  , {1 −   ( ,+1 −  , )  ( , ,  , ) } ⋅  ( , ,  , , ) .

Table 1 :
The time averaged weighted throughput (AWT) of the extensive test problems.

Table 2 :
The relative AWT value of the three approaches.