^{1}

^{1}

^{1}

We study an online multisource multisink queueing network control problem characterized with self-organizing network structure and self-organizing job routing. We decompose the self-organizing queueing network control problem into a series of interrelated Markov Decision Processes and construct a control decision model for them based on the coupled reinforcement learning (RL) architecture. To maximize the mean time averaged weighted throughput of the jobs through the network, we propose a reinforcement learning algorithm with time averaged reward to deal with the control decision model and obtain a control policy integrating the jobs routing selection strategy and the jobs sequencing strategy. Computational experiments verify the learning ability and the effectiveness of the proposed reinforcement learning algorithm applied in the investigated self-organizing network control problem.

Queueing network optimization problems widely exist in the fields of manufacturing, transportation, logistics, computer science, communication, healthcare [

Self-organizing networks are a kind of new queueing network system. In self-organizing networks, each station or node can establish a link with its adjacent stations or nodes, receive jobs from other stations or nodes, and transfer them to other stations or nodes. Due to the complex link relationship of stations or nodes, the paths and the sequence of the jobs to go through the network are very complicated. Consequently, the control problem of this kind of networks is very complicated. In literature, researchers concentrate on the control of multihop network, which is a kind of network with self-organizing characteristic. The research methods of multihop network control mainly include two categories. The first one is to decompose it into a series of single-station queueing problems or tandem queueing network problems [

In this paper, we study an online multisource multisink queueing network control problem limited by the queue length. We consider the inherent self-organization characteristic of the queueing network problem, transform the problem into Markov Decision Processes (MDP), and then construct an RL system to deal with them. An optimized control strategy and a global optimized solution are obtained by the proposed RL system. The rest of this paper is organized as follows: we introduce the self-organizing queueing network control problem in Section

The online self-organizing network control problem concerned in this paper is described as follows. There are

An example of self-organizing network with multitype stations.

Jobs of type

There exist a lot of feasible paths for a job from its arrival station to its destination station. Take the network in Figure

The queue capacity of each transfer station is limited, which is denoted by

A station is not allowed to send a job to another station if a link is not established between the two stations. An upstream station is allowed to send one or more jobs after establishing a link with a downstream station until it establishes a new link with another downstream station. Assume that a station can only send a job to another station at a time. The time required for establishing a link between two stations is a random variable. Let

The task of network control is to control the routes and the transferring sequence of the jobs. Based on the dynamic status of the queueing network, each station selects an appropriate job from its queue and sends it to an appropriate transfer station or its destination station. The control objective function is to maximize the time averaged weighted throughput (i.e., the weighted throughput rate) of the jobs across the network, which is defined as

The problem addressed above is of a new queueing network problem with the following characteristics. (1) The first one is that the problem is an online dynamic control problem for multisource multisink networks with limited queue length. (2) The second one is self-organization characteristics of the jobs’ transferring paths. There are multiple kinds of jobs with different destination stations. For arbitrary job, many alternative paths exist from the arrival station to its destination station. The most suitable path is not necessarily the shortest one or the one with the fewest transfer stations. Moreover, the more complex the network structure is, the more flexible the path selection is. Network control needs to be conducted considering the factors such as the global situation, the transferring time of each job on each station, the efficiency of each station, and the length of each station’s queue. (3) The third one is the self-organization characteristics of network structure. The topological structure of the network is complex and may be dynamic; that is, the location and the number of stations and the relationship among the stations may vary over time. The control approach for the queueing network should be able to adapt to the changes of network topology structure.

In the following sections, an RL model is constructed to depict the above network control problem and an RL algorithm is proposed to deal with it.

To depict the size of the feasible solution space and the difficulty of the self-organizing network control problem, we use a tandem queueing network control problem as an extremely simple example. This tandem queueing network is composed of ^{9}. Moreover, the general online self-organizing network control problem is much more complicated than the above tandem queueing network control problem with the same number of stations and job types. Due to the large scale of the self-organizing network, it is difficult to formulate the whole system as a unified model and solve it. We formulate the RL model of the self-organizing queueing network problem described in the previous section. According to the characteristics of the self-organizing queueing network control problem and following the decomposition-association strategy, the whole queueing network is decomposed into a number of closely connected small-scale subnetworks and a Markov Decision Process (MDP) model is constructed for each subnetwork. That is, the whole queueing network control problem is converted into a plurality of interconnected MDP problems. The subnetworks are connected by the coupling mechanism. By using this method, we can enhance the adaptability and robustness of the model and make it more adaptive to the changes of the topology structure of self-organizing networks so as to reduce the size of the problem and keep the essential structure of the original problem.

Construct a subnetwork for each station in the self-organizing queueing network. For each station, its corresponding subnetwork is centered on this station and contains its adjacent stations linked with this station. Each subnetwork corresponds to an RL subsystem which is used to solve the MDP model formulating the control problem for this subnetwork. The state transition of an RL subsystem directly causes the state variation of its adjacent RL subsystems, thus the adjacent RL subsystems are coupled in state transition.

Reinforcement learning (RL) is a machine learning method proposed to solve large-scale multistage decision problems or Markov Decision Processes with incomplete probability information. In the following we convert the self-organizing queueing network control problems of the subnetworks into RL problems, mainly including representation of states, construction of actions, and definition of the reward function. In this section we also introduce the coupling mechanism of the RL subsystems.

State variables describe the primary characteristics of the RL subsystem. The

When the central station of an RL subsystem is idle and a trigger event occurs, the RL subsystem selects an action. Since the task of the queueing network control is to control the routes and the transferring sequence of the jobs, an action of an RL subsystem contains two decisions: one decision is to determine which station this central station is going to connect to and the other decision is to select a job from the jobs waiting in its queue to transfer. For the

The reward function indicates the instant and long-term impact of an action; that is, the immediate reward indicates the instant impact of an action and the average reward indicates the optimization of the objective function value. Thus, the whole RL system receives larger time averaged reward for larger time averaged weighted throughput. Let

For a type

Without loss of generality, assume that the job starts from its arrival station

Similarly, the reward caused by the job during the process of being transferred from station

Consequently, the accumulated reward caused by the job during the process of being transferred from station

By Lemma

For a type

By Lemma

A state transition takes place when a new job arrives at the network or a job is completely transferred by any station. Without loss of generality, we assume that the sojourn time of any arriving job in the network is finite. Hence, the total number of jobs staying in the network at any time is finite. According to Lemmas

If there exists a positive integer

Assume that the jobs arriving at the network are divided into two sets

It follows from (

Because the total number of jobs staying in the network is less than or equal to

It follows from (

Since

It follows from (

Consequently, maximizing the time averaged weighted throughput is equivalent to maximizing the time averaged reward over infinite time. This links the long-term average reward of the RL system and the optimization of the objective function value for the network control problem.

The trigger events for state transition in an RL subsystem are completion of transferring a job to the central station of this subsystem and completion of transferring a job from the central station of this subsystem. Take the

An RL subsystem is coupled with another RL subsystem if a link is established between the central stations of these two subsystems. For example, for any two stations

The constitution of the RL architecture.

To describe the coupling mechanism more precisely and explain the overlap among subsystems, we give an illustrative example. Suppose that

For the

The online self-organizing queueing network control problem is converted into an RL problem in the previous section. We apply reinforcement learning to solve the RL problem and use

We propose the following reinforcement learning algorithm (Algorithm

In this section, we conduct computational experiments to examine the learning ability and the performance of the proposed reinforcement learning algorithm (Algorithm

To investigate the convergence of the state value function, we examine the variation of the state values during the learning process. The experiment results in this section take the average over 50 instances. Let

Variation of

Let

Figure

Variation of

For a given problem, we can draw a “Learning Curve” to examine the learning ability of the proposed reinforcement learning algorithm. In a Learning Curve, the objective function values are averaged over 50 instances. As shown in Figure

The Learning Curve of weighted throughput rate.

To validate the adaptability of Algorithm

The time averaged weighted throughput (AWT) of the extensive test problems.

RAII | | |||||||
---|---|---|---|---|---|---|---|---|

10 | 15 | 20 | 25 | 30 | 35 | 40 | ||

0.50 | CRR | 0.076 | 0.079 | 0.091 | 0.102 | 0.117 | 0.131 | 0.147 |

PGR | 0.089 | 0.101 | 0.112 | 0.127 | 0.143 | 0.161 | 0.175 | |

| 0.092 | 0.104 | 0.115 | 0.131 | 0.147 | 0.166 | 0.183 | |

| ||||||||

0.75 | CRR | 0.105 | 0.117 | 0.130 | 0.153 | 0.174 | 0.198 | 0.230 |

PGR | 0.129 | 0.142 | 0.165 | 0.185 | 0.215 | 0.242 | 0.276 | |

| 0.134 | 0.147 | 0.170 | 0.192 | 0.222 | 0.251 | 0.286 | |

| ||||||||

1.00 | CRR | 0.171 | 0.195 | 0.221 | 0.258 | 0.282 | 0.304 | 0.338 |

PGR | 0.209 | 0.248 | 0.275 | 0.317 | 0.348 | 0.389 | 0.419 | |

| 0.218 | 0.257 | 0.284 | 0.329 | 0.363 | 0.403 | 0.431 | |

| ||||||||

1.50 | CRR | 0.228 | 0.255 | 0.285 | 0.344 | 0.383 | 0.428 | 0.469 |

PGR | 0.285 | 0.323 | 0.357 | 0.424 | 0.481 | 0.537 | 0.597 | |

| 0.295 | 0.335 | 0.373 | 0.439 | 0.502 | 0.561 | 0.618 | |

| ||||||||

2.00 | CRR | 0.267 | 0.296 | 0.324 | 0.398 | 0.446 | 0.503 | 0.571 |

PGR | 0.338 | 0.375 | 0.415 | 0.493 | 0.574 | 0.642 | 0.706 | |

| 0.352 | 0.391 | 0.433 | 0.511 | 0.598 | 0.669 | 0.735 |

The relative AWT value of the three approaches.

RAII | | |||||||
---|---|---|---|---|---|---|---|---|

10 | 15 | 20 | 25 | 30 | 35 | 40 | ||

0.50 | CRR | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |

PGR | 1.178 | 1.278 | 1.226 | 1.239 | 1.219 | 1.229 | 1.206 | |

| 1.212 | 1.314 | 1.260 | 1.280 | 1.252 | 1.268 | 1.241 | |

| ||||||||

0.75 | CRR | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |

PGR | 1.231 | 1.215 | 1.269 | 1.208 | 1.234 | 1.222 | 1.199 | |

| 1.277 | 1.256 | 1.310 | 1.255 | 1.276 | 1.268 | 1.241 | |

| ||||||||

1.00 | CRR | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |

PGR | 1.218 | 1.271 | 1.245 | 1.230 | 1.234 | 1.303 | 1.266 | |

| 1.272 | 1.318 | 1.285 | 1.275 | 1.287 | 1.351 | 1.315 | |

| ||||||||

1.50 | CRR | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |

PGR | 1.252 | 1.267 | 1.256 | 1.231 | 1.256 | 1.255 | 1.246 | |

| 1.296 | 1.312 | 1.310 | 1.275 | 1.310 | 1.311 | 1.292 | |

| ||||||||

2.00 | CRR | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |

PGR | 1.269 | 1.268 | 1.281 | 1.238 | 1.286 | 1.277 | 1.237 | |

| 1.320 | 1.321 | 1.336 | 1.283 | 1.341 | 1.330 | 1.287 |

As shown in Table

We decompose the investigated online self-organizing queueing network control problem with time averaged weighted throughput objective into a series of cross-correlated Markov Decision Processes and convert them into a coupled reinforcement learning model. In the reinforcement learning system, maximizing the time averaged weighted throughput of all jobs across the network is equivalent to maximizing the time averaged reward of all subsystems. Online reinforcement learning algorithm with time averaged reward is adopted to solve the reinforcement learning model and obtains a control policy integrating the jobs routing selection strategy and the jobs sequencing strategy. Computational experiments verify the convergence of the state value function through the learning process and the effectiveness of the proposed algorithm applied in the self-organizing queueing network control problem. In the test problems, the proposed algorithm improves the weighted throughput rate of the networks with a remarkable average proportion through online learning process. The experiment results show that the reinforcement learning system is adaptive to different network topology structures and it learns an optimized policy through interaction with the control process.

The authors declare that there are no conflicts of interest regarding the publication of this paper. The mentioned received funding in the “Acknowledgments” did not lead to any conflicts of interest regarding the publication of this manuscript.

This work is supported by Natural Science Foundation of Guangdong Province (no. 2015A030313649, no. 2015A030310274), Science and Technology Planning Project of Guangdong Province, China (no. 2015A010103021), Key Platforms and Scientific Research Program of Guangdong Province (Featured Innovative Projects of Natural Science, no. 2015KTSCX137), and National Natural Science Foundation of China (Grant no. 61703102).