Video Stream Session Migration Method Using Deep Reinforcement Learning in Cloud Computing Environment

In the resource scheduling of streaming Media Edge Cloud (MEC), in order to balance the cost and load of migration, this paper proposes a video stream session migration method based on deep reinforcement learning in cloud computing environment. First, combined with the current popular OpenFlow technology, a novel MEC architecture is designed, which separates streaming media service processing in application layer from forwarding path optimization in network layer. Second, taking the state information of the system as the attribute feature, the session migration is calculated, and gradient reinforcement learning is combined with indepth learning and deterministic strategy for video stream session migration to solve the user request access problem. The experimental results show that the method has a better request access effect, can effectively improve the request acceptance rate, and can reduce the migration cost, while shortening the running time.


Introduction
In recent years, with the maturity of cloud computing technology, streaming media services are gradually transforming to cloud form, that is, streaming Media Cloud. Streaming media cloud pushes the content requested by users to the edge of the network by placing media edge cloud in different geographical locations, so that to reduce the user response delay and reduce the traffic load of the main network [1]. At the same time, the subcloud can adapt to the changes of system load and the size of user requests, so that to effectively solve the problem of traditional streaming media services [2].
In streaming Media Edge Cloud (MEC), system resources are virtualized into resource pools to ensure service transparency. Cloud resource allocation is automatically adjusted by the cloud platform according to the scale of actual demand, so how to allocate system resources in real-time to meet user needs. Under the condition of limited resource allocation, the fluctuation and randomness of user request mode will make the system load unbalanced and affect the access effect of user request [3,4].
In order to solve the above problems, domestic and foreign scholars have proposed a migration-based task scheduling method for streaming media. Ref. [5] proposed a session migration strategy based on dynamic threshold allocation (SMS-DTA). According to the popularity distribution, the session allocation thresholds of all kinds of videos on each server are determined, and the user request access is guided by the allocation thresholds. Ref. [6] proposed resource dispatch based on data priority (RDDP) algorithm. However, considering the impact of the urgency and scarcity of data blocks on priority, quantitative calculation is not given. Only the balance factor is used to measure the quantitative relationship between them, and the influence of time factor on emergency quantification is omitted. Ref. [7] proposed a direct access storage device (DASD) hopping algorithm to migrate sessions of nodes with different loads in order to maintain the load balance of hard disk. However, due to the lack of self-adaptability, it is difficult to adjust the strategy according to the system operation scenario. Moreover, the mathematical model is relatively complex and the calculation is large, which cannot solve the problem of large-scale resource allocation.
Ref. [8] explored how to make the streaming media edge cloud admit more requests via online session migration and proposed an adaptive strategy of online session migration. Besides the load information, the video popularity is adopted for obtaining the allocation thresholds of different videos on each server, and a new request would be admitted under the guidance of the obtained threshold distribution. Specially, when the video popularity varies, the allocation thresholds would be recalculated. Ref. [9] proposed a joint optimization algorithm of session migration and video deployment, the proposed strategy is more adaptive to dynamic fluctuation of video popularity, and thus gains a flexible balance between service cost and quality. The trace-driven experiment verified the effectiveness of the proposed method.
According to the resource allocation of streaming media edge cloud, in order to balance the cost and load of migration, considering the cost of migration, load balancing, and other constraints, this paper proposes a video stream session migration method based on deep reinforcement learning. Based on the current popular OpenFlow technology, a novel MEC architecture is designed, which separates streaming media service processing in application layer from forwarding path optimization in network layer to ensure service transparency. The main innovations are as follows: (1) This paper improves the resource utilization by effectively utilizing the state information of the MEC system, combining in-depth learning and deterministic strategy for video stream session migration (2) This paper proposes a session migration computing model to process user requests more scientifically, maximize the access rate of user requests, and control the migration cost appropriately, at the same time, make the system achieve load balancing as far as possible

Streaming Media Edge Cloud Architecture
Streaming Media Edge Cloud is located on the edge of the network, which is responsible for local video services. As shown in Figure 1, combined with the current popular Open-Flow technology, this paper designs a novel MEC architecture. The whole MEC is composed of streaming media server, business management server, and OpenFlow controller and switch, in which the streaming media server is responsible for providing media streaming to users; the business management server is mainly responsible for the access scheduling of user requests, generating migration strategies and sending them to OpenFlow controller; the OpenFlow controller and switch, on the one hand, it constitutes a media stream distribution network, on the other hand, it is responsible for the actual implementation of session migration; OpenFlow controller is responsible for generating flow tables according to migration strategy and sending them to switches; OpenFlow switch completes the modification and forwarding of data packets according to flow tables. By introducing MEC architecture, streaming media service processing in application layer is separated from for-warding path optimization in network layer, and transparency of video service is realized.

Session Scheduling Strategy Based on Deep Reinforcement Learning
Assuming that the video content provided by the MEC system has I kind of video content, the i kind of video is represented by v i . Each kind of video is encoded at a constant bit rate and serves at the same bit rate [10,11]. Assuming that the total number of MEC streaming media servers is J, the j server is represented by M j , and D is defined as the video deployment matrix with the size of I × J, and the element d ij ∈ f0, 1g represents whether or not a copy of v i is deployed on M j . Assuming that all servers are homogeneous, and a single server can provide up to C streaming sessions at the same time, as well as up to S videos [12]. K is defined as a session distribution matrix with the size of I × J. A single element k ij ∈ ½0, 1 represents the ratio of all sessions of video v i on server M j to the total service capacity (JC) of the system. G is defined as a server adjacency matrix with the size of J × J. Element g nj ∈ f0, 1g denotes whether there are sessions on M n server that can be migrated to the server, where n = 1, 2, ⋯, J.
Define l j as the load of streaming media server M j , that is, the total number of access sessions, then: Define L as the average load of all streaming media servers in the system, then: In this paper, the state information of the MEC system is taken as an attribute feature, and the decision-maker and value function are fitted by deep convolution neural network combined with reinforcement learning elements such as state space, action set, and return function. In order to improve the efficiency of the algorithm, the deterministic strategy gradient is used to train the neural network.
3.1. Session Scheduling Model. For streaming media edge cloud system, the goal of reinforcement learning is to access the video request to the most suitable server independently according to the current MEC system status and video request according to the experience strategy. Then, according to the load state of the server, the optimal migration method for the current incoming user requests is obtained by using the migration video strategy to perform the request access or one-step session migration action [13,14].
In this paper, the deep reinforcement learning method is applied to session scheduling in streaming media edge cloud, and its session migration method is shown in Figure 2.

Wireless Communications and Mobile Computing
In Figure 2, for the current step video request v i , the decision-making action of the decision-maker is to connect the video request to a server, assuming that the server accessed is M j , then the strategy of moving out video is: if M j is not full, no video needs to be moved out, set the number of moving out video to be 0, corresponding to the request access; if M j is full, it needs to move out the video and move out the video. The set of numbers is fv m | k mj ≥ 1, m ≠ i, m ≤ Ig, corresponding to one-step session migration.

Enhanced Learning Model of Conversation Transfer.
According to the characteristics of the problem, the state of t time step in the MEC system is defined as follows: Among them, G t is the server adjacency matrix and the size is J × J, which indicates whether session migration can be carried out among servers, R t is the video request matrix of t time step, the size is I × J, the elements of one row in the video request matrix are all 1, the other elements are all 0, and the corresponding video number of the row with the elements all 1 is the appropriate one. The former video request D t is the video deployment matrix and the size is I × J, which reflects the deployment of video copies in the MEC system. K t is the session distribution matrix and the size is I × J, which will reflect the distribution of video sessions in the MEC system. Every time a new video request is processed, the system will undergo a state transition [15][16][17].
Since the task is to decide which server to access or reject the request based on the current MEC system status and video request, the action is defined as the server number M j to which the video request is accessed, where i = 1, 2, ⋯, I. For the current step video request v i , the optional action set is shown in Formula (4). When v i accesses MEC directly or through session migration, the set of optional actions is the set of servers deployed with video v i ; when rejecting video request v i , the corresponding action is 0.
If video request v i is accessed to server M j according to the decision-making action, this paper chooses the deployment of video v i on server M j , the load of server M j , and the variance of load balance of MEC system after executing the action as the immediate return function.  3 Wireless Communications and Mobile Computing deployment on the server, the load value of the server, and the load balancing variance of the system are different, so the load balancing variance of the system is normalized. The load balancing variance function is defined as: Since the variance of load balancing is the inverse of variance, the formula above shows that the larger the variance σ of load balancing, the more balanced the load of the system.
If the video requested in time step t is v i , the quotation value returned by action a t of that step is where i = 1, 2, ⋯, I, j = 1, 2, ⋯, J, when video v i is deployed on the server M j corresponding to decision action a t , there will be corresponding reward value ω 1 . If video v i is not deployed on server M j , the action is not a reasonable access action. It is not in the optional action set, reward value 0, and reward value 1 − l j /C represents the remaining service capability of the server. When the server is full, that is, the residual service capacity is 0, the reward value is 0. When migration occurs, because session migration has a certain cost, the reward value is reduced by 1 as the corresponding penalty. When the action is to reject the video request, the return value is set to -1. ω 1 , ω 2 , and ω 3 represent the weights of the returns from the three optimization objectives, respectively. The weights can be set according to the importance of the optimization objectives, but the sum of the three weights must satisfy ∑ 3 i=1 ω i = 1. Defined in MEC system state s t , after taking action a t , if strategy μ is continuously implemented, the expected value of immediate return is action-value function. Defined Bellman equation as follows: where rðs t , a t Þ is the immediate return value after taking action a t under the state s t of the MEC system. In the whole session scheduling process, the above equation is the final solution of the equation, and the optimal scheduling strategy is obtained by solving the equation.   Under the unbalanced load distribution, the full-loaded servers can continue to access new requests only if some sessions are moved out. Therefore, whether the load is balanced or not will indirectly affect the cost of migration. In practice, due to the fluctuation of request distribution, all kinds of video requests do not arrive strictly according to popularity. The scheme of optimizing the acceptance rate mentioned above can easily lead to an unbalanced load and increase the cost. Therefore, a goal of load balancing maintenance is introduced.
(1) For new requests arriving immediately r, due to the directive allocation threshold, r can only be connected to nodes that have not yet reached M, in order to minimize the load imbalance. Therefore, the following new optimization objectives have been added where A r is a constant matrix, the calculation method is: for v r , assuming that M j is the node deployed v r with the smallest load, the corresponding element a r rj = k rj + 1; of course, the corresponding element a r rj = k rj + 1, i ≠ r on the other nodes. In addition, for the rest of the video v i , i ≠ r, the corresponding element is a r rj = k rj , ∀j ≤ 1.
(2) For all subsequent arrival requests, in order to connect them to the minimum load node, it is necessary to ensure that each allocation threshold is larger than the number of sessions [18]. In addition, considering the continuity and randomness of request arrival, the difference between the allocation threshold and the number of sessions should also be related to the arrival of requests and other factors [19,20]. Therefore, the following new optimization objectives have been added where A * is a constant matrix, assuming that R i represents the number of requests that v i has not yet arrived, it can be approximately expressed as R i = max ðJCp i − ∑ J j=1 k ij , 0Þ. For the node set N i = fj | d ij = 1, j ≤ Jg of deployment v i , the corresponding element a * ij = k ij + R i /∑ J j=1 d ij , and for the remaining nodes, the corresponding element a * ij = 0.
θ 1×I and γ J×1 are weight vectors, considering that popular video is more likely to affect the load distribution, the weight θ i is desirable p i ; considering that lightweight nodes should allocate larger thresholds, the weight γ i is desirable ð1 − l j /CÞ. From the effect point of view, the smaller the load of nodes, the larger the allocation threshold to undertake more requests for access. However, since this optimization strategy is adopted after the start of MEC, the load of each node is basically balanced, so the abovementioned average allocation processing can still achieve the desired effect.
In addition to the limitation of the cost of a single migration, the following constraints should be considered: the service capacity limitation of each server; the value range limitation of a ij ; and the principle of "no reduction in the number of actual sessions." In summary, with session assignment matrix A as a decision variable, the migration computation model can be expressed as follows: where A and A * are const matrix. Subject to: The behavior of each step can be obtained by calculating function μ. Function μ is simulated by using convolutional neural network. The network is a strategy network with a parameter of θ μ . A function JðμÞ is used to measure the 5 Wireless Communications and Mobile Computing performance of strategy μ, which is defined as: where s t is the state of the system, Q μ ðs t , μðs t ÞÞ is in each state, if the action a t is selected according to policy μ, the Q value can be generated, that is, JðμÞ is the expected value of Q μ ðs t , μðs t ÞÞ when policy μ. Therefore, the optimal behavior strategy is the strategy μ which maximizes JðμÞ, that is, Network input is MEC system state, that is, video request matrix, video deployment matrix, session distribution matrix, and server adjacency matrix with size J × J. The eigenvectors of the video request matrix and the video deployment matrix represent the deployment information of the current video on the server. The size of the eigenvectors is I × J. The three eigenvectors are connected through concat layer. Finally, the probability distribution of server number is obtained by using Softmax classifier. The dimension is d, and the decision-making action is the server number corresponding to the maximum probability.

Initialization parameter
Call the trained policy network reverse calculation to get the server j to be accessed Is there no video v or j = 0 deployed on the server?

Request rejection
Is the server j fully loaded?
Request v to the server j

Wireless Communications and Mobile Computing
In order to make it more exploratory, on the basis of the deterministic strategy, behavior search is added, that is, 30% of the actions are randomly selected in the optional action space, and the remaining actions are the output of the strategy network.

Iterative Value Calculation.
In this paper, the convolution neural network is used to simulate Q function. The network is called Q network. Its parameter is θ Q . The model of Q network is shown in Figure 3.
The input of Q network is the MEC system state and action vector, and the action vector is the result of transforming the probability distribution vector of the output of policy network into one-hot vector, the size of which is I × J. In network training, input sample data is highly correlated with time, and direct training is not easy to converge. In order to break the correlation between data, the method of "experience playback" is used to save the generated sample data into the buffer, and the sample data used in training is randomly extracted from the buffer.
In the process of network training, this paper uses the target network method to establish the copy θ μ and θ Q of the policy network and Q network to calculate the target value, and then update the original network slowly in the proportion of τ. Through this network learning method, the learning process will be more stable and convergence will be more guaranteed. The flow chart of reinforcement learning algorithm based on deterministic strategy gradient is shown in Figure 4.

Experiment and Analysis
4.1. Parameter Setting. In the environment of the MEC system simulation, the environment parameters are as follows: the total number of streaming media servers J = 30, the capacity of each streaming media serverS = 40, the maximum number of service sessions C = 100, and the number of video types I = 350. At the same time, assuming that the arrival rate of users' requests obeys Poisson distribution of ζ requests per minute, the value of ζ ranges from 58 to 65. The average playback time is set to 30 minutes, and the system can support 2000 (JC) user video requests concurrently in one playback time. Therefore, when ζ = 65ð1900/30Þ, the system reaches full load. The video content requested by users obeys Zipf distribution and random uniform distribution, respectively.
In order to analyze the effectiveness and practicability of deep reinforcement learning algorithm, this paper programmed on tensorflow platform and applied it in the MEC system session scheduling strategy. The parameters of the algorithm are as follows: the learning rate of policy network is 0.0001, the learning rate of Q network is 0.001, the discount coefficient is 0.95, the capacity of buffer is 1,00000, the preheating coefficient of buffer is 1,000, the number of iterations is 100,000, the upper limit of time step is 60 steps,

Result Analysis.
In this paper, according to the parameters set in Section 4.1, the deep neural network training is carried out. The trained network model is used in the MEC system simulation experiment, and the simulation time is set to 300 minutes. In order to better reflect the effect of algorithm optimization, under the same experimental conditions, the proposed algorithm is compared with Ref. [8] algorithm and Ref. [9] algorithm. In this paper, user request receipt rate, total number of migrated sessions, and running time are used as performance evaluation indicators.
Set the video content requested by the user to follow Zipf distribution. Figures 5-7 show the relationship between the user request reception rate, the total number of migration sessions and the running time of the simulation algorithm, and the system load under this condition, respectively.
As can be seen from Figures 5 and 6, in the case of low load (ζ ≤ 61), the user request reception rate and the total number of migration sessions of this method are basically the same as Ref. [8] algorithm and Ref. [9] algorithm. In the case of high load (ζ > 61), the receiving rate of user requests and the total number of migrating sessions of this method are lower than Ref. [8] algorithm and Ref. [9] algorithm.
Compared with Ref. [8] algorithm and Ref. [9] algorithm, the average receipt rates of user requests in this method are reduced by 0.85% and 1.72%, respectively, and the total number of migrated sessions is reduced by 3.55% and 5.29%, respectively. The result shows the advantage of reinforcement learning. Because session transfer is cost-effective, in order to obtain greater returns, the decision-maker constantly adjusts the decision-making actions and ultimately reduces the cost of transfer, while guaranteeing a higher request reception rate.
As can be seen from Figure 7, for both low load and high load, the running time of the proposed algorithm is better than Ref. [8] algorithm and Ref. [9] algorithm, and the running time is shortened by 39.98% and 54.54% on average, respectively. Because in the process of user requesting access, the session allocation threshold needs to be constantly updated by method Ref. [8] algorithm, which leads to a lot of computation, and method Ref. [9] algorithm needs to be constantly overlapped. In order to find the optimal solution, the deep reinforcement learning method used in this paper only needs to make scheduling decisions through the trained strategy network, which has less computational complexity and improves efficiency.
In order to evaluate the adaptability of the proposed algorithm, a random uniform distribution of video content requested by users is set up. Figures 8-10 show the relationship between the request reception rate, the total number of migration sessions, and the running time of the simulation algorithm and the system load under this condition, respectively. Compared with Ref. [8] algorithm and Ref. [9] algorithm, the average receipt rate of user requests in this algorithm is reduced by 0.41% and 1.19%, the total number of migrated sessions is reduced by 3.64% and 6.57%, respectively, and the running time is reduced by 45.28% and 56.03%, respectively. The experimental results show that the proposed algorithm has a certain degree of selfadaptability. When the distribution of user requests changes, the scheduling strategy can still be adjusted in the training process, resulting in a lower migration cost and a higher user request reception rate.
In summary, the proposed deep reinforcement learningbased session scheduling strategy for streaming media edge cloud not only achieves better request access effect but also has lower migration cost. More importantly, it has a great speed advantage, that is, shorter running time. At the same time, it has strong adaptability in an uncertain MEC system environment.

Conclusion
In order to achieve efficient and smooth resource scheduling for streaming media service system in cloud mode, this paper proposes a video stream session migration method based on deep reinforcement learning. The method transforms session migration problem into reinforcement learning problem; defines state space, action set, and return function; calculates session volume according to load; and uses convolutional neural network to fit behavior selection strategy function and action-value function. The experimental results show that compared with the methods of Ref. [8] algorithm and Ref. [9] algorithm, this strategy can reduce the migration cost and shorten the running time. This paper only considers the video session request access server as the output of Q network. Later research focuses on the video session request access server and the video session moved out of the server as the output of Q network, in order to improve the migration method of streaming media edge cloud session and extend the application object to dynamic video session.

Data Availability
The data included in this paper are available without any restriction.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.