Composition of Web Services Using Markov Decision Processes and Dynamic Programming

We propose a Markov decision process model for solving the Web service composition (WSC) problem. Iterative policy evaluation, value iteration, and policy iteration algorithms are used to experimentally validate our approach, with artificial and real data. The experimental results show the reliability of the model and the methods employed, with policy iteration being the best one in terms of the minimum number of iterations needed to estimate an optimal policy, with the highest Quality of Service attributes. Our experimental work shows how the solution of a WSC problem involving a set of 100,000 individual Web services and where a valid composition requiring the selection of 1,000 services from the available set can be computed in the worst case in less than 200 seconds, using an Intel Core i5 computer with 6 GB RAM. Moreover, a real WSC problem involving only 7 individual Web services requires less than 0.08 seconds, using the same computational power. Finally, a comparison with two popular reinforcement learning algorithms, sarsa and Q-learning, shows that these algorithms require one or two orders of magnitude and more time than policy iteration, iterative policy evaluation, and value iteration to handle WSC problems of the same complexity.


Introduction
A Web service is a software system designed to support interoperable machine-to-machine interaction over a network, with an interface described in a machine-processable format called Web Services Description Language [1]. A Web service is typically modeled as a software component that implements a set of operations. The emergence of this type of software components has created unprecedented opportunities to establish more agile collaborations between organizations, and as a consequence, systems based on Web services are growing in importance for the development of distributed applications designed to be accessed via the Internet.
When a Web service is requested, all available Web services descriptions must be matched with the requested description, so that an appropriate service with the desired functionality can be found. However, since the number of available Web services is continuously growing year by year, finding the best match is not a trivial problem anymore, especially if we take into account that the matching criteria must consider not only the desired functionality, but also other attributes such as execution cost, security, performance, and so forth.
If individual Web services are not able to meet complex requirements, they can be combined to create composite services [2]. A composite Web service has one initial task and one ending task, and between the initial and the ending tasks there can be = {0, 1, 2, . . . , } individual tasks connected in sequential order. To create a composite Web service it is necessary to discover and select the most suitable services. The complexity of WSC involves three main factors: (1) the large number of dynamic Web Services instances with similar functionality that may be available to a complex service; (2) the different possibilities of integrating service instance components into a complex service process; (3) various performance requirements (e.g., end-to-end delay, service cost, and reliability) of a complex service. 2 The Scientific World Journal Some others have proposed to use optimization methods specially designed for solving constraint satisfaction problems, such as integer programming [9], linear programming [10], or methods for solving the knapsack problem [11]. Artificial intelligence methods such as planning algorithms [12][13][14], ant colony optimization [15], fuzzy sets [2], and binary search trees [16] have been used too.
The use of methods based on Markov decision processes (MDPs) for the composition problem is certainly not new. In [17], the problem of workflow composition is modeled as a MDP and a Bayesian learning algorithm is used to estimate the true probability models involved in the MDP. In [18], the WSC is solved using QoS attributes in a MDP framework with two versions of the value iteration algorithm: one backward and recursive and one forward version. In [19], the authors proposed the use of what they call value of changed information. Their approach uses MDPs focusing on changes of the state transition function, in order to anticipate values of the service parameters that do not change the WSC. In [20], a combination of MDPs and HTN (Hierarchical Task Network) planning is proposed.
Solutions based on reinforcement learning are also relevant. For instance, in [21], reinforcement learning and preference logic were employed together to solve the WSC problem, obtaining some kind of qualitative solution. Authors argue that computing a qualitative solution has many advantages over a quantitative one. Other methods using Q-learning are given in [22][23][24]. It is important to remember that reinforcement learning methods [25] belong to a family of algorithms highly related to the MDPs. The main difference with these methods is that the state transition function is assumed to be unknown and therefore the agents need to explore their state and action spaces by executing different actions in different states and observe the numerical rewards obtained after each state transition.

Contributions of This
Paper. The goal of automatic WSC is to determine a sequence of Web services that can be combined to satisfy a set of predefined QoS constraints. For problems where we need to find the sequence of actions maximizing an overall performance function, the MDPs are one of the most robust mathematical tools that we can use. Therefore, in this paper we propose an MDP model to solve the WSC problem. To show the reliability of our model, we conducted experiments with three of the most studied algorithms: policy iteration, iterative policy evaluation, and value iteration. Although all three algorithms provided good solutions, the policy iteration algorithm required the minimum number of iterations to converge to the optimal solutions. We also compared these three algorithms against sarsa and Q-learning, showing that the latter methods require one or two orders of magnitude and more time to solve composition problems of the same complexity.
This paper is structured as follows. Section 2 provides the basics of the MDPs framework and introduces the three algorithms that we tested. Section 3 introduces our MDP model for solving the WSC problem. Section 4 describes the experimental setup and presents the most relevant results. Section 5 presents comparative experiments with sarsa and Q-learning algorithms. Finally, Section 6 concludes this paper by discussing the main findings and providing some advice for future research.

Markov Decision Processes
The WSC problem can be abstracted as the problem of selecting a sequence of actions, in such a way that we maximize an overall evaluation function. Such kind of sequential decision problems can be defined and solved in an MDP framework. An MDP is a tuple ( , , , , ), where is a set of states, is a set of actions, ( +1 | , ) are the state transition probabilities for all states , +1 ∈ and actions ∈ , ∈ [0, 1) is a discount factor, and : × → R is the reward function.
The MDP dynamics is the following. An agent in state ∈ performs an action selected from the set of actions . As a result of performing action , the agent receives a reward with expected value ( , ) and the current state of the MDP transitions to some successor state +1 , according to the transition probability ( +1 | , ). Once in state +1 the agent chooses and executes an action +1 , receiving reward ( +1 , +1 ) and moving to state +2 . The agent keeps choosing and executing actions, creating a path of visited states , +1 , +2 , . . ..
As the agent goes through states, 0 , 1 , 2 , . . ., it obtains the following rewards: The reward at timestep is discounted by a factor of . By doing so, the agent gives more importance to those rewards obtained sooner. In an MDP we try to maximize the sum of expected rewards obtained by the agent: A policy is defined as a function : → mapping from the states to the actions. A value function for a policy is the expected sum of discounted rewards, obtained by performing always the actions provided by : is the expected sum of discounted rewards that the agent would receive if it starts in state and takes actions given by . Given a fixed policy , its value function satisfies the Bellman equation: The optimal value function is defined as Algorithm 1: Iterative policy evaluation.
Algorithm 2: Policy iteration algorithm. This function gives the best possible expected sum of discounted rewards that can be obtained using any policy . The Bellman equation for the optimal value function is The optimal value function is such that we have * ( ) = * ( ) ≥ ( ) .
2.1. Dynamic Programming Algorithms for MDPs. When the state transition probabilities are known, dynamic programming can be used to solve (6). Next, we present three efficient algorithms for solving finite-state MDPs by means of dynamic programming. The first one is the iterative policy evaluation (given in Algorithm 1). The second one is the policy value iteration algorithm (given in Algorithm 2). This algorithm repeatedly computes the value function for the current policy and then updates the policy using the current value function. The third one, shown in Algorithm 3, called value function iteration, can be thought as an iterative update of the estimated value function using Bellman Equation (6). The last two algorithms are known to converge usually faster than the first one. Moreover policy iteration and value iteration are standard algorithms for solving MDPs, and there is not currently universal agreement over which algorithm is better [26,27].

Web Service Composition Model
In this section we define the MDP model used to represent and solve the Web service composition problem by means of dynamic programming algorithms.
We begin by describing the WSC problem in more details. Individual Web services can be categorized in classes by their functionality, input data, and output data. Given different classes of individual Web services, the WSC problem consists in finding a sequence of length of individual Web services ⟨ 1 , 2 , . . . , ⟩, such that ∈ , for = 1, 2, . . . , , where is the set of all available Web services of class . Thus, we are making the assumption that a valid composite Web service needs a Web service from each of the existing classes. We are also making the assumptions that all available Web services have been previously categorized into classes and that the ordering of the classes 1 ≺ 2 ≺ ⋅ ⋅ ⋅ ≺ has been predefined.
≺ means that a Web service from set must be executed before a Web service from set to ensure the correct operation of the selected Web services. The correct operation depends basically on their functionality and input and output data. Therefore, the output of must be fully compatible with the input of . Now, we are ready to introduce our model. We define a Web service composition problem as an MDP ( , , , , ), where is the set of states, is the set of actions, is the state transition probability function, is a discount factor such that ∈ [0, 1), and is the reward function. Elements , , , and are defined next.

4
The Scientific World Journal Algorithm 3: Value iteration algorithm.

3.1.
States. is the set of states. Given a WSC problem with classes, consists of all compositions of length at most . Thus, for = 1, = {⟨ 1 ⟩}, with 1 ∈ 1 . A composition of length = 1 is not really a composition; it is just a single Web service; however, we will relax the meaning of the word composition and will call it a composition of length = 1. For 2 ∈ 2 , and 3 ∈ 3 . In general, for a WSC problem with classes = {⟨ 1 ⟩, ⟨ 1 , 2 ⟩, . . . , ⟨ 1 , 2 , . . . , ⟩}.

Actions.
is the set of all actions. Given a state , the set of actions available from is denoted by ( ); thus = { ( )} ∈ . An action consists of selecting a Web service to be included in the current composition. If the current composition is of length = , all the possibilities of selecting a Web service of class = +1 will constitute the set of current available actions.
Formally, we say that = { ( =0 ), ( =1 ), ( =2 ), . . . , ( = −1 )}, where ( = ) is read as the set of actions available from a state representing a composition of length = . Note that ( =0 ) refers to set of actions available from a composition of length = 0, which corresponds to the state where none of the Web services has been selected yet.
For example, if the current state represents the composition ⟨ 1 , 2 ⟩ which is of length = 2, then ( =2 ) is given by all the possibilities of selecting a Web service of class = 3. In other words, we are in a situation where we have already selected Web services from class = 1 and class = 2, and now we need to select a Web service from class = 3.

Transition
Probabilities. ( | , ) are the state transition probabilities for all states , ∈ and actions ∈ , which are currently available from and . Note that the probability of going from a state = ⟨ 1 ⟩ to the state = ⟨ 1 , 2 ⟩ is 1. Meanwhile, the probability of going from the same state = ⟨ 1 ⟩ to a state = ⟨ 1 , 2 , 3 ⟩ is 0. In other words, we can only go from a composition state of length = to another composition state of length = + 1. Function. ( | , ) is the reward received when action is executed and the environment makes a transition from to . The reward function for our model is computed using three QoS attributes, as indicated in (8), which was originally proposed in [22]. The QoS employed are availability, throughput, and execution time:

Reward
where V , time , are the availability, average execution time, and throughput values for the last Web service added to the composition represented by state . V min , time min , min and V max , time max , and max are the minimum and maximum values for all the Web services.

Experimental Evaluation
In this section we provide the results of our experimental comparison using two scenarios, one real and one artificial. The experiments that we present in this section were performed running policy iteration, iterative policy iteration, and value iteration algorithms, on an Intel Core i5 2.5 GHz processor, on Windows 8.1, 64 bits operating system, and 6 GB RAM.

Real Scenario.
The WSC problem considered as our first experimental scenario consists of 2 classes of Web services. One class is about weather services that can be used to obtain the current temperature in a city. The other class is about Web services that can be used to convert temperatures from one metric unit to another, for example, from Fahrenheit to Celsius. In the class of weather services we considered 3 different Web services.
In the class of metric units conversion services we considered 4 different Web services.
we can use subtraction, multiplication, and division operations for the temperature conversion.
We obtained the QoS attribute values of all 7 Web services using a java program designed to get the attribute values with the following formulas: where is the number of successful calls to the Web service and are the total calls, where is the total execution time for all the calls, with = 50. In order to obtain representative QoS values for the Web services, we made many measurements, several days in different moments of the day. We obtained the values for each parameter and measurement, and then we calculated the average values for the QoS parameters.
Once we gathered the information of the QoS attributes we used all 3 dynamic programming algorithms to learn the best composite Web service. With 7 Web services belonging to 2 different classes, there are 12 possible compositions. All these possibilities are represented with the graph illustrated in Figure 1.
The graph of the real scenario illustrates each class of Web services as a layer. In this graph, each node represents an individual Web service. Node represents the state where none of the Web services has been selected yet. Node represents the state where a full composition of Web service has been accomplished. A path from to implies that a valid composite Web service has been generated.
Results with the real Web services scenario are plotted in Figure 2. All 3 algorithms found the solution for the Web service composition very quickly, in less than 0.07 seconds, with policy iteration being the winner.

Artificial Scenario.
As our second scenario to test all 3 dynamic programming algorithms, we simulated data for three QoS attributes: availability, execution time, and throughput. We created a maximum of 100,000 individual Web services, classified into 100 hypothetical classes of Web services. We assumed that every Web service in a class can access all the Web services in class + 1. Each of these classes is represented as a layer in Figure 3. Each layer contains 100 nodes or individual Web services.
As in the first scenario, node is the initial state of the graph and represents a state where none of the Web services has been selected yet. Node is reached when a valid composition has been accomplished. Nodes between and     represent the available Web services. A route from to gives a possible composite Web service.
Each layer in the graph represents 100 Web services belonging to the same class. Therefore, when the number of nodes to be selected for a valid Web service composition is 1,000, we are really solving a problem with 100 × 1,000 = 100,000 Web services. We can see from the learning curves that the time needed to solve the MDP problem increases as the number of nodes is increased. Again, all 3 algorithms found the optimal solution, but policy iteration found it in less time. The best performances of the algorithms were obtained for = 0.8 and = 0.9, requiring less than 180 seconds to find the optimal composition using iterative policy evaluation and value iteration and less than 120 in the case of policy iteration.

Comparison with Sarsa and Q-Learning
In some related works [22][23][24], reinforcement learning algorithms were proposed to solve the Web service composition problem. In this section we compare the learning times required by sarsa and Q-learning against policy iteration, iterative policy evaluation, and value iteration. [25] is an on-policy temporal difference control algorithm which continually estimates the stateaction value function for the behavior policy and at the same time changes toward greediness with respect to . Algorithm 4 presents the sarsa algorithm as taken from [25].

Sarsa. Sarsa
If the policy is such that each action is executed infinitely often in every state, every state is visited infinitely often, and it is greedy with respect to the current action-value function in the limit, then by decaying , the algorithm converges to * [28]. [29] is an off-policy temporal difference control algorithm which directly approximates the optimal action-value function, independently of the policy being followed. It is one of the most popular algorithms in reinforcement learning. Algorithm 5 reproduces the Qlearning algorithm as taken from [25]. If in the limit the action-values of all state-action pairs are updated infinitely often, with a decaying , then the algorithm converges to * with probability 1 [26,30].

Learning Time Analysis.
We have implemented sarsa and Q-learning algorithms to solve the real scenario problem defined previously in the experimental section. A comparison graph illustrating the time required by sarsa, Q-learning, policy iteration, iterative policy evaluation, and value iteration is given using a logarithmic scale in Figure 7. From this graph we can clearly see that sarsa and Q-learning required two orders of magnitude and more time to find the optimal composition.
Additionally, we ran experiments with a second artificially created scenario, with 3 layers of 20 Web services each. Once more, reinforcement learning methods required much more time than the dynamic programming algorithms. Logarithmic time curves given in Figure 8 show that sarsa and Q-learning required one order of magnitude and more time than dynamic programming algorithms. Furthermore, in some of the experiments, reinforcement learning algorithms failed to find the optimal solution, getting stuck in suboptimal compositions.
Dynamic programming methods converge faster than reinforcement learning methods simply because dynamic programming methods update every single state value at each iteration. Reinforcement learning methods only update the value of the states that happen to visit, giving its exploration policy, that is, epsilon greedy.
Furthermore, in terms of the deployment of an automatic Web service composition system, it is worth mentioning that the gathering of QoS information can be performed  at specific time intervals by a dedicated module of such system. Once we have gathered this information, which is fundamental for the evaluation of the reward function, there is no need to explore the state space of Web services as reinforcement learning methods do. We can simply run a dynamic programming algorithm to estimate the value function of the Web services and then compute the optimal composition of Web services.

Conclusion
In this paper we have proposed an MDP model to address the Web service composition problem. We used three dynamic programming algorithms, namely, iterative policy evaluation, value iteration, and policy iteration, to show the reliability of our approach. Experiments were conducted with both artificially created data and a set of real data involving seven publicly available Web services.
Our experimental results show that policy iteration is the best one in terms of the minimum number of iterations needed to estimate an optimal policy. The optimal policy indicates the sequence of combined individual Web services making up a composite Web service with the highest evaluation of their QoS attributes.
Although some approaches using reinforcement learning have also been proposed, we argue that dynamic programming methods are better suited for the Web service composition problem than reinforcement learning methods. The reason is that reinforcement learning methods such as sarsa and Q-learning require a lot of exploration of the state space and consequently they need more iterations to make a good estimation of the optimal policy. To illustrate this, we compared sarsa and Q-learning against policy iteration, iterative policy evaluation, and value iteration. The result of this comparison is that sarsa and Q-learning required one or two orders of magnitude and more time than the dynamic programming methods to handle problems of the same complexity. Moreover, in some of the artificially created experiments, reinforcement learning algorithms got stuck in suboptimal Web services compositions.
None of the related works proposing the use of MDPbased methods to solve the Web service composition problem have provided a comparison study involving the five algorithms that we have analyzed in this work: iterative policy evaluation, value iteration, policy iteration, sarsa, and Qlearning. Moreover, we present experimental results using both a real scenario and a Web service composition scenario with artificially generated data. All other related works report experiments performed only with artificially created data.
Future research on this topic must address real Web services composition involving more nodes. Another interesting subject that deserves to be further investigated is the design of complex reward functions capable of handling an increasing number of QoS factors.