Improving ELM-Based Service Quality Prediction by Concise Feature Extraction

Web services often run on highly dynamic and changing environments, which generate huge volumes of data.Thus, it is impractical to monitor the change of every QoS parameter for the timely trigger precaution due to high computational costs associated with the process. To address the problem, this paper proposes an active service quality prediction method based on extreme learning machine. First, we extract web service trace logs and QoS information from the service log and convert them into feature vectors. Second, by the proposed EC rules, we are enabled to trigger the precaution of QoS as soon as possible with high confidence. An efficient prefix tree based mining algorithm together with some effective pruning rules is developed to mine such rules. Finally, we study how to extract a set of diversified features as the representative of all mined results. The problem is proved to be NP-hard. A greedy algorithm is presented to approximate the optimal solution. Experimental results show that ELM trained by the selected feature subsets can efficiently improve the reliability and the earliness of service quality prediction.


Introduction
The advantage of composite Web services is that it realizes a complex application by connecting multiple component services seamlessly.However, in real applications, Web service lives in a highly dynamic environment, and both the network condition and the operational status of each of the component Web services (WSs) may change during the lifetime of a business process itself.The instability brought by various uncertain factors often makes the composite services failed or interrupted temporally.Therefore, it is very important to ensure the normal execution of the composite service applications and provide a reliable software system [1].
As one of the promising technologies to address the above issue, Web service quality prediction has become an important research problem and has attracted a lot of attention in recent years.The goal is to perceive in advance whether the invoked services will fail or be interrupted by monitoring and evaluating service quality fluctuation.In SOA infrastructure, Web service prediction aims to optimally select the high quality service in advance to ensure the reliable execution of system.A number of Web service prediction models have been proposed, such as ML-based methods [2][3][4], QoS-aware based methods [5], and collaborative filteringbased methods [6,7].These models are often implemented by monitoring and evaluating the quality of composite services.In spite of improving the quality of composite services to some extent, there methods still have three major drawbacks.First, most of the traditional ML-based prediction models [3], such as support vector machines (SVM) and artificial neural networks (ANN), are more sensitive to the user-specified parameters.Second, the prediction models based on QoS monitoring, such as Naive Bayes and Markov model [8,9], often assume that sequences in a class are generated by an underlying model  and the probability distributions are described by a set of parameters.However, these parameters are obtained by predicting QoS during the whole lifecycle of the services and will therefore lead to high overhead costs.In another sequence distance based prediction method, such as collaborative filtering [6,7], a function measuring the similarity between a pair of sequences is necessary.However, how to select an optimal similarity function is far from trivial, as it will introduce numerous parameters and measures for distances which may be rather subjective.

Mathematical Problems in Engineering
As a powerful prediction model, extreme learning machine (ELM for short) was originally developed based on single-hidden layer feedforward neural networks (SLFNs) in [10].Compared with the conventional learning machines, it is of extremely fast learning capacity and good generalization capability.Thus, ELM, with its variants, has been widely applied in many fields.For example, in [11], ELM was applied for plain text classification by using the one-against-one (OAO) and one-against-all (OAA) decomposition scheme.In [12], an ELM-based XML document classification framework was proposed to improve classification accuracy by exploiting two different voting strategies.A protein secondary structure prediction framework based on ELM was proposed in [13,14] to provide good performance at extremely high speed.References [15,16] evaluated the multicategory classification performance of ELM on three microarray datasets.The results indicate that ELM produces comparable or better classification accuracies with reduced training time and implementation complexity compared to artificial neural networks methods and support vector machine methods.
In this paper, we introduce ELM into Web service QoS prediction.To our best knowledge, it has never been addressed by any previous work.However, it is not trivial to integrate ELM into Web services quality prediction.Some issues need further consideration, for example, how to model the execution information of Web services to facilitate the usage of ELM on the data and how to train ELM in as short time as possible to get a model of high prediction accuracy so that we can conduct an on-line Web service QoS prediction.
Our contributions include that (1) we devise a method to extract web service trace logs and QoS information from the service log and convert them into feature vectors; (2) we propose a concept, namely, EC rule, based on which we are enabled to trigger precaution as soon as possible with high confidence; (3) we develop an efficient prefix tree based mining algorithm together with some effective pruning rules to mine such rules; (4) we further study how to extract a set of diversified features as the representative of all mined results based on ELM.
The rest of this paper is organized as follows.Section 2 gives a brief overview of ELM.Section 3 presents ELM-based QoS prediction framework.Section 4 studies the feature vectors representation of Web services.Section 5 defines the EC rules and proposes the mining algorithm.In Section 6, we study the problem of diversified feature selection and present the greedy solution.In Section 7, the experimental evaluation results are reported.Finally, Section 8 concludes this paper.

A Brief Introduction to ELM
ELM (extreme learning machine) is a generalized single hidden-layer feedforward network.In ELM, the hidden-layer node parameter is mathematically calculated instead of being iteratively tuned; thus, it provides good generalization performance at thousands of times faster speed than traditional popular learning algorithms for feedforward neural networks [12].
Given  arbitrary distinct samples (  ,   ), where   = [ 1 ,  2 , . . .,   ]  ∈ R  and   = [ 1 ,  2 , . . .,   ]  ∈ R  , standard SLFNs with  hidden nodes and activation function () are mathematically modeled as where a  and   are the learning parameters of hidden nodes and   is the weight connecting the th hidden node to the output node.(a  ,   , x) is the output of the th hidden node with respect to the input .In our case, sigmoid type of additive hidden nodes is used.Thus, ( 1) is given by where w  = [ 1 ,  2 , . . .,   ]  is the weight vector connecting the th hidden node and the input nodes,   = [ 1 ,  2 , . . .,   ]  is the weight vector connecting the th hidden node and the output nodes,   is the bias of the th hidden node, and o  is the output of the th node [10].
If an SLFN with activation function () can approximate the  given samples with zero errors that ∑  =1 ‖o  − t  ‖ = 0, there exist   , a  , and   such that  ∑ =1    (w  ⋅ x  +   ) = t  ,  = 1, . . ., . ( Equation ( 3) can be expressed compactly as follows: where H is called the hidden layer output matrix of the network.The th column of H is the th hidden nodes output vector with respect to inputs  1 ,  2 , . . .,   and the th row of H is the output vector of the hidden layer with respect to input   .
For the binary classification applications, the decision function of ELM [17] is ℎ(x) = [(a 1 ,  1 , x), . . ., (a L ,   , x)]  is the output vector of the hidden layer with respect to the input x. ℎ(x) actually maps the data from the -dimensional input space to the dimensional hidden layer feature space H.
In ELM, the parameters of hidden layer nodes, that is,   and   , can be chosen randomly without knowing the training datasets.The output weight L is then calculated with matrix computation formula L = H † T, where H † is the Moore-Penrose inverse of H.
ELM tends to reach not only the smallest training error but also the smallest norm of weights [18].Given a training set ℵ = {(x  , t  ) | x  ∈ R  , t  ∈ R  ,  = 1, . . ., }, activation function (), and hidden node number , the pseudocode of ELM [10] is given in Algorithm 1.

The ELM-Based QoS Prediction Framework
In order to immediately comprehend our idea, we illustrate the whole process of ELM-based Web service QoS prediction shown in Figure 1.As shown, the process consists of four major phases: (1) preprocess, which records the composite service execution log information, extracts multidimensional QoS attributes, and converts them into service feature vectors; (2) the EC rules mining, where a prefix tree based algorithm is proposed to mine the candidate feature sets, namely, the EC rules; (3) diversified feature selection, where a small subset of diversified features are extracted from all the rules to construct a classifier of high prediction accuracy, that is, F-ELM; (4) feature updating, where the process periodically updates the prefix tree with the QoS values changing.
(1) Preprocess.At first, the system needs to collect large amounts of composite service execution information, aiming to mine useful knowledge for prediction.The original service log includes a variety of structural and unstructural data information, such as service trace logs, quality of service (QoS) information, service invocation relationships, and Web service description language (WSDL).These sets of information are typically heterogeneous, of multiple data types, and high dynamic.Thus, in order to extract the useful feature vectors, a preprocess step is necessary.This part will be discussed in Section 4.
(2) The EC Rules Mining.Since the goal is to conduct an online Web service QoS prediction, the rules should be concise so as to response the predictor as early as possible.By the proposed EC rules, we are enabled to trigger the prediction as soon as possible with high confidence.An efficient prefix tree based mining algorithm together with some effective  pruning rules is developed to mine such rule.This part will be described in Section 5.
(3) Diversified Feature Selection.Too many rules increase the chance for model overfitting and decrease the generalization performance of a model.Thus, in this step, we study how to extract a small subset of diversified features as the representative of all mined results.By an ELM-based evaluation, the feature subset of the highest score is utilized to construct the predictor, that is, F-ELM.This part will be described in Section 6.
(4) Features Updating.Further, when a new service sequence is input, the update module judges QoS status of the service sequences.If the status of a service attribute changes greatly, the update module sends the updating request to the prefix tree according to a certain strategy.Besides judging the status, the feature values of each node in the prefix tree are recalculated periodically.In this paper, we exploit the strategy mentioned in [19] to address the issue.
In what follows, we mainly focus on steps (1)∼(3) one by one.

Proprecessing
Once a service-oriented application or a composite service is deployed in a running environment, the application can be executed in many execution instances.Each execution instance is uniquely identified with an identifier (i.e., id).In each execution instance, a set of service components can be triggered.Due to various internet uncertain factors, there possibly exist a large number of sets of potential exception status information.We record the triggered events of the Web service failure information in a log.It is helpful for service quality management by extracting execution status information from the execution log to predict the service reliability.
Web service QoS information often includes many attributes.For example, the literature [20] lists twelve attributes to depict service QoS, for example, response time, availability, throughput, successability, reliability, compliance, latency, service name, WSDL address, documentation, and service classification.To simplify the explanation, we assume that there are just two QoS attributes for each component service in this paper, that is, the availability attribute (av) and the execution time attribute (exe).We further suppose that there are three possible states for av, that is, inaccessible, intermittently accessible, and accessible, denoted by av 0 , av 0.5 , and av 1 , respectively, and two states for exe, that is, delayed execution and normal execution, denoted by exe 0 and exe 1 , respectively.As such, we obtain five possible groups of service execution statuses as shown in Table 1: ⟨av 0 , exe 0 ⟩, denoted by  0 , corresponding to the status of server unavailable and runtime delay; ⟨av 0.5 , exe 0 ⟩, denoted by  1 , the status of server available intermittently and runtime delay; ⟨av 1 , exe 0 ⟩, denoted by  2 , the status of server available and runtime delay; ⟨av 0.5 , exe 1 ⟩, denoted by  3 , the status of server available intermittently but normal execution; and ⟨av 1 , exe 1 ⟩, denoted by  4 , the status of server available but normal execution.Note that the status ⟨av 0 , exe 1 ⟩, denoted by  5 , does not exist in practice.This is because an unavailable service is not executed.Given the QoS status representation in Table 1, we extract Web service trace logs and QoS information from the service log and convert them into feature vectors by the following way.Let  = { 1 ,  2 , . . .,   } be the candidate service component set and    the status  of service   .For every record in the web service log, we replace each individual component service by the corresponding status such that every record could be converted into a sequence of feature vectors.Table 2 exemplifies a service execution dataset of 15 failed executions, 5 successful executions, and 3 failure types.For example, column 2 in row 5 denotes an execution sequence " 4  1  2 2  2 3  2 4 , " which first invokes service  1 of the status  4 and then service  2 of the status  2 , service  3 of the status  2 , and service  4 of the status  2 .Column 3 indicates that the sequence was executed twice, and column 4 shows that this execution failed with error type .

The EC Rules Mining
In the last step, we have modeled the data as a sequence dataset.Next, we detail how to mine the candidate features for on-line Web service QoS prediction from the sequence  2) conciseness.This is because on-line Web service QoS prediction is a temporal process, where the prediction should be triggered as soon as possible.

Basic Definition.
In this section, we first give some basic concepts and the problem statement.
Definition 1 (feature).Let D be service execution log with service set . ., ) be a component service or a subset of execution sequences containing status information.We call a set of component services with status information a Feature.Note that    is the status  of service   .
Given a feature F =   1   2 , . . .,    , we say feature F appears in a sequence S if there exists However, a feature F may appear several times in a sequence.For example, F = {} appears twice in sequence S = {aacdf gacde}.Below, we give the minimum prefix length definition.
Definition 2 (minimum prefix length).Given feature F and a sequence S, where F ⊆ S, the minimum prefix length is the length from initial position of S to the first matched position of F (MPL(F, S) for short).
Definition 3 (weight Intra Support with early factor).Let F be a feature and let   be a class.The weight intraclass support with early factor of feature F in class   is the ratio of the sum of the reciprocals of the minimum prefix lengths containing F in   to the number of data in class   .wis  is an abbreviation of weight Intra Support with early factor.Consider wis  (F ∪   ) denotes the support of features F and   emerging simultaneously, wis  (F ∪¬  ) denotes the support of features F and ¬  emerging simultaneously; that is, wis  (F∪¬  ) = wis  ()−wis  (F∪  ).We say that wis  (F) is frequent if wis  (F) ≥ , where  is user-specific minimum frequent threshold.
Definition 4 (discriminative feature).Let F be a feature and let   be a class.The discriminative power F, denoted by DF, is calculated as follows: where  is a regulation factor.Specially, we say that F is discriminate feature if DF(F) ≥ , where  is a user-specific minimum discriminative threshold.
The rationale behind Definition 4 is intuitive.If a feature F often occurs in class   but rarely in other classes (i.e., ¬  ), we consider it a feature well discriminating   from the other classes.Moreover, since Supp(F → ¬  ) may be zero, we add a regulation factor  to avoid this case.
Definition 5 (concise feature).Given a specific class label   and a discriminate feature F, the discriminative power of F is no less than that of a longer feature F  and conf(F  →   ) ≥ conf(F →   ), we say that F is concise with respect to   , where conf(F →   ) = wis  (F  )/wis  (F).
Definition 5 is also understandable.This is because if we have a shorter feature F  , the discriminative power of which is no less than that of a longer feature F such that F  ⊑ F, there is no need to use F instead of F  for classification.That is, we prefer a feature of shorter size but stronger discriminative power.In this sense, we refer to such a feature as a concise feature.
Problem Statement.Given a Web service execution log , a minimum frequent threshold , a regulation factor , and a minimum discriminative threshold , our goal is to find all sequence rules satisfying both Definitions 3 and 5, that is, the EC rules.

The EC-Miner Algorithm.
In this section, we detail the proposed EC rule mining algorithm, namely, EC-Miner.The main idea is formalized in Algorithm 2. The mining process is exemplified by a prefix tree as shown in Figure 2, which is built on Table 2 with wis  = 0.1.As seen from Figure 2, there are four services  1 ,  2 ,  3 , and  4 at level 1.We obtain different status information for each service in descending order at level 2. Different from the traditional support computing method, we consider both concise and early characteristics.Therefore, the obtained order using wis  for each item is also different from traditional approaches.For example, at first, the algorithm scans Table 2 once and computes the wis  of each item.After computing, we generate the candidate early feature for the second level.We can see 11 candidate 1-features at level 2. The wis  of each one item is  4  1 : 0.45,  1 1 : 0.3,  0 1 : 0.15,  2 2 : 0.15,  3  2 : 0.125,  1 3 : 0.125,  2 1 : 0.1,  2 3 : 0.1, and so on, where the number after colon denotes weight of   with early factor (wis  ).Next, we generate the candidate early 2features for the third level.For example,  4  1  3 2 : 0.25 denotes the wis  of  4  1  3 2 which is 0.25.The feature F with solid box represents the corresponding rule.For example, class  with solid box under  4  1  3 2 means the rule  4 1  3 2 → successful can be deduced.
At last but not the least, discriminative-based pruning rule 3 is very important but not difficult to be understood (see Section 5.3).For example, candidate feature  4  1 is removed by line 14 because wis  ( 4  1 → ) = wis  ( 4 1 → successful).In Figure 2, the pruning rule 3 is applied which is marked by C. A complete pseudocode for mining optimal EF sets is presented in Algorithm 2.
Algorithm 2 discusses the wis  -based pruning, concisebased pruning, and discriminative-based pruning.Most of the existing algorithms find an interesting rule set by postpruning.However, this may be very inefficient especially when the minimum support is low since it will generate an amount of redundancy rules.Our EC-Miner algorithm makes use of the interestingness measure property to efficiently prune uninteresting rules and saves only the maximal interesting rules instead of all ones.This distinguishes it from other association rule mining algorithms.
Function 1 is a function to generate candidate item sets.All generated candidates are built on the prefix tree structure.We adopt the  −1 *  −1 merge strategy [21] to obtain the candidate item sets.After rules have been formed, we can prune many redundancy rules.

The Pruning Strategies.
To improve the efficiency of EC-Miner, we devise a series of pruning rules.
Pruning Rule 1.In pruning by wis  : given wis  , a feature F and all its possible proper supersets F, and class   , if 0 ⩽ wis  (F →   ) ⩽ , then F →   and F →   are all not the EC rules.
Proof.Once 0 ⩽ wis  (F →   ) ⩽  is observed, it is not necessary to search for more specific rules F →   .Because wis  (F →   ) ≤ wis  (F →   ) ⩽ .So, target   ∈  will be terminated in candidate rule F →   .
Instead of the global support, pruning 1 describes the intraclass support of a feature with respect to a specific class.This is because a feature F in   is hardly frequent if   is rare in service execution log.Thus, pruning rule 1 can reduce the redundancy rules greatly.This is different from association rules.
Pruning Rule 2. In pruning by conciseness, given wis  , a feature F and all its possible proper supersets F, and class   , if wis  (F) = wis  (F), then feature F and all its proper supersets can be pruned.
Proof.In the proof, we show that confidence(F Pruning Rule 3. In pruning by discrimination, given a feature F, if log( + wis  (F →   ))/( + wis  (F → ¬  )) ≥ , then F will not be the discriminative prediction rules. and  are appointed by user.
Proof.If log(+wis  (F →   ))/(+wis  (F → ¬  )) ≥ , then F is relative frequent in different class.However, we say feature F does not have the ability to distinguish different class because it does not satisfy Definition 4.
The above pruning rules are very efficient since they only generate a subset of frequent features with great interestingness instead of all ones.Finally, the EC rules set is significantly smaller than an association rule set but is still too large for decision practitioners to review them all.Next, we give an ELM-based diversified feature selection method to further reduce the size of EC rules.

ELM-Based Diversified Feature Selection
As ever mentioned, the EC-Miner algorithm generates a set of optimal feature sets (rules); however, their number may be still a little large.An enormous number of features impose a great challenge on understanding and further analyzing the classification or prediction results.In this section, we study how to construct a classifier of high classification (prediction) accuracy by extracting a small number of feature sets as the representative of all mined results.
In the context of feature selection data analysis, most of the current methods adopt such a framework that ranks the attributes according to their individual discriminative power to the target class and then selects top- ranked attributes.These methods cannot remove redundant features.It is pointed out in a number of studies [22] that simply combining highly ranked features often does not form a better feature set because these features could be highly correlated.The drawback of redundancy among selected features is twofold.On one hand, the selected feature set can have a less comprehensive representation of the target class than one of the same size but without redundant features; on the other hand, redundant features may unnecessarily increase the size of the selected feature set, which may reduce the classifier performance.Besides incapability of handling redundant features, in most ranking based methods, the number of features to be selected is arbitrarily determined.
To address the above issues, we propose an ELM-based diversified feature selection method in this section.Before describing it, we first give a diversity function as follows: where ( 1 ) is the set of samples which contain  1 as a significant chain, ( 1 ) is the set of items involved in  1 , LCS( 1 ,  2 ) is the longest common feature of  1 and  2 , and symbol "| |" denotes the length of a pattern.In (10), the diversity between two early rules  1 and  2 , that is, Div( 1 ,  2 ), is measured from two aspects: support sequences and involved items.If  1 and  2 have few common support sequences, they should have high diversity.Similarly, if LCS of  1 and  2 is short, they should have high diversity.
Based on (10), we can construct a diversity graph in the following way.For a list of results FS = { 1 ,  2 , . ..}, the corresponding diversity graph, denoted as (FS) = (, ), is an undirected graph such that, for any result   ∈ FS, there is a corresponding node V  ∈  and, for any two results   ∈ FS and   ∈ FS, there is an edge (V  , V  ) ∈  if and only if Div(  ,   ) ≤  (a user-specified threshold).The problem of finding a set of diversified rules, which represent all mined results, is now equivalent to find an independent dominating set of (FS).Further, we require the number of the selected features as few as possible to reduce the complexity of classifier.Thus, the problem of diversified feature selection can be viewed as an instance of finding minimum independent dominating set of (FS), which is NP-hard [23].
Since it is difficult to find the optimal solutions, we adopt a greedy algorithm to address this problem.Given the result set FS and a set of selected results FS  , the algorithm incrementally selects patterns from FS − FS  with diversity guarantee.A pattern   ∈ FS − FS  is selected if ∀  ∈ FS  , Div(  ,   ) > .If there are several such alternative   's in a selection, the one corresponding to a node of the most neighbors is selected.Note that, at beginning, the set FS  is empty.The algorithm picks the most significant pattern, that is, an irreducible sequence of the largest confidence value, and inserts it to FS  .As seen from what we mentioned, the final selected FS  may be more than one in the process.For example, there may be several sequences of the largest confidence value and there may be more than one node of the same number of neighbors.In such case, we use ELM Input: a set of feature sets(FS) Output: The selected feature subset FS  (1) Let  be the feature set of the largest confidence (2) FS  = {} (3) while there is a node in FS − FS  not dominated by FS do (4) Find a pattern   ∈ FS − FS  s.t.∀  ∈ FS  , Div (  ,   ) > , and the number of the neighbors of node   is largest FS  = FS  ∪ {  } (5) end while (6) using ELM evaluates every possible FS  (7) the FS  of the highest accuracy on ELM; Algorithm 3: The FS algorithm.WSDL address Location of the Web service definition language (WSDL) file on web None to evaluate every possible candidate.The one of the largest prediction accuracy is selected.Algorithm 3 formalizes the process.
The greedy algorithm can be viewed as a hybrid of the filter model and the wrapper model in feature selection, which achieves a better trade-off between the two.Better than the filter model, it explicitly removes redundancy among the selected features and determines the number of the selected features automatically.Compared with the wrapper model, it is of less computation cost.

Experiments Result Analysis
In this section, we design a series of experiments to verify the performance of the proposed method.For brevity, we refer to the algorithm of diversified feature selection based ELM as F-ELM.We select two different scenarios: one is Web service quality prediction and the other is Web service fault diagnosis prediction.
We provide two kinds of datasets.For the real dataset, we use E. AI-Mari and Dr. QH. Mahmouds' QoS dataset [20] (downloaded from http://www.uoguelph.ca/∼qmahmoud/qws/), which includes twelve attributes (x1 to x12) as shown in Table 3, where the attributes x1 to x10 are used as explanatory variables and the attribute x10 is used as the target variable.However, attributes x11 and x12 are ignored as they do not contribute to the analysis.
For artificial datasets, we get the Web service datasets by simulating a network environment and general network topology graph by ERITETool: with two input parameters: number of network nodes (#Web service), number of embedded classes (#), percent of embedded fault rate (%).The system selects composite service by matching I/O operation.

Analysis of Efficiency.
In this set of experiments, we refer to the diversified feature selection based ELM as F-ELM and the original ELM as ELM.The efficiency of F-ELM is studied by showing how response time varies with service nodes and service categories.In Figure 3, we compare the training time and the testing time between F-ELM and ELM with respect to the same categories (number of categories is 3) when the numbers of Web service are increasing (from 50 to 300).In Figures 4(a  As seen from Figures 3(a) and 4(a), training time decreases with service nodes increasing.We note that the total training time of F-ELM is a bit longer than original ELM when the service nodes are increasing.It is the same as the scenario where the service categories are increasing.This is because the increasing of service nodes (service categories) may lead to more rules to be evaluated and pruned in ELMbased diversified feature selection.
However, both Figures 3(b) and 4(b) show that the testing time of F-ELM outperforms that of ELM and the advantage becomes more substantial with a larger dataset (category).This is because ELM has to perform a time-consuming check for all feature sets.However, F-ELM only performs a series of early and concise interesting features.Although the test time changes little, F-ELM is still constantly faster than ELM.

Classification Accuracy.
The following evaluation criteria are used to measure the performance of F-ELM, ELM, and SVM.
Accuracy denotes the proportion of the correctly classified service sequences in the whole service sequence sets Recall denotes the proportion of the correctly classified service sequences with respect to a specific class recall = TP TP + FN .
-measure ( 1 score) is the harmonic mean of precision and recall.Since precision and recall cannot reach mathematical optimum,  1 score measures both of the two criteria and assumes that the weight of precision is equal to the weight of recall For artificial dataset, Figure 5 presents the classification comparison result of F-ELM, the original ELM algorithm, and SVM with different categories changing.The precision comparison result is presented in Figure 5(a).The recall comparison result is presented in Figure 5(b).Figure 5(c) shows the comparison result of  1 , scores.Figure 6 presents the classification comparison result of F-ELM, ELM, and SVM with different datasets changing.The precision comparison result is shown in Figure 6(a).The recall comparison result is present in Figure 6(b).Figure 6(c) shows the comparison result of  1 scores.All six figures demonstrate that F-ELM is better than ELM and SVM on each of the three criteria in terms of both the categories vary and the datasets change.
The features selected by ELM-based diversified feature selection just involve six attributes out of the original twelfth ones.To show how the selected features of the six attributes affect the performance of a classifier, we compared the training time, the testing time, and the accuracies of six different classifiers, that is, ELM, SVM, CART, J48, Treenet, and BPNN, on the features of the six attributes and all the original attributes, respectively.The results are shown in Table 4.As seen, the performance of a classifier on the features of the six attributes is always better than that on the features of all the original attributes.This confirms that these classifiers can benefit from the selected features.Moreover, F-ELM behaves the best among all the introduced classifiers on the same attributes setting.This is because F-ELM exploits the relationship among attributes as the features and it removes the redundancy among the selected features while the other methods do not.-test is utilized to evaluate whether the accuracy difference between F-ELM and a comparative method is statistically significant.Since 10-fold cross-validation is used and  0.01 (49) is about 2.678, the values larger than 2.678 indicate a statistically significant difference.Thus, F-ELM does outperform the comparative methods on effectiveness.We also conduct the accuracy comparison of different algorithms on a real microarray dataset, that is, Leukemia dataset, which contains 7129 genes, 38 training samples, and 34 testing samples.The results are reported in Table 5.Since all the -test values are larger than  0.01 (33) = 2.733, F-ELM still outperforms other comparative methods on accuracy in statistical significance.Thus, it is reasonable to say that the proposed method could be applied in a wider range of applications.Additionally, we conducted a set of experiments for comparing the proposed feature selection method with six other feature selection methods, which are often used as comparative methods in machine learning for feature selection studies.The six methods are information gain (IG), twoing rule (TR), sum minority (SM), max minority (MM), Gini  space leads to the less running time.However, this does not indicate that we should choose  as large as possible.Figure 8(a) shows that too large  may deteriorate classification accuracy.This is because many rules of potentially high usability will be pruned at a high  level.Also, the accuracy at a too low  level is not very good due to the "overfitting"  The results can also be explained in a similar way as those for Figures 7(a) and 8(a), respectively.Differently, Figures 7(c) and 8(c) show that  rarely affects the running time of the feature selection and classification accuracy.This is because  is introduced just for avoiding the case where the denominator of ( 8) is zero. is often set to a very low value, the effect of which is dominated by other values setting in (8).

Conclusions
In this paper, we propose an ELM-based service quality prediction framework.Considering the highly dynamic and the uncontrollable circumstances, the service quality prediction is required to be triggered as soon as possible in the proposed framework.By developing the prefix tree based algorithm, EC-Miner, a series of candidate rule sets are first found, where both the earliness and conciseness of the rules are considered.Then, an ELM-based diversified feature selection algorithm is proposed to fine the candidate rule set.A small subset of high-quality features are discovered as the representative of the whole candidate rule set.A greedy algorithm is presented to approximate the optimal solution.Experimental results show that the proposed approach significantly improves the efficiency and the effectiveness of ELM with respect to some widely used feature selection techniques.
) and 4(b), we compare the training time and the testing time of F-FLM and ELM respectively, where the number of service categories varies from 2 to 8 while the number of Web services is fixed to 200.

Figure 3 :Figure 4 :
Figure 3: Training/testing time comparison between F-ELM and ELM.

Figure 5 :
Figure 5: Performance comparison versus number of categories.

Figure 6 :
Figure 6: Performance comparison versus number of nodes.

Table 1 :
Composite QoS status information.

Table 2 :
An example of service execution instances.

Table 3 :
QWS datasets attributes and their description.