Distributed Learning over Massive XML Documents in ELM Feature Space

With the exponentially increasing volume of XML data, centralized learning solutions are unable to meet the requirements of mining applications with massive training samples. In this paper, a solution to distributed learning over massive XML documents is proposed, which provides distributed conversion of XML documents into representation model in parallel based onMapReduce and a distributed learning component based on Extreme Learning Machine for mining tasks of classification or clustering. Within this framework, training samples are converted from raw XML datasets with better efficiency and information representation ability and taken to distributed learning algorithms in Extreme Learning Machine (ELM) feature space. Extensive experiments are conducted on massive XML documents datasets to verify the effectiveness and efficiency for both classification and clustering applications.


Introduction
Classification and clustering are two major problems of XML documents mining tasks.One of the most important parts of mining XML documents is to convert XML documents into representation model.Most traditional representation models are designed for plain text mining applications, taking no account of structural information of XML documents.Vector Space Model (VSM) [1] is one of the most classic and popular representation models of plain text.Previous work proposed some approaches to solve the problem of considering both semantic and structural information for XML document classification, among which Structured Link Vector Model (SLVM) [2] extends VSM to generate a matrix by recording the attribute values of each element in an XML document.Reduced Structured Vector Space Model (RS-VSM) proposed in [3] achieves a higher performance due to its feature subset selection method.Different weights are assigned to different elements according to their priority and representation ability.Distribution based Structured Vector Model (DSVM) [4] further improves the calculation of traditional Term Frequency Inverse Document Frequency (TFIDF) values and takes two factors into consideration, namely, Among Classes Discrimination and Within Class Discrimination.
Extreme Learning Machine (ELM) was proposed by Huang et al. in [5,6] based on generalized single-hidden layer feedforward networks (SLFNs).With its variants [7][8][9][10][11], ELM achieves extremely fast learning capacity and good generalization capabilities usually in many application fields, including text classification [12], multimedia recognition [13][14][15], bioinformatics [16], and mobile objects [17].Recently, Huang et al. in [18] pointed out that (1) the maximal margin property of Support Vector Machine (SVM) [19] and the minimal norm of weights theory of ELM are consistent; (2) from the standard optimization method point of view, ELM for classification and SVM are equivalent.Furthermore, it is proved in [20] that (1) ELM provides a unified learning platform with a widespread type of feature mappings; (2) ELM can be implemented in regression and multiclass classification applications in one formula directly.ELM can be linearly extended to SVMs [18] and SVMs can apply ELM kernel to get better performance due to its universal approximation capability [10,11,21] and classification capability [20].
It is generally believed that all the ELM based algorithms consist of two major stages [22]: (1) random feature mapping and (2) output weights calculation.The first stage generating feature mapping randomly is the key concept in ELM theory which differs from other feature learning algorithms.In view of the good properties of the ELM feature mapping, most existing ELM based classification algorithms can be viewed as supervised learning in ELM feature space.While, in [23], the unsupervised learning in ELM feature space is studied, drawing the conclusion that the proposed ELM -Means algorithm and ELM NMF (nonnegative matrix factorization) clustering can get better clustering results than traditional algorithms in original feature space.
Recently, the volume of XML documents keeps explosively increasing in various kinds of web applications.Since the larger the training sample is, generally the better the learning model will be trained [24], it is a great challenge to implement distributed learning solutions to process massive XML datasets in parallel.MapReduce [25], introduced by Google to process parallelizable problems across huge datasets on clusters of computers, provides tremendous parallel computing power without concerns for the underlying implementation and technology.However, MapReduce framework requires distributed storage of the datasets and no communication among mappers or reducers, which brings challenges to (1) converting XML datasets into global representation model and (2) implementing learning algorithms in ELM feature space.
To our best knowledge, this paper is the first to discuss massive XML documents mining problems.We present a distributed solution to XML representation and learning in ELM feature space.Since the raw XML datasets are stored on distributed file system, we propose algorithm DXRC to convert the XML documents into training samples in the form of XML representation model using a MapReduce job.With the converted training samples, we apply PELM [26] and POS-ELM [27] to realize supervised learning and propose a distributed -Means in ELM feature space based on ELM -Means proposed in [23].The contributions can be summarized as follows.
( The remainder of this paper is structured as follows.Section 2 introduces XML documents representation models and proposes a distributed converting algorithm to represent XML documents stored on distributed file system.Extreme Learning Machine feature mapping is presented in Section 3. Section 4 presents classification algorithms based on distributed ELMs and proposes a distributed clustering algorithm in ELM feature space based on MapReduce.Section 6 makes performance comparison among distributed classification algorithms and evaluates the proposed distributed clustering algorithm.Section 7 draws conclusions of this paper.

Distributed XML Representation
In this section, we first introduce representation model of XML documents and then propose a distributed converting algorithm, which is able to generate global feature vectors for all the XML documents stored on distributed file system.

XML Representation Model.
For learning problems of texts, such as XML and plain documents, the first important task is to convert original documents into representation model.Vector Space Model (VSM) [1] is often used to represent plain text documents, which takes term occurrence statistics as feature vectors.However, representing an XML document in VSM directly will lose the structural information.Structured Link Vector Model (SLVM) is proposed in [2] based on VSM to represent semistructured documents, which contains both semantic and structural information.SLVM is defined as where d  is a feature vector of the th XML element calculated as where   is the th term and   is a unit vector corresponding to the element   .
In SLVM, each d slvm is a feature matrix R × , which is viewed as an array of VSMs.d  consists of the feature terms corresponding to the same XML element, which is an dimensional feature vector in each element unit.
Based on SLVM, in [3], we proposed Reduced Structured Vector Space Model (RS-VSM), which not only inherits the advantages of representing structural information of SVLM, but also achieves a better performance due to the feature subset selection based on information gain.We also proposed Distribution Based Structured Vector Model (DSVM) in [4] to further strengthen the ability of representation.Two improved interact factors were designed, including Among Classes Discrimination (ACD) and Within Class Discrimination (WCD).Revised IDF was also introduced to indicate the importance of a feature term in other classes more precisely.
is the th term feature described as where  is the number of elements in document doc, doc.  is the th element   of doc, and   , which is the unit vector The map function in Algorithm 1 accepts key-value pairs and the MapReduce job context as input.The key of key-value pairs is the XML document ID and the value is the corresponding XML document content.A HashMap  (Line 1) is used to cache all the elements of one XML document (Lines 2-11), using element name as key and another HashMap  (Line 4) as value.The  caches the TF values of all the words in one element (Lines 5-10).That is, for each XML document, the numbers of  items and XML elements are the same; for each element, there are as many items in  as there are distinct words in this element.Each item in  and  will be emitted as output in the form of ⟨, ⟨, , , ⟩⟩ (Lines 12-17).
After the ⟨, ⟨, , , ⟩⟩ pairs are emitted by map function, all the key-value pairs with the same key, which are also the key-value pairs of the same word in XML documents, are combined and passed to the same reduce function in Algorithm 2 as input.For each key-value pair processed by reduce function, two HashMaps  (Line 1) and  (Line 2) are initiated.The HashMap  is to cache the  values of a word in each element in the corresponding XML document and  is to cache the number of documents containing this word.The total number of documents  (Line 3) and the vector ℎ (Line 4), which indicates the weights of all the elements in each XML document, are obtained through distributed cache defined in MapReduce job configuration.Since reduce now has all the  values grouped by XML elements along with their weights, weighted  values (Line 6) and the number of documents containing each word are calculated and cached in  and , respectively (Lines 5-12).Then the  value can be calculated (Line 14) and multiplied by each item in .The output of reduce is the ⟨, ⟩ pairs, of which  is ⟨, ⟩ indicating the index of DSVM matrix and  is the value of the matrix.Finally, the XML representation DSVM can be built by this matrix and factor  CD , uploaded onto distributed file system, and used as the input of the training model.

ELM Feature Mapping
Extreme Learning Machine (ELM) randomly generates parameters of the single-hidden layer feedforward networks without iteratively tuning to gain extremely fast learning speed.The output weights can be calculated by matrix multiplication after the training samples are mapped into ELM feature space.
Given  arbitrary samples (x  , t  ) ∈ R × , ELM is modeled as where  is the number of hidden layer nodes,   is the output weight from the th hidden node to the output node, w  = [ 1 ,  2 , . . .,   ]  is the input weight vector, and   is the bias of th hidden node.(x) is the activation function to generate mapping neurons, which can be any nonlinear piecewise continuous functions [22], including Sigmoid function (5) and Gaussian function (6), as follows: Gaussian (w, , x) = exp (− ‖x − a‖) .
Figure 1 shows the structure of ELM with multiple output nodes and the feature mapping process.The three layers Hidden nodes

Output node
Input nodes ELM feature mapping Figure 1: ELM structure and ELM feature mapping.
of ELM network are input layer, hidden layer, and output layer.The  input nodes correspond to the -dimensional data space of original samples, while  hidden nodes correspond to the -dimensional ELM feature space.With the dimensional output space, the decision function outputs the class label of the samples.
The ELM feature mapping denoted by H is calculated as

Distributed Classification in ELM Feature Space
In this section, we introduce the learning procedure of classification problems in ELM feature space and distributed implementations based on two existing representative distributed ELM algorithms, which are PELM [26] and POS-ELM [27]. where × is the vector of class labels.The matrix  is the output weight, which is calculated as where H † is the Moore-Penrose inverse of H. ELM for classification is presented as Algorithm 3.
The output weight of ELM can also be calculated as

Split blocks Map
For where, according to the ridge regression theory [29], the diagonal of a symmetric matrix can be incremented by a biasing constant 1/ to gain better stability and generalization performance [18].
For the case in which the number of training samples is much larger than the dimensionality of the feature space, considering the computation cost, the output weight calculation equation can be rewritten as 4.2.Distributed Implementations.Some existing works have introduced distributed implementations of various ELM algorithms.The original ELM was parallelized by PELM in [26]; Online Sequential ELM (OS-ELM) was implemented on MapReduce as POS-ELM in [27].

Parallel ELM.
In the original ELM algorithm, in the case that the number of training samples is much larger than the dimensionality of ELM feature space and H  H is nonsingular, the major cost is the calculation of Moore-Penrose generalized inverse of matrix H, where the orthogonal projection method is used as (11).Thus the matrix multiplication U = H  H and V = H  T can be calculated by a MapReduce job.In map function, each term of U and V can be expressed as follows [26]: In reduce function, all the intermediate results are merged and added up according to the corresponding elements of the result matrix.Since the training input matrix X is stored by sample on different machines, the calculation can be parallelized and executed by the MapReduce job.The calculation procedure is demonstrated as Figure 2.

Parallel Online Sequential ELM.
The basic idea of Parallel Online Sequential ELM (POS-ELM) is to calculate H 1 , . . ., H  in parallel.By taking advantages of the calculation of partial ELM feature matrix H  with a chunk of training data of OS-ELM, POS-ELM calculates its H  with its own data chunk in the map phase on each machine.The reduce function collects all the H  and calculates  +1 as where The calculation procedure of POS-ELM is shown in Figure 3.

Distributed Clustering in ELM Feature Space
In this section, in order to improve the efficiency of clustering massive XML datasets, we also propose a parallel implementation of ELM -Means algorithm, named Distributed ELM -Means (DEK) in this section.

. . . Assign each x i to a cluster c i
Assign each x i to a cluster c i

Unsupervised Learning in ELM Feature
Space.It is believed that transforming nonlinear data into some high dimensional feature space increases the probability of the linear separability.However, many Mercer kernel based clustering algorithms are usually not efficient for computation, since the feature mapping is always implicit and cannot be guaranteed to satisfy the universal approximation condition.Thus, [23] holds that explicit feature mapping like ELM feature mapping is more appropriate.
Generally, -Means algorithm in ELM feature space, as ELM -Means for short, has two major steps: (1) transform the original data into ELM feature space and (2) implement traditional clustering algorithm directly.Clustering in the ELM feature space is much more convenient than kernel based algorithms.

Distributed ELM 𝑘-Means.
For the applications of massive XML documents clustering, implementation of unsupervised learning methods to MapReduce is a key part of the problem.Since ELM feature mapping is extremely fast with good generalization performance and universal approximation ability, in this section, we propose Distributed ELM -Means (DEK) based on ELM -Means [23].
In DEK algorithm, the training samples of XML documents are distributedly stored on distributed file system.Each (x  ,   ) represents a training sample x  in ELM feature space with its corresponding class label   .With a set of initiated  cluster centroids, in the map phase, the distances between each centroid and each training sample stored on its own site are calculated.Then the sample is assigned to the centroid with the shortest distance.In the reduce phase, all the samples assigned to the same centroid are collected in the same reducer.Then a new centroid of each cluster is calculated, with which the set of  cluster centroids are updated.That is, a round of MapReduce job updates the set of  cluster centroids once.In the next round of MapReduce job, the updated set of centroids is updated again and this procedure will be repeated until convergence or up to maximum number of iterations (Figure 4).Algorithm 4 presents the map function of DEK.For each sample stored on this mapper (Line 1), the distance between the sample and each cluster centroids is calculated (Lines 2, 3).Then each sample is assigned to the cluster whose centroid is the nearest to this sample (Line 4).The intermediate keyvalue pair is emitted in the form of ⟨ max , x  ⟩ (Line 5), in which x  is the specific sample and  max is the assigned cluster of x  .
Algorithm 5 presents the reduce function of DEK.We add up the sum distance in Euclidean space of all the samples x  in list(x) (Lines 1, 2) and then calculate the mean value to represent the new version of the centroid

Performance Evaluation
All the experiments are conducted to compare the performance in the following aspects: We also fetched RSS feeds of news and articles in the format of XML from IBM DeveloperWorks and ABC News official web sites.Each XML document of the RSS feeds is composed of elements of title, author, summary, publish information, and so forth.In order to compare the performance of the algorithms over different datasets, we choose the same numbers of XML documents out of all the three datasets, which are 6 classes and 500 documents in each class.

Parameters.
According to the universal approximation conditions and classification capability of ELM, a large number of hidden nodes guarantee that the data can be linearly separated [23], especially for the learning problems on high-dimensional training samples like XML documents.Thus, after a set of experiments for parameter setting, the only parameter of learning algorithms in ELM feature space, that is, the number of hidden nodes , is set to 800.

Evaluation Criteria.
To clearly evaluate the performance, three sets of evaluation criteria are utilized.
(1) For scalability evaluation, we compare the criteria of speedup, sizeup, and scaleup.Speedup indicates the scalability when increasing the number of running machines, which is measured as Sizeup indicates the scalability when increasing the data size, which is measured as Sizeup () = running time over  units of data running time over one unit of data .
Scaleup is to measure the scalability of processing times larger data on an -times larger cluster, which is calculated as = running time over one unit of data on one machine running time over  units of data on  machines .
(2) For classification problems, accuracy, recall, and measure are used to evaluate the performance of supervised learning performance in ELM feature space.Accuracy indicates the overall ratio of correctly classified samples, which is measured as Recall is the ratio of the samples with a specific class label to the ones classified into this class, which is measured as F-measure is to measure the overall performance considering both accuracy and recall, which is calculated as Input: ⟨centroid   , samples list(x)⟩ Output: Updated set of centroids C updated (1) foreach x  ∈ list(x) do (2) Add x  to squared sum ;   (3) For clustering problems, since each sample in the datasets we used in our experiments is assigned with a class label, we treat this class label as the cluster label.Thus, the same evaluation criteria are used for clustering problems as for classification problems.

Scalability of DXRC.
The scalability of representation converting algorithm DXRC is first evaluated.Figure 5(a) demonstrates the speedup of DXRC.As the number of the slave nodes varies from one to eight, the speedup tends to be approximately linear at first, but the growth slows down due to the increasing cost of network communications among more and more working machines.But, in general, DXRC gains good speedup.Figure 5(b) presents the sizeup of DXRC.
The -axis denotes the percentage against the whole datasets.That is, 1 is the full size of original dataset; 0.5 indicates half of the original dataset, in which the samples are randomly chosen.With a fixed number of slave machines, which is eight, the evaluation result shows good sizeup of DXRC.Since the scaleup in distributed implementation cannot stick to 1 in practice, in Figure 5(c), the scaleup of DXRC drops slowly when the number of slave nodes and the size of dataset increase, which indicates a good scaleup of DXRC.Note that the representation ability and performance influence on XML documents classification of DSVM applied in DXRC can be found in our previous work [4].

Scalability of Massive XML Classification in ELM Feature Space.
With the training samples converted by algorithm DXRC, a classifier of massive XML documents can be trained based on MapRedcue.The speedup comparison between PELM and POS-ELM on three datasets is presented in Figure 6.
Algorithm PELM, which implements original ELM on MapReduce, requires calculating the inverse of ELM feature space matrix, while PEO-ELM makes use of the idea of online sequential processing to realize parallel computation without communication and requires calculating output weight  and the auxiliary matrix P iteratively in a single reducer.The centralized calculation reduces the scalability of both PELM and POS-ELM to some degree, especially for POS-ELM.Thus, the speedup of PELM is better than POS-ELM.
Figure 7 demonstrates the sizeup comparison between PELM and POS-ELM.From this figure, we find that Mathematical Problems in Engineering  the sizeup of PELM is better than POS-ELM on all the three datasets.
For the scaleup comparison, Figure 8 demonstrates that both PELM and POS-ELM have good scaleup performance, and PELM outperforms POS-ELM on each of the three datasets.
In summary, both PELM and POS-ELM have good scalability for massive XML documents classification applications, but PELM has better scalability than POS-ELM.

Performance of Massive XML Classification in ELM
Feature Space.The parallel implementation of both PELM and POS-ELM does not invade the computation theory of original ELM and OS-ELM, respectively; that is, the classification performances of PELM and POS-ELM are nearly the same as their corresponding centralized algorithms, respectively.The classification results are shown in Table 1.
From the table we can see that PELM slightly outperforms POS-ELM, because the iterative matrix operations of output weight  in POS-ELM cause loss of calculation accuracy.However, for massive XML documents classification applications, since both the extraction and reduction of XML document features are complicated, both PELM and POS-ELM provide satisfactory classification performance.

Scalability of Massive XML Clustering in ELM Feature
Space.In this set of experiments, we evaluate the proposed distributed clustering algorithm in ELM feature space, that is, distributed ELM -Means.In theory, the scalability of distributed -Means in ELM feature space and in original feature space is the same, since the only difference is the feature space of the training samples, which has no influence on the computation complexity.Thus, we only present the scalability of DEK without comparison with distributed -Means in original feature space.The scalability of DEK is evaluated on all the three datasets in terms of speedup in Figure 9(a), sizeup in Figure 9(b), and scaleup in Figure 9(c).The experimental results all demonstrate good scalability of DEK for massive XML documents clustering applications.6.2.5.Performance of Massive XML Clustering in ELM Feature Space.Clustering performance comparison between distributed clustering in ELM feature space and clustering in original feature space is made for massive XML documents clustering applications in this set of experiments.Note that, since the manual relabeling of the massive XML dataset is infeasible, we only evaluate the clustering quality with the original number of classes, which is six.The comparison results on three different datasets are presented in Table 2.It can be seen from the comparison results that DEK gets better clustering performance due to its ELM features mapping.

Conclusion
This paper addresses the problem of distributed XML documents learning in ELM feature space, which has no previous work to our best knowledge.Parallel XML documents representation converting problem based on MapReduce is
doc.  in SLVM, is now the dot product of -dimensional unit vector and -dimensional weight vector.IDF ex (  , ) is the revised IDF.The factor  CD is the distribution modifying factor, which equals the reciprocal of arithmetic product of WCD and ACD.The detailed calculation of d (5)  = DistributedCache.get("totalDocsNum");(4)weights= DistributedCache.get("elementWeightsVector");(5)foreach ∈  do (6) weightedDocEleTF = weights[docId, element] * itr.times/itr.sum;(7) mapDocEleTF.put(⟨docId,element⟩, weightedDocEleTF); (8) if .() then (9) newTimes = mapTF.get(docId)+ irt.getValue().of2.2.Distributed Converting Algorithm.In this section, we propose a distributed converting algorithm, named Distributed XML Representation Converting (DXRC), to calculate TFIDF [28] values of DSVM based on MapReduce.Since the volume of XML documents is so large that the representation model cannot be generated on a single machine, DXRC realizes the representation of XML documents in the form of DSVM in parallel.The map function and reduce function of DXRC are presented as Algorithms 1 and 2, respectively.
Training samples X,  centroids C Output: ⟨centroid  max , sample x  ⟩ (1) foreach x  ∈ X do (2) foreach   ∈ C do (3) Calculate distance   between x  and   ; (4) Assign x  to the cluster  max with max  (  );   (Line 3).When all the  cluster centroids are updated in this MapReduce job, if this version of centroids is the same as the older one, or if the maximum number of iterations is reached, DEK holds that the clustering job is done; otherwise, DEK continues to the next iteration of MapReduce job until convergence. Input:

Table 1 :
Classification performance comparison between PELM and POS-ELM.

Table 2 :
Clustering performance of DEK compared with parallel -Means.The problem of massive XML documents classification in ELM feature space is studied by implementing PELM and POS-ELM, while, for the problem of massive XML documents clustering in ELM feature space, a distributed ELM -Means algorithm DEK is proposed.Experimental results demonstrate that the distributed XML learning in ELM feature space shows good scalability and learning performance.