Stream-Based Extreme Learning Machine Approach for Big Data Problems

1Graduate Program in Electrical Engineering, Federal University of Minas Gerais, Avenida Antônio Carlos 6627, 31270-901 Belo Horizonte, MG, Brazil 2Institute of Science and Technology, Federal University of Jequitinhonha and Mucuri Valleys, Rodovia MGT 367, Km 583, 5000 Alto da Jacuba, 39100-000 Diamantina, MG, Brazil 3Department of Electrical Engineering, Federal University of Minas Gerais, Avenida Antônio Carlos 6627, 31270-901 Belo Horizonte, MG, Brazil 4Department of Electronics Engineering, Federal University of Minas Gerais, Avenida Antônio Carlos 6627, 31270-901 Belo Horizonte, MG, Brazil


Introduction
The induction of Supervised Learning models relies on a large enough set of (x  ,   ) pairs obtained by sampling x  from the input space  according to a probability function (X) and by querying an oracle function   (x  ) for the labels   .The final goal of learning is to obtain the parameters Z and w of the approximation function (x, Z, w) so that (x, Z, w) ≈   (x).Convergence conditions to guarantee that (x, Z, w) →   (x) in  depend on the representativeness and size of the learning set   = {x  ,   }   =1 .Reliable labeling of the input samples   = {x  }   =1 is of paramount importance to guarantee robustness of the approximation function (x, Z, w).In any case,   should be large enough to guarantee convergence conditions.
In Big Data problems, however, the availability of a large amount of data reveals itself as another important challenge for the induction of supervised models [1].Learning using the entire dataset can be impracticable for most current supervised classifiers due to their time-consuming training procedures.The problem becomes more evident when labeling cost is difficult or expensive and data is presented to the learner via data streams [2].Dealing with Big Data requires some technique to circumvent the need of considering the entire data in the learning process.In this context, sampling probability (X) can be controlled in order to induce good learning models using fewer patterns.

Mathematical Problems in Engineering
In Supervised Learning it is assumed that the learner has no control of the sampling probability (X).Nonetheless, the construction of learning machines that may influence (X) has become a central problem in recent years.This new subfield of Machine Learning, known as Active Learning [2], has received special attention due to the new learning settings that have appeared in application areas such as bioinformatics [3], electronic commerce [4], and video classification [5].
In the new setting, the learner is active and may actually choose the samples from a stream [6] or pool [7] of data to be labeled.The sample selection strategy embodied into the active learner determines the probability (X) of an input sample to be selected for labeling and learning.In the end, the goal of Active Learning is similar to that of Supervised Learning: to induce a learning function (x, Z, w) that is valid in the whole input domain by behaving as similar as possible to the label-generator function   (x).The goal is to select more representative samples that will result in (x, Z, w) →   (x).For instance, in a classification problem those samples that are near the separation margin between classes may suffice [8,9] if discriminative models like Support Vector Machine (SVM) [10] and Perceptron-based neural networks [11] are used.
Margin-based Active Learning has been usually accomplished by considering the simplistic linear separability of patterns in the input space [12][13][14].Once a linear separator is obtained from the initial samples, further labeling is accomplished according to a preestablished criterion, usually related to sample proximity to the separator, which is simpler to calculate if the separator is linear.In a more realistic and general approach, however, a nonlinear separator should be considered, which requires that linearization be carried out by mapping the input data into a feature space, where sample selection is actually accomplished.The overall function (x, Z, w) is composed of the hidden layer mapping function (x, Z) and the output function ((x, Z), w).Since both functions are single layer, (⋅, ⋅) can only perform linear separation and (x, Z) is expected to linearize the problem.Nevertheless, the difficulty with the nonlinear approach is that in order to obtain (x, Z) some sort of user interaction may be required.
In order to overcome the difficulty to obtain a userindependent nonlinear feature space mapping, in this paper we present a method that is based on the principles of Extreme Learning Machines (ELMs) [15] to obtain the mapping function (x, Z).The basic principle of ELM is to randomly sample the elements of Z and to expand the input space into a higher dimension in order to obtain (x, Z).This is the most fundamental difference between ELM, feedforward neural networks, and SVM, since in these two models the function (x, Z) is obtained by minimizing the output error.In practice, the only parameter required by the ELM projection is the dimension (number of neurons) of the feature space to which its final performance is not much sensitive.
Although both ELM and SVM are based on twolayer mapping, SVM's kernel provides an implicit mapping whereas ELM is based on the explicit mapping by the hidden layer sigmoidal functions [16,17].The two models also differ on the way that smoothing of the approximation function is treated, since SVM's performance relies on support vectors and ELM's output is computed considering the whole dataset.The Lagrangian solution of SVM's quadratic programming learning problem yields Lagrange multipliers that, in practice, point out to the patterns (the support vectors) that will be used to compute SVM's output.In fact, SVM's output ŷ for the input pattern x  is a linear combination of the labels   weighted by the kernel (x  , x  ) between x  and all other learning patterns x  : ŷ = sign(∑  =1     (x  , x  ) + ).The linear combination coefficients are the Lagrange multipliers   resulting from the solution of the quadratic programming problem, which was formulated with the objective of minimizing the empirical risk and maximizing the separation margin [10].Since only those patterns with nonzero Lagrange multipliers effectively contribute to the computation of ŷ , SVM's learning can be seen as a sample selection problem.Given the proper kernel parameters, the selection of margin patterns x  and the Lagrange multipliers   yields error minimization and margin maximization [10].In such a scenario, "discarded" samples are those assigned to null Lagrange multipliers   = 0 and the "selected" ones, the support vectors, are those with Lagrange multipliers in the range 0 ≤   ≤ , where  is a regularization parameter.
SVM's learning approach, however, can not be directly applied to the Active Learning problem since the whole dataset must be available at learning time so that the optimization problem can be solved.Active Learning methods should be capable of dealing with incremental and online learning [2,14,18], which is particularly convenient to Big Data problems.Nonetheless, the selection strategy presented in this paper aims at patterns near the class separation boundaries, which is expected to result in large-margin separators and in output function smoothing that may compensate for overparametrization [19] of the projection function (x, Z).
The mapping H of X into the feature space is expected to embody a linearly separable problem given a large enough number of projection neurons  [20].Once the mapping matrix H is obtained, the learning problem is reduced to selecting patterns and to inducing the parameters of the linear separator.
The pseudoinverse approach adopted by original formulation of ELM [15] to obtain the linear separator results in overfitting when the number of selected patterns tends to the number of neurons ( → ).In such a situation, the number of equations is the same as the number of unknowns and the pseudoinverse yields a zero-error solution [15].Consequently an overfitted model is obtained due to the large number of hidden neurons  required to separate the data with a random projection.Since in Active Learning the training set size will most likely reach the number of neurons as more patterns are labeled, the zero-error solution effect of the pseudoinverse is unwanted because it may result in a sudden decline in performance near the limit  ≈ .Because of that, an alternative to the pseudoinverse solution should be adopted in Active Learning problems.
Recently, Huang et al. [21] proposed a regularized version of ELM that can avoid the zero-error solution for  ≈ .
For this formulation a regularization parameter should be fine-tuned, which can increase the costs to perform Active Learning, because some labeled patterns should be separated to the parameter tuning.In addition, since Active Learning is incremental, relearning the whole dataset for every new pattern can be prohibitive.At first sight, the Online Sequential Extreme Learning Machine (OS-ELM) [22] could be a good candidate to Active Learning, because it can learn data one by one or chunk by chunk.However, its formulation demands that the initial model must be calculated using at least  =  patterns.In this case  can be large, which implies that the initial learning set should also be large.So, this is not the best option, because the main objective of Active Learning is to minimize the number of labeled patterns necessary to learn [2].Because of that, in this paper we propose a new incremental learning approach to replace the pseudoinverse-based solutions.The method has an inherent residual term that compensates the zero-error solution of the pseudoinverse and that can be viewed as implicit regularization.
The method presented in this paper is a classifier composed of a large size ELM hidden layer and an output layer learned via a Hebbian Learning Perceptron with normalized weights [23].The Active Learning strategy relies on a convergence test adapted from the Convergence Theorem of Perceptron [11,24].The learning process is stream-based and each pattern is analyzed once.It is also incremental and online, which is particularly suitable for Big Data.Experimental results have shown that the proposed Active Learning strategy achieved a performance similar to linear SVM with ELM kernel and to regularized ELM.Our approach, however, has shown learning only a small part of the dataset.
The remainder of this paper is organized as follows: Section 2 describes the foundations of Extreme Learning Machines.Section 3 presents the Hebbian Learning.Section 4 discusses how overfitting can be controlled using Hebbian Learning.Section 5 extends the Perceptron Convergence Theorem [11,24] to the Hebbian Learning with normalized weights [23].Section 6 presents the principles of our Active Learning strategy.Experimental results are shown in Section 7. At last, the final discussions and conclusions are provided in Section 8.

Extreme Learning Machines
ELM can be seen as a learning approach to train a two-layer feedforward neural network, Multilayer Perceptron (MLP) type [15].The method has basically the following main characteristics: (1) number of hidden neurons is large, (2) training of hidden and output layers is made separately, (3) hidden nodes parameters are not learned according to a general objective function but randomly chosen, and (4) output weights are not learned iteratively but obtained directly with the pseudoinverse method.
The input matrix X with  rows and  columns contains the input training data, where  is the number of samples and  is the input space dimension.The rows of the  × 1 vector y contain the corresponding labels of each one of the  input samples of X: Function (x, Z, b), with argument x, matrix of weights Z, and vector of bias b, maps each one of the rows of X into the rows of the × mapping matrix H, where  is the number of hidden layer neurons (H = (X, Z, b)).Activation functions of all neurons are regular sum-and-sigmoid functions: In the particular case of ELM since the elements of Z and b are randomly sampled the number of hidden neurons  is expected to be large enough to meet the linear separability conditions of Cover's theorem [20], so the projected data from X into H is assumed to be linearly separable.
Matrix H is then mapped into the output space by the function (H, w) in order to approximate the label-vector y.The  × 1 vector w contains the parameters of the linear separator in the hidden layer and is obtained solving a linear system of  equations: The smallest norm least-squares solution of the above linear system is [15] where H † is the Moore-Penrose pseudoinverse.The network response ŷ to an input pattern x is obtained by first calculating H and then by estimating the output as ŷ = sign(Hw) [21] for binary classification.For multiclass classification the output is estimated choosing the highest output neuron.In this paper we focus only in binary classification.Like in any function approximation problem, the resulting general function (x, Z, b, w) is expected to be robust to x  ∉ X.However, the pseudoinverse zero-error least-squares solution results in overfitting of the oversized ELM when  is close to .Since the learning set is formed incrementally in Active Learning,  will eventually reach  as learning develops, which makes the use of original formulation of ELM in this context impracticable, even if the heuristics of Schohn and Cohn [9] and Tong and Koller [8] are applied.
In order to show performance degradation when using the pseudoinverse, a sample selection strategy using ELM with 100 hidden neurons was applied to the dataset of Figure 1, which is a nonlinear binary classification problem with 180 samples of each class.The experiment was performed with 10-fold cross-validation and 10 runs.The learning process started using only one randomly chosen pattern, which was then projected into the hidden layer with random bias and weights.Output weights were obtained with the pseudoinverse followed by the calculation of Area Under the ROC Curve (AUC) [25] performance on the test set.Active Learning continues with the random selection strategy and as new random patterns are added to the learning set the projection procedures and pseudoinverse calculation are repeated.Figure 2 shows the yielded average AUC on all experiments.As can be observed, AUC performance degrades sharply in the region  ≈  when the number of equations reaches the number of unknowns and the pseudoinverse solution of the linear system is exact.
Active Learning could benefit from the linearization accomplished by the hidden layer projection of ELM, since it is practically a parameterless mapping.The number of hidden neurons  does not require fine-tuning and hidden nodes parameters are randomly sampled, so no optimization parameters need to be set and no interaction is required with user at learning time.The payoff is that the large value required for  and the pseudoinverse solution will result in the unwanted behavior of Figure 2.
The OS-ELM seems to be a good candidate to perform Active Learning, because it can learn data one by one or chunk by chunk.However, its formulation demands that in initialization phase the number of training patterns should be at least equal to the number of hidden nodes ( = ) [22].As discussed before, the learning set is formed incrementally in Active Learning and  could be less than  at the beginning, so OS-ELM is not the best option to perform Active Learning.
The new regularized formulations of ELM proposed by Huang et al. [21] can generalize well even if  ≤ .The classification problem for the new regularized formulations of ELM with a single-output node can be formulated as [21] Minimize: where   is the training error with respect to the training pattern x  and  is a regularization parameter.In this formulation, the ELM training is equivalent to solving the following dual optimization problem [21]: where each Lagrange multiplier   corresponds to the th training pattern [21].Huang et al. [21] proposed two solutions for this optimization problem: (1) For the case where the learning set has small or medium size: (2) For the case where the learning set has large size: For binary classification the output for a input pattern x  is estimated as ŷ = sign((x  , Z, b)w) [21].For this ELM formulation a regularization parameter should be fine-tuned.In this case, some patterns should be separated and labeled for that purpose, which can increase the Active Learning costs.The main objective of Active Learning is to induce a Supervised Learning model using the fewest number of labeled patterns, so in this case the parameter tuning can be prohibitive.
With the purpose of showing the characteristics of these ELM alternatives (all ELM implementations are accomplished using the source codes available in the website http://www.ntu.edu.sg/home/egbhuang/elmcodes.html) the same problem of Figure 2 was applied to OS-ELM and the new regularized formulation of ELM that we called ELM2012.For ELM2012 thirty percent of the training set was used for fine-tuning the regularization parameter .For OS-ELM it was used in the initialization phase of a number of labeled patterns equal to the number of hidden nodes (100).The subsequent patterns are learned one by one in a randomized way.The results are presented in Figure 3.As can be seen, the OS-ELM results are affected by the overfitted pseudoinverse solution of the initialization phase ( = ).This problem is propagated during the one by one learning.The ELM2012 solves the pseudoinverse limitations when  ≈ , but a regularization parameter should be fine-tuned, which can be prohibitive in the Active Learning scenario [18].
In this paper we present a Hebbian Learning [23,26] approach to compensate the unwanted behavior of Figure 2 and to avoid fine-tuning regularization parameters as will be discussed in the next sections.

Hebbian Learning
In neural networks learning, the update of weight   that connects neurons  and  according to Hebb's rule [26], is proportional to the cross product of their activation values for any input-output association  or in other words   ∝     .The rule can be written in matrix form as in where w is the  × 1 weight vector, X is the  ×  input data matrix, and y is the  × 1 vector containing the output values.
Since the estimation ŷ of vector y is given by ŷ = Xw, the substitution of w from (9) into this expression leads to ŷ = XX  y.Therefore, in order to perfectly retrieve y we must have XX  = I.If such a condition is met the training error is zero, since ŷ = y, and the learning rule is equivalent to the pseudoinverse method of (4).This condition can only happen if x  ⊥ x  and x   x  = 1 ∀.In such a situation X forms an orthonormal basis and X  = X −1 [27] which leads ( 9) and ( 4) to be equivalent.However, in most real situations X will not be an orthonormal basis and a residual term because the product XX  will cause a shift of ŷ in relation to y.Since  , OS-ELM [22], and the new regularized formulation of ELM [21] here called ELM2012.Thirty percent of the learning set was used to fine-tune the regularization parameter of ELM2012.
an individual ŷ is obtained as ŷ = ∑  =1 x   x    , the term corresponding to  =  can be separated from the summation which leads to the following: In the particular situation when the input data is normalized, that is, x   x  = 1, (10) can be simplified into The interpretation of (11) [28] is that the estimated response ŷ for the input pattern x  is the exact response   plus the residual crosstalk term.
The crosstalk of ( 11) is inherent to Hebbian Learning that is due to the nonorthogonality of the input data, so it depends on how the input samples are spatially related to each other.For usual inductive learning approaches data is provided to learning without any interference on the sampling probability of a given x  .Therefore, crosstalk is innate to a given dataset and can be calculated directly given the samples.From (11) it can be noted that it represents a shift in relation to the zero-error pseudoinverse solution, which is in fact unwanted in ELM learning since it yields overfitting, as discussed in Section 2. The degree on which the Hebbian solution differs from the zero-error solution is in fact determined by the sample selection strategy in Active Learning.This has a penalization effect in relation to the pseudoinverse outcome that is expected to result in a smoother learning curve than the one presented in Figure 2.
We argue in this paper that the use of Hebbian Learning as represented in (9) in replacement to (4) yields better generalization due to the residual term, which alleviates the overfitting effects of the pseudoinverse, as will be discussed in the next sections.In addition, Hebbian Learning is particularly suitable to Active Learning, since the selection of a new pattern does not require that previously selected ones be learned again.The contribution of a new pattern can be simply added to the current weight vector, as shown in (12).Unlearning, which is appropriate to remove redundant patterns in Big Data problems, can also be accomplished by removing the data from the summation It is clear that the crosstalk may increase as new patterns are selected to be learned and are added to the summation of (12).The estimation of the crosstalk term according to varied sampling scenarios was the subject of many papers that aimed at estimating the storage capacity of Hopfield Networks [29] in the late 1980s [30][31][32][33].The limit number of associations that can be stored is clearly dependent on the crosstalk.In Active Learning, however, the magnitude of the crosstalk is a result of the patterns selected for learning, so the selection strategy has an indirect control of how much ŷ deviates from   .In the next sections a selection strategy as well as a method to estimate the maximum number of patterns that can be stored in ELMs trained with Hebbian Learning in the Active Learning scenario is presented.It is also shown in the next section that crosstalk has a regularization effect in Hebbian Learning.

Avoiding Overfitting with Hebbian Learning
As discussed in the previous sections, the very nature of ELM's design results in an oversized network that is likely to yield overfitting with pseudoinverse learning.This happens because the pseudoinverse solution is optimal and results in null error when  = , so in order to avoid the associated overfitting effect of single-objective error minimization learning, many methods aim at displacing the network solution from the null error region.Regularization methods [34], for instance, include a regularization term in the objective function, which is usually represented as a linear combination of the square error and an additional penalization function.The objective function is usually described in the form (w) = ∑ (w) 2 + ‖w‖ with the norm of the weights ‖w‖ often used as penalization [35].The effect of the regularized objective function is to shift the solution from the null training set error for  ̸ = 0.In other words, the training set error should increase in order to avoid the overfitting effect of pseudoinverse learning combined with an intrinsically oversized network.
The crosstalk term of (11) implicitly contributes with a penalization term to the zero-error solution, which can be made clearer by the expression that follows.After some algebraic manipulations, which are presented in Appendix A, the square error  2  = (  − ŷ ) 2 due to an arbitrary association  trained with Hebbian Learning, as described in Section 3, can be represented by The penalization term of ( 13) was obtained from the original error (  − ŷ ) 2 with the objective of showing the residual effect of Hebbian Learning.The first term is proportional to the square error, whereas the second one has a penalization effect due to the spatial relation of patterns within the learning set, since it is related to the crosstalk of (11).As more patterns are selected for learning the interference among them is likely to increase and so is the magnitude of the penalization term of (13).The sample selection strategy in Active Learning has, therefore, a direct impact on penalization and on overfitting smoothing.
The graph of Figure 4 shows ELM's training set error due to Hebbian and pseudoinverse learning, as the number of (random) learning patterns increases.Training was accomplished with (4) and (9).As expected, Hebbian Learning error increases quadratically, due to the first term of (13), whereas pseudoinverse error is null regardless of the number of selected learning patterns.This kind of behavior changes drastically when the error is calculated for the test set as can be seen in Figure 5.It is important to emphasize at this point that Figures 4 and 5 are in different -axis scales, because of the discrepancy of magnitude of errors; however, the scales do not affect the current qualitative analysis.It can be observed that the pseudoinverse error increases drastically near  = , a behavior that is compatible with the one presented also for the AUC in Figure 2.Although Hebbian test error has still a quadratic behavior, which is very smooth in the figure because of the scale, pseudoinverse error for the test set is much higher, specially near  =  due to overfitting.
This kind of behavior indicates that crosstalk, which is often considered as a limitation of Hebbian Learning [30][31][32][33], may have a positive effect on ELM learning.Its contribution to output smoothing will, however, depend on the learning Mathematical Problems in Engineering method's ability to control its magnitude, which is one of the goals of the proposed method that will be detailed in the next sections.

Limit Number of Training Patterns
The resulting model from ( 9) is in fact a Simple Perceptron [36] trained with Hebbian Learning.A variation with normalized weights, as presented in ( 14), has been shown to yield margin maximization [23].In our context the question that remains, however, is to estimate the minimum number of patterns that are needed for learning.This is particularly important to smooth the effect of crosstalk, to control the labeling cost in Active Learning, and to reduce training set size of Big Data problems: Perceptron Convergence Theorem states that Rosenblatt's algorithm [36] converges with a limit number of iterations that is less than or equal to the maximum number of misclassifications if the problem is linearly separable [11].In our context, for ELM learning, we assume that the resulting problem in ELM's hidden layer is linearly separable due to ELM's hidden layer nonlinear projection.Based on this assumption, we show next that Nilsson's proof of convergence [11,24] can be extended to Hebbian Learning.The proof of the theorem that follows, which is presented in Appendix B, assures the estimation of the limit number of patterns that is sufficient to find a linear separator and, consequently, yields the upper limit that ensures convergence.When considering such a reduced training set, learning can be interrupted to avoid increase in the penalization term of (13) due to crosstalk.As will be shown in the next sections, the limit number of patterns given by the following theorem does in fact minimize the crosstalk.

Theorem 1. For two linearly separable classes the maximum number of labels needed for convergence of a Hebbian Perceptron is given by
where  is the maximum square norm of the training patterns,  is the margin of the most distant pattern from the separating hyperplane, and  is the margin of the closest pattern to the separating hyperplane.
The deductions of ,  and  are presented in Appendix B. Considering the learning set , ,  and  can be defined as presented in the following equations: Equation (15) indicates that Hebbian Learning Perceptron with normalized weights [23] converges using at most ( + 2)/ 2 labels.As will be shown in the next section, the value of  max can be estimated during learning and used as part of the sample selection criterion and also to determine the number of patterns to be learned.This is specially interesting to Big Data problems because only the most informative patterns need to be selected and learned.

Convergence Test as Active Learning Strategy
The theorem presented in the previous section can be used as a labeling criterion for Active Learning since it points out to the amount of labels necessary to ensure convergence of the Hebbian Learning.The decision on labeling is based on successive estimates of  max along the learning process.
Since the convergence should occur with a reduced number of patterns, the influence of crosstalk term can be controlled and smoothing can be achieved.Our classifier is composed of a single ELM hidden layer and an output layer learned via Hebbian Learning Perceptron with normalized weights [23].Fernández-Delgado et al. [23] demonstrated that this Perceptron works as an SVM in which all training patterns are considered as support vectors with Lagrange multipliers equal to 1.
In order to select only the most informative patterns our Active Learning strategy uses a convergence test as a labeling criterion at each time a new pattern is presented.Each new pattern is propagated through ELM hidden layer and then its margin is calculated in relation to the current hyperplane.This value is assigned to  in (15).The variable  corresponds to the maximum between the calculated  and the previous s.The variable  is the maximum between the norm of the current vector and the largest norm of the previous training vectors.Using these variables,  max is calculated using (15).The convergence test is accomplished using  max .If  max is larger than the number of used training patterns, then the algorithm did not converge from the current pattern, so that this pattern must be learned.Its label is queried to the specialist and the Perceptron weights are adjusted using the following equation: If  max is lower than or equal to the number of used training patterns then the algorithm converges and therefore it is not necessary to query the current label since it is probably redundant.The process continues until new patterns are presented or until the maximum number of labels is reached.Algorithm 1 presents the pseudocode of our approach.
The proposed Active Learning strategy is able to find a solution that maximizes AUC and controls the crosstalk term.In order to illustrate these characteristics, three sample selection strategies using ELM with 100 hidden neurons were applied to the dataset of Figure 1.The first strategy uses the Hebbian Learning with normalized weights to learn, iteratively, the patterns closest to the separating hyperplane as proposed in the heuristic of Schohn and Cohn [9] (shown in Figure 6 as ELMPCP).The second strategy uses the Hebbian Learning with normalized weights to learn patterns selected at random (shown in Figure 6 as ELMPRP).The last strategy is our Active Learning method based on convergence test, named here as Extreme Active Learning Machine (EALM).
We also included the ELMs results of Figure 3.The experiment was performed with 10-fold cross-validation and 10 runs.The results are presented in Figure 6 in terms of average AUC.
Considering a reduced training set, the heuristic of Schohn and Cohn [9] achieved better results than random selection.Nevertheless, Schohn's heuristic is considerably more time-consuming since it requires at each iteration the hyperplane distance estimates for all patterns.Our Active Learning strategy, in turn, achieves a good AUC performance calculating the margin once for each pattern.It is worth noting that although our strategy is stream-based [6], its performance was similar to the Schohn's pool-based strategy [9].The regularized ELM achieved better results than the other strategies, but for this method a regularization parameter was fine-tuned using thirty percent of the learning set, which can be prohibitive in a real Active Learning scenario.
In order to illustrate the effect of crosstalk control, average square errors were calculated for each training set via (13).Each term of this equation is presented as a separated curve in Figure 7.As can be seen the error term related to the crosstalk is dominant for all strategies.In contrast with random pattern selection strategy, Schohn's heuristic was able to reduce the total error (square error) for reduced training sets.Our Active Learning strategy found a solution with nonzero total error and reduced crosstalk term, as expected.The next section presents experimental results for real problems, including Big Data, and compares our strategy with others reported in literature.

Experimental Results
In this section we report the results of four experiments.Experiment 1 demonstrates the limitation of original formulation of ELM when the number of training patterns is closer to the hidden layer size.It demonstrates the effect of using Hebbian Learning Perceptron with normalized weights [23] and it also compares all strategies with the regularized formulation of ELM.Experiment 2 shows that the number of hidden neurons does not need to be finetuned if it is much larger than the input size.Experiment 3 compares our Active Learning strategy with known Active Learning methods in literature, with a linear SVM (all linear SVM implementations are accomplished using LibLinear [38] available in the Shogun package [39]) and also with the regularized formulation of ELM.All these 3 experiments were performed over datasets of the UCI repository [40].Such datasets are listed in Table 1 along their characteristics.At last, Experiment 4 shows the effectiveness of our Active Learning strategy on Big Data problems.The datasets were extracted from [41] and have their characteristics listed in Table 2.As can be observed, they present a large amount of patterns, which may hinder learning of traditional supervised models.Our results demonstrate, however, that it is possible to obtain effective models for Big Data with a reduced number of informative patterns.

Results with UCI Datasets.
The following models were compared in Experiment 1: original formulation of ELM (original ELM); regularized version of ELM (ELM2012); ELM hidden layer with Hebbian Learning Perceptron and normalized weights [23] (ELMP); and our Active Learning strategy (Extreme Active Learning Machine-EALM).Data was presented at random for ELM and EALM models.In the particular case of ELMP, two tests were performed: (i) learning from data presented at random (ELMPRP); (ii) learning from patterns which are closest to the separating hyperplane as proposed in Schohn and Cohn [9] (ELMPCP).For all models, ELM hidden layer was set with the same random parameters (weights and bias) and 100 hidden neurons.This number was chosen to demonstrate ELM's limitation when the number of training patterns is close to the hidden layer size.For ELM2012 thirty percent of the learning set was used to fine-tune the regularization parameter.
The average performance of a particular algorithm was calculated over 10 runs of 10-fold cross-validation with AUC.The results of Experiment 1 are presented in Figures 8 and 9  layer size.This is because ELM's output weights are calculated solving a system of linear equations [15].One can also observe that our Active Learning solution achieved a performance similar to ELMP trained with the closest patterns (ELMPCP).This shows that our labeling criterion based on the converge test indeed works as a "margin-based filter" but without the need to calculate the distances for all patterns.The EALM solutions are also close to the best solutions obtained using ELM2012, which confirms that our model uses the crosstalk term as implicit regularization.Experiment 2 compares EALMs with different hidden layer sizes: 100, 500, 1000, 2500, 5000, and 10000.Results are listed in Table 3 in terms of the average number of labels selected by EALM (AL Labels) and the average AUC (AUC).The best values are marked in bold.From Table 3, one can notice that when the number of hidden neurons reaches values larger than 500, the performance of all models becomes similar.This demonstrates that EALM is little sensitive to the hidden layer size and therefore it is not necessary to perform any tuning procedure for this parameter.The use of a larger value such as 1000 is enough, as suggested in [17,42].Experiment 3 [37] compares EALM with the following methods: Perceptron of Dasgupta et al. (PDKCM) [13], Perceptron of Cesa-Bianchi et al. (PCBGZ) [12]; SVM of Tong and Koller (SVMTK) [8]; a linear SVM trained with all patterns (SVMALL) and the regularized formulation of ELM trained with all patterns (ELM2012).All these methods were used as linear outputs for an ELM hidden layer projection with 1000 neurons and weights and bias randomly selected in the range [−3, 3], as proposed by [17].Thirty percent of each dataset was separated in order to set the free parameters of PDKCM, PCBGZ, ELM2012, and SVMs.This was accomplished via a grid search with 10-fold cross-validation procedure, as suggested in Monteleoni and Kääriäinen [14].PCBGZ was also tested with the optimal parameter  = (max ∈ ‖‖ 2 )/2 [12] and this variation was named as PCBGZ-OPT.Regularization parameter  of SVM was configured according to the range {2 −5 , 2 −4 , 2 −3 , 2 −2 , 2 −1 , 1, 2, 2 2 , . . ., 2 14 }, as suggested in [23].Regularization parameter  of ELM2012 was configured according to the range {2 −24 , 2 −23 , . . ., 2 24 , 2 25 }, as suggested in [21].
Ten runs of 10-fold cross-validation were performed in order to calculate the average accuracy (Ac), the average AUC (AUC), and the average number of selected labels (AL Labels).All datasets were normalized to mean 0 and standard deviation 1. EALM, PDKCM, PCBGZ, and PCBGZ-OPT were initialized with one pattern selected at random.SVMTK was initialized using a set composed of two patterns, one of each class, and randomly chosen, as proposed by [8].
Results of Experiment 3 are shown in Table 4.For a particular model, the effective amount of labels consists of those used to configure its free parameters added to the labels selected by Active Learning (AL Labels).Table 4 shows the number of labels achieved by Active Learning (AL Labels) and the number of labels effectively used (Effective Labels).As can be observed, the best results were obtained for EALM, SVMTK, SVMALL, and ELM2012.Although SVMTK had selected a smaller number of labels during Active Learning, its computational cost is higher than EALM.Furthermore, it was verified that the effective number of labels used by EALM is smaller than those used by other models.The results of EALM are close to those of ELM2012 which indicates that the crosstalk term of EALM has a similar regularization effect to the regularization parameter of ELM2012 but with the advantage that it is not fine-tuned because the crosstalk term is automatically controlled by our Active Learning strategy.

Results with Big Data.
Experiment 4 shows the ability of our Active Learning strategy (EALM) in dealing with problems containing a large amount of data (see Table 2).As mentioned earlier, EALM has inherent properties that make it Although the regularized formulations of ELM (ELM2012) and linear SVM do not implement such aspects, their results over the same conditions were considered here as a baseline for comparison.Regularization parameters  of linear SVM and of ELM2012 were configured using a grid search with 10-fold cross-validation procedure on 30% of the training set.All datasets were normalized to mean 0 and standard deviation 1. Table 5 lists the average values achieved by EALM, ELM2012, and SVM for the test subsets.These values were calculated over 10 runs (training/test cases) with the metrics accuracy (Ac) and AUC.The average number of selected labels by EALM (Used Labels) is also shown.
As can be verified in Table 5, EALM, ELM2012, and linear SVM achieved similar performances for Splice and Adult datasets.Nevertheless, EALM used a reduced number of labels (42.28% and 18.88%, resp.), which can be highly positive in a Big Data setting in which the labels have a high cost.Datasets IJCNN1 and Web Gaussian, in turn, are highly imbalanced and this issue may have affected the AUC performance for all models.In the particular case of Web Gaussian, linear SVM achieved better results than EALM.This is because SVM used all labels in the optimization

Conclusion
Big Data problems demand data models with abilities to handle time-varying, massive, and high dimensional data.In this scenario, Active Learning emerges as an attractive technique for the development of high performance classifiers using few data.The importance of Active Learning for Big Data becomes more evident when labeling cost is high and when data is presented to the learner via data streams.Some researches have developed linear models to perform Active Learning which take some unrealistic assumptions about data distribution and require setting of free parameters.These aspects directly impact the main reason of Active Learning, which is cost reduction, and their use could be almost prohibitive for massive datasets.
Our stream-based Active Learning strategy is little sensitive to parameter setting.This is achieved by using an ELM hidden layer projection whose size is larger than the dimension of input data.This issue has already been claimed in others ELM-based studies in literature [17,42].Overfitting is inherently avoided via the Hebbian Learning crosstalk term, which leads to smooth linear separations in the ELM output layer.This fact was confirmed with similar performances achieved by our Active Learning strategy and large-margin strategies such as ELM+linear SVM and ELM+Schohn's heuristic.Our results are also close to those of regularized ELM which indicates that the crosstalk term of our model has a similar regularization effect to the regularization parameter of ELM, but with the advantage that the crosstalk term is automatically controlled by our Active Learning strategy.
The use of the convergence test as a labeling criterion enables the iterative selection of the most informative patterns without the need to calculate the distances for any previous pattern, as is usually accomplished by the marginbased Active Learning approaches.This simply means that our strategy does not hold any pattern previously learned, so that it is particularly suitable to scenarios that require online learning, such as stream-based Big Data domains.
The results of Experiment 4 show that our approach is effective when applied to massive data.Our approach with a reduced training dataset was able to achieve a performance similar to an SVM trained with all patterns.This fact highlights the importance of our approach to current Big Data applications.

A. Deduction of (13)
The error  2  = (  − ŷ ) 2 due to an arbitrary sample , after expansion, is given by

Figure 1 :
Figure 1: Binary classification dataset used in the random selection Active Learning example of Figure 2.

Figure 2 :
Figure 2: The use of the pseudoinverse causes sharp decline in performance when the number of selected patterns approaches the number of hidden neurons ( ≈ ).

Figure 3 :
Figure3: Comparison between the original formulation of ELM[15], OS-ELM[22], and the new regularized formulation of ELM[21] here called ELM2012.Thirty percent of the learning set was used to fine-tune the regularization parameter of ELM2012.

Figure 9 :
Figure 9: Average results of 10 runs of 10-fold cross-validation for different ELM methods in the datasets: AUST, LIV, GER, and SPAM.
Initial size of the training set , maximum number of labels , ELM random weights Z and random bias b, generator function  Output: Weights vector w Method: Take at random  pattens x  from the generator function ; Propagate the  pattens through the ELM layer ((x   , Z, b)) and query its labels   ; Input:  , Z, b))/‖w + (x  , Z, b)‖;  =  + 1; end Until ( = ) or ( = ⌀); Algorithm 1: Active Learning strategy.
. As expected, the original formulation of ELM did not present good generalization when training set size is near the hidden (f) ION Figure 8: Average results of 10 runs of 10-fold cross-validation for different ELM methods in the datasets: HRT, WBCO, WBCD, PIMA, SNR, and ION.

Table 3 :
Comparison of the hidden layer size.

Table 5 :
Big Data results.