Cascade Support Vector Machines with Dimensionality Reduction

.ThisisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Cascade support vector machines have been introduced as extension of classic support vector machines that allow a fast training on large data sets. In this work, we combine cascade support vector machines with dimensionality reduction based preprocessing. The cascade principle allows fast learning based on the division of the training set into subsets and the union of cascade learning results based on support vectors in each cascade level. The combination with dimensionality reduction as preprocessing results in a significant speedup, often without loss of classifier accuracies, while considering the high-dimensional pendants of the low-dimensional support vectors in each new cascade level. We analyze and compare various instantiations of dimensionality reduction preprocessing and cascade SVMs with principal component analysis, locally linear embedding, and isometric mapping. The experimental analysis on various artificial and real-world benchmark problems includes various cascade specific parameters like intermediate training set sizes and dimensionalities.


Introduction
Large data sets require the development of machine learning methods that are able to efficiently compute supervised learning solutions.State-of-the-art methods in classification are support vector machines (SVMs) [1].But due to the cubic runtime and quadratic space of support vector learning with respect to the number of patterns, their applicability is restricted.Cascade machines are machine learning methods that divide the training set into subsets to reduce the computational complexity and employ the principle of dividing a large problem into smaller subproblems that can be solved more efficiently.The cascade principle is outstandingly successful for SVMs [2], as their learning results (the support vectors) can hierarchically be used as training patterns for the following cascade level.The objective of this paper is to show that a further speedup can be achieved via dimensionality reduction (DR) from space R  to space R  with  <  as preprocessing step without a significant loss of accuracy.In each cascade level of the novel cascade variant, which we will call extreme cascade machines (ECMs), the original patterns (with original pattern dimensionality ) are employed.The support vector learning process on the patterns with reduced dimensionality  turns out to choose a similar set of support vectors like the original SVM learning process on patterns with dimensionality .The advantage of the proposed combined cascade learning and dimensionality reduction procedure is that the final SVM is also trained on a subset of the patterns with original dimensionality .Hence, an independent test set does not have to be mapped to a low-dimensional space before it can be subject to the ECM classification process.
In this work, we present the approach to divide the training set into subsets, reduce its dimensionality, and employ the cascade principle.The approach turns out to depend on various parameters that are analyzed experimentally.The hybridization of the cascade approach with DR methods belongs to the successful line of research on DR-based preprocessing in supervised learning.At the same time, cascades share similarities with ensemble methods, which combine the learning results of multiple learners and have proven to be strong means to strengthen computational intelligence methods.For example, Ye et al. [3] introduced a -nearest neighbor based bagging algorithm with pruning for ensemble SVM classification.Ensembles have also proven well in other applications like visualization of high-dimensional data with neural networks [4].
This paper is structured as follows.In Section 2, the new ECM approach is presented.It is experimentally analyzed in Section 3 with respect to training set sizes, the choice of parameters, and the employment of various DR reduction methods.Conclusions are drawn in Section 4.

Extreme Cascade Machines
In this section, we introduce the concept of ECMs.They are based on the combination of classic cascade SVMs with DR preprocessing.
2.1.Support Vector Machines.SVMs for classification place a hyperplane in data space to separate patterns of different classes.Given a set of  observed patterns x 1 , . . ., x  with x  ∈ R  and corresponding label information  1 , . . .,   with   ∈ {−1, 1}, the task in supervised classification is to train a model  for the prediction of the label of an unknown pattern x  ∈ R  .SVMs are successful models for such supervised learning tasks.They are based on maximizing the margin of a hyperplane that separates patterns of different classes.The dual SVM optimization problem is to maximize with respect to   subject to constraints ∑  =1     = 0 and   ≥ 0 ∀.Once the optimization problem is solved for  *  , in many scenarios the majority of patterns vanish with   = 0 and only few have   > 0. Patterns x  , for which   > 0 holds, are called support vectors.The separating hyperplane H is defined with these support vectors: The support vectors satisfy   (w  x  +  0 ) = 1, while lying on the corner of the margin.With any support vector x  , we can compute  0 =   − w  x  and the resulting discriminant (x  ) = sign(w  x  +  0 ), which is called SVM.An SVM that is trained with the support vectors computes the same discriminant function as the SVM trained on the original training set, a principle that is used in cascade SVMs.Extensions of SVMs have been proposed that allow learning with large data sets such as core vector machines [5].Each SVM returns the support vectors as learning result.In each iteration, the training set is reduced to the corresponding set of support vectors.This process is stopped, when the final number of support vectors is smaller or equal to  * .The resulting support vector set is the basis of the final SVM that can be employed as final estimator.

Cascade
For C-SVMs, the reduction of runtime on the first level becomes  3 > (/) ⋅  3 =  ⋅  2 .A similar argument holds for all subsequent levels.The speedup can be increased by parallelizing the training process on multicore machines.However, small cascade training set sizes  result in larger support vector sets for each level and consequently more cascade levels.
Figure 2 illustrates the learning results of a C-SVM with radial basis function (RBF) kernel that divides the training set of patterns from the XOR problem into two parts.Figure 2(a) shows the learning result of an SVM with RBF kernel.Figure 2(b) shows the learning result of an SVM trained with the support vectors computed by an SVM that has been trained on the first half of the XOR data set.The corresponding SVM trained with the support vectors of the second half of the data set is shown in Figure 2(c).Figure 2(d) shows the decision boundary of the C-SVM trained with the union of both support vector sets.The figures show that the original SVM and the C-SVM learn the same decision boundary.

Extreme Cascading.
Often, not all features are important to efficiently solve classification problems; some may be uncorrelated with the label or redundant.The reduction of the number of features to a relevant subset in supervised learning is a common approach in machine learning [8,9].Plastria et al. [10] have shown that a proper choice of methods and number of dimensions can yield a significant increase in classifier performance.
The extreme cascade model we propose in this work combines cascading with dimensionality reduction.Algorithm 1 shows the pseudocode of the ECM approach.Each training subset is subject to a DR preprocessing resulting in reduced -dimensional subsets T[,] = {(x  ,   ), . . ., (x  ,   )} with x,..., ∈ R  .The reduced training sets T[⋅,⋅] are each subject to  The assumption of ECM is that the dimensionality reduction process maintains most properties of the high-dimensional data space and that the set of support vectors of the low-dimensional space is similar to the support vector set of the high-dimensional data.This assumption will be analyzed in the experimental part.
The final SVM * is trained on the last training set  consisting of the patterns x ∈ R  that correspond to the support vectors of the last cascade level.The ECM employs many parameters that can be tuned to define the complete ECM model, from DR choice and target dimensionality  to DR method parameters and SVM parameters like kernel type, bandwidth parameters, and regularization parameter .Some of the parameters are analyzed in the following experimental part of this work.

Support Vector Analysis.
We start the experimental part with an analysis of the assumption that the support vectors learned by the SVM in the low-dimensional space are the same as the support vectors an SVM learns in the highdimensional space.Figure 3 shows the ratio of common support vectors of both SVMs and the number of support vectors of the SVM in the high-dimensional space with respect to an increasing dimensionality  of the low-dimensional space.One curve shows the average, maximum, and minimum ratios of 20 runs with new instances of the MakeClass data set (cf. Appendix for a detailed description) with increasing .The blue curve shows the results for less structured instances of MakeClass with Δ = 0.15, while the red curve employs more informative features with Δ = 0.2.As expected, we can observe that the ratio of common support vectors increases with the dimensionality of the low-dimensional space until a  is reached, as of which the ratio remains 1.0.This state is reached later for Δ = 0.2, as more support vectors are necessary for data sets that employ more informative features and less redundancy.

Parameter Analysis.
The ECM approach depends on a proper choice of parameters.We analyze parameter settings   for the PCA-ECM in Figure 4(a) on the MakeClass [14] data set with  = 30,000,  = 100, and Δ = 0.2.To set the SVM parameters  and  for the RBF kernel, we employ a 5-fold cross-validation in all the following experiments.The plot shows accuracy and training time with various settings for , , and  * .We can observe that the accuracy of all ECM variants lies between 0.9 and 0.99.The highest accuracies have been achieved with setting  = 40, while the fastest run with a high accuracy has been achieved with settings  = 40,  = 3,000, and  * = 5,000.This cascade is comparatively sensitive concerning the choice of .A similar comparison is shown in Figure 4(b), where LLE-ECM and ISOMAP-ECM variants with various settings are compared on the MakeClass data set.The results show that ISOMAP is a faster preprocessing method and often achieves high accuracies, but it is at the same time less robust than LLE, as bad parameter choices can result in comparatively low classification accuracies.For a closer look at the LLE and ISOMAP variants, Table 1 shows runtime and mean squared error (MSE) results on the MakeClass data set with  = 100,000,  = 40, and Δ = 0.3 and ECM settings  = 1,000 and  * = 5,000.Various settings for neighborhood size  of LLE and ISOMAP and for target dimensionality  are employed.The results show that the approaches are faster for smaller , an effect that is more significant for ISOMAP than for LLE.Surprisingly, we can observe a tendency for better accuracies, if smaller values for  are employed.This may be due to the fact that the manifold learning process is forced to concentrate on the most important features of the data set, while higher dimensions introduce noise to the classification problem.The best result (highest accuracy with best runtime) has been achieved by ISOMAP with settings  = 20 and  = 20.
A typical development of the number of support vectors during a typical run is shown in Figure 5(a).The number of support vectors is decreasing approximately linearly in the course of successive cascade levels.Table 2 shows the test error  (normalized MSE) after training on a training set achieved on an independent test set and the runtime (the runtime depends on the machine (2.7 GHz Intel Core i5), the operating system (Apple OS X), and the programming language and packages (Python and scikitlearn))  of training and test phase of the three classifiers.The cascade classifiers PCA-ECM and C-SVM employ the settings for  and  * ; PCA-ECM uses the specified latent space dimensionality .These parameters have been found in manual and automatic tuning processes (with grid search), but we tried to use similar settings to draw parameterindependent conclusions.The MakeClass benchmark problem is an artificial data set with setting Δ = 0.22 for balance between informative and redundant features and training set size  = 50,000 with  = 100.PCA-ECM turns out to be the fastest variant, while the classic SVM is the slowest but achieves lowest error.PCA-ECM achieves a larger error.The accuracy dropout of the PCA-ECM may be acceptable (2% in accuracy, from 0.022 to 0.041), when considering the fact that the PCA only requires 18% of the SVM runtime.The Hastie data set is an artificial data set from Hastie et al. [8], with  = 100,000 and  = 10, which is not high-dimensional, but large.Here, the SVM is still the fastest classifier, but the situation changes when we compare PCA-ECM and C-SVM.Now, the PCA-ECM is faster, but the C-SVM achieves lower test error.
On Faces, the PCA-ECM achieves a higher accuracy than the C-SVM but is slightly worse than the classic SVM.Although the SVM is already fast, the PCA-ECM is slightly faster.On the Blobs data set, DR-based preprocessing seems to be important as the C-SVM completely fails, while the PCA-ECM achieves very good results, almost as good as the classic SVM.The latter is about six times slower than the PCA-ECM.On the regression data sets, a similar picture can be drawn.The PCA-ECM achieves a lower error than the C-SVM on Friedman 1.Although the SVM is better again, it requires almost six minutes to compute the solution.On Friedman 3, the situation is similar.SVM achieves the lowest error but requires much more time than the cascade variants, which are only slightly worse in accuracy.On the Wind data set, a similar observation can be made like that on Friedman 1, Hastie, Faces, and Blobs.The DR process introduces an advantage in accuracy and also in runtime.Taking into account all seven benchmark data sets, we can observe that the SVM always achieves lowest test error but requires the longest training and test time.The cascade variants are always faster than the classic SVM with the expected tradeoff concerning the accuracy.

Conclusions
ECMs allow the application of SVMs to large data sets by cascading training sets and at the same time reducing the dimensionality of patterns.We experimentally analyzed fast variants that are based on preprocessing with DR and C-SVMs, in particular PCA-based ECMs concentrating on parameters like intermediate and final cascade training set size.Both parameters have a significant influence on the final classification result.Further comparisons of DR methods for preprocessing have shown that ISOMAP outperforms LLE in terms of runtime.Most cascade variants lead to a fast training process while maintaining the same classifier accuracy or only paying with a slight decrease of the accuracy.As the computation of intermediate SVMs can be parallelized on each level, the distribution to multiple cores allows a further significant speedup.This will be subject to future investigations.Further, an extensive comparison to core SVMs, which achieve the speedup based on approximations using a minimum enclosing ball [5], will allow a rigorous comparison of SVMs for large data sets.

Appendix Benchmark Problems
In the following, the employed benchmark data sets are shortly described.We employ a data set size of  and employ the last   patterns for the test set.
(i) MakeClass is a classification data set ( = 50000,   = 1000, and  = 100) generated with the scikitlearn [14] method make classification with  = 100 dimensions and two centers.The structure Δ determines the ratio informative features; the remaining ones are redundant.
(iii) The Faces data set is called Labeled Faces in the Wild (iv) The Gaussian Blobs data set ( = 20000,   = 1000, and  = 100) is generated with the scikit-learn [14] method make blobs and the following settings.Two centers, that is, two classes, are generated, each with a standard deviation of  = 10.0 and variable .

Figure 1 :
Figure1: Illustration of C-SVM and ECM learning scheme.In each level, the training set is divided into / subsets.This procedure is repeated consecutively until the target size  * of training set for support vectors is reached.In case of ECM, the SVM learning process in the low-dimensional space returns low-dimensional support vectors.In each next level, the high-dimensional pendants are employed.
(a) SVM trained with all patterns (b) SVM trained with support vectors of 1st half (c) SVM trained with support vectors of 2nd half (d) C-SVM trained with support vectors of two SVMs

Figure 2 :
Figure 2: Comparison of SVM learning results of (a) a classic SVM trained on a set of patterns (XOR data set), (b) an SVM trained with the support vectors of an SVM that has been trained on the 1st half of the XOR data set, (c) an SVM trained on the support vectors, 2nd half of XOR data set, and (d) the C-SVM trained on the support vectors of both SVMs.

3. 3 .
Benchmark Data Sets.As the PCA-ECM is the fastest ECM variant with strong accuracies, we concentrate on a comparison of PCA-ECM, C-SVM, and a standard SVM on a larger benchmark data set; see Appendix, in the following.The benchmark data set contains the four classification problems MakeClass, Hastie, Faces, and Blobs.The problems Friedman 1, Friedman 3, and Wind are regression problems.

Figure 5 :
Figure 5: Illustration of ECM training characteristics: (a) development of the number of support vectors during a typical run; the number of support vectors is decreasing approximately linearly and (b) runtime comparison of SVM and ECM on an increasing training set size with  = 1,000 and  * = 5,000 on MakeClass with  = 20 and Δ = 0.2.

[ 15 ]
( = 1088,   = 200, and  = 1850) and has been introduced for studying the face recognition problem.The data set source is http://vis-www.cs.umass.edu/lfw/.It contains JPEG images of famous people collected from the internet.The faces are labeled with the name of the person pictured.
(vii)  The Wind data set ( = 50000,   = 5000, and  = 11) is based on spatiotemporal time series data from the National Renewable Energy Laboratory (NREL) Western Wind data set.The whole data set comprises time series of 32,043 wind turbines, each holding ten 3 MW turbines over a timespan of three years in a 10minute resolution.The dimensionality is  = 22.
and a target training set size  * (corresponding to the final number of support vectors) have to be defined.We employ the following cascade variant; see Figure1.The training set  [1,] = {(x 1 ,  1 ), . . ., (x  ,   )} of patterns x  with corresponding labels   is divided into subsets of size ; that is, * is used for all SVMs.

Table 1 :
Comparison of ISOMAP-ECM and LLE-ECM with various choices for  and  in terms of MSE on MakeClass data set with  = 100,000,  = 40, and Δ = 0.3.

Table 2 :
Comparison between PCA-ECM, C-SVM, and SVM in terms of MSE on benchmark problems.