Cross-Validation, Bootstrap, and Support Vector Machines

This paper considers the applications of resampling methods to support vector machines (SVMs). We take into account the leavingone-out cross-validation (CV) when determining the optimum tuning parameters and bootstrapping the deviance in order to summarize the measure of goodness-of-fit in SVMs. The leaving-one-out CV is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples. We analyze the data from a mackerel-egg survey and a liver-disease study.


Introduction
In recent years, support vector machines (SVMs) have been intensively studied and applied to practical problems in many fields of science and engineering [1][2][3].SVMs have many merits that distinguish them from many other machine learning algorithms, including the nonexistence of local minima, the speed of calculation, and the use of only two tuning parameters.There are at least two reasons to use a leaving-one-out cross-validation (CV) [4].First, the criterion based on the method is demonstrated to be favorable when determining the tuning parameters.Second, the method can estimate the bias of the excess error in prediction.No standard procedures exist by which to assess the overall goodness-of-fit of the model based on SVM.By introducing the maximum likelihood principle, the deviance allows us to test the goodness-of-fit of the model.Since no adequate distribution theory exists for the deviance, we provide bootstrapping on the null distribution of the deviance for the model having optimum tuning parameters for SVM with a specified significance level [5][6][7][8].
The remainder of this paper is organized as follows.In Section 2, using the leaving-one-out CV, we focus on the determination of the tuning parameters and the evaluation of the overall goodness-of-fit with the optimum tuning parameters based on bootstrapping.The leaving-one-out CV is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples [9].In Section 3, the one-against-one method is used to estimate a vector of multiclass probabilities for each pair of classes and then to couple the estimates together [3,10].In Section 4, the methods are illustrated using mackerelegg survey and liver-disease data.We discuss the relative merits and limitations of the methods in Section 5.

Support Vector Machines.
Given n training pairs (x 1 , y 1 ), (x 2 , y 2 ), . . ., (x n , y n ), where x i is an input vector and y i ∈ {−1, +1}, the SVM solves the following primal problem: where β is a unit vector (i.e., β = 1), T denotes the transposition of the matrix, K(x, Advances in Artificial Neural Systems kernel function, C is the tuning parameter denoting the tradeoff between the margin width and the training data error, and ξ i ≥ 0 are slack variables.For an unknown input pattern x, we have the decision function where {α i , i = 1, 2, . . ., n; α i ≥ 0} are the Lagrange multipliers.We employ the Gaussian radial basis function as the kernel function [3,11,12] where γ > 0 is a fixed parameter, and Binary classification is performed by using the decision function f (x): the input X = (x 1 , x 2 , . . ., x n ) T is assigned to the positive class if f (x) ≥ 0, and to the negative class otherwise.Platt [13] proposed one method for producing probabilistic outputs from a decision function by using logistic link function where f i = f (x i ) and y i represent the output of the SVM and the target value for the sample, respectively [14].This is equivalent to fitting a logistic regression model to the estimated decision values.The unknown parameters A, B in (5) can be estimated by minimizing the cross-entropy where Putting from ( 6), (7), and (8), we obtain Lin et al. [15] observed that the problem of ln(0) never occurs for (9).

Leaving-One-Out Cross-Validation
2.2.1.CV Score.We must determine the optimum values of tuning parameters C and γ in (1) and (3), respectively.This can be done by means of the leaving-one-out CV; a byproduct is that the excess error rate of incorrectly predicting the outcome is estimated.
Let the initial sample X = {X 1 , X 2 , . . ., X i−1 , X i , X i+1 , . . ., X n } with X i = (x i , t i ) be independently distributed according to an unknown distribution.The leaving-one-out CV algorithm is then given as follows (see, e.g., [5]).
Step 1. From the initial sample X, X i are deleted in order to form the training sample Step 2. Using each training sample, fit an SVM and predict the decision value f [i] for X i .
Step 3. From the decision value f [i] , we can predict p [i] for the deleted ith sample using (7) and calculate the predicted log-likelihood t Step 4. Steps 1 to 3 are repeated for i = 1, 2, . . ., n.
Step 5.The CV score (i.e., averaged predicted log-likelihood) is given by Step 6. Carry out a grid search over tuning parameters C and γ, taking the tuning parameters with minimum CV as optimal.It should be noted that the CV score is asymptotically equivalent to AIC (akaike information criterion) and EIC (extended information criterion) [16][17][18].

Excess Error Estimation.
Let the actual error rate be the probability of incorrectly predicting the outcome of a new observation, given a discriminant rule on initial sample X; this is useful for performance assessment of a discriminant rule.Given a discriminant rule based on the initial sample, the error rates of discrimination are also of interest.As the same observations are used for forming and assessing the discriminant rule, this proportion of errors, called the apparent error rate, underestimates the actual error rate.The estimate of the error rate is seriously biased when the initial sample is small.This bias for a given discriminant rule is called the excess error of that rule.To correct this bias and estimate the error rates, we provide the bias correction of the apparent error rate associated with a discriminant rule, which is constructed by fitting to the training sample in the SVM.
By applying a discriminant rule to the initial sample X, we can form the realized discriminant rule η X .Let η X[i] be the discrimination rule based on X [i] .Given a subject with x i , we predict the response by η X[i] (x i ).The algorithm for leavingone-out CV that estimates the excess error rate when fitting a SVM is given as follows [9].
Advances in Artificial Neural Systems 3 Step 1. Generate the training sample X [i] , and construct the realized discrimination rule η X Then leaving-one-out CV error rate is given by Step 2. Calculate the apparent error Step 3. The cross-validation estimator of expected excess error is 2.3.Bootstrapping.Introducing the maximum likelihood principle into the SVM, the deviance allows us to test the goodness-of-fit of the model where ln L c denotes the maximized log likelihood under some current SVM, and the log likelihood for the saturated model ln L f is zero.The deviance given by ( 15) is, however, not even approximately a χ 2 distribution for the case in which ungrouped binary responses are available [19,20].
The number of degrees of freedom (d.f.) required for the test for significance using the assumed χ 2 distribution for the deviance is a contentious issue.No adequate distribution theory exists for the deviance.The reason for this is somewhat technical (for details, see Section 3.8.3 in [19]).Consequently, the deviance on fitting a model to binary response data cannot be used as a summary measure of the goodness-of-fit test of the model.Based on the above discussion, the percentile of deviance for goodness-of-fit test can in principle be calculated.However, the calculations are usually too complicated to perform analytically, so Monte Carlo method can be employed [6,7].
Step 1. Generate B bootstrap samples X * from the original sample X.Let X * b denote the bth bootstrap sample.
Step 2. For the bootstrap sample X * b , compute the deviance of (15), denoted by Dev * (b).
Steps 1 and 2 are repeated independently B times, and the computed values are arranged in ascending order.
Step 3. Take the value of the jth order statistic Dev * (b) of the B replications as an estimate of the quantile of order j/(B+1).
Step 4. The estimate of the 100(1−α)th percentile of Dev * (b) is used to test the goodness-of-fit of a model having a specified significance level α = 1 − j/(B + 1).The value of the deviance of (15) being greater than the estimate of the percentile indicates that the model fits poorly.Typically, the number of replication B is in the range of 50 ≤ B ≤ 400.

Influential Analysis.
Assessing the discrepancies between t i and p i at the ith observation in (15), the influence measure provides guides and suggestions that may be carefully applied to a SVM [19].The effect of the ith observation on the deviance can be measured by computing where Dev [i] is the deviance with ith observation deleted.The distribution of ΔDev [i] will be approximated by χ 2 with d.f.= 1 when the fitted model is correct.An index plot is a reasonable rule of thumb for graphically presenting the information contained in the values of ΔDev The key idea behind this plot is not to focus on a global measure of goodness-of-fit but rather on local contributions to the fit.An influential observation is one that greatly changes the results of the statistical inference when deleted from the initial sample.
Platt [13] proposed the threefold CV for estimating the decision values in (9).However, the value of ΔDev [i] may be negative because three SVMs are trained on splitted three parts of training pairs (x 1 , y 1 ), (x 2 , y 2 ), . . ., (x n , y n ).Therefore, in the present paper, we train a single SVM on the training pairs in order to evaluate the decision values f i s and estimate probabilistic outputs according to [15].

Multiclass SVM
We consider the discriminant problem with K classes and n training pairs (x 1 , t 1 ), (x 2 , t 2 ), . . ., (x n , t n ), where x i is an input vector and t i = (t i1 , t i2 , . . ., t iK ) [10,21,22].Let p ii , p i2 , . . ., p iK denote the response probabilities, with K k=1 p ik = 1, for multiclass classification with The log-likelihood is given by For multi-class classification, the one-against-one method (also called pairwise classification) is used to produce a vector of multi-class probabilities for each pair of classes, and then to couple the estimates together [10].The earliest used implementation for multi-class SVM is probably the one-against-one method of [21].This method constructs K(K − 1)/2 classifiers based on the training on data from the kth and lth classes of training set.
The SVM solves the primal formulation [3,10,23] min Given K classes of data for any x, the goal is to estimate We first estimate pairwise class probabilities by using where A and B are estimated by minimizing the cross entropy using training data and the corresponding decision values f .Hastie and Tibshirani [21] proposed minimizing the Kullback-Leibler (KL) distance between r kl and where r lk = 1 − r kl and n kl is the number of training data in the kth and lth classes.Wu et al. [10] propose the second approach to obtain p k from all these r kl 's by optimizing min 1 2 Thus, we can adopt the leaving-one-out CV similar to the method in Section 2.2.
Step 1. From the initial sample X, X i are deleted in order to form the training sample Step 2. Using each training sample, fit a SVM in order to estimate r kl[i] by (22), and predict ( p Step 3. Steps 1 and 2 are repeated for i = 1, 2, . . ., n. Step 4. The CV score is given by Step 5. Tuning parameters with minimum CV can be determined as optimal by carrying out a grid search over C and

Examples
4.1.Mackerel-Egg Survey Data.We consider data consisting of 634 observations from a 1992 mackerel egg survey [24].There are the following predictors egg abundance: the location (longitude and latitude) at which samples were taken, depth of the ocean, distance from the 200 m seabed contour, and, finally, water temperature at a depth of 20 m.We first fit a SVM.In the same manner as described in [11], we determine tuning parameters C and γ.The optimum values of the tuning parameters are (C, γ) = (28, 0.09).
The bootstrap estimator of the percentile for the deviance is Dev * (b) = 444.31.A comparison with the deviance Dev = 443.132from (15) suggests that the SVM fits the data fairly well.For reference purposes, the histogram of the bootstrapped Dev * (b) for B = 400 is provided in Figure 1.
We can estimate the apparent errors rate of incorrectly predicting outcome and leaving-one-out CV error rates for several models as shown in Table 1.The smoothing parameters in generalized additive models (GAM) [24] and the number of hidden units in a neural network in Table 1 are determined using the leaving-one-out CV.From Table 1, the leaving-one-out CV error rate for the SVM is the smallest among all models, but the apparent error rate is the smallest for the neural network.The CV scores are 477.04,509.44, and 541.61 for the SVM, logistic discriminant, and neural network, respectively.This implies that the SVM is the best among these three models from the point of view of CV. Figure 2 shows the index plot of ΔDev [i] , which indicates that no.399 and no.601 are influential observations at the 0.01% level of significance.

Liver Disease Data.
We apply the proposed method to laboratory data collected from 218 patients with liver disorders [25][26][27].Four liver diseases were observed: acute viral hepatitis (57 patients), persistent chronic hepatitis (44 patients), aggressive chronic hepatitis (40 patients), and postnecrotic cirrhosis (77 patients).The covariates consist of four liver enzymes: aspartate aminotransferase (AST), alanine aminotransferase (ALT), glutamate dehydrogenase (GIDH), and ornithine carbamyltransferase (OCT).For each (C, γ) pair, the CV performance is measured by training 70% and testing the other 30% of the data.Then, we train the whole training set by using the pair (C, γ) = (93, 0.20), which achieves the minimum CV score (=187.93)and predicts the test set.The apparent and leaving-one-out CV error rates for traing and test samples for several models as shown in Table 2.As shown, the apparent error rate for SVM of training sample and the error rate for SVM of test sample are the smallest among all models, but the leaving-one-out CV error rate for SVM of training sample is larger than that of the multinomial logistic discriminant model.

Concluding Remarks
We considered the application of resampling methods to SVMs.Statistical inference based on the likelihood approach for SVMs was discussed, and the leaving-one-out CV was suggested for determining the tuning of parameters and for estimating the bias of the excess error in prediction.Bootstrapping is used to focus on the evaluation of the overall goodness-of-fit with the optimum tuning parameters.Data from a mackerel-egg survey and a liver-disease study are used to evaluate the resampling methods.
There is one broad limitation to our approach: the SVM assumed the independence of the predictor variables.More generally, it may be preferable to visualize interactions between predictor variables.The smoothing spline ANOVA models [28] can provide an excellent means for handling data of mutually exclusive groups and a set of predictor variables.We expect that flexible methods for a discriminant model using machine learning theory [1], such as penalized smoothing splines, will be very useful in these real-world contexts.

Table 1 :
Error rates for mackerel-egg survey data.

Table 2 :
Error rates for liver disease data.