Capped Asymmetric Elastic Net Support Vector Machine for Robust Binary Classification

Recently, there are lots of literature on improving the robustness of SVM by constructing nonconvex functions, but they seldom theoretically study the robust property of the constructed functions. In this paper, based on our recent work, we present a novel capped asymmetric elastic net (CaEN) loss and equip it with the SVM as CaENSVM. We derive the infuence function of the estimators of the CaENSVM to theoretically explain the robustness of the proposed method. Our results can be easily extended to other similar nonconvex loss functions. We further show that the infuence function of the CaENSVM is bounded, so that the robustness of the CaENSVM can be theoretically explained. Other theoretical analysis demonstrates that the CaENSVM satisfes the Bayes rule and the corresponding generalization error bound based on Rademacher complexity guarantees its good generalization capability. Since CaEN loss is concave, we implement an efcient DC procedure based on the stochastic gradient descent algorithm (Pegasos) to solve the optimization problem. A host of experiments are conducted to verify the efectiveness of our proposed CaENSVM model.


Introduction
Support vector machine (SVM), frst proposed by Cortes and Vapnik [1], is a powerful binary classifcation tool and has been widely used in various felds, such as bioinformatics analysis [2,3], industrial faw detection [4], and fnancial forecasting [5].On the one hand, the SVM can be easily understood in the geometric view, i.e., it aims to seek a single separating hyperplane for classifying datasets.On the other hand, there is a solid statistical theory basis behind to well guarantee the classifcation performance of the SVM [6][7][8].Tus, it has been drawing much attention to study the SVM [9][10][11][12][13][14][15].Although a host of literature demonstrate the advantages of support vector classifers, there is still a room for improvement.
One drawback is the sensitivity to feature noise or more specifcally the instability for resampling.In fact, the SVM can be ft in the regularization framework of loss + penalty by adopting hinge loss, i.e., l hinge (u) � max (u, 0).Huang et al. [11] pointed out that the hinge loss-based SVM lacks resistance to feature noise and the fnal separating hyperplane is severely disturbed by the feature noise around the decision boundary.To tackle this problem, motivated by the quantile in the statistical feld, Huang et al. [11] constructed the so-called pinball loss and applied it to the SVM to propose PinSVM.Later, Xu et al. [16] extended this idea to the twin support vector machine, which can simultaneously obtain a pair of nonparallel separating hyperplanes.Tere are two typical defects of the pinball loss.Te frst defect is the heavy optimization burden caused by the singularity of the pinball loss function at zero.Huang et al. [17] considered an asymmetric least squared loss, which is also stable for resampling but is smooth everywhere.A similar idea was studied by Liu et al. [18].Unlike them, Li and Lv [19] utilized Chen-Harker-Kanzow-Smale function to construct a smooth approximation of the pinball loss.Te second defect is the lack of sparseness.According to the pinball loss function, those correctly classifed training samples still produce losses, which is completely diferent from the hinge loss and increases the training cost.To enhance the sparseness, Huang et al. [11] introduced the pinball loss with ϵ-insensitive zone, which can achieve sparsity and maintain the stability simultaneously.Shen et al. [20] truncated the left part of the pinball loss and proposed pin-SVM, providing a more fexible framework for the tradeof between sparsity and stability.Yang and Xu [21] applied a safe screening rule for accelerating the PinSVM.More PinSVM-related work can be found in [22][23][24][25][26][27].
Another drawback is the sensitivity to label noise (outliers).Since l hinge (u) tends to infnite as u ⟶ ∞, outliers often produce large losses, indicating that the resulting decision hyperplane is possibly deviated.Wu and Liu [28] suggested a truncated hinge loss, termed as ramp loss, to suppress the infuences of outliers.Based on the ramp loss and motivated by the Huberized scheme, Wang et al. [29] proposed a smooth ramp loss function, which is twice diferentiable and can be efciently solved.Liu et al. [30] applied ramp loss to a nonparallel support vector machine.Tang et al. [13] combined the pinball loss with ramp loss to propose the valley loss, which is both stable for resampling and robust to outliers.For more related literature, one can refer to [31][32][33][34].Another diferent strategy to improve robustness is using correntropy-induced loss (C-loss) [35].Based on C-loss, Xu et al. [36] proposed the rescaled hinge loss.Due to the properties of an exponential function, the rescaled hinge loss is bounded and the induced support vector classifer is insensitive to label noise.Yang and Dong [37] applied the idea of the pinball loss to C-loss and proposed a new generalized quantile loss, which can be viewed as a rescaled version of the pinball loss.Te generalized quantile loss inherits the stability for resampling from the pinball loss and the robustness to outliers from Closs.Similarly, we recently proposed a joint rescaled asymmetric least squared (RaLS) loss and applied it to a nonparallel support vector machine [38].Te proposed RaLS loss is smooth everywhere and enjoys both stability and robustness.
Inspired by the previous work, we construct a novel capped asymmetric elastic net (CaEN) loss and apply it to the SVM (CaENSVM) in this paper.As a generalization of the pinball loss and ramp loss, the designed CaEN loss is bounded and asymmetric, as well.However, it is more fexible than the previous truncated pinball loss functions.To demonstrate its advantages, we theoretically investigate several properties of the CaEN loss, including noise insensitivity, Bayes rule, and generalization error bound.Te main contributions of this work can be summarized as follows: (i) A novel capped asymmetric elastic net (CaEN) loss is proposed to achieve stability for resampling and robustness to outliers simultaneously.Te advantages of the elastic net (EN) loss for the SVM were theoretically discussed in our recent work [3].Te derived VTUB signifcantly characterizes the advantage of the EN loss.Tus, it is meaningful to improve the performance of the EN loss under the framework of the SVM.
(ii) We derive the infuence function (see Teorem Te remainder of this paper is organized as follows: In Section 2, we introduce several related studies.In Section 3, we frst formulate the proposed CaENSVM.Ten, an efcient DC procedure based on the stochastic gradient descent algorithm is implemented for optimizing the CaENSVM problem.Teoretical analysis on the properties of the CaENSVM, including noise insensitivity, Bayes rule, and generalization error bound, is carefully discussed in Section 4. We conducted lots of experiments in Section 5 to investigate the performance of the CaENSVM.In Section 6, a conclusion summarizes the main contributions and further potential directions.

Background
In this section, we review several related works.Considering a binary classifcation problem with n training samples and p features, let x i ∈ R p×1 and y i ∈ +1, − 1 { } be i-th instance and the corresponding label, respectively.All samples are reorganized as a data matrix X ∈ R n×p .Without particular explanations, all vectors are in column form.

SVM with Elastic Net Loss.
Recently, Qi et al. [39] proposed a so-called elastic net (EN) loss given as follows: where c 1 and c 2 are both two positive tuning parameters.EN loss is a fusion of the standard hinge loss and the squared hinge loss [9].Figure 1(a) shows the shapes of the EN loss with diferent sets of parameters.

International Journal of Intelligent Systems
Based on EN loss, Qi et al. [39] constructed the following elastic net support vector machine (ENSVM): where w ∈ R p×1 and b ∈ R are the normal vector and the intercept of the separating hyperplane, respectively.After obtaining w and b from (2), the decision function of a new sample x new is f(x new ) � sign(w T x new + b), where sign(•) is a sign function, which maps a real number to its sign and zero to zero.We can equivalently transform (2) into the following constrained optimization problem: where For ENSVM (3), one can clearly see that the ENSVM resembles the standard SVM [1] with c 1 � 0 and reduces to the Lagrangian SVM [9] with c 2 � 0. Tus, the ENSVM is more fexible.Moreover, in our recent work [3], we derived the so-called VTUB for the SVM with the EN loss to demonstrate its unique advantages.Tus, it is meaningful to improve the performance of the EN loss under the framework of the SVM.

SVM with Pinball
Loss.Huang et al. [10] proved that the support vector classifers with hinge-type losses, including ENSVM, are sensitive to feature noise, or specifcally, are unstable for resampling.Motivated by the quantile in the statistical feld, Huang et al. [10] proposed a novel pinball loss defned as where τ ∈ [0, 1] controls the level of stability for resampling.Figure 1(b) illustrates the shapes of the pin loss with diferent values of τ.As the fgure shows, unlike the hinge-type losses, l Pin also produces losses for u < 0, which can beneft the classifer for balancing the disturbance of the feature noise around the decision boundary (see subsection 3.3 in the study by Huang et al. [10] for details).
Following the method of formulating the hinge lossbased SVM, Huang et al. [10] constructed the pinball loss SVM (PinSVM) as where c ≥ 0 is a tuning parameter.We can also equivalently rewrite (5) as Note that if τ � 0, the second constraint of problem (6) turns to ξ ≥ 0, and PinSVM reduces to the standard SVM with hinge loss.In other words, the PinSVM can be viewed as a generalization of the standard SVM.

SVM with Ramp Loss.
Since the values of losses produced by the hinge-type loss [1,9,39] and the pinball-type loss [11,17,18] functions tend to infnite when u ⟶ ∞, the support vector classifers induced by these losses are sensitive to label noise (outliers).To reduce the infuence of label International Journal of Intelligent Systems noise, Wu and Liu [28] truncated the hinge loss and proposed the so-called ramp loss, which is defned as where s > 0 controls the truncation level.Figure 1(c) shows diferent shapes of the ramp loss.As compared with the hinge-type losses, ramp loss is upper bounded, such that it can weaken the disturbance of label noise.
Applying ramp loss to the SVM, we can obtain RampSVM, i.e., min where c > 0 is a tuning parameter.

Capped Asymmetric Elastic Net Loss-Based SVM
3.1.Te CaENSVM Model.In our recent work [3], we derived the so-called VTUB to demonstrate the unique advantages of the EN loss under the framework of the SVM.Tus, it is meaningful to improve the performance of the EN loss.Motivated by the pinball loss, we designed the following asymmetric elastic net (aEN) loss: where θ ∈ [0, 1] corresponds to a tradeof between L 1 norm and L 2 norm, τ ∈ [0, 1] is a tuning parameter controlling the bias of the penalization for positive and negative losses.
Figure 2(a) illustrates diferent shapes of aEN losses with diferent sets of tuning parameters.According to defnition (9), l aEN can be regarded as a generalization of elastic net loss, pinball loss, and asymmetric least squared loss functions [17], since l aEN reduces to l EN for τ � 0, l aEN becomes l Pin for θ � 0 and l aEN is equivalent to l aLS for θ � 1.
According to (9), l aEN goes to infnity along with u ⟶ ∞, so the proposed aEN loss is also sensitive to outliers (label noise).To improve its robustness against outliers, we further use capped trick to propose a capped asymmetric elastic net (CaEN) loss function, which is defned as l CaEN (u; τ, θ, s) � min l aEN (u), s , where s > 0 is a thresholding parameter.Figure 2 where c > 0 is the tuning parameter.We remove the intercept term b in (11) for simplicity, which can be achieved before training by centering features just like [10,13].After obtaining w from (11)

4
International Journal of Intelligent Systems For nonlinear CaENSVM, we consider the following kernel-generated separating hyperplane [40,41]: where ϕ(x) maps x to a high-dimensional Hilbert space.In application, we often utilize kernel trick, i.e., ϕ(x) � K(X, x), where Ten, by replacing w T x i with w T K(X, x i ) in (11), we can obtain the nonlinear CaENSVM model.For determining the class of a new sample x new , we only need to replace x new with K(X, x new ) in the linear decision function.

DC Algorithm for CaENSVM.
Te truncation for aEN loss results in a nonconvex loss, indicating that solving CaENSVM (11) involves nonconvex minimization, which is often difcult.Note that, though CaEN loss is concave, we can decompose l CaEN into the diference of two convex functions, i.e., l CaEN (u; τ, θ, s) � l aEN (u; τ, θ) − l aEN1 (u; τ, θ, s), (13) where l aEN1 corresponds to the so-called aEN1 loss, which is given as follows: Figure 2(c) depicts the above convex decomposition of the CaEN loss.As the fgure shows, both aEN and aEN1 losses are exactly convex.By calculating the diference of aEN and aEN1 losses, we can fnally obtain the nonconvex CaEN loss.Using this property of the CaEN loss, we apply the DC (diference of convex functions) algorithm [42] to optimize problem (11).
Considering the following optimization problem: where g and h are both convex functions on R m .To solve problem (15), the DC algorithm turns to minimize a sequence of convex subproblems.A general framework of the DC algorithm for ( 15) is illustrated as Algorithm 1.
Recalling (11), based on the decomposition (13), we can reformulate CaENSVM optimization problem as where g and h are both convex functions on R p .
According to the DC algorithm, we have to calculate the derivative of h with respect to w.Since l aEN1 in ( 16) has sharp points, h is also nondiferentiable.Tus, we utilize the subgradient instead of the derivative.For a set of tuning parameters (τ, θ, s), the subgradient of l aEN1 with respect to w is given as follows: where u 1 and u 2 (u 1 < 0 < u 2 ) are two sharp points of both l aEN1 (u) and l CaEN (u) except for zero, which can be easily calculated from ( 9) and ( 14) and are given as follows: Require: k � 0; Θ (0) , the initial value of Θ. Ensure: optimal solution of ( 15). ( 1) repeat (2) ALGORITHM 1: A general framework of the DC algorithm for (15).

International Journal of Intelligent Systems
Terefore, by Algorithm 1 and given w (k) , the main optimized subproblem is For the sake of scalability and efciency, we apply a stochastic gradient descent algorithm, i.e., Pegasos [43], to solve problem (19).
m be a subset of k samples, randomly chosen from the whole dataset for the t-th iteration during optimizing a problem (19).Tus, we consider the following approximate objective function: Ten, the subgradient of F(v; A t ) with respect to v at v (t) is given as follows: , c, τ, θ, and s.Ensure: optimal solution of ( 11). (1 Compute ∇F(v (t) ) by ( 21) and ( 22). ( 5) Set t � 0, v (0) � w (k) .( 13) Compute ∇F(v (t) ) by ( 21) and ( 22). ( 16) Set end while (19) Set w (k+1) � v (t+1) .( 20) end while (21) returnw (k+1) .ALGORITHM 2: Pegasos-based DC procedure for CaENSVM.6 International Journal of Intelligent Systems where v (t) is the optimal value at the t-th iteration and In line with Pegasos, the update can be written as where η t � c/t is the step size.Finally, based on the aforementioned results about the DC algorithm and Pegasos, we design a Pegasos-based DC procedure to solve the CaENSVM optimization problem (11), which is shown in Algorithm 2. Note that if we substitute the training sample matrix X with the kernelized form , we can directly apply the implemented Algorithm 2 to solve nonlinear CaENSVM.

Properties of CaENSVM
In this section, we theoretically investigate several properties of the proposed CaENSVM, including noise insensitivity, Bayes rule, and generalization error bound.Since the following analysis involves the statistical distribution of the training dataset, we frst make several notations and assumptions.Supposing the training samples Let Prob(•) and Prob(•|•) be the probability and the conditional probability, respectively.

Noise Insensitivity.
By the construction of the CaEN loss, it inherits the robustness to label noise (outliers) from ramp loss and the resampling stability to feature noise from the pinball loss.Terefore, we focus on the noise insensitivity of CaENSVM from two aspects: the robustness to label noise and the resampling stability to feature noise.

Robustness to Label Noise.
For the robustness to label noise, we show this property throughout proving the boundness of the infuence function, which was frst introduced by Hampel [44].Te infuence function aims to measure the stability of estimators against an infnitesimal contamination.Te infuence function of a robust estimator should be bounded [44,45].Before giving the main result, we have to make the following assumption for the distribution of the training dataset, which is common in statistical analysis.
Assumption 1. Te random variable x ∈ X has fnite second moment.
We denote by (x T 0 , y 0 ) T a sample point with mass probability distribution ∆ x 0 ,y 0 .Given the distribution F of (x T , y) T in R p+1 , let the mixed distribution of F and ∆ x 0 ,y 0 be F ϵ � (1 − ϵ)F + ϵ∆ x 0 ,y 0 , where ϵ ∈ (0, 1) is the proportion parameter.Fixing τ, θ and s, let Ten, the infuence function at a sample point (x T 0 , y 0 ) T is defned as provided that the limit exists.

Theorem 1. (infuence function)
. For linear CaENSVM (11) with τ, θ and s fxed, the infuence function IF(x 0 , y 0 ; w * 0 ) at a sample point (x T 0 , y 0 ) T is given by where and Proof.According to KKT conditions, w * ϵ must satisfy Since F ϵ � (1 − ϵ)F + ϵ∆ x 0 ,y 0 , equation ( 29) can be rewritten as Diferentiating with respect to ϵ in both sides of (30) and letting ϵ ⟶ 0, we have where where are from the results of the optimality condition (36).Combining (30) and (31), we can obtain that where I is an identity matrix with a proper size.

□
Remark 1.According to Corollary 1, the derivative of loss, i.e., ∇l CaEN (u), signifcantly relates to the characteristics of the infuence function.In fact, because ∇l CaEN (u) is bounded, we can easily deduce the bounds of the infuence function.For those convex losses, such as elastic net loss and pinball loss, their derivatives are boundless, which means the corresponding estimators are not robust.In other words, Corollary 1 reveals the reason of the robustness of the CaEN loss, or those concave losses.

Resampling Stability to Feature Noise.
For the resampling stability to feature noise, we demonstrate this property in line with Huang et al. [11].Recalling the CaEN loss (10), the subgradient of l CaEN with respect to u is given by Ten, by KKT (Karush-Kuhn-Tucker) conditions, the solution of CaENSVM satisfes where 0 is a proper length vector with all components equal to zero.For given w, the training sample index set can be partitioned into the following seven sets: International Journal of Intelligent Systems Using the notations (38), the optimal conditions (37) can be equivalently rewritten as Since S w 1 , S w 3 , and S w 5 are based on equalities, it is reasonable to see that S w 1 , S w 3 , and S w 5 have much smaller sizes than S w 2 or S w 4 .Tus, the contributions of S w 1 , S w 3 , and S w 5 to (39) are considerably weak.In other words, we can roughly determine w by S w 2 and S w 4 , i.e., With parameters τ and θ properly selected, equation (40) indicates that τ controls the sensitivity of the CaENSVM to feature noise.In fact, by (38), ((1 − θ) − θ(1 − y i w T x i )) and (θ(1 − y i w T x i ) + (1 − θ)) are both positive, which means that a large τ (close to 1) can well balance the size of S w 2 and S w 4 for zero mean feature noise.Terefore, the efect of zero mean feature noise is weakened, and the fnal separating hyperplane of the CaENSVM is stable for resampling.Along with τ decreasing (close to 0), by (38), the fnal separating hyperplane is gradually dominated by the instances in S w 4 .As a result, the classifcation results are signifcantly disturbed by the zero mean feature noise around the decision boundary.

Bayes Rule.
Let P(x) � Prob(Y � 1|X � x) be the conditional probability of the positive class given X � x.Lin [46] claimed that sign(P(x) − 1/2) is the decision-theoretic optimal classifcation rule with the smallest generalization error, which is the so-called Bayes rule.We can also equivalently defne Bayes rule as For any loss function l(•), we defne the expected risk of a classifer f: X⟼Y as By minimizing the expected risk over all measurable functions, we can obtain f l,ρ (x) as where ρ(y|x) is the conditional distribution of y given x.
Note that ρ(y|x) is a binary distribution, corresponding to Prob(y � +1|x) and Prob(y � − 1|x).Huang et al. [11] proved that the pinball loss can lead to the Bayes classifer.In the following, we demonstrate that the Bayes rule also holds for our proposed capped asymmetric elastic net loss function.
According to (10), it follows that and Let P(+1) � Prob(y � +1|x) and P(− 1) � Prob(y � − 1|x), respectively.We denote by r for convenience, respectively.We found that the expected risk relates to the value of u 2 .Tus, based on ( 45) and ( 46) and the continuity of l CaEN , we discuss all cases as follows: For the case of 0 < u 2 < 1, we have One can easily verify that the minimal value is not afected.Supposing that P(+1) > P(− 1), the minimal value of μ is +1 according to the increase-decrease characteristics of l CaEN with respect to μ in each interval.Similarly, the minimal value of μ is − 1 if P(+1) < P(− 1) or the minimal value of μ is +1 or − 1 if P(+1) � P(− 1).Consequently, we have f l CaEN (x) � f C (x) when 0 < u 2 ≤ 1.

10
International Journal of Intelligent Systems For the case of 1 < u 2 < 2, we have and [1, u 2 − 1] both disappear, which has no efects on the minimum value.After simple calculations, one can fnd that the minimal value is the same as 0 < u 2 ≤ 1. Tus, we have For the case of 2 < u 2 < 2 − u 1 , we have , and [u 2 − 1, 1 − u 1 ] disappear, which has no efects on the minimum value; the minimal value is the same as 0 < u 2 ≤ 1, i.e., we have For the case of u 2 > 2 − u 1 , we have International Journal of Intelligent Systems One can easily obtain the same minimal value as 0 < u 2 ≤ 1, i.e., we have With the abovementioned results, minimizing l CaEN -based expected risk over all measurable functions can lead to the Bayes rule.Tus, Teorem 2 is proved.□ 4.3.Generalization Error Bound.We have proved that the proposed CaENSVM is equivalent to the Bayes rule, which is also called classifcation-calibrated in the study of Bartlett et al. [7], indicating that the CaENSVM enjoys many good properties.Here, we further give the generalization error bound of the CaENSVM based on the empirical Rademacher complexity [6], where the empirical Rademacher complexity is defned as follows: Defnition 1. Supposing that x 1 , • • • , x n are independently selected from X with a probability distribution ], and F is a real-valued function class mapping from X to R, the empirical Rademacher complexity of F is a random variable defned as where σ � (σ 1 , . . ., σ n ) T is independently uniform ± 1 { } -valued (Rademacher) random variables and E σ [•] means the expectation over σ.Te Rademacher complexity of F is where E ] [•] means the expectation over ].
According to (51),  R n (F) can be viewed as a correlation between f � f (x 1 ), • • • , f(x n ) T and σ for given x 1 , • • • , x n .Tus, the higher  R n (F) is, the more complex F is.By (52), R n (F) depicts the average complexity of F based on ] instead of a set of particular samples.
In line with the abovementioned defnitions and lemmas in the study of Bartlett and Mendelson [6], we provide the following theorem to yield the generalization error bound of the CaENSVM.International Journal of Intelligent Systems Proof.defne the Heaviside function Ξ(•) as Ten, we can easily obtain By defning Ψ(u) � 2l CaEN (1 + u; τ, θ, s � 1/2), we can easily verify that u 2 ≤ 1 and Ψ(u) ∈ [0, 1] dominates the function Ξ(•) on the support of ρ, we can obtain the    Bartlett and Mendelson [6] as where According to the CaEN loss (10) with the optimal w * 0 , we have According to the defnition (51), we can deduce that Hence, by Lemma 22 in the study of Bartlett and Mendelson [6] Finally, combining (56)-( 58) and (60), we can reach the result of the theorem.
Remark 3.According to the proof of Teorem 3, s � 1/2 can lead to a tighter generalization error bound.However, experimental results indicate that larger s may produce a more satisfactory classifer.

Numerical Studies
In this section, we conduct a host of experiments to investigate the performance of the proposed CaENSVM on both synthetic and benchmark datasets.For fair assessment, we compare with several famous or recent related SVMs, including ENSVM [39], PinSVM [11], RampSVM [28], Rhinge-SVM [36], and Valley-SVM [13].Note that Tang et al. [13] frst proposed the valley loss and applied it to the  } like [36] and parameter θ of CaENSVM is tuned in 0.1, 0.5, 0.9 { }, respectively.For nonlinear cases, we consider Gaussian kernel, i.e., K(x 1 , x 2 ) � exp (− c‖x 1 − x 2 ‖), where c > 0 is the kernel parameter.If not otherwise specifed, all remaining parameters are optimized in 2 − 8 , 2 − 7 , • • • , 2 7 , 2 8   , including kernel parameter.We use the fve-fold cross-validation strategy to search for the optimal parameters.Note that, for the implemented Pegasos-based DC algorithm, eps is fxed to 10 − 3 , T 1 � 10, and T 2 � 500 through our experiments.According to the numerical studies, this setting can often lead to a satisfactory result.

Synthetic Datasets.
We generate a two-dimensional synthetic dataset to test the robustness of CaENSVM.Te training dataset consists of 60 equal positive and negative samples.Te positive samples are independently drawn from a two-dimensional normal distribution with the mean vector μ + � (− 0.4, 1.0) T and the covariance matrix V � di ag(0.02,0.06).Te negative samples are independently drawn from a similar normal distribution with the mean vector μ − � (0.4, 0.2) T and the same covariance matrix.
Case 1.In this case, we only add three extra outliers for positive samples.Figure 3 shows the training samples and the separating lines (black solid lines) obtained by six SVMs.Note that the green "+" notation in each fgure is the midpoint between the centers of two classes of samples.Since the distributions behind two classes of training samples only difer from the location of centers, it is reasonable for the obtained separating line crossing the midpoint, i.e., the green "+" notation in each fgure.In other words, we can measure the level of disturbance caused by outliers by comparing the relative location between the obtained separating lines and the corresponding midpoints.
According to Figure 3, our proposed CaENSVM is more robust to outliers and produces the most satisfactory classifer.Due to the infnity of EN losses, ENSVM is easily attracted by outliers.As Figure 3(a) shows, ENSVM distinguishes two classes of samples worst, and the obtained separating line is clearly away from the midpoint, indicating its high sensitivity to outliers.Tough pinball loss is unbounded, the correctly classifed samples also produce losses, which behave like a balance and can reduce the attraction of outliers.From Figure 3(b), PinSVM performs little better than the ENSVM.Note that the separating line induced by the PinSVM is also close to outliers, which means PinSVM is still sensitive to outliers.ramp, Rhinge, and valley losses are all concave and bounded; they can limit the infuence of outliers and contribute to robust classifers.Te performances of RampSVM and Rhinge-SVM closely resemble each other.However, both provide totally diferent shapes of classifers compared with others.According to Figures 3(c Case 2. In this case, we add three extra outliers for both positive and negative samples. Figure 4 shows the training samples and the fnal decision lines (black solid lines) of six SVMs.Te midpoint between the centers of two classes of samples is also marked in each fgure to help measure the robustness of each classifer.
According to Figure 4, our proposed CaENSVM still performs best compared with other fve SVMs.ENSVM, RampSVM, and Rhinge-SVM present similar performances.Tey are all severely attracted by outliers and provide unsatisfactory decision lines.Tough the decision line obtained by PinSVM is also deviated by outliers, it seems clearly better than those given by the ENSVM, RampSVM, and Rhinge-SVM.In our opinion, since pinball loss produces losses for corrected classifed training samples, it can balance and reduce the infuence of outliers to some extent.Valley-SVM performs competitively like PinSVM, but the decision line of Valley-SVM is also drawn by outliers.In comparison with other fve SVMs, our proposed CaENSVM shows the best robustness against outliers as before.
Case 3. In this case, we investigate the time cost of the implemented Pegasos-based DC procedure for the proposed CaENSVM.Specifcally, we turn to generate from 50 to 5000 equal positive and negative training samples.For a fair assessment, we compare the PinSVM solved by clipDCD [47] and Rhinge-SVM solved by clipDCD-based half-  18 International Journal of Intelligent Systems quadratic optimization algorithm [36].All tuning parameters are fxed to 0.5 for simplicity.Figures 5 and 6 show the one-run CPU time of PinSVM, Rhinge-SVM, and CaENSVM with linear and Guassian kernels, respectively.According to Figures 5 and 6, our implemented Pegasosbased DC procedure obviously runs the fastest in comparison with PinSVM and Rhinge-SVM.For the linear kernel, PinSVM and Rhinge-SVM consume time similarly, though Rhinge-SVM needs to iterate a clipDCD chunk many times.It may be that the adaptive weighting scheme and sparsity of Rhinge-SVM can help reduce the training cost.Our CaENSVM runs fast, and the time cost is free of the sample size, indicating the scalability of Pegasos [43].For the Guassian kernel, Rhinge-SVM is clearly more time-consuming than PinSVM due to the outer iteration of the half-quadratic procedure.Te training time of CaENSVM also depends on the sample size, but it is still efcient.According to the experimental results, our proposed CaENSVM can be easily applied in solving large-scale data classifcation problems.

Experimental Settings and Results
. We select twelve UCI datasets to further demonstrate the advantages of our proposed CaENSVM.Te detailed information of the chosen datasets is listed in Table 1.To investigate the label noise insensitivity, we artifcially add 15% and 25% label noises to the raw datasets, i.e., we randomly select 15% and 25% training samples and exchange their labels.Te experimental results with linear and Gaussian kernels based on fve-fold crossvalidation criterion are shown in Tables 2 and 3, respectively.
From Table 2 our proposed CaENSVM with linear kernel outperforms others in most cases, according to the average prediction accuracies.For the case without label noise, our CaENSVM performs slightly better than the ENSVM, followed by Valley-SVM.Rhinge-SVM and RampSVM have competitive performances, while PinSVM seems to be the worst.Along with the ratio of label noise increasing, the performances of all SVMs seem roughly reduced.Specifcally, the prediction accuracy of the ENSVM is most afected by label noise, since the elastic net loss lacks robustness.Tough PinSVM performs little stably due to the asymmetric pinball loss, its prediction accuracy is often the most unsatisfactory and competitive with the ENSVM for the case with moderate and high label noise.Due to the robustness of ramp, Rhinge, and valley losses, RampSVM and Rhinge-SVM together with Valley-SVM present more clearly high prediction accuracies for the case with label noise in comparison with ENSVM and PinSVM.Because our designed CaEN loss enjoys both outlier insensitivity and resampling stability, the CaENSVM always achieves the highest average prediction accuracies for the cases with label noise.Terefore, the advantage of CaENSVM becomes more concrete, indicating its good robustness to label noise.
From Table 3, our proposed CaENSVM with Gaussian kernel slightly performs better than others according to the average prediction accuracies.For the raw datasets, the CaENSVM gives higher average prediction accuracies in more than half of the datasets.Te PinSVM turns to perform well, only behind the CaENSVM.RampSVM also has a good performance, but it is clearly worse than PinSVM.Te remaining three SVMs are competitive with each other.For the datasets with moderate and high label noise, the CaENSVM still has a weak superiority, followed by Rhinge-SVM and PinSVM.ENSVM and Valley-SVM seem to be the worst in most datasets.Compared with the predicting performance in the linear cases, the superiority of our proposed CaENSVM is not overwhelming.In our opinion, this situation is possibly caused by the Pegasos procedure, since the efciency of Pegasos is reduced in the nonlinear cases [43].We will consider designing efcient algorithms for nonlinear cases in future.

Comparisons by Statistical Test.
In this part, we conduct a statistical test on the experimental results of UCI datasets to further demonstrate the advantage of the CaENSVM.Specifcally, we utilize the famous Friedman test with the corresponding post hoc test [48] to study whether there is a statistically signifcant diference among CaENSVM and other fve compared SVMs.Firstly, we calculate the average ranks of each method with respect to diferent kernels and ratios of label noise in Tables 2 and 3. Te results are presented in Table 4.
Secondly, we need to obtain the Friedman statistic based on Table 4. Let D n and C k be the total number of compared datasets and classifers, respectively.By Tables 2 and 3, we where χ 2 F is the raw Friedman statistic defned as where R i is the average rank of the i-th classifer for each type of kernel and ratio of label noise.Tirdly, we apply the Nemenyi test for post hoc test to further distinguish the detailed diferences of six classifers.Te critical domain (CD) of the diference of the average ranks of two SVMs given by Nemenyi is defned as where q 0.1 � 2.589.If the absolute diference of two SVMs is larger than CD, it means that they perform statistically diferently.Otherwise, they have no statistical diferences with each other.Figure 7 shows the comparison of the average ranks of each SVM for diferent type of kernels and ratios of label noise.According to Figure 7, there is no signifcant diference between CaENSVM and ENSVM with linear kernel.However, by Figures 7(a)-7(c), the diference between CaENSVM and ENSVM becomes larger as the ratio of label noise increases, which means the proposed CaENSVM is still better than others.Rhinge-SVM, RampSVM, and PinSVM always have no signifcant diferences for linear cases, they all are signifcantly worse than CaENSVM.Valley-SVM shows good robustness for a high ratio of label noise, but there is still a gap between it and CaENSVM.For the cases with Gaussian kernel, our CaENSVM always presents slightly better performances than others, though the diferences among it and Rhinge-SVM and PinSVM are not signifcant.Te performances of Valley-SVM and ENSVM are signifcantly worse than those of other four methods, especially for high ratio of label noise.In short, our CaENSVM enjoys good performance in the statistical viewpoint.

Handwritten Digit Recognition.
In this subsection, we apply CaENSVM to solve a real problem, i.e., handwritten digit recognition.Te test dataset is the PMU-UD dataset from the study of Alghazo et al. [49], containing handwritten Urdu/ Arabic numerals from 0 to 9. Each handwritten number is standardized as a 120 × 80 image.Figure 8 shows four selected images for every handwritten number.We choose the datasets corresponding to two diferent handwritten numbers each time to conduct binary classifcation.Te information of all considered datasets is listed in Table 5.
For a fair evaluation, we also compare our CaENSVM with ENSVM, PinSVM, RampSVM, Rhinge-SVM, and Valley-SVM.Te average prediction accuracies with linear and Gaussian kernels based on the fve-fold cross-validation criterion are presented in Tables 6 and 7, respectively.According to Tables 6 and 7, our CaENSVM achieves the highest prediction accuracies in more than half of the cases, indicating its excellent performance in handwritten digit recognition problems.

Conclusion
In this paper, we have proposed a novel robust support vector classifer (CaENSVM) with a capped elastic net loss function.Teoretical analysis is conducted to thoroughly demonstrate the properties of CaENSVM, including noise insensitivity, Bayes rule, and the generalization error bound based on the Rademacher complexity.It is worth noting that we use the infuence function to explain well the robustness of CaENSVM.Tough the constructed CaEN loss is nonconvex, the implemented Pegasos-based DC algorithm can efciently solve the CaENSVM optimization problem.Te results of numerical studies indicate the following: (1) our CaENSVM is robust to outliers and performs better than many similar state-of-the-art SVMs according to prediction accuracy.Te superiority of the CaENSVM is also supported by the statistical test.(2) Te performance of the CaENSVM for nonlinear cases is not overwhelming like the performance for linear cases in comparison with other methods, indicating that there may be a room for improving the efciency of the algorithm.In fact, though Pegasos algorithm is scalable, the performance with nonlinear kernel is little unsatisfactory.Further work will focus on designing a more stable and efcient algorithm to achieve higher prediction accuracy, especially for nonlinear cases.Note that the R code of the proposed CaENSVM is developed by the authors and it is available at https://github.com/jiandan94/CaENSVM.

Figure 1 :
Figure1: Diferent types of loss functions: (a) "EN loss" is an elastic net loss, (b) "pin loss" is a pinball loss, and (c) "ramp loss" is a ramp loss.

Theorem 2 .
Te decision function f l CaEN ,ρ obtained by minimizing l CaEN -based expected risk over all measurable functions f: X ⟶ Y is equivalent to the Bayes rule, i.e., f l CaEN ,ρ (x) � f C (x), ∀x ∈ X. Proof.By simple calculating, we can obtain International Journal of Intelligent Systems

Theorem.
Fix ζ ∈ (0, 1) and B ∈ R + and consider the binary classifcation problem on (x T i , y i )   n i�1 drawn independently from a probability distribution F. Let F � f|f: x↦w T ϕ(x), ‖w‖ ≤ B   and G � g|g: (y, f(x))↦ − yf(x), f ∈ F   be function classes, respectively.If the CaEN loss (10) with s ≥ 1/2 and the optimal

Figure 5 :
Figure 5: Te one-run CPU time (in seconds) of PinSVM, Rhinge-SVM, and CaENSVM with linear kernel.x-coordinate is the log 10 (sample size), while y-coordinate is the training time (seconds).

Figure 6 :
Figure 6: Te one-run CPU time (in seconds) of PinSVM, Rhinge-SVM, and CaENSVM with Gaussian kernel.x-coordinate is the log 10 (sample size), while y-coordinate is the training time (seconds).
) and 3(d), the corresponding classifers indicate their overftness as well as the sensitivity to outliers.Valley-SVM and our proposed CaENSVM share analogical performance.According to Figures 3(e) and 3(f ) and based on the midpoint, our CaENSVM seems slightly better than Valley-SVM.
, the decision function of the linear CaENSVM for a new sample x new is f(x new ) � sign(w T x new ).

Table 1 :
Te information of UCI datasets.

Table 2 :
Te mean accuracy (Acc.) and standard deviation (sd) with linear kernel for UCI datasets.

Table 3 :
Te mean accuracy (Acc.) and standard deviation (sd) with Gaussian kernel for UCI datasets.

Table 4 :
Average ranks of each SVM with respect to diferent kernels and ratios of label noise for UCI datasets.

Table 5 :
Te information of the PMU-UD dataset.

Table 6 :
Te mean accuracy (Acc.) and standard deviation (sd) with linear kernel for the PMU-UD dataset.� 15 and C k � 6 for each type of kernel and ratio of label noise.Ten, we consider the following F statistic:

Table 4 ,
we obtain that the values of F F are 8.57, 8.34, and 8.87 with respect to diferent ratios of label noise for linear cases, 19.27, 21.21, and 12.69 with respect to diferent ratios of label noise for nonlinear cases.All values of F F statistic are larger than the critical value 1.93, which means that there is indeed a statistical diference among those compared six SVMs.
Te obtained F F statistic obeys F distribution with (C k − 1) and (C k − 1)(D n − 1) degrees of freedom.When the signifcant level is set to 0.1, we have F 0.1 (5, 70) � 1.93.According to

Table 7 :
Te mean accuracy (Acc.) and standard deviation (sd) with Gaussian kernel for the PMU-UD dataset.