Approximation Analysis of Learning Algorithms for Support Vector Regression and Quantile Regression

We study learning algorithms generated by regularization schemes in reproducing kernel Hilbert spaces associated with an (cid:2) -insensitive pinball loss. This loss function is motivated by the (cid:2) -insen- sitive loss for support vector regression and the pinball loss for quantile regression. Approximation analysis is conducted for these algorithms by means of a variance-expectation bound when a noise condition is satisﬁed for the underlying probability measure. The rates are explicitly derived under a priori conditions on approximation and capacity of the reproducing kernel Hilbert space. As an application, we get approximation orders for the support vector regression and the quantile regularized regression.


Introduction and Motivation
In this paper, we study a family of learning algorithms serving both purposes of support vector regression and quantile regression.Approximation analysis and learning rates will be provided, which also helps better understanding of some classical learning methods.
Support vector regression is a classical kernel-based algorithm in learning theory introduced in 1 .It is a regularization scheme in a reproducing kernel Hilbert space RKHS H K associated with an -insensitive loss ψ : R → R defined for ≥ 0 by

Journal of Applied Mathematics
Here, for learning functions on a compact metric space X, K : X × X → R is a continuous, symmetric, and positive semidefinite function called a Mercer kernel.The associated RKHS H K is defined 2 as the completion of the linear span of the set of function {K x K x, • : x ∈ X} with the inner product •, • K satisfying K x , K y K K x, y .Let Y R and ρ be a Borel probability measure on Z : X × Y .With a sample z { x i , y i } m i 1 ∈ Z m independently drawn according to ρ, the support vector regression is defined as where λ λ m > 0 is a regularization parameter.
When > 0 is fixed, convergence of 1.2 was analyzed in 3 .Notice from the original motivation 1 for the insensitive parameter for balancing the approximation and sparsity of the algorithm that should change with the sample size and usually m → 0 as the sample size m increases.Mathematical analysis for this original algorithm is still open.We will solve this problem in a special case of our approximation analysis for general learning algorithms.In particular, we show how f SVR z approximates the median function f ρ,1/2 , which is one of the purposes of this paper.Here, for x ∈ X, the median function value f ρ,1/2 x is a median of the conditional distribution ρ • | x of ρ at x.
Quantile regression, compared with the least squares regression, provides richer information about response variables such as stretching or compressing tails 4 .It aims at estimating quantile regression functions.With a quantile parameter 0 < τ < 1, a quantile regression function f ρ,τ is defined by its value f ρ,τ x to be a τ-quantile of ρ Quantile regression has been studied by kernel-based regularization schemes in a learning theory literature e.g., 5-8 .These regularization schemes take the form where ψ τ : R → R is the pinball loss shown in Figure 1 defined by Motivated by the -insensitive loss ψ and the pinball loss ψ τ , we propose the -insensitive pinball loss ψ τ : R → R with an insensitive parameter ≥ 0 shown in Figure 1 defined as This loss function has been applied to online learning for quantile regression in our previous work 8 .It is applied here to a regularization scheme in the RKHS as The main goal of this paper is to study how the output function f z given by 1.7 converges to the quantile regression function f ρ,τ and how explicit learning rates can be obtained with suitable choices of the parameters λ m −α , m −β based on a priori conditions on the probability measure ρ.

Main Results on Approximation
Throughout the paper, we assume that the conditional distribution ρ • | x is supported on −1, 1 for every x ∈ X.Then, we see from 1.3 that we can take values of f ρ,τ to be on −1, 1 .
So to see how f z approximates f ρ,τ , it is natural to project values of the output function f z onto the same interval by the projection operator introduced in 9 .
Definition 2.1.The projection operator π on the space of function on X is defined by

2.1
Our approximation analysis aims at establishing bounds for the error π f z −f ρ,τ L p * ρ X in the space L p * ρ X with some p * > 0 where ρ X is the marginal distribution of ρ on X.

Support Vector Regression and Quantile Regression
Our error bounds and learning rates are presented in terms of a noise condition and approximation condition on ρ.
The noise condition on ρ is defined in 5, 6 as follows.
Definition 2.2.Let p ∈ 0, ∞ and q ∈ 1, ∞ .We say that ρ has a τ-quantile of p-average type q if for every x ∈ X, there exist a τ-quantile t * ∈ R and constants a x ∈ 0, 2 , b x > 0 such that for each u ∈ 0, a x , and that the function on X taking value b x a q−1 x Note that condition 2.2 tells us that f ρ,τ x t * is uniquely defined at every x ∈ X.The approximation condition on ρ is stated in terms of the integral operator L K : Since K is positive semidefinite, L K is a compact positive operator and its r-th power L r K is well-defined for any r > 0. Our approximation condition is given as Let us illustrate our approximation analysis by the following special case which will be proved in Section 5.
If p ≥ 1/2η − 2 for 0 < η < 1/4, we see that the power exponent for the learning rate 2.4 is at least 1/2 −2η.This exponent can be arbitrarily close to 1/2 when η is small enough.
In particular, if we take τ 1/2, Theorem 2.3 provides rates for output function f SVR z of the support vector regression 1.2 to approximate the median function f ρ,1/2 .If we take β ∞ leading to 0, Theorem 2.3 provides rates for output function f QR z of the quantile regression algorithm 1.4 to approximate the quantile regression function f ρ,τ .

General Approximation Analysis
To state our approximation analysis in the general case, we need the capacity of the hypothesis space measured by covering numbers.
Definition 2.4.For a subset S of C X and u > 0, the covering number N S, u is the minimal integer l ∈ N such that there exist l disks with radius u covering S.
The covering numbers of balls B R {f ∈ H K : f K ≤ R} with radius R > 0 of the RKHS have been well studied in the learning theory literature 10, 11 .In this paper, we assume for some s > 0 and Now we can state our main result which will be proved in Section 5.For p ∈ 0, ∞ and q ∈ 1, ∞ , we denote Theorem 2.5.Assume 2.3 with 0 < r ≤ 1/2 and 2.5 with s > 0. Suppose that ρ has a τ-quantile of p-average type q for some p ∈ 0, ∞ and q ∈ 1, ∞ .Take λ m −α with 0 < α ≤ 1 and α < Then, with p * pq/ p 1 > 0, for any 0 < δ < 1, with confidence 1 − δ, one has where C is a constant independent of m or δ and the power index ϑ is given in terms of r, s, p, q, α, and η by

2.9
The index ϑ can be viewed as a function of variables r, s, p, q, α, η.The restriction 0 < α < 2 s /s 2 s − θ on α and 2.7 on η ensure that ϑ is positive, which verifies the valid learning rate in Theorem 2.5.
Assumption 2.5 is a measurement of regularity of the kernel K when X is a subset of R n .In particular, s can be arbitrarily small when K is smooth enough.In this case, the power index ϑ in 2.8 can be arbitrarily close to 1/q min{αr/ 1−r , 1/ 2−θ }.Again, when β ∞, 0, algorithm 1.7 corresponds to algorithm 1.4 for quantile regression.In this case, Theorem 2.5 provides learning rates for quantile regression algorithm 1.4 .
Error analysis has been done for quantile regression algorithm 1.4 in 5, 6 .Under the assumptions that ρ satisfies 2.2 with some p ∈ 0, ∞ and q > 1 and the 2 empirical covering number N z B 1 , η, 2 see 5 for more details of B 1 is bounded as it was proved in 5 that with confidence 1 − δ, where C p and K s,C p are constants independent of m or λ.Here, D τ λ is the regularization error defined as 12 and E τ f is the generalization error associated with the pinball loss ψ τ defined by Note that E τ f is minimized by the quantile regression function f ρ,τ .Thus, when the regularization error D τ λ decays polynomially as D τ λ O λ r/ 1−r which is ensured by Lemma 2.6 below when 2.3 is satisfied and

2.14
Since p 1 / p 2 1/ 2 − θ , we see that this learning rate is comparable to our result in 2.8 .

Comparison with Least Squares Regression
There has been a large literature in learning theory described in 12 for the least squares algorithms: 2.15 It aims at learning the regression function f ρ x Y ydρ y | x .A crucial property for its error analysis is the identity for the least squares generalization where f : X → Y is an arbitrary measurable function.Such a variance-expectation bound with E ξ possibly replaced by its positive power E ξ θ plays an essential role for analyzing regularization schemes and the power exponent θ depends on strong convexity of the loss.See 13 and references therein.However, the pinball loss in the quantile regression setting has no strong convexity 6 and we would not expect a variance-expectation bound for a general distribution ρ.When ρ has a τ-quantile of p-average type q, the following variance-expectation bound with θ given by 2.6 can be found in 5, 7 derived by means of Lemma 3.1 below .Lemma 2.6.If ρ has a τ-quantile of p-average type q for some p ∈ 0, ∞ and q ∈ 1, ∞ , then

2.16
where the power index θ is given by 2.6 and the constant Lemma 2.6 overcomes the difficulty of quantile regression caused by lack of strong convexity of the pinball loss.It enables us to derive satisfactory learning rates, as in Theorem 2.5.

Insensitive Relation and Error Decomposition
An important relation for quantile regression observed in 5 assets that the error π f z − f ρ,τ taken in a suitable L p * ρ X space can be bounded by the excess generalization error E τ π f z − E τ f ρ,τ when the noise condition is satisfied.
Lemma 3.1.Let p ∈ 0, ∞ and q ∈ 1, ∞ .Denote p * pq/ p 1 > 0. If ρ has a τ-quantile of p-average type q, then for any measurable function f on X, one has where C q,ρ 2 1− 1/q q 1/q { b x a q−1 x , we only need to bound the excess generalization error E τ π f z − E τ f ρ,τ .This will be done by conducting an error decomposition which has been developed in the literature for regularization schemes 9, 13-15 .Technical difficulty arises for our problem here because the insensitive parameter changes with m.This can be overcome 16 by the following insensitive relation Now, we can conduct an error decomposition.Define the empirical error E z,τ f for f : X → R as Lemma 3.2.Let λ > 0, f z be defined by 1.7 and Then, where

3.6
Proof.The regularized excess generalization error

3.7
Journal of Applied Mathematics 9 The fact |y| ≤ 1 implies that E z,τ π f z ≤ E z,τ f z .The insensitive relation 3.2 and the definition of f z tell us that

3.8
Then, by subtracting and adding E τ f ρ,τ and E z,τ f ρ,τ and noting , we see that the desired inequality in Lemma 3.2 holds true.
In the error decomposition 3.5 , the first two terms are called sample error.The last term is the regularization error defined in 2.12 .It can be estimated as follows.

3.10
It can be found in 17, 18 that when 2.3 holds, we have . Since ψ τ is Lipschitz, we know by taking f f μ in 2.12 that

3.12
This verifies the desired bound for D τ λ .By taking f 0 in 2.12 , we have

3.13
Then the bound for f 0 λ K is proved.

Estimating Sample Error
This section is devoted to estimating the sample error.This is conducted by using the variance-expectation bound in Lemma 2.6.Denote κ Proposition 4.1.Assume 2.3 and 2.5 .Let R ≥ 1 and 0 < δ < 1.If ρ has a τ-quantile of p-average type q for some p ∈ 0, ∞ and q ∈ 1, ∞ , then there exists a subset V R of Z m with measure at most δ such that for any where θ is given by 2.6 and C 1 , C 2 , C 3 are constants given by

4.3
Proof.Let us first estimate the second part S 2 of the sample error.It can be decomposed into two parts S 2 S 2,1 S 2,2 where

4.4
For bounding S 2,1 , we take the random variable ξ z ψ τ f Applying the one-side Bernstein inequality 12 , we know that there exists a subset Z 1,δ of Z m with measure at least 1 − δ/3 such that For S 2,2 , we apply the one-side Bernstein inequality again to the random variable ξ z ψ τ π f 0 λ x − y − ψ τ f ρ,τ x − y , bound the variance by Lemma 2.6 with f π f 0 λ , and find that there exists another subset Z 2,δ of Z m with measure at least 1 − δ/3 such that

4.6
Next, we estimate the first part S 1 of the sample error.Consider the function set , and E g 2 ≤ C θ E g θ by 2.16 .Also, the Lipschitz property of the pinball loss yields N G, u ≤ N B 1 , u/R .Then, we apply a standard covering number argument with a ratio inequality 12, 13, 19, 20 to G and find from the covering number condition 2.5 that Setting the confidence to be 1 − δ/3 , we take u * R, m, δ/3 to be the positive solution to the equation Then, there exists a third subset Z 3,δ of Z m with measure at least 1 − δ/3 such that sup , ∀z ∈ Z 3,δ .

4.11
Here, we have used the elementary inequality √ a b ≤ √ a √ b and Young's inequality.Putting this bound and 4.5 , 4.6 into 3.5 , we know that for z ∈ W R ∩ Z 3,δ ∩ Z 1,δ ∩ Z 2,δ , there holds 12 which together with Proposition 3.3 implies

4.13
Here, we have used the reproducing property in H K which yields 12 Equation 4.9 can be expressed as By Lemma 7.2 in 12 , the positive solution u * R, m, δ/3 to this equation can be bounded as Thus, for z ∈ W R ∩ Z 3,δ ∩ Z 1,δ ∩ Z 2,δ , the desired bound 4.2 holds true.Since the measure of the set Z 3,δ ∩ Z 1,δ ∩ Z 2,δ is at least 1 − δ, our conclusion is proved.

Deriving Convergence Rates by Iteration
To apply Proposition 4.1 for error analysis, we need some R ≥ 1 for z ∈ W R .One may choose R λ −1/2 according to which is seen by taking f 0 in 1.7 .This choice is too rough.Recall from Proposition 3.3 that f 2r which is a bound for the noise-free limit f 0 λ of f z .It is much better than λ −1/2 .This motivates us to try similar tight bounds for f z .This target will be achieved in this section by applying Proposition 4.1 iteratively.The iteration technique has been used in 13, 21 to improve learning rates.Lemma 5.1.Assume 2.3 with 0 < r ≤ 1/2 and 2.5 with s > 0. Take λ m −α with 0 < α ≤ 1 and m −β with 0 < β ≤ ∞.Let 0 < η < 1.If ρ has a τ-quantile of p-average type q for some p ∈ 0, ∞ and q ∈ 1, ∞ , then for any 0 < δ < 1, with confidence 1 − δ, there holds where θ η is given by Proof.Putting λ m −α with 0 < α ≤ 1 and m −β with 0 < β ≤ ∞ into Proposition 4.1, we know that for any R ≥ 1 there exists a subset V R of Z m with measure at most δ such that

Journal of Applied Mathematics
Let us apply 5.6 iteratively to a sequence {R j } J j 0 defined by R 0 λ −1/2 and R j a m R j−1 s/ 2 2s b m where J ∈ N will be determined later.Then, W R As the measure of V R j is at most δ, we know that the measure of ∪ J−1 j 0 V R j is at most Jδ.Hence, W R J has measure at least 1 − Jδ.
Denote Δ s/ 2 2s < 1/2.The definition of the sequence {R j } J j 0 tells us that Let us bound the two terms on the right-hand side.

5.10
Take J to be the smallest integer greater than or equal to log 1/η / log 2. The above expression can be bounded by C 3 m α 2 s−θ −1 1 s / 2 s−θ 2 s η .The second terms equals

5.14
Then, our conclusion follows by replacing δ by δ/J and noting J ≤ 2 log 3/η .Now, we can prove our main result, Theorem 2.5.
Proof of Theorem 2.5.Take R to be the right side of 5.2 .By Lemma 5.1, there exists a subset V R of Z m with measure at most δ such that Z m \V R ⊆ W R .Applying Proposition 4.1 to this R, we know that there exists another subset V R of Z m with measure at most δ such that for any z ∈ W R \V R ,

5.16
Since the set V R ∪ V R has measure at most 2δ, after scaling 2δ to δ and setting the constant C by C 2C q,ρ 2 C 1 C 2 C 4 , 5.17

with confidence 1 − δ and the power index ϑ give by ϑ 1
we see that the above estimate together with Lemma 3.1 gives the error boundπ f z − f ρ,τ L p *