ERM Scheme for Quantile Regression

and Applied Analysis 3 Under the same assumptions asTheorem 4, we get that by replacing ι by n/s, for any 0 < δ < 1, with confidence 1 − δ, 󵄩󵄩󵄩󵄩 fz − fτ,ρ 󵄩󵄩󵄩󵄩Lr ρX ≤ ?̃?m −θ log 2 δ , (13)


Introduction
In this paper, we study empirical risk minimization scheme (ERM) for quantile regression.Let  be a compact metric space (input space) and  = R.Let  be a fixed but unknown probability distribution on  :=  ×  which describes the noise of sampling.The conditional quantile regression aims at producing functions to estimate quantile regression functions.With a prespecified quantile parameter  ∈ (0, 1), a quantile regression function  , is defined by its value  , () to be a -quantile of (⋅ | ), that is, a value  ∈  satisfying  ((−∞, ] | ) ≥ ,  ([, ∞) | ) ≥ 1 − ,  ∈ , (1) where (⋅ | ) is the conditional distribution of  at  ∈ .
We consider a learning algorithm generated by ERM scheme associated with pinball loss and hypothesis space H.The pinball loss   : R → [0, ∞) is defined by The hypothesis space H is a compact subset of ().So there exists some  > 0 such that ‖‖ () ≤  for any  ∈ H.We assume without loss of generality ‖‖ () ≤ 1 for any  ∈ H.The ERM scheme for quantile regression is defined with a sample z = {(  ,   )}  =1 ∈   drawn independently from  as follows: A family of kernel based learning algorithms for quantile regression has been widely studied in a large literature [1][2][3][4] and references therein.The form of the algorithms is a regularized scheme in a reproducing kernel Hilbert space H  (RKHS, see [5] for details) associated with a Mercer kernel .Given a sample z the kernel based regularized scheme for quantile regression is defined by In [1,3,4], error analysis for general H  has been done.Learning with varying Gaussian kernel was studied in [2].ERM scheme (3) is very different from kernel based regularized scheme (4).The output function  z produced by the ERM scheme has a uniform bound, under our assumption, ‖ z ‖ () ≤ 1.However, we cannot expect it for  z, .It is easy to see that ‖ z, ‖ 2  ≤ ∑  =1 |  | by choosing  = 0.It happens often that ‖ z, ‖  → ∞ as  → 0. The lack of a uniform bound for  z, has a serious negative impact on the learning rates.So in the literature of kernel based regularized schemes for quantile regression, values of the output function  z, are always projected onto the interval [−1, 1], and error analysis is conducted for the projected function, not  z, itself.
In this paper, we aim at establishing convergence and learning rates for the error ‖ z −  , ‖     in the space     .Here  > 0 depends on the pair (, ) which will be decided in Section 2 and   is the marginal distribution of  on .In the rest of this paper, we assume  = [−1, 1] which in turn leads to values of the target function  , lie in the same interval.However, this identity relation and expectation-variance bound fail in the setting of the quantile regression.The reason is that the pinball loss is lack of strong convexity.If we add some noise condition on distribution  named -quantile of -average type  (see Definition 1), we can also get a similar identity relation which in turn enables us to have a varianceexpectation bound stated in the following which is proved by Steinwart and Christman [1].
and that the function  on  defined by () =  (⋅|) We also need capacity of the hypothesis space H for our learning rates.Here in this paper, we measure the capacity by empirical covering numbers.Definition 2. Let (M, ) be a pseudometric space and  be a subset of M. For every  > 0, the covering number N(, , ) of  with respect to  and  is defined as the minimal number of balls of radius  whose union covers , that is, where Here  2 is the normalized ℓ 2 -metric on the Euclidean space R  given by Assumption.Assume that the empirical covering number of the hypothesis space H is bounded for some  > 0 and  ∈ (0, 2), Theorem 4. Assume that  satisfies (5) with some  ∈ (0, ∞] and  ∈ [1, ∞).Denote  = /( + 1).One further assumes that  , is uniquely defined.If  , ∈ H and H satisfies (9) with  ∈ (0, 2), then for any 0 <  < 1, with confidence 1 − , one has where and C is a constant independent of  and .
Remark 5.In the ERM scheme, we can choose  , ∈ H which in turn makes the approximation error described by (23) equal to zero.However, it is impossible for the kernel based regularized scheme because of the appearance of the penalty term ‖‖ 2  .If  ≤ 2, all conditional distributions around the quantile behave similar to the uniform distribution.In this case  = /( + 1) ≤ 2 and  = min{2/, /( + 1)} = /( + 1) for all  > 0. Hence,  = 2( + 1)/((2 + ) + 4).Furthermore, when  is large enough, the parameter  tends to  and the power index for the above learning rate arbitrarily approaches to 2/(2 + ) which shows that the learning rate power index for is arbitrarily close to 2/(2+) independent of .In particular,  can be arbitrarily small when H is smooth enough.In this case, the power index of the learning rates 2/(2 + ) can be arbitrarily close to 1 which is the optimal learning rate for the least square regression.
Let us take some examples to demonstrate the above main result.
Example 6.Let H be a unit ball of the sobolev space   with  > 0. Observe that the empirical covering number is bounded above by the uniform covering number defined in Definition 2. Hence we have (see [6,7]) where  is the dimension of the input space  and   > 0.
Abstract and Applied Analysis 3 Under the same assumptions as Theorem 4, we get that by replacing  by /, for any 0 <  < 1, with confidence 1 − , where and C is a constant independent of  and .
We carry out the same discussions on the case of  ≤ 2 and large enough  as Remark 5. Therefore the power index of the learning rates for is arbitrarily close to 2/(2+/) independent of .Furthermore,  can be arbitrarily large if the Sobolev space is smooth enough.In this special case, the learning rate power index arbitrarily approaches to 1.

Example 7.
Let H be a unit ball of the reproducing kernel Hilbert space H  generated by a Gaussian kernel (see [5]).Reference [7] where  , > 0 depends only on  and  > 0. Obviously, the right-hand side of ( 15) is bounded by  , (1/) +1 .So from Theorem 4, we can get different learning rates with power index If  ≤ 2 and  is large enough, the power index of the learning rates for which is very slow if  is large.However, in most data sets the data are concentrated on a much lower dimensional manifold embedded in the high dimensional space.In this setting an analysis that replaces  by the intrinsic dimension of the manifold would be of great interest (see [8] and references therein).

Error Analysis
Define the noise-free error called generalization error associated with the pinball loss   as Then the measurable function  , is a minimizer of E  .Obviously,  , () ∈ [−1, 1].We need the following results from [1] for our error analysis.
Proposition 8. Let   be the pinball loss.Assume that  satisfies (5) with some  ∈ (0, ∞] and  ∈ [1, ∞) The above result implies that we can get convergence rates of  z in the space     by bounding the excess generalization error E  ( z ) − E  ( , ).
To bound E  ( z ) − E  ( , ), we need a standard error decomposition procedure [6] and a concentration inequality.Lemma 9. Let   be the pinball loss,  z be defined by (3) and  H ∈ H by (22).Then Proof.The excess generalization error can be written as The definition of  z implies that E z, ( z ) − E z, ( H ) ≤ 0. Furthermore, by subtracting and adding E  ( , ) and E z, ( , ) in the first term and third term, we see that Lemma 9 holds true.
We call the term (23) approximation error.It has been studied in [9].

Concentration Inequality and Sample Error.
Let us recall the one-sided Bernstein inequality as follows.
Let us turn to estimate the sample error (3.5) involving the function  z which runs over a set of functions since z is a random sample itself.To estimate it, we use a concentration inequality below involving empirical covering numbers [10][11][12].Lemma 12. Let F be a class of measurable functions on .Assume that there are constants ,  > 0 and  ∈ [0, 1] and ‖‖ ∞ ≤  and E 2 ≤ (E)  for every  ∈ F. If (7) holds, then there exists a constant    depending only on  such that for any  > 0, with probability at least 1 −  − , there holds where We apply Lemma 12 to a function set F, where , ∀ ∈ H, where .
The above bound follows directly from Propositions 11 and 13 with the fact that  z ∈ H.

Further Discussions
In this paper, we studied ERM algorithm (3) for quantile regression and provide convergence and learning rates.We showed some essential differences between ERM scheme and kernel based regularized scheme for quantile regression.We also point out the difficulty to deal with quantile regression: the lack of strong convexity of the pinball loss.To overcome this difficulty, some noise condition on  is proposed to enable us to get a variance-expectation bound similar to the one for the least square regression.
In our analysis we just consider  ∈ H and ‖‖ () ≤ 1.The case ‖‖ () ≤  for  ≥ 1 would be interesting in the future work.The approximation error involving  can be estimated by the knowledge of interpolation space.
In our setting, the sample is drawn independently from the distribution .However, in many practical problems, the i.i.d condition is a little demanding, so it would be interesting to investigate the ERM scheme for quantile regression with nonidentical distributions [13,14] or dependent sampling [15].
) and C are constant independent of  and .