AAA Abstract and Applied Analysis 1687-0409 1085-3375 Hindawi Publishing Corporation 927827 10.1155/2013/927827 927827 Research Article Regularized Ranking with Convex Losses and 1-Penalty Chen Heng Wu Jitao Ying Yiming Department of Mathematics Beijing University of Aeronautics and Astronautics Beijing 100191 China buaa.edu.cn 2013 5 12 2013 2013 26 09 2013 13 11 2013 2013 Copyright © 2013 Heng Chen and Jitao Wu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In the ranking problem, one has to compare two different observations and decide the ordering between them. It has received increasing attention both in the statistical and machine learning literature. This paper considers 1-regularized ranking rules with convex loss. Under some mild conditions, a learning rate is established.

1. Introduction

In the ranking problem, one has to compare two different observations and decide the ordering between them. The problem of ranking has become an interesting field for researchers in machine learning community. It has received increasing attention both in the statistical and machine learning literature.

The problem of ranking may be modeled in the framework of statistical learning (see [1, 2]). Let (X,Y) be a pair of random variables taking values in 𝒳×. The random observation X models some object and Y denotes its real-valued label. Let (X,Y) denote a pair of random variables identically distributed with (X,Y) (with respect to the probability ) and independent of it. In the ranking problem one observes X and X but not their labels Y and Y. X is “better” than X if Y>Y. We are to construct a measurable function f:𝒳×𝒳, called a ranking rule, which predicts the ordering between objects in the following way: if f(X,X)0, we predict that X is better than X. A ranking rule f has the property f(x,x)f(x,x)0. The performance of a ranking rule f is measured by the ranking error: (1)L(f)=(sign(Y-Y)f(X,X)<0), that is, the probability that f ranks two randomly drawn instances incorrectly. It is easily seen that L(f) attains its minimum L*, over the class of all measurable functions, at the ranking rule (2)f*(x,x):=sgn(2η(x,x)-1)withη(x,x)=(Y>YX=x,X=x).

In practice, the best rule f* is unknown since the probability is unknown. A widely used approach for estimating L* is the empirical risk minimization with convex loss.

Definition 1.

one says that ϕ:[0,) is a ranking loss (function) if it is convex, differentiable at 0 with ϕ(0)<0, and the smallest zero of ϕ is 1.

Examples of ranking loss include the least square loss ϕ(t)=(1-t)2 and q-norm SVM loss ϕq(t)=(1-t)+q, where q1 and t+=max{0,t} for t.

The risk of a measurable function f is defined as Q(f)=𝔼[ϕ(sign(Y-Y)f(X,X))]. Denote by fϕ a minimizer of Q(f) over the set of all measurable and antisymmetric functions. For example, as in the classification case (see [3, 4]), fϕ1=f*, and for q>1,  (3)fϕq(x,x)=(1+f*(x,x))1/(q-1)  -(1-f*(x,x))1/(q-1)(1+f*(x,x))1/(q-1)+(1-f*(x,x))1/(q-1),fϕqfϕqfϕqfϕqfϕqfϕqfϕqfϕqfϕqfϕqfϕqfϕq11x,x𝒳.

The following inequality holds for any f: (4)L(f)-L*{Q(f)-Q*,if  ϕ(t)=(1-t)+,cQ(f)-Q*,if  ϕ(0)0, where Q*=Q(fϕ) and c is some constant.

Before proceeding further, we introduce the notion of Reproducing Kernel Hilbert Space (RKHS). Recall that a continuous function K(σ,σ) is a Mercer kernel on a set Σ, if K(σ,σ)=K(σ,σ),σ,σΣ, and given an arbitrary finite set {σ1,,σn}Σ, the matrix K=(K(σi,σj))i,j=1n is positive semidefinite. The RKHS K associated with the Mercer kernel K is the completion of span{Kσ=K(σ,·)σΣ}, with respect to the inner product given by Kσ,KσK=K(σ,σ). See  and ([6, Ch. 4]) for details.

For convenience, we assume hereafter that the Mercer kernels K on 𝒳2×𝒳2 are symmetric in the sense that (5)K((u,u),(x,x))=K((u,u),(x,x)),1111111111111111111111u,u,x,x𝒳. Such examples are Mercer kernels K of either form K(s,t)=k(|s-t|2) or K(s,t)=k(s,t),s,t𝒳2, where |·|2 and ·,· are the Euclidean norm and inner product, respectively.

Since the best ranking rule f* is antisymmetric, it is reasonable that we restricted ourselves to the subspace as of anti-symmetric functions in K; that is, (6)Kas={ffK,f(x,x)=-f(x,x),x,x𝒳}. For any xi𝒳,i=1,,n, and αij with αij=-αji,i,j=1,,n, it is easily seen that (7)f=i,j=1nαijK(xi,xj)Kas. Conversely, any anti-symmetric function fspan{K(xi,xj)}i,j=1n with above expression should satisfy αij=-αji,i,j=1,,n, provided detK>0.

For a set of samples z={Z1,,Zn}𝒳2, let (8)Ωz(f)=inf{i,j=1n|αij|f=i,j=1nαijK(xi,xj)};K,z={i,j=1nαijK(xi,xj)},K,zas={i,j=1nαijK(xi,xj)αij=-αji}.

For λ>0, the 1-penalty regularized ranking rule fz,λ is the minimizer of the minimization problem (9)fz,λ=argminfK,zas{Qn(f)+λΩz(f)}, where Qn(f), known as empirical risk, is given by (10)Qn(f)=1n(n-1)ijϕ(sign(yi-yj)f(xi,xj)).

Associated with any ranking rule f, we construct another ranking rule as follows: (11)π(f)(x,x)={sign(f(x,x)),if|f(x,x)|>1,f(x,x),if|f(x,x)|1. Clearly, π(f) gives the same ranking rule as f and it satisfies (12)Q(π(f))Q(f),Qn(π(f))Qn(f).

Hereafter, we denote gn=π(fz,λ). The goal of this paper is to bound the excess error Q(gn)-Q*, which in turn together with (4) up bounds the excess ranking error L(gn)-L*. The main result of this paper is, under mild conditions, to establish a learning rate for 1-penalty regularized ranking rules with convex loss.

Classification with convex loss, in particular for q-norm SVMs, has been the subject of many theoretical considerations in recent years. The 1-norm SVMs with regularizer being RKHS norm for ranking was investigated in [2, 7]. The 1-penalty has been used in [8, 9] for classification problems under the framework of SVMs. It is well known that 1-regularization usually leads to solution with sparse representation (see, e.g., ). In this paper, we consider ranking with convex loss and 1-penalty.

In , the RKHS-norm SVMs for ranking was proposed. But it was implemented over a ball BR={f|fKR} of K, not the whole RKHS K; that is, it solves the minimization problem (13)fn=argminfBRQn(f)+λfK2. A convergence rate for Q(fn)-inffBRQ(f) has been established for Gaussian kernel. The approximation error inffBRQ(f)-Q* was not considered there. The asymptotic behavior of the same algorithm implemented over the whole RKHS K was investigated in . Moreover, a fast learning rate O(1/n) is obtained under some conditions.

We would like to mention a recent paper , where the error (f) of function f is defined by (f)=𝔼((Y-Y-f(X)+f(X))2). A convergence rate for the minimizer of the regularized empirical error was established. The author made use of the technique of estimation via integral operator developed in .

The rest of the paper is organized as follows. In Section 2, after making some assumptions, we state the main result, an upper bound for Q(gn)-Q*+λΩz(fz,λ). As usual, it is decomposed as a sum of three terms, sample error, hypothesis error, and approximation error. Sections 3 and 4 are devoted to the estimations of hypothesis error and sample error, respectively. A proof of the main result is given in Section 5.

2. Assumptions and Main Results

For the statement of the main results, we need to introduce some notions and make some assumptions.

Denote gf=ϕ(sign(y-y)f(x,x))-ϕ(sign(y-y)fϕ(x,x)). The following assumption is a bound for variance of gf, which is adopted by many authors.

Assumption 2.

There is a constant α[0,1] such that for any M, (14)𝔼(gf2)c1(𝔼(gf))α,f:𝒳2[-M,M], where c1=c1(M) is a constant.

For ϕq(t)=(1-t)+q, the assumption is satisfied  for (15)α={1,if  1<q2,2q,if  2<q.

It is known in  that if there is some positive constant C such that (16)(|2η(X,X)-1|ξ)Cξα/(1-α),ξ>0, then the assumption is satisfied for ϕ1(t)=(1-t)+.

Suppose hereafter κ:=sups𝒳2K(s,s)<. We note that, for any Mercer kernel K, (17)sups𝒳2K(s,s)=sups,s𝒳2K(s,s). We now construct a set of functions which contains K,z and is independent of the samples.

Definition 3.

The Banach space is defined as the function set on 𝒳2 containing all functions of the form (18)f=i=1aiKsi,{ai}i=11,si𝒳2, with the norm (19)f:=inf{i=1|ai|f=i=1aiKsi}.

Obviously, K,z,zZn. By the definition of κ and (17), one has (20)i=N1N2aiKsiK2=i,j=N1N2aiajK(si,sj)κ(i=N1N2|ai|)2, which implies that the series i=1aiKsi converges in K. Consequently, fK and fKκf. The following also holds: fCκf,f, where fC=sups𝒳2|f(s)| for fC(𝒳2).

Denote as={ff(x,x)=-f(x,x)}. The approximation error of Q* by Q(f) with fas is defined as (21)D(λ):=inffas{Q(f)-Q*+λf}.

Denote the minimizer (22)fλ:=argminfas{Q(f)+λf},λ>0.

The next assumption is concerned with the approximation power of as to fϕ.

Assumption 4.

There are positive constants c2 and β such that (23)𝒟(λ)=Q(fλ)-Q*+λfλc2λβ,λ>0.

Recall fϕ(x,x)=-fϕ(x,x). The above assumption is not too restrict.

Assumption 5.

(i) The kernel K satisfies a Lipschitz condition of order γ with 0<γ<1; that is, there exists some c3>0 such that (24)|K(s,t)-K(s,t)|c3|t-t|2γ. (ii) The ranking loss has an increment exponent θ1; that is, there exist some constants θ1,c4 such that (25)ϕ(t)c4(1+|t|)θ,ϕ±(t)c4(1+|t|)θ-1,t,where ϕ± denotes the right- and left-sided derivatives of ϕ, respectively.

Assumption 6.

The margin distribution ρX satisfies condition Lτ with 0<τ<; that is, for some c5>0 and any ball B(x,δ):={u𝒳|u-x|2<δ}, one has (26)ρX(B(x,δ))c5δτ,x𝒳,0<δ1.

The last assumption concerns covering numbers. For a subset 𝒮 of a space with pseudometric ρ and δ>0. The covering number 𝒩(δ,𝒮,ρ) is defined to be the minimal number l such that there exist l disks with radius δ covering 𝒮. When Σ is compact this number is finite.

Assumption 7.

(i) There are some α>0 and cα>0 such that (27)𝒩(δ,𝒳,|·|2)c6(1δ)α,δ>0. (ii) For R>0, let BR={fasfR}. There are some constant s(0,1),c7>0 such that (28)log𝒩(δ,B1,·)c7δ-s,δ>0.

It was shown in , under Assumptions 5(i), 6, and 7(i), that the following holds: (29)log𝒩(δ,𝒳,|·|2)c6(4c3δ)α/γlog(2+4κδ),0<δ<1. Therefor (ii) in Assumption 7 holds provided that α/γ<1.

We are in a position to state the main result of this paper. The proof is given in Section 5.

Theorem 8.

For any ε(0,β), under the Assumptions 27, one has confidence at least 1-Cεe-t(30)Q(gn)-Q*+λΩz(fz,λ)C(max{(logn+t)γ/τ,t})(2-α+s)/(2-α)n-(β-ε)μ, where μ=min{γ/τ(β+(1-β)θ),1/(β+(1-β)θ),1/((2-α)β+s)} and Cε, C are constants independent of t or n.

The first step of the proof is to decompose Q(gn)-Q*+λΩz(fz,λ) into errors of different types as the following: (31)Q(gn)-Q*+λΩz(fz,λ)=S(z,λ)+P(z,λ)+D(λ), where (32)S(z,λ)={Q(gn)-Qn(gn)}+{Qn(fλ)-Q(fλ)}, referred to as sample error and (33)P(z,λ)={Qn(gn)+λΩz(fz,λ)}-{Qn(fλ)+λfλ}, referred to as hypothesis error. We bound hypothesis error and sample error in the next two sections, respectively.

In the estimation of sample error, Hoeffding's decomposition of U-statistic, which breaks U-statistic into a sum of iid random variables and a degenerate U-statistic (see Section 4 for details), is a useful tool.

3. Hypothesis Error

In this section, we bound hypothesis error P(z,λ). This error is caused as we switch from the minimizer fz,λ of Qn(f)+λΩz(f) in K,zas to the minimizer fλ of Q(f)+λ||f|| in as. Such errors are estimated in some papers, for example, [7, 18], and so forth. We note that, different from [18, 19], the underlying spaces K,zas and as are sets of antisymmetric functions. We begin with the representations of the functions.

Lemma 9.

Let fas. For any η>0, one has a representation: (34)f=12i=1ai(K(xi,ui)-K(ui,xi)),xi,ui𝒳,i=1|ai|f+η.

Proof.

For any η>0, there are sequences {si}i=1𝒳2 and {ai}i=11 such that f=i=1aiKsi,si𝒳2 and i=1|ai|f+η.

Denote si=(xi,ui),xi,ui𝒳,i=1,2,. It follows from (5) that (35)f(x,x)=i=1aiK((ui,xi),(x,x)). The proof is complete by f(x,x)=1/2(f(x,x)-f(x,x)).

A set {xi}i=1n𝒳 is said to be Δ-dense in 𝒳 if for any x𝒳 there exists some 1jn such that |x-xj|2<Δ.

Proposition 10 ([<xref ref-type="bibr" rid="B21">19</xref>], Proposition 3.1).

Let {Xj}j=1n be drawn independently according to ρX. Then for any t>1, under Assumption 6 and (i) in Assumption 7, with confidence at least 1-e-t,{Xj}j=1n is cα,τ((logn+t)/n)γ/τ-dense in 𝒳, where cα,τ is a constant that depends only on α and τ.

The hypothesis error P(z,λ) is bounded by the following proposition.

Proposition 11.

Assume Assumptions 5 and 6. Then for any t>1, with confidence at least 1-e-t, there holds (36)P(z,λ)Cfλ(logn+tn)γ/τ×(1+fλ+fλ(logn+tn)γ/τ)θ-1, where C is a constant independent of z,λ,m, and t. (hereafter, C and C are constants which are independent of R,t,n, or λ, and may changes from line to line.)

Proof.

The proof follows the line of [19, 20]. For any η>0, let the representation fλ=i=1bi(K(xi,ui)-K(ui,xi)) with i=1|bi|(fλ+η)/2 be given in Lemma 9.

By Proposition 11, with confidence at least 1-e-t, for any i=1,2,,n, there are some Xji,Xki{Xi}i=1n such that max{|xi-Xji|2,|ui-Xki|}cα,τ((logn+t)/n)1/τ. For an integer N, which will be determined later, denote f=i=1Nbi(K(Xji,Xki)-K(Xki,Xji))K,zas. So by Assumption 7, (37)f-i=1Nbi(K(xi,ui)-K(ui,xi))C(fλ+η)(logn+tn)γ/τ, where C=c3cα,τγ.

Choose N such that j=N+1|bj|<η/2. Therefore (38)i=1Nbi(K(xi,ui)-K(ui,xi))-fλ2κj=N+1|bj|κη. Consequently, f-fλDη:=C(fλ+η)((logn+t)/n)γ/τ+κη, which together with (39)|Qn(f1)-Qn(f2)|c4(1+max{f1,f2})θ-1f1-f2 yields, with confidence at least 1-e-t, (40)Qn(f)Qn(fλ)+c4(1+fλ+Dη)θ-1Dη. On the other hand, since fK,zas, the following holds by (12) and (9), with confidence at least 1-e-t: (41)Qn(gn)+λΩz(fz,λ)Qn(fz,λ)+λΩz(fz,λ)Qn(f)+2λj=1N|bj|Qn(f)+λ(fλ+η), which together with (40) completes the proof by letting η0.

The above bound for P(z,λ) is the same as in , both using of the same density of {Xi}i=1n in 𝒳. However, the functions considered there are defined on 𝒳, instead of 𝒳2 as present.

4. Sample Error

As in [2, 7], Hoeffding's decomposition plays an important role in the estimation for sample error. For any f(x,x),x,x𝒳, denote (42)φf(z,z)=ϕ(sign(y-y)f(x,x)),z=(x,y),z=(x,y). By Hoeffding's decomposition of U-statistic, we have (43)Q(f-fϕ)-Qn(f-fϕ)=2(Q(f-fϕ)-Pn(Pφf-Pφfϕ))-Un(hf-hfϕ), where hf(z,z)=φf(z,z)-Pφf(z)-Pφf(z)+Q(f) and, for any g(z,z), (44)Pg(z)=𝔼(g(Z,Z)Z=z),Pn(g)=1ni=1ng(Zi),Un(g)=1n(n-1)ijg(Zi,Zj).

Moreover, for any ranking rule f, we denote gf(z)=Pφf(z)-Pφfϕ(z). Then (45)Q(f-fϕ)-Pn(Pφf-Pφfϕ)=𝔼(gf)-Pn(gf). It is the deviation of sum of independent random variables from their mean.

As seen, in Hoeffding's decomposition (43), the first term is a sum of iid variables and the second term Un(hf-hfϕ) is a degenerate U-statistic. The degeneration means 𝔼(Un(hf(Z,Z)-hfϕ(Z,Z))Z) = 0,Z.

Denote (46)S1=Q(gn)-Q(fϕ)-(Qn(gn)-Qn(fϕ))=Q(gn-fϕ)-Qn(gn-fϕ),S2=Qn(fλ)-Qn(fϕ)-(Q(fλ)-Q(fϕ))=Qn(fλ-fϕ)-Q(fλ-fϕ), so that the sample error Q(gn)-Qn(gn)+Qn(fλ)-Q(fλ)=S1+S2.

We first estimate S1. Since gn depends on z, by (43) and (45), we need to consider the suprema of the sets (47){𝔼(gπ(f))-Pn(gπ(f))F},{|Un(hπ(f)-hfϕ)|F}, where , containing gn, is (a subset of) a ball in as.

With the above decomposition and the methods in , the following proposition is established for the ball {fKasfKR} and ϕ(t)=(1-t)+ in . The Assumption 2 and the condition on the covering number of {fKasfKR} play a crucial role. The arguments also work in the present setting; that is, for the ball BR={fasfR}. For this we note that ϕ satisfies |ϕ(t)-ϕ(t)|C|t-t|,t,t[-1,1]. Therefore, under (ii) of Assumption 5 and (ii) of Assumption 7, the covering number 𝒩(δ,𝒢,·) of 𝒢={gπ(f)fBR} satisfies log𝒩(δ,𝒢,·)C(R/δ)s. The interested reader may refer to [7, 21] for the details.

Proposition 12.

Let R>0 and t>0. Under Assumption 2, (ii) of Assumption 5, and (ii) of Assumption 7, one has confidence at least 1-e-t, (48)𝔼(gπ(f))-Pn(gπ(f))δ0+δ01-(α/2)(𝔼(gπ(f)))(α/2),123132132132132132132132132132132111fBR, where δ0 is bounded by (49)δ0C((tn)1/(2-α)+(Rsn)1/(2-α+s)), with C being a constant independent of R,t, and n.

The estimation for the supremum supfR|Un(hπ(f)-hfϕ)| of U-process is much involved. The supremum of U-processes has been studied in a few papers. The following lemma follows from the proof of ([22, Theorem 3.2, m=2]).

Lemma 13.

Suppose that a function class satisfies the following conditions.

For any f, f(z,z)=f(z,z) and 𝔼(f(Z,Z)Z)=0.

is uniformly bounded by a universal constant C0.

C:=0log𝒩(δ,,·)dδ<.

Then (50)𝔼exp(λΓn)Cexp(Cλ2C),λ>0, where C, C are some constants, independent of C, and (51)Γn=supf|(n-1)Un(f)|.

Proposition 14.

Suppose that (ii) in Assumption 7 holds. Then one has with confidence at least 1-Ce-t(52)supfR|Un(hπ(f)-hfϕ)|C(Rs+1)tn, where R={hπ(f)  fBR} and C,C are some positive constants.

Proof.

We first claim that R satisfies the conditions in Lemma 13. Indeed, condition (i) holds by definition of hπ(f) and f(x,x)=-f(x,x) for any fBR. Also, by definition of hf, for any fR, ||hπ(f)||4ϕ(-1), implying (ii). Moreover, by (ii) of Assumption 5(53)hπ(f1)-hπ(f2)Cπ(f1)-π(f2)Cf1-f2 we have (54)log𝒩(R,δ,·)log𝒩(BR,δC,·)c7(CR)sδ-s provided (ii) in Assumption 7. This establishes (iii), as claimed.

Applying Markov's inequality to exp(λΓn), with λ=t/2CRs and appealing to Lemma 13, we have (Γn>t)Ce-t/(4CRs),t>0; that is, (55)(Γn>4CRst)Ce-t,t>0. For single function hfϕ, it is known in ([23, Proposition 2.3]) that (56)(|Un(hfϕ)|Ctn-1)Ce-t, which together with (55) completes the proof.

The estimation for S2 is easy since fλ does not change with the set z of samples.

Proposition 15.

Assume Assumption 2. For any t>0, one has with confidence at least 1-Ce-t(57)S2Ct((1+κfλ)θ3n+(1n)1/(2-α))+D(λ), where C,C are some constants, independent of λ,t and n.

Proof.

Clearly, the function gfλ(Z) satisfies gfλC(1+fλ)θ. Then, by Assumption 2, we conclude, as in [20, 21], with confidence at least 1-e-t, that (58)Pn(gfλ)-(Q(fλ)-Q*)Ct((1+fλ)θ3n+(1n)1/(2-α))+D(λ).

It remains to estimate Un(hfλ-hfϕ). For any single function g with g(z,z)=g(z,z) and 𝔼[g(Z,Z)Z]=0,Z, we have by [23, Proposition 2.3] (59)(|Un(g)|Ctgn-1)Ce-t, which together with hfλ-hfϕC(1+fλ)θ implies, with confidence at least 1-Ce-t, that (60)|Un(hfλ-hfϕ)|C(1+fλ)θtn, where C and C are constants. The proof is complete.

5. Proof of Theorem <xref ref-type="statement" rid="thm1">8</xref> Theorem 16.

For R such that Ωz(fz,λ)R and any t>1, under Assumptions 27, one has with confidence at least 1-Ce-t(61)Q(gn)-Q*+λΩz(fz,λ)C(+(Rsn)1/2-α+s+(Rs+1)tn+tλ(β-1)θn+λβ(logn+tn)γ/τλ(β-1)θ+(tn)1/(2-α)+(Rsn)12-α+s)(logn+tn)γ/τλ(β-1)θ+(tn)1/(2-α)+(Rsn)1/(2-α+s)+(Rs+1)tn+tλ(β-1)θn+λβ(logn+tn)γ/τλ(β-1)θ+(tn)1/(2-α)+(Rsn)12-α+s), where C,C are constant independent of R,t, or n.

Proof.

We note that (62)fλκfλκD(λ)λCλβ-1. By Ωz(fz,λ)R, a combination of Propositions 11, 12, 14, and 15 yields that, with confidence at least 1-Ce-t, (63)Q(gn)-Q*+λΩz(fz,λ)C1(logn+tn)γ/τλ(β-1)θ+δ0+δ01-α/2{Q(f)-Q*}α/2+C2(Rs+1)tn+C3t(λ(β-1)θ3n+(1n)1/(2-α))+2λβ, where Ci,i=1,2,3, are constants, and δ0 is bounded by (49).

Putting x=Q(gn)-Q*+λΩz(fz,λ) and ν=α/2 into the implication relation (64)xaxν+b,a,b,x>0xmax{(2a)1/(1-ν),2b}, we obtain (61) from (63). Therefore, the conditional probability of the event that inequality (61) holds, given the event Ωz(fz,λ)R, is at least 1-Ce-t. The proof is complete.

For any R, denote the random event Ωz(fz,λ)R by ξR. Obviously, (ξR)=1 for R=ϕ(0)/λ. However, to prove Theorem 8, a smaller R with (ξR)=1 is desired. To this end, we apply the iteration technique for estimation of Ωz(fz,λ) introduced in .

Recall that μ is given in Theorem 8. It is easily seen that, for λ=n-μ,Rn1/s, (65)(1n)γ/τλ(β-1)θλβ,  1nλ(β-1)θλβ,(1n)1/(2-α)(Rs+1)n(Rsn)1/(2-α+s)λβ(λ1-βR)s/(2-α+s). Therefore, we have, by Theorem 16, with the conditional probability at least 1-Ce-t, given ξR, (66)Q(gn)-Q*+λΩz(fz,λ)Cmax{(logn+t)γ/τ,t}λβ((λ1-βR)s/(2-α+s)+2). If λβ-1R-1=O(1), the above inequality becomes (67)Q(gn)-Q*+λΩz(fz,λ)Cmax{(logn+t)γ/τ,t}λβ(λ1-βR)s/(2-α+s). Consequently, given event ξR with λβ-1R-1=O(1),Rn1/s, we have with confidence at least 1-Ce-t, (68)fz,λΩz(fz,λ)r(R)Cmax{(logn+t)γ/τ,t}×λ(β-1)((1-s)/(2-α+s))Rs/(2-α+s).

Let R(0)=ϕ(0)/λ,R(k)=r(R(k-1)),k=1,2,. By induction, it is easy to prove λβ-1(R(k))-1 = O(1),R(k)n1/s. Since {ξR0}=1, we have with confidence at least 1-kCe-t  Ωz(fz,λ)R(k),k=1,2,,. Clearly, (69)R(k)(Cmax{(logn+t)γ/τ,t})(1-νk+1)/(1-ν)×λβ-1λ-(1/(2-α+s))k+1,kkkkkkkkk=1,2,, where ν=s/(2-α+s)<1.

For any ε>0, let k be the smallest integer such that (1/(2-α+s))k+1<ε. Substituting R=R(k) into (67), we bound the right hand side of (67) by (Cmax{(logn+t)γ/τ,t})1/(1-ν)λβ-ε, with confidence at least 1-Cεe-t, where Cε=kC. This completes the proof.

Acknowledgment

The authors thank Professor Di-Rong Chen for his help.

Clémençon S. Lugosi G. Vayatis N. Ranking and empirical minimization of U-statistics The Annals of Statistics 2008 36 2 844 874 10.1214/009052607000000910 MR2396817 Rejchel W. On ranking and generalization bounds Journal of Machine Learning Research 2012 13 1373 1392 MR2930642 Lin Y. Support vector machines and the Bayes rule in classification Data Mining and Knowledge Discovery 2002 6 3 259 275 10.1023/A:1015469627679 MR1917926 Chen D.-R. Wu Q. Ying Y. Zhou D.-X. Support vector machine soft margin classifiers: error analysis Journal of Machine Learning Research 2003/04 5 1143 1175 MR2248013 Aronszajn N. Theory of reproducing kernels Transactions of the American Mathematical Society 1950 68 337 404 MR0051437 10.1090/S0002-9947-1950-0051437-7 ZBL0037.20701 Cucker F. Zhou D.-X. Learning Theory: An Approximation Theory Viewpoint 2007 24 Cambridge, UK Cambridge University Press 10.1017/CBO9780511618796 MR2354721 Chen H. Wu J. T. Support vecor machine for ranking submitted Bradley P. S. Mangasarian O. L. Shavlik J. Feature selection via concave minimization and support vector machines Proceedings of the 15th International Conference on Machine Learning (ICML '98) 1998 Morgan Kaufmann Song M. Breneman C. M. Bi J. Sukumar N. Bennett K. P. Cramer S. Tugcu N. Prediction of protein retention times in anion-exchange chromatography systems using support vector regression Journal of Chemical Information and Computer Sciences 2002 42 6 1347 1357 2-s2.0-0036827078 10.1021/ci025580t Donoho D. L. For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution Communications on Pure and Applied Mathematics 2006 59 6 797 829 2-s2.0-33646365077 10.1002/cpa.20132 Zhu J. Rosset S. Hastie T. Tibshirani R. 1-norm support vector machines Advances in Neural Information Processing Systems 2004 16 49 56 Tarigan B. van de Geer S. A. Classifiers of support vector machine type with 1 complexity regularization Bernoulli 2006 12 6 1045 1076 10.3150/bj/1165269150 MR2274857 ZBL1118.62067 Daubechies I. Defrise M. De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint Communications on Pure and Applied Mathematics 2004 57 11 1413 1457 10.1002/cpa.20042 MR2077704 ZBL1077.65055 Chen H. The convergence rate of a regularized ranking algorithm Journal of Approximation Theory 2012 164 12 1513 1519 MR3020936 10.1016/j.jat.2012.09.001 ZBL1252.68225 Smale S. Zhou D.-X. Learning theory estimates via integral operators and their approximations Constructive Approximation 2007 26 2 153 172 10.1007/s00365-006-0659-y MR2327597 ZBL1127.68088 Bartlett P. L. Jordan M. I. McAuliffe J. D. Convexity, classification, and risk bounds Journal of the American Statistical Association 2006 101 473 138 156 10.1198/016214505000000907 MR2268032 ZBL1118.62330 Steinwart I. Scovel C. Fast rates for support vector machines using Gaussian kernels The Annals of Statistics 2007 35 2 575 607 10.1214/009053606000001226 MR2336860 ZBL1127.68091 Xiao Q.-W. Zhou D.-X. Learning by nonsymmetric kernels with data dependent spaces and 1-regularizer Taiwanese Journal of Mathematics 2010 14 5 1821 1836 MR2724135 Tong H. Chen D.-R. Yang F. Support vector machines regression with 1-regularizer Journal of Approximation Theory 2012 164 10 1331 1344 10.1016/j.jat.2012.06.005 MR2961184 Tong H. Chen D.-R. Yang F. Learning rates for 1-regularized kernel classifiers Journal of Applied Mathematics 2013 2013 11 496282 10.1155/2013/496282 Tong H. Chen D.-R. Peng L. Analysis of support vector machines regression Foundations of Computational Mathematics 2009 9 2 243 257 10.1007/s10208-008-9026-0 MR2496562 ZBL1185.68577 Arcones M. A. Giné E. U -processes indexed by Vapnik-Červonenkis classes of functions with applications to asymptotics and bootstrap of U-statistics with estimated parameters Stochastic Processes and their Applications 1994 52 1 17 38 10.1016/0304-4149(94)90098-1 MR1289166 Arcones M. A. Giné E. Limit theorems for U-processes The Annals of Probability 1993 21 3 1494 1542 MR1235426 10.1214/aop/1176989128