JAM Journal of Applied Mathematics 1687-0042 1110-757X Hindawi Publishing Corporation 189753 10.1155/2012/189753 189753 Research Article Approximation Analysis of Gradient Descent Algorithm for Bipartite Ranking Chen Hong 1 He Fangchao 2,3 Pan Zhibin 1 Xu Yuesheng 1 College of Science Huazhong Agricultural University Wuhan 430070 China hzau.edu.cn 2 School of Science Hubei University of Technology Wuhan 430068 China hebut.edu.cn 3 Faculty of Mathematics and Computer Science Hubei University Wuhan 430062 China hubu.edu.cn 2012 14 07 2012 2012 09 03 2012 12 05 2012 26 05 2012 2012 Copyright © 2012 Hong Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We introduce a gradient descent algorithm for bipartite ranking with general convex losses. The implementation of this algorithm is simple, and its generalization performance is investigated. Explicit learning rates are presented in terms of the suitable choices of the regularization parameter and the step size. The result fills the theoretical gap in learning rates for ranking problem with general convex losses.

1. Introduction

In this paper we consider a gradient descent algorithm for bipartite ranking generated from Tikhonov regularization scheme with general convex losses and reproducing kernel Hilbert spaces (RKHS).

Let 𝒳 be a compact metric space and 𝒴={-1,1}. In bipartite ranking problem, the learner is given positive samples S+={xi+}i=1m and negative samples S-={xi-}i=1n, which are randomly independent drawn from ρ+ and ρ-, respectively. Given training set S:=(S+,S-), the goal of bipartite ranking is to learn a real-valued ranking function f:𝒳 that ranks future positive samples higher than negative ones.

The expected loss incurred by a ranking function f on a pair of instances (x+,x-) is I{f(x+)-f(x-)0}, where I{t} is 1 if t is true and 0 otherwise. However, due to the nonconvexity of I, the empirical minimization method based on I is NP-hard. Thus, we consider replacing I by a convex upper loss function ϕ(f(x+)-f(x-)). Typical choices of ϕ include the hinge loss, the least square loss, and the logistic loss.

The expected convex risk is (1.1)E(f)=Xϕ(f(x+)-f(x-))dρ+(x+)dρ-(x-). The corresponding empirical risk is (1.2)ES(f)=1mni=1mj=1nϕ(f(xi+)-f(xj-)).

Let ϑ={f𝔉:f=argminf𝔉(f)} be the target function set, where 𝔉 is the measurable function space. We can observe that the target function is not unique. In particular, for the least square loss, the regression function is one element in this set.

The ranking algorithm we investigate in this paper is based on a Tikhonov regularization scheme associated with a Mercer kernel. We usually call a symmetric and positive semidefinite continuous function K:𝒳×𝒳 a Mercer kernel. The RKHS K associated with the kernel K is defined (see ) to be the closure of the linear span of the set of functions {Kx:=K(x,·):x𝒳} with the inner product   ·  K given by Kx,KxK=K(x,x). The reproducing property takes the form f(x)=f,KxK, for all x𝒳,fK. The reproducing property with the Schwartz inequality yields that |f(x)|K(x,x)fK. Then, fκfK, where κ:=supx𝒳K(x,x).

The regularized ranking algorithm is implemented by an offline regularization scheme  in K(1.3)fz,λ=argminfHK{ES(f)+λfHK2}, where λ>0 is the regularization parameter. A data free-limit of (1.3) is (1.4)fλ=argminfHK{E(f)+λfHK2}.

Though the offline algorithm (1.3) has been well understood in , it might be practically challenging when the sample size m or n is large. The same difficulty for classification and regression algorithms is overcome by reducing the computational complexity through a stochastic gradient descent method. Such algorithms have been proposed for online regression in [3, 4], online classification in [5, 6], and gradient learning in [7, 8]. In this paper, we use the idea of gradient descent to propose an algorithm for learning a target function in ϑ.

Since ϕ is convex, we know that its left derivative ϕ- is well defined and nondecreasing on . By taking functional derivatives in (1.3), we introduce the following algorithm for ranking.

Definition 1.1.

The stochastic gradient descent ranking algorithm is defined for the sample S by f1S=0 and (1.5)ft+1S=(1-ηtλ)ftS-ηtmni=1mj=1nϕ-(ftS(xi+)-ftS(xj-))(Kxi+-Kxj-), where t and {ηt} is the sequence of step sizes.

In fact, Burges et al.  investigate gradient descent methods for learning ranking functions and introducing a neural network to model the underlying ranking function. From the idea of maximizing the generalized Wilcoxon-Mann-Whitney statistic, a ranking algorithm using gradient approximation has been proposed in . However, these approaches are different from ours and their analysis focuses on computational complexity. Recently, for least square loss, numerical experiments by gradient descent algorithm have been presented in . The aim of this paper is to provide generalization bounds for the gradient descent ranking algorithm (1.5) with general convex losses. To the best of our knowledge, there is no error analysis in this case; This is why we conduct our study in this paper.

We mainly analyze the errors ftS-fλKand inffϑftS-fK, which is different from previous error analysis for ranking algorithms based on uniform convergence (e.g., ) and stability analysis in [2, 17, 18]. Though the convergence rates of K norm for classification and regression algorithms have been elegantly investigated in [19, 20], there is no such analysis in the ranking setting. The main difference in the formulation of the ranking problem as compared to the problems of classification and regression is that the performance or loss in ranking is measured on pairs of examples, rather than on individual examples. This means in particular that, unlike the empirical error in classification or regression, the empirical error in ranking cannot be expressed as a sum of independent random variables . This makes the convergence analysis of K norm difficult and previous techniques invalid. Fortunately, we observe that similar difficulty for gradient learning has been well overcome in [7, 21, 22] for gradient learning by introducing some novel techniques. In this paper, we will develop an elaborative analysis in terms of these analysis techniques.

2. Main Result

In this section we present our main results on learning rates of algorithm (1.5) for learning ranking functions. We assume that ϕC1() satisfies (2.1)|ϕ(u)|C0(1+|u|)q,|ϕ(u)|C0(1+|u|)q-1,uR for some C0>0 and q1. Denote the constant (2.2)Δ*=1+4κ2(sup|u|1|ϕ-(u)-ϕ(0)||u|+C0ϕ(0)+(8κ2ϕ(0))q-1).

Theorem 2.1.

Assume ϕ satisfies (2.1), and choose the step size as (2.3)ηt=η*λmax{q-2,0}t-θfor some0<θ<1,0<η*1Δ*. For 0<γ<(1-θ)/min{q-1,1},s>0, one takes λ=t-γ with (mn/(m+n)3/2)st2(mn/(m+n)3/2)s. Then, for any 0<δ<1, with confidence at least 1-δ, one has (2.4)ftS-fλHK2C(mn(m+n)3/2)-α, where C is a constant independent of m,n, and (2.5)α=min{sθ-sγmin{q+1,2q-1},1-sγ(1+q)}.

Theorem 2.1 will be proved in the next section where the constant C can be obtained explicitly. The explicit parameters in Theorem 2.1 are described in Table 1 for some special loss functions ϕ. Note that the iteration steps and iterative numbers depend on sample number m,n. When m=O(n) and m, we have t and ηt0.

The values of parameters for different convex losses.

Loss function C 0 q Δ * α
ϕ ( t ) = t 2 1 2 1 + 12 κ 2 θ min { ( 1 / 2 ) ( θ - 3 γ ) , γ β }
ϕ ( t ) = log ( 1 + e - t ) 2 1 1 + 11 κ 2 ( γ + θ ) min { θ - γ , γ - β }
ϕ ( t ) = ( 1 - t ) + 1 1 1 + 8 κ 2 θ min { ( 1 / 2 ) ( θ - 3 γ ) , γ β }

From the results in Theorem 2.1, we know that the balance of samples is crucial to reach fast learning rates. For m=O(n) and the least square loss, the approximation order is O(m(-1/2)min{sθ-3sγ,1-3sγ}). Moreover, when sθ1 and sγ0, we have ftS-fλK20 with the order O(m(-1/2)).

Now we present the estimates of inffϑftS-fK under some approximation conditions.

Corollary 2.2.

Assume that there is f*ϑ such that fλ-f*K2Cβλβ for some 0<β<1. Under the condition in Theorem 2.1, for any 0<δ<1, with confidence at least 1-δ, one has (2.6)inffϑftS-fHK2C~(mn(m+n)3/2)-α~, where C~ is a constant independent of m,n, and (2.7)α~=min{sθ-sγmin{q+1,2q-1},1-sγ(1+q),sγβ}.

For m=O(n) and the least square loss, by setting s=1/γ(3+β), we can derive the learning rate O(m(-1/γ(6+2β))min{θ-3γ,γβ}). Moreover, if β<(θ-3γ)/γ, we get the approximation order O(m-(β)/(6+2β)).

For the least square loss, the regression function is an optimal predictor in ϑ. Then, the bipartite ranking problem can be reduced as a regression problem. Based on the theoretical analysis in [19, 20], we know that the approximation condition in Corollary 2.2 can be achieved when the regression function lies in the (β+1)/2th power of the integral operator associated with the kernel K.

The highlight of our theoretical analysis results is to provide the estimate of the distance between ftS and the target function set ϑ in RKHS. This is different from the previous result on error analysis that focuses on establishing the estimate of |(f)-S(f)|. Compared with the previous theoretical studies, the approximation analysis in K-norm is new and fills the gap on learning rates for ranking problem with general convex losses.

We also note that the techniques of previous error estimate for ranking problem mainly include stability analysis in [2, 17], concentration estimation based on U-statistics in , and uniform convergence bounds based on covering numbers [15, 16]. Our analysis presents a novel capacity-independent procedure to investigate the generalization performance of ranking algorithms.

3. Proof of Main Result

We introduce a special property of (f)+(λ/2)fK2. Since the proof is the same as that in , we will omit it here.

Lemma 3.1.

Let λ>0. For any fK, there holds (3.1)λ2f-fλHK{E(f)+λ2fHK2}-{E(fλ)+λ2fλHK2}.

Denote (3.2)ftλ=1mni=1mj=1nϕ-(f(xi+)-f(xj-))(Kxi+-Kxj-)+λftS,ft+1S=ftS-ηtftλ.

Now we give the one-step analysis.

Lemma 3.2.

For t0, one has (3.3)ft+1S-fλHK2(1-ηtλ)ftS-fλHK2+ηt2ftλHK2+2ηtφ(S,t), where φ(S,t)=S(fλ)-(fλ)+(ftS)-S(ftS).

Proof.

Observe that (3.4)ft+1S-fλHK2=ftS-fλHK2+ηt2ftλHK2+2ηtfλ-ftS,ftλK. Note that (3.5)fλ-ftS,ftλK=1mni=1mj=1nϕ-(ftS(xi+)-ftS(xj-))(fλ(xi+)-fλ(xj-)-ftS(xi+)+ftS(xj-))+λfλ-ftS,ftSKES(fλ)-ES(ftS)-λftSHK2+λfλ,ftSK{ES(fλ)+λ2fλHK2}-{ES(ftS)+λ2ftSHK2}, where the first and the second inequalities are derived by the convexity of ϕ and the Schwartz inequality, respectively.

By Lemma 3.1, we know that (3.6){ES(fλ)+λ2fλHK2}-{ES(ftS)+λ2ftSHK2}{ES(fλ)-E(fλ)+E(ftS)-ES(ftS)}-λ2ftS-fλHK2. Thus, the desired result follows by combining (3.5) and (3.6) with (3.4).

To deal with the sample error iteratively by applying (3.3), we need to bound the quantity φ(S,t) by the theory of uniform convergence. To this end, a bound for the norm of ftS is required.

Definition 3.3.

One says that ϕ- is locally Lipschitz at the origin if the local Lipschitz constant (3.7)M(λ)=sup{|ϕ-(u)-ϕ-(0)||u|:|u|4κ2|ϕ-(0)|λ} is finite for any λ>0.

Now we estimate the bound of ftS from the ideas given in .

Lemma 3.4.

Assume that ϕ- is locally Lipschitz at the origin. If the step size ηt satisfies ηt(4κ2M(λ)+λ)1 for each t, then ftSK2κ|ϕ-(0)|/λ.

Proof.

We prove by induction. It is trivial that f1S=0 satisfies the bound.

Suppose that this bound holds true for ftS, ftSK2κ|ϕ-(0)|/λ. Consider(3.8)ft+1S=(1-ηtλ)ftS-ηtmni=1mj=1nϕ-(ftS(xi+)-ftS(xj-))(Kxi+-Kxj-)=(1-ηtλ)ftS-ηtmni=1mj=1nϕ-(ftS(xi+)-ftS(xj-))-ϕ-(0)ftS(xi+)-ftS(xj-)(ftS(xi+)-ftS(xj-))(Kxi+-Kxj-)-ηtmni=1mj=1nϕ-(0)(Kxi+-Kxj-). Let Lijf=f,Kxi+-Kxj-K(Kxi+-Kxj-). Since (3.9)0Lijf,fK=|f,Kxi+-Kxj-K|24κ2fHK2, we have Lij4κ2.

Meanwhile, (ϕ-ftS(xi+)-ftS(xj-)-ϕ-(0)/ftS(xi+)-ftS(xj-)M(λ)). Then, (3.10)ηtmni=1mj=1nϕ-(ftS(xi+)-ftS(xj-))-ϕ-(0)ftS(xi+)-ftS(xj-)Lij is a positive linear operator on K and its norm is bounded by 4κ2M(λ).

Since ηt(4κ2M(λ)+λ)1, the operator (3.11)A:=(1-ηtλ)I-ηtmni=1mj=1nϕ-(ftS(xi+)-ftS(xj-))-ϕ-(0)ftS(xi+)-ftS(xj-)Lij on K is positive and A(1-ηtλ)I.

Thus, (3.12)ft+1SHK(1-ηtλ)ftSHK+ηtmni=1mj=1n|ϕ-(0)|Kxi+-Kxj-HK=2κ|ϕ-(0)|λ. This proves the lemma.

For r>0, denote 𝔉r={fK:fKr}. Meanwhile, denote Lr=max{|ϕ-(2κr)|,|ϕ-(-2κr)|} and Mr=max{|ϕ(2κr)|,|ϕ(-2κr)|}.

Based on analysis techniques in [21, 23], we derive the capacity-independent bounds for W(S,r):=supf𝔉r|S(f)-(f)|.

Lemma 3.5.

For every r>0 and ε>0, one has (3.13)ProbS{|W(S,r)-EW(S,r)|>ε}exp{-2m2n2ε2(m+n)3Mr2},EW(S,r)(4Lrκr+2ϕ(0))m+nmn.

Proof.

Because of the feature of S, four cases of samples change should be taken into account to use McDiarmid's inequality. Denote by Sk the sample coinciding with S except for xk+ (or xk-) replaced by x~k+ (or x~k-). It is easy to verify that (3.14)|W(S,r)-W(Sk,r)|=|supfFr|ES(f)-E(f)|-supfFr|ESk(f)-E(f)||supfFr|ES(f)-ESk(f)|m+nmnMr. Based on MicDiarmid’s inequality in , we can derive the first result in Lemma 3.5. To derive the second result, we denote ξ(x+,u-)=ϕ(f(x+)-f(u-)). Then, (f)=Ex+Ex-ξ(x+,x-) and S(f)=1/mni=1mj=1nξ(xi+,xj-). Observe that (3.15)W(S,r)supfFr|E(f)-1nj=1nEx+ξ(x+,xj-)|+supfFr|1nj=1nEx+ξ(x+,xj-)-ES(f)|Ex+supfFr|Ex-ξ(x+,x-)-1nj=1nξ(x+,xj-)|+1nj=1nsupfFrsupx-|Ex+ξ(x+,x-)-1mi=1mξ(xi+,x-)|=W1+W2.

Denote Gx+={h(x-)=f(x+)-f(x-):f𝔉}. Then, (3.16)EW1=ExEsuphGx+|Ex-ϕ(h(x-))-1nj=1nϕ(h(xj-))|2supx+EsuphGx+|1nj=1nεjϕ(h(xj-))|.

Since ϕ(h(xj-))-ϕ(0)Lr(h(xj-)-0), we have (3.17)EW12Lrsupx+EsuphGx+|1nj=1nεjh(xj-)|+2ϕ(0)nE|j=1nεj|4Lrκrn+2ϕ(0)nE|j=1nεj|4Lrκr+2ϕ(0)n(E|j=1nεj|2)1/24Lrκr+2ϕ(0)n(Ej,j=1nεjεj)1/2=4Lrκr+2ϕ(0)n.

With the same fashion, we can also derive (3.18)EW24Lrκr+2ϕ(0)m. Thus, the second desired result follows by combining (3.17) and (3.18).

Now we can derive the estimate of φ(S,t).

Lemma 3.6.

If ηt satisfies (1) for each t and r=2κ|ϕ-(0)|/λ+2ϕ(0)/λ, then with confidence at least 1-δ one has (3.19)φ(S,t)Bλ:=(4Lrκr+2ϕ(0)+Mr2log(2δ))(m+n)3/2mn.

Proof.

By Lemma 3.5, we have, with confidence at least 1-δ, (3.20)W(S,r)(4Lrκr+2ϕ(0))m+nmn+(m+n)3/2Mr2mn2log(2δ). By taking f=0 in the definition of fλ, we see that (3.21)λ2fλHK2E(0)+0ϕ(0). Then, for any λ>0, we have fλK2ϕ(0)/λ. Thus, ftS,fλ𝔉r for r=2κ|ϕ-(0)|/λ+2ϕ(0)/λ. So, φ(S,t)2W(S,r) for each t. This completes the proof.

We are in a position to give bounds for the sample error. We need the following elementary inequalities that can be found in [3, 5].

Lemma 3.7.

(1) For α(0,1] and θ[0,1], (3.22)i=1t1iθj=i+1t(1-αjθ)3α.

(2) Let v(0,1] and θ(0,1]. Then (3.23)t=1T-11t2θexp{-vj=t+1Tj-θ}{18vTθ+9T1-θ(1-θ)21-θexp{-v(1-2θ-1)1-θ(T+1)1-θ}if    θ<1,81-v(T+1)-vif    θ=1.

(3) For any t<T and θ(0,1], there holds (3.24)j=t+1Tj-θ{11-θ[(T+1)1-θ-(t+1)1-θ]if    θ<1,log(T+1)-log(t+1)if    θ=1.

Proposition 3.8.

Let ηt=η1t-θ for some θ[0,1], and let η1 satisfy η1(4M(λ)+λ)1. Set r and Bλ as in Lemma 3.6. Denote B~λ=2κLr+κ|ϕ-(0)|. Then, with confidence at least 1-δ, the following bound holds for t1: when θ<1, (3.25)ftS-fλHK2fλHK2exp{-η1λ1-θ(t1-θ-1)}+18B~λ2η1λtθ+2B~λ2η12t1-θ(1-θ)21-θexp{-η1λ(1-2θ-1)1-θ(t+1)1-θ}+6Bλλ, when θ=1, (3.26)ftS-fλHK2fλHK2t-η1λ+8B~λ2η121-η1λ(t+1)-η1λ+6Bλλ.

Proof.

Since ftSK2κ|ϕ-(0)|/λ, we have |ftS(xi+)-ftS(xj-)|2κftSK2κr. From the definition of ftλ, we know that ftλK2κLr+κ|ϕ-(0)|=B~λ. Thus, when φ(S,t)Bλ, we have from Lemma 3.2(3.27)ft+1S-fλHK2(1-ηtλ)ftS-fλHK2+ηt2B~λ2+2ηtBλ. Applying this relation iteratively, we have (3.28)ftS-fλHK2i=1t-1(1-ηiλ)fλHK2+i=1t-1j=i+1t-1(1-ηiλ)(ηi2B~λ2+2ηiBλ). Since ηi=η1i-θ, by Lemma 3.7(2), we have for θ<1(3.29)i=1t-1j=i+1t-1(1-ηiλ)ηi2η12i=1t-112θexp{-η1λj=i+1t-1j-θ}18η1λt+9η12t1-θ(1-θ)21-θexp{-η1λ(1-2θ-1)1-θ(t+1)1-θ} and for θ=1(3.30)i=1t-1j=i+1t-1(1-ηiλ)ηi28η121-η1λ(t+1)-η1λ. Lemma 3.7(1) yields (3.31)i=1t-1j=i+1t-1(1-ηiλ)ηiη1i=1t-11iθj=i+1t-1(1-η1λjθ)3λ. By Lemma 3.7(3), we also have for θ<1(3.32)i=1t-1(1-ηiλ)exp{-i=1t-1ηiλ}exp{η1λ1-θ(1-t1-θ)} and for θ=1(3.33)i=1t-1(1-ηiλ)t-η1λ. Combining the above estimations with Lemma 3.6, we derive the desired results.

Now we present the proof of Theorem 2.1.

Proof of Theorem <xref ref-type="statement" rid="thm1">2.1</xref>.

First we derive explicit expressions for the quantities in Proposition 3.8. Since λ=t-γ, we have rC3tγ, where C3=2κ|ϕ-(0)|+2ϕ(0). By (2.1), we find that (3.34)LrC0(1+2κr)q-1C0(1+2κ)q-1C3q-1t(q-1)γ,MrC0(1+2κ)qC3qtqγ. Then, (3.35)BλC4log(2δ)tqγ(m+n)3/2mn,B~λC5t(q-1)γ, where C4=4C0κ(1+2κ)q-1C3q-1+C0(1+2κ)qC3q+2ϕ(0) and C5=2κC0(1+2κ)q-1C3q-1+2ϕ(0).

Next, we bound M(λ). When 1|u|4κ2|ϕ-(0)|/λ, we have (3.36)|ϕ-(u)-ϕ-(0)||u||ϕ-(0)|+2q-1C0(4κ2ϕ-(0))q-1λ2-q. Hence, (3.37)M(λ)(sup|u|1|ϕ-(u)-ϕ-(0)||u|  +|ϕ-(0)|+2q-1C0(4κ2ϕ-(0))q-1)λmin{2-q,0}. It follows that the condition η1(4κ2M(λ)+λ)1 in Proposition 3.8 holds true when ηt=η1t-θ and η1=η*λmax{q-2,0}. Based on Proposition 3.8 and (mn/m+n3/2)st2(mn/m+n3/2)s, we have, with confidence at least 1-δ, (3.38)ftS-fλHK2(c~1fλHK2exp{η*1-θ+c~2η*2(1-θ)21-θ(mn(m+n)3/2)s(1-θ-2(q-1)γ)})×exp{-η*(1-2θ-1)1-θt1-θ-γmax{q-1,1}}+c~3(mn(m+n)3/2)sγ(1+q)-1+c~4η*(mn(m+n)3/2)sγmin{q+1,2q-1}-sθ, where c~1,c~2,c~3, and c~4 are constants independent of m,n,and t.

Thus, when 1-θ-γmax{q-1,1}>0, we can derive the desired result in Theorem 2.1.

Acknowledgments

This work was supported partially by the National Natural Science Foundation of China (NSFC) under Grant no. 11001092 and the Fundamental Research Funds for the Central Universities (Program no. 2011PY130, 2011QC022). The authors are indebted to the anonymous reviewers for their constructive comments.

Aronszajn N. Theory of reproducing kernels Transactions of the American Mathematical Society 1950 68 337 404 0051437 Agarwal S. Niyogi P. Stability and generalization of bipartite ranking algorithms In COLT, 2005 Smale S. Yao Y. Online learning algorithms Foundations of Computational Mathematics 2006 6 2 145 170 10.1007/s10208-004-0160-z 2228737 Smale S. Zhou D. X. Online learning with Markov sampling Analysis and Applications 2009 7 1 87 113 10.1142/S0219530509001293 2488871 Ying Y. Zhou D. X. Online regularized classification algorithms IEEE Transactions on Information Theory 2006 52 11 4775 4788 10.1109/TIT.2006.883632 2302601 Dong X. M. Chen D. R. Learning rates of gradient descent algorithm for classification Journal of Computational and Applied Mathematics 2009 224 1 182 192 10.1016/j.cam.2008.04.022 2474223 Cai J. Wang H. Zhou D. X. Gradient learning in a classification setting by gradient descent Journal of Approximation Theory 2009 161 2 674 692 10.1016/j.jat.2008.12.002 2563075 Dong X. Zhou D. X. Learning gradients by a gradient descent algorithm Journal of Mathematical Analysis and Applications 2008 341 2 1018 1027 10.1016/j.jmaa.2007.10.044 2398266 Burges C. Shaked T. Renshaw E. Lazier A. Deeds M. Hamilton N. Hullender G. Learning to rank using gradient descent Proceedings of the 22nd international conference on Machine learning 2005 Raykar V. C. Duraiswami R. Krishnapuram B. A fast algorithm for learning a ranking function from large-scale data sets IEEE Transactions on Pattern Analysis and Machine Intelligence 2008 30 7 1158 1170 10.1109/TPAMI.2007.70776 Chen H. Tang Y. Li L. Q. Li X. Ranking by a gradient descent algorithm manuscript Freund Y. Iyer R. Schapire R. E. Singer Y. An efficient boosting algorithm for combining preferences Journal of Machine Learning Research 2004 4 6 933 969 10.1162/1532443041827916 2125342 Agarwal S. Graepel T. Herbrich R. Har-Peled S. Roth D. Generalization bounds for the area under the ROC curve Journal of Machine Learning Research 2005 6 393 425 2249826 Clémençon S. Lugosi G. Vayatis N. Ranking and empirical minimization of U-statistics The Annals of Statistics 2008 36 2 844 874 10.1214/009052607000000910 2396817 Rudin C. Schapire R. E. Margin-based ranking and an equivalence between AdaBoost and RankBoost Journal of Machine Learning Research 2009 10 2193 2232 2563980 Rudin C. The P-norm push: a simple convex ranking algorithm that concentrates at the top of the list Journal of Machine Learning Research 2009 10 2233 2271 2563981 Agarwal S. Niyogi P. Generalization bounds for ranking algorithms via algorithmic stability Journal of Machine Learning Research 2009 10 441 474 2485989 Cossock D. Zhang T. Statistical analysis of Bayes optimal subset ranking IEEE Transactions on Information Theory 2008 54 11 5140 5154 10.1109/TIT.2008.929939 2589888 Smale S. Zhou D. X. Shannon sampling. II. Connections to learning theory Applied and Computational Harmonic Analysis 2005 19 3 285 302 10.1016/j.acha.2005.03.001 2186447 Smale S. Zhou D. X. Learning theory estimates via integral operators and their approximations Constructive Approximation 2007 26 2 153 172 10.1007/s00365-006-0659-y 2327597 Mukherjee S. Wu Q. Estimation of gradients and coordinate covariation in classification Journal of Machine Learning Research 2006 7 2481 2514 2274447 Mukherjee S. Zhou D. X. Learning coordinate covariances via gradients Journal of Machine Learning Research 2006 7 519 549 2274377 Chen H. Li L. Q. Learning rates of multi-kernel regularized regression Journal of Statistical Planning and Inference 2010 140 9 2562 2568 10.1016/j.jspi.2010.03.020 2644077 McDiarmid C. On the method of bounded differences Surveys in Combinatorics, 1989 (Norwich, 1989) 1989 141 Cambridge, UK Cambridge University Press 148 188 1036755