Convergence Analysis of an Empirical Eigenfunction-Based Ranking Algorithm with Truncated Sparsity

and Applied Analysis 3 Lemma 2. Let {(?̃?x l , Ṽ l )} m l=1 be eigenpairs ofB.Then, for 1 ≤ l ≤ m, one has


Introduction
Motivated by various applications including problems related to information retrieval, user-preference modeling, and computational biology, the problem of ranking has recently gained much attention in machine learning (see, e.g., [1][2][3][4]).This paper proposes a kernel-based ranking algorithm to search a ranking function in a data dependent hypothesis space.The space is spanned by certain empirical eigenfunctions which we select by using a truncated parameter.The notion of empirical eigenfunctions, first studied for learning algorithms in [5], has been used to develop classification and regression algorithms which are shown to have strong learning ability [6,7].We will use such an idea to develop learning algorithms for ranking.
1.1.The Ranking Problem.The problem of ranking is distinct from both classification and regression.In ranking, one learns a real-valued function that assigns scores to instances, but the scores themselves do not matter; instead, what is important is the relative ranking of instances induced by those scores.
Formally, the problem of ranking may be modeled in the framework of statistical learning theory (see, e.g., [8] for more details).Assume  is a Borel probability measure on  =  × , where  is a compact metric space (input or instance space) and  = [0, ] (output space) for some  > 0. Let   be its marginal distribution on  and let (⋅ | ) be the conditional distribution on  at given .The learner is given a set of samples z = {  }  =1 = {(  ,   )}  =1 ∈   drawn independently and identically according to  and the goal is to find a function  z :  → R that ranks future instances with larger labels higher than those with smaller labels.In other words,  is to be ranked as preferred over   if  z () >  z (  ) and lower than   if  z () <  z (  ) ( z () =  z (  ) indicates that there is no difference in ranking preference between the two instances).In this setting, the penalty of a ranking function  on a pair of instances (,   ) with corresponding labels  and   can be taken to be the least squares ranking loss: L (, (, ) , (  ,   )) = (( −   ) − ( () −  (  ))) 2 , ( and, as a result, the quality of  can be measured by its expected ranking error: ( Let  2   be the space of square integrable functions on  with respect to the measure d  .Let G be the collection of target functions which are defined to be the functions minimizing the error E() over  2    ; that is, G = { ∈  2   :  = arg min ∈ 2   E()}.It is apparent from (2) that the error of any ranking function of the form +  is the same as that of , where  ∈ R is some constant.Therefore, unlike the target function in classification or regression, the target function in ranking is not unique in general.It is easy to show that the regression function of  defined by is a minimizer of the error (2), which indicates that any function of the form   +  is also a target function.On the other hand, any target function in  2   must has the form   + , which can be checked from Lemma 11 in [9].Thus we conclude that the collection G consists exactly of functions that have the form   () +  with  ∈ R being an arbitrary constant; that is, G = { ∈  2   :  =   () + ,  ∈ R}.

The Mercer Kernel and Empirical Eigenpair.
Our algorithm is based on a Mercer kernel and we need to introduce some notions related to kernels (see [10,11] for more details where  = √sup ∈ |(, )|.
Let  be a Mercer kernel.The integral operator   : is introduced in [12] to analyze the following regularized ranking algorithm: This operator is compact, positive, and self-adjoint.In particular, it has at most countably many nonzero eigenvalues and all of these eigenvalues are nonnegative.Let us arrange these eigenvalues {  } (with multiplicities) as a nonincreasing sequence tending to 0 and take an associated sequence of eigenfunctions {  } to be an orthonormal basis of H  .In the remainder of this paper, we will use a general assumption that   =    (  ) for some  > 0 and   ∈ H  , where the power    of   is defined in terms of {  } and {  } by The assumption (7) The operator  x  : H  → H  is self-adjoint and positive with rank at most .We denote its eigensystem to be {( x  ,  x  )}, where the eigenvalues { x  } are arranged in nonincreasing order with  x  = 0 whenever  >  and the corresponding eigenfunctions { x  } form an orthonormal basis of H  .It can be proved that E x ( x  ) =   , which means that the eigenfunctions {  } can be approximated by the empirical eigenfunctions { x  }.This fact indicates that the first  eigenfunctions are reasonably promising for ranking.

The Computation of Empirical Eigenpair.
Before proceeding further, we need to show how the empirical eigenpairs {( x  ,  x  )} can be found explicitly.The main difficulty here is that the kernel of   is not symmetric even though it is a self-adjoint operator on H  , which makes the computations of empirical eigenfunctions relatively difficult (We refer the readers to [13] for some results on regression learning with indefinite kernels.) Denote the symmetric matrix ((  ,   ))  ,=1 by K and define A = (1/)(I − 11  )K and B = (1/)A(I − 11  ), where I is the th-order unit matrix and 1 = (1, . . ., 1)  ∈ R  .The proofs of Lemmas 1-5 can be found in [14].
It is easy to see that B is a positive semidefinite matrix.Denote its eigenvalues as λx = 0 with  x being the rank of B and the corresponding orthonormal eigenvectors as Ṽ1 , . . ., Ṽ x , Ṽ x +1 , . . ., Ṽ .
where  x  | x is the vector in R  obtained by restricting the function  x  onto the sampling points.
Based on the above arguments, we are now confident that the following theorem can be proved, which yields a method for computing the empirical eigenpairs {( x  ,  x  )}  x =1 explicitly.
Theorem 6.The number of positive eigenvalues of  x  is equal to that of B.Moreover, the empirical eigenpairs can be computed by using the eigenpairs {( λx  , Ṽ )}  x =1 of B as follows: for  = 1, . . .,  x with  x denoting the rank of  x  .
1.4.The Ranking Algorithm.Prompted by the above analysis, we propose a learning algorithm for ranking as follows.Let  be a positive number (called a truncated parameter) and let D , denote the set of empirical eigenfunctions such that their corresponding eigenvalues are less than or equal to ; that is, D , = { x  : 1 ≤  ≤ ,  x  ≥ }.Let  be the number of eigenfunctions in D , .Our ranking algorithm now takes the form and the output function is We are concerned in this paper with the representer theorem, that is, the explicit solution to problem (15), and the convergence analysis in the H  -norm of the above algorithm.Previous work on error analysis of ranking algorithms, such as [3,8], deals only with generalization properties of the algorithms.Though convergence analysis of classification and regression algorithms has been well studied (see, e.g., [15,16]), little research has been conducted in establishing similar results in the setting of ranking.Perhaps the first work is that of Chen [12], who derives the convergence rate of a regularized ranking algorithm by means of the technique of operator approximation.Our results can be considered as another attempt in this direction.It should be pointed out that, for the sake of simplicity, rather than taking all target functions into consideration, we will here restrict ourselves to the regression function   .In other words, we will consider the convergence bounds for ‖ z  −   ‖  , instead of inf ∈G ‖ z  − ‖  .Notice that, compared with classification or regression problems, the main difference in the formulation of ranking problems is that its performance or loss is measured on pairs of examples, rather than on individual examples.This results in the double-index summation in algorithm (15), which prevents us from directly applying the standard Hoeffding inequality used to obtain convergence bounds for classification and regression.We will tackle this problem by a McDiarmid-Bernstein type probability inequality for vectorvalued random variables [17] as is done in [8,12].Finally, we show that when the eigenvalues decay polynomially, the algorithm produces sparse representations with respect to the empirical eigenfunctions by choosing a suitable parameter .

The Representer Theorem
In this section we provide the representer theorem for algorithm (15).The key point in proving the representer theorem is an equality involving the empirical eigenfunctions and eigenvalues.
Theorem 7. The solution to problem (15) is given by where Proof.The empirical error part takes the form A routine computation gives rise to By using (19), one can carry on with the above equality chain as follows: Hence we have an equivalent form of (15) as The component  z , can be found by solving the following optimization problem: which has the solution given by ( 17).This proves the theorem.

The Error Analysis
In order to derive an error bound for algorithm (15), we need some preliminary inequalities.The following Hoffman-Wielandt inequality establishes the relationship between   −  x  and   −  x  , which has been investigated in [18][19][20][21].
where ‖   −  x  ‖ HS is the Hilbert-Schmidt norm of HS(H  ), the Hilbert space of all Hilbert-Schmidt operators on H  , with inner product ⟨, ⟩ HS = Tr(  ).Here Tr denotes the trace of a linear operator.
The inner product in HS(H  ) can also be defined by ⟨, ⟩ HS = ∑  ⟨  ,   ⟩  , where {  } is an orthonormal basis of H  .The space HS(H  ) is a subspace of the space of bounded linear operators on H  , denoted as (H  , ‖ ⋅ ‖), with the norm relations ‖‖ ≤ ‖‖ HS and ‖‖ HS ≤ ‖‖ HS ‖‖.
To bound the quantity ‖  −  x  ‖ HS , we introduce the following McDiarmid-Bernstein type of probability inequality for vector-valued random variable established in [17].Lemma 9. Let z = {  }  =1 be independently drawn according to a probability distribution  on , (, ‖ ⋅ ‖) a Hilbert space, and  :   →  measurable.If there is M > 0 such that ‖(z) − E   ((z))‖ ≤ M for each 1 ≤  ≤  and almost every z ∈   , then for every  > 0, where From the fact Tr(⟨⋅,   ⟩     ) = (,   ) (see [22]) and after tedious calculations, one can derive that, for each  ∈ {1, . . ., }, ‖ x  − E   ( x  )‖ HS ≤ (6/) 2 .By Lemma 9, we obtain the following.( With the help of the preceding five lemmas, we are now in a position to derive an error estimate for the algorithm.We will conduct analysis for the error in the H  -metric, which makes the corresponding error estimate stronger than that performed in the  2   -metric [16]. Theorem 12. Assume (7).For any 0 <  < where  1 ,  2 , and  3 are constants independent of  and  (given explicitly in the proof).
Proof.By Lemmas 10 and 11, we know that for any 0 <  < 1/2 there exists a subset   of   of measure at least 1 − 2 such that both (26) and (28) hold for each z ∈   .Let z ∈   .It follows from the orthogonal expansion in terms of the orthonormal basis { x  } that We bound the first term Δ 1 on the right-hand side of (30) by decomposing it further into two parts with The part with ∑ ∞ =+1 is easy to deal with since { x  } is an orthonormal basis; we have where the last inequality follows from ‖{  }‖  2 = ‖  ‖  .The part with ∑  =1 can be estimated by the Schwarz inequality as We continue to bound Abstract and Applied Analysis From the definition of the Hilbert-Schmidt norm, we have By Lemma 8, we get Case 2 ( < 1).We notice that  2  ≤  2−2   2  and obtain from the above estimate The bounds for the two cases together with (32) give a bound for Δ 1 as Here Now we turn to the second term Δ 2 on the right-hand side of (30).Note that Thus, for any 0 <  < 1, with confidence 1 − , we have Putting the bounds for √Δ 1 and √Δ 2 into (26), we know that, with confidence 1 − , ‖ z  −   ‖  can be bounded by Let  1 = 32,  2 = 2 max{,1} ‖  ‖  , and ‖  ‖  .Then the conclusion of Theorem 12 follows by scaling 2 to .
Remark 14.The truncated parameter  in algorithm (15) plays a role of the regularization parameter instead of .Thus error bounds for our algorithm are closely related to the truncated parameter.Note that our learning rates are given in terms of special choices of the truncated parameter which depends on a priori condition (7).However, methods for determining directly the truncated parameter by the data are more preferrable to practical learners.This will be our future research direction.
Remark 15.Note that when  is large enough (meaning that   has high regularity), the learning rate behaves like  −1/2 .Moreover, the nonzero coefficients in  z  = ∑  =1  z ,  x  are at most  = ⌊ 1/(2 2 −1) ⌋, which is much smaller than the sample size  when  2 is large.Thus, our algorithm produces sparse representations with respect to the empirical eigenfunctions under a mild condition.