A Model Based on Cocitation for Web Information Retrieval

According to the relationship between authority and cocitation in HITS, we propose a new hyperlink weighting scheme to describe the strength of the relevancy between any two webpages. Then we combine hyperlink weight normalization and random surfing schemes as used in PageRank to justify the new model. In the new model based on cocitation (MBCC), the pages with stronger relevancy are assigned higher values, not just depending on the outlinks.Thismodel combines both features ofHITS and PageRank. Finally, we present the results of some numerical experiments, showing that the MBCC ranking agrees with the HITS ranking, especially in top 10. Meanwhile, MBCC keeps the superiority of PageRank, that is, existence and uniqueness of ranking vectors.


Introduction
In the past, search engines ranked pages by using word frequency or similar measures.However, the relevancy of webpages returned by this traditional web information retrieval is still lacking, because the webpages are created with varying qualities.Recently, some new algorithms have been created that greatly improve rankings.One of the popular ideas is to use hyperlinks to determine the value of different webpages.This hyperlink graph contains useful information: if webpage  has a link pointing to webpage , it usually indicates that the creator of  considers  to contain relevant information for .Such useful opinions and knowledge are therefore registered in the form of adjacency matrix which is denoted by .  = 1 if there is a link from  to , or 0, otherwise.
Two most popular ranking algorithms based on hyperlink analysis are the PageRank algorithm [1,2] and the HITS (Hyper-text Induced Topic Selection) algorithm [3].Generally, PageRank considers the hyperlink weight normalization and the equilibrium distribution of random surfers as the citation score.For more information about the calculation methods of PageRank refer to [4][5][6].HITS makes the distinction between hubs and authorities and then computes them in a mutually reinforcing way.For each of these two algorithms, the ranking vector is the dominant eigenvector of some matrix describing the network.How this matrix is defined differs in each method.There are other works which have recognized that the hyperlink structure can be very valuable for locating information [3,7,8].
This paper is organized as follows.In Section 2, we introduce the PageRank and HITS algorithms and briefly discuss the limitations in HITS.Then in Section 3, we emphasize the role of cocitation (Figure 1) and provide a hyperlink weighting scheme to describe the strength of the relevancy between any two webpages.In order to ensure the existence of solutions and uniqueness of solutions in the new model (MBCC), we also combine ideas from PageRank.In Section 4, some experiments are presented.The result shows that the MBCC ranking is close well to the HITS ranking.Conclusions are given in Section 5.

PageRank and HITS
We treat the web as a directed graph  = (, ): the nodes in  correspond to the pages, and a directed edge (, ) ∈  indicates the existence of a link from  to .We say that the out-degree of a node  denoted by  out () is the number of nodes it has links point to, and the in-degree of  denoted by  in () is the number of nodes that have links point to it.We also denote that If the page  has no outlink, that is,  out () = 0, then, at time  + 1, the surfer chooses any page with probability 1/.Thus, we replace  out () = 0 with  out () = .Then the stationary distribution  is determined by the following matrix form: Here  = ( 1 , . . .,   )  ,  is the adjacency matrix of the directed web graph,  out = diag( out ), and  = (1, . . ., 1)  .In the vector , the element   = 1 if the th row of  corresponds to a dangling node ( out () = 0), or 0, otherwise.
In order to calculate the above recursive equation and get a unique stationary probability distribution, it is important to guarantee that (3) is convergent.This problem can be solved if the directed graph  is strongly connected, which is generally not the case for the directed graph.In the context of computing PageRank, the standard way of ensuring this property is to add a new set of complete outgoing transitions, with small transition probabilities (in this work, we set each of them as 1/), to all nodes in .Then the modified transition probability called Google matrix is where  = 0.8 ∼ 0.9.Here  = (1, . . ., 1)  ; thus   is a matrix of all 1's.The PageRank algorithm is to solve the eigenvector of the Google matrix where   is stochastic and irreducible.PageRank models two types of random jumps on the Internet.With probability 1 −  a surfer randomly chooses a new page.Otherwise, the surfer follows one of directed edges from the present node.

Review of HITS.
In the HITS algorithm [3], each webpage  has both a hub score   (based on the links going from the page) and an authority score   (based on the links going to the page).Let  = ( 1 , . . .,   )  denote the vector of all authority weights, let  = ( 1 , . . .,   )  denote the vector of all hub weights, and let  be the adjacency matrix of the directed web graph.In HITS, there are two operations at each iteration.One is defined as operation I which sets the authority vector to  =   .It indicates that a good authority is pointed by many good hubs.Another is defined as operation O which sets the hub vector to  = .It indicates that a good hub points to many good authorities.This mutually reinforcing relationship can be written in the following matrix representations: The final authority and hub scores are the principal eigenvectors of    and   which are corresponding to the dominant eigenvalue  * .Since    and   determine the authority ranking and hub ranking, we call    the authority matrix and   the hub matrix.
In the fields of citation analysis and bibliometrics, it has shown that the authority matrix has interesting connections to cocitation [3].Here cocitation is defined as the number of webpages that cocite ,  [9].In the authority matrix, This implies that For  ̸ = , (  )  = Σ  =1     is the number of webpages that cocite ,  that is denoted by   .Therefore the authority matrix    is the sum of in-degree and cocitation [10,11] The self cocitation   in  is not defined and is usually set to 0.

Existence and Uniqueness of Ranking Vectors.
In this section, we present the existence and uniqueness of ranking vectors in the above two algorithms.Since the Google matrix   in ( 4) is stochastic and irreducible, for the PageRank algorithm, the PageRank ranking vector exists, and it is unique and positive.See the equivalent theorem in [12,Theorem 3.8].For the HITS algorithm, it has been proved that the hub and authority ranking vectors exist but may not be unique.In [12], they show that the HITS algorithm badly behaved on certain networks, meaning that (i) it can return ranking vectors that are not unique but depend on the initial seed vector or (ii) it can return ranking vectors that inappropriately assign zero weights to parts of the network.
There are also other limitations for HITS; see [12,13].Thus, to address these limitations, a modification for HITS is needed, for example, exponentiated input method in [12].In the next section, we combine both features of HITS and PageRank.The ranking produced by the new model is expected to be unique and close to the HITS ranking.

A Model Based on Cocitation (MBCC)
In HITS, according to (9), the authority ranking value   can be expressed as revealing the close relationship between authorities and cocitations.It also implies that, if two distinct webpages ,  are cocited by many other webpages  as shown in Figure 1, then ,  are likely to be related in some sense.In this paper, we present a property for HITS corresponding to (10).
Property 1 (relationship between authority value and cocitation).If the number of webpages that cocite webpages  and , that is,   , is larger, the page  could receive more authority value from the page , even though there are no links between  and .
The fact that the webpages cocite two distinct webpages  and  indicates that ,  have certain commonality.Therefore, we say that the number of cocitations represents the relevancy among the pages.Then, in the following, we focus on the use of cocitation for analyzing the relevancy among the pages.
Note that, in Section 2.1, the rank of a page in PageRank is divided among its forward links evenly; see (2); that is, a web surfer could chose the forward outlinks randomly.However, this process of dividing the rank equally may seem unrealistic; that is, a web surfer may have a priori idea of the value of pages, favoring pages from the relevant sites.Since it shows that the number of cocitations could represent the relevancy among the pages, we say that the number of cocitations between two pages can impact the behavior of web surfers.Therefore, we define a new hyperlink weighting scheme based on cocitation as follows: Definition 2 (hyperlink weighting scheme based on cocitation).Let   be the number of webpages that cocite two webpages , .Specially,   =  in (), and  in () is the in-degree of webpage .Then we define the following function as the value of  which will receive form : where Under this assignment method, the rank value for the page  is determined by The matrix form of above equation is  = , where  = ( 1 , . . .,   )  .The problem is that, if at least one page has zero in-degree, that is, no in-links and   = 0, then the matrix  is absorbing and its dominant eigenvector does not exist.In order to resolve this, similarly to PageRank, we assume that, if the page  has no link that points to it, then at time  + 1, the page  divides its value equally to any other page with probability 1/.The modified matrix  is given by where we replace   = 0 with   = ,   = diag( Q) and Q = ( 1 , . . .,   )  .Q can be computed as In the vector V, the element V  = 1 if the -th row of  corresponds to a page with no in-degree, or 0, otherwise.Therefore, the modified matrix  becomes a stochastic matrix, that is, each column in  sum to 1.
In order to get a unique stationary probability distribution, it is important to guarantee that  is strongly connected.Similarly to PageRank, we add a new set of complete outgoing transitions.The final transition probability matrix based on using cocitation as a hyperlink weighting scheme is where 0 <  < 1 and  = (1, . . ., 1)  .The model based on cocitation (MBCC) is to solve the following function:  We assume that the solution of (16) denoted by  * = ( * 1 , . . .,  *  ) is the MBCC authority ranking vector, and  =    * is the MBCC hub ranking vector.Since the matrix  in (15) is stochastic and irreducible, just like the Google matrix  in PageRank, the solution of (16) exists, and it is unique and positive.

Numerical Experiments
First, we present an example to describe the assignment process in Definition 2.
Example 1. Suppose that there are six webpages  = ( 1 ,  2 ,  3 , , , ), and the directed graph is shown in Figure 2. The conclusion can be found from Table 1 and Figure 2. In Table 1, (, ) is the number of webpages that cocite webpages  and ; (, ) is obtained by (11).In Figure 2, the left one is the original link structure of PageRank where the value of the page  1 is divided equally to the pages that it points to, and the right one divides the value of  1 based on cocitation.
Then, we compare the MBCC model with HITS and PageRank, experimenting with dataset from http://www.cs .toronto.edu/∼tsap/experiments/datasets/.The dataset is about the topic computational geometry which contains a total of 1100 webpages.We set  = 0.9.Meanwhile, we use  = 10 −10 as the convergence tolerance and measure the convergence rates of the three algorithms using the L1 norm of the residual vector.Table 2 shows the list of the top 20 authorities with HITS, MBCC, and PageRank.Table 3 shows the list of the top 20 hubs with HITS and MBCC.It shows that MBCC authority ranking is closer to HITS

Conclusion
In this work, we emphasize the role of cocitation in defining authorities.First, we observe that, in the HITS algorithm, if two distinct webpages ,  are cocited by many other webpages , then ,  are likely to be related in some sense or have certain commonality.According to this close relationship, we come to the conclusion that the higher the number of webpages that cocite webpages  and , the stronger the relevancy between the two pages.The page  with stronger relevancy should obtain more values from page .Therefore, we develop a hyperlink weighting scheme for extracting information from the link structure.Then we combine hyperlink weight normalization and random surfing schemes as used in PageRank to justify the model.The experimental results show that the MBCC authority (hub) ranking is close well to the HITS authority (hub) ranking in top 20, and in general a surfer seldomly browses beyond these webpages in top 20 [11].Moreover, MBCC keeps the superiority of PageRank: the authority vector of MBCC in (16) exists, and it is unique and positive, while the authority and hub vectors of HITS may not be unique.Therefore, we can use the authority (hub) ranking vector of MBCC as the authority (hub) ranking vector of HITS.

2
[1,2]matical Problems in Engineering 2.1.Review of PageRank.PageRank[1,2]uses a web surfing model based on a random walk process.Suppose there is a link from page  to page ; that is, (, ) ∈ .Consider a random surfer visiting page  at time .Then at the next time +1, the surfer lands at page  with probability 1/ out ().Once the above is done, the PageRank algorithm assigns a rank value   for the page  as a function of the rank of the pages that point to it:

Table 1 :
The data of Example 1.

Table 2 :
HITS authority ranking, MBCC authority ranking, and PageRank ranking in top 20.

Table 3 :
HITS hub ranking and MBCC hub ranking in top 20.

Table 4 :
Comparison between MBCC and HITS ranking vectors, for example, top 10 represents a ranking vector agreeing with another ranking vector in top 10.The comparison between MBCC and HITS ranking vectors in Table4indicates that MBCC ranking agrees well with HITS ranking, especially in top 10.