An Improved Approach to the PageRank Problems

We introduce a partition of the web pages particularly suited to the PageRank problems in which the web link graph has a nested block structure. Based on the partition of the web pages, dangling nodes, common nodes, and general nodes, the hyperlink matrix can be reordered to be a more simple block structure. Then based on the parallel computation method, we propose an algorithm for the PageRank problems. In this algorithm, the dimension of the linear system becomes smaller, and the vector for general nodes in each block can be calculated separately in every iteration. Numerical experiments show that this approach speeds up the computation of PageRank.


Introduction
The rapid growth of the World Wide Web has created a need for search tools.One of the best-known algorithms in web search is Google's PageRank algorithm [1].Google's PageRank algorithm is based on a random surfer model [1] which can be viewed as a stationary distribution of a Markov chain.Simultaneously with the random surfer model, a different but closely related approach, the HITS algorithm, was invented in [2].Another model SALSA [3] incorporated ideas from both HITS and PageRank to create another ranking of webpages.
In this paper, we focus on Google's PageRank algorithm.Let us introduce some notations about Google's PageRank algorithm.We can model the web as a directed graph with the web pages as the nodes and the hyperlinks as the directed edges.In the graph, if there is a link from page   to page   , then, for page   , it has an outlink to page   , and, for page   , it has an inlink from page   .Then we can define the elements of a hyperlink matrix  as follows.
If the web page   has outlinks   ≥ 1, then, for each link from page   to another page   , the element ℎ , of the matrix  is 1/  .If there is no link from page   to page   , then the element ℎ , of  is 0. The scalar   is the number of outlinks from the page   .Thus, each nonzero row of  sums to 1.If the page   has no outlinks at all (such as a pdf, image, or audio file), it is called a dangling node, and all elements in the th row of  are set to 0.
The problem is that if at least one node has zero outdegree, that is, no outlinks, then the Markov chain is absorbing, so a modification to  is needed.In order to resolve this, the founders of Google, Brin and Page suggest replacing each zero row (corresponding to a dangling node) of the sparse hyperlink matrix with a dense nonnegative vector V  (V   = 1;  is the column vector of all ones and V  also could be a personalized vector, see [4,5]) and create the new stochastic matrix denoted by ,  =  + V  .In the vector , the element   = 1 if the th row of  corresponds to a dangling node, and 0 otherwise.Another problem is that there is nothing in our definition so far that guarantees the convergence of the PageRank algorithm or the uniqueness of the PageRank vector with the matrix .In general, if the matrix  is irreducible, this problem can be settled.Thus, Brin and Page added another dense perturbation matrix V  that creates direct connections between each page to force the matrix to be irreducible.Then, the stochastic, irreducible matrix is called the Google matrix  and given by where 0 <  < 1 (a typical value for  is between 0.85 and 0.95.It is shown in [6] that  controls the convergence rate of the PageRank algorithm).Mathematically, the PageRank vector  is the stationary distribution of the so-called Google matrix .Now, we have got many methods for solving the PageRank vector , such as the famous Power Method [1,7,8].Due to the sheer size of the web (over 3 billion pages), this computation can take several days.In [9], Arasu et al. used values from the current iteration as they become available, rather than using only values from the previous iteration.They also suggested that exploiting the "bow-tie" structure of the web [10] would be useful in computing PageRank.In [11], Kamvar et al. presented a variety of extrapolation methods.In [12], Avrachenkov et al. showed that Monte Carlo methods already provide good estimation of the PageRank for relatively important pages after one iteration.Gleich et al. in [13] presented an inner-outer iterative algorithm for accelerating PageRank computations.To put it another way, for the existence of the dangling nodes, Lee et al. [14] partitioned the web into dangling and nondangling nodes and applied an aggregation method to this partition.
Recently, the structure of the web link graph has been noticed.Kamvar et al. in [4] brilliantly exploited the block structure of the web for computing PageRank.They also exploited the fact that pages with lower page rank tend to converge faster and propose adaptive methods in [15].Based on the characteristics of the web link graph, research on parallelization of PageRank can be found in [16][17][18][19][20][21].In [21], Manaskasemsak and Rungsawang discussed a parallelization of the power method.In [17], Gleich et al. introduced a method to compare the various linear system formulations in terms of parallel runtime performance.Cevahir et al. in [16] proposed the site-based partitioning and repartitioning techniques for parallel PageRank computation.Some special models for parallel PageRank were proposed in [18][19][20].
In our paper, we combine ideas from the existence of the dangling nodes and the block structure of the web and exploit a new structure for the hyperlink matrix .Then some parallel computation methods are applied to speed up the computation of PageRank by using a partition of the nodes.Firstly, we present that our target is to compute the PageRank of the nondangling nodes in the linear system for the Google problem [22] (Section 2).Secondly, according to the partition of the web pages, we get a special structure of the hyperlink matrix, and then we propose an algorithm (Section 3).Finally, we make an analysis of our algorithms, and some numerical results are given (Sections 4 and 5).

The Problem
Generally, the Google problem is to solve the eigenvector  of the matrix  in the following equation: Here, we introduce some theorems to show that the Google problem can turn out to be a linear system problem and only need to compute the unnormalized PageRank subvector of the nondangling nodes.In the following, the matrix  denotes the identity matrix.
Theorem 1 (see [22, linear system for Google problem]).Suppose that the matrix  is a hyperlink matrix.Solving the linear system, and letting   =   /‖‖ 1 produce the PageRank vector.
Since the coefficient matrix ( − ) in ( 3) is an matrix (Theorem 8.(4.2) in [23]) as well as nonsingular and irreducible, thus, the solution of the linear system in Theorem 1 is existent and unique.
The rows in the matrix  corresponding to the dangling nodes would be zero.It is natural as well as efficient to exclude the dangling nodes from the PageRank computation.This can be done by partitioning the web nodes into nondangling nodes and dangling nodes.This is similar to the method of "lumping" all the dangling nodes into a single node [24].Supposing that the rows and columns of  are permuted corresponding to the partition, then the rows corresponding to the dangling nodes are at the bottom of the matrix: where  is the set of the nondangling nodes and  is the set of the dangling nodes.
Then, the coefficient matrix ( − ) in ( 3) becomes and the inverse of this matrix is Therefore, the unnormalized PageRank vector   = V  ( − ) −1 in (4) can be written as Then, Langville and Meyer [22] proposed two reordered PageRank algorithms for computing the PageRank vector.
One is Algorithm 1, called reordered PageRank algorithm, and the other is called reordered PageRank algorithm.However, unfortunately, the reordered PageRank algorithm is not necessarily an improvement over Algorithm 1 in some cases.
In this reordered PageRank Algorithm 1, the only system that must be solved is x The reordered PageRank Algorithm 2 is based on a process of locating zero rows which can be repeated recursively on smaller and smaller submatrices of Ĥ11 , continuing until a submatrix is created that has no zero rows.For interested readers, the detail of the reordered PageRank algorithms can be found in [22].However, this structure of the web they (1) Partition the web nodes into dangling and nondangling nodes, so that the hyperlink matrix  has the structure of ( 4). ( 2 Algorithm 1: Reordered PageRank Algorithm [22]. (1) Partition the web nodes which form  blocks:  = ( 1 ,  2 , . . .,   ) into  + 2 blocks:  = ( 2 1 ,  2 2 , . . .,  2  , , ), so the hyperlink matrix  has the structure of ( 12).(2) Partition the given vector and PageRank vector   = (x  1 , x 2 ) according to the size of the  + 2 blocks: (3) Compute the limiting vector of   1 by iterations as follow: Algorithm 2: An algorithm based on a separation of the common nodes.
exploit in reordered PageRank Algorithm 2 is not practical, as reordering the web matrix according to this structure requires depth-first search, which is prohibitively costly on the web.
To put it another way, even though some hyperlink matrices  can be suited to the reordered PageRank algorithm, the structure may not exist for some hyperlink matrices.Thus the reordered PageRank Algorithm 2 will have no advantage over Algorithm 1 in this worst case.Similarly, we can find the same conclusion in their experiments.Thus, we come back to (4) and reorder the structure of the matrix Ĥ11 to speed up the computation of PageRank vector.The objective function becomes where the coefficient matrix ( −  Ĥ11 ) is the nontrivial leading principal submatrix of ( − ) and it is nonsingular (Theorem 6.(4.16) of [23]).

PageRank Algorithms Based on a Separation of the Common Nodes
Then, we separate the dangling nodes from each of the blocks.Thus, we get the new blocks  1  ,  ∈ (1, . . ., ), which are the original blocks   with dangling nodes removed.The set of nodes  = ( 1 ,  2 , . . .,   ) is  = (, ), where  = ( 1  1 ,  1 2 , . . .,  1  ) and  is the set of the dangling nodes.The rows and columns of  can be permuted, making the rows corresponding to the dangling nodes at the bottom of the matrix just like (4) in Section 2:  In the above equation, the submatrix Ĥ11 is

A Separation of the Common Nodes.
To investigate the detail of the web structure, we can see the experiments in [4].They used LARGEWEB link graph [25] and considered the version of LARGEWEB with dangling nodes removed, which contains roughly 70 M nodes, with over 600 M edges, and requires 3.6 GB of storage.They partitioned the links in the graph into "intrahost" links, which means links from a page to another page in the same host, and "interhost" links, which means links from a page to a page in a different host.Through counting the number of the two different links separately, Table 2 in [4] shows that 93.6% of the links in the datasets are intrahost links and 6.4% are interhost links, which means that larger majority of links are intrahost links and only a minority of links are interhost links.They also found the same result by partitioning the links according to different domains.This result leads to a deeper study of the structure of the hyperlink matrix .That is, if the pages are grouped by domain, host, or others, the graph for the pages will appear as a block structure.If a node in a web link graph is not a dangling node or a common node, then we call it general node.The nodes in a web link graph are divided into three classes: dangling node, common node, and general node.Specially, the common nodes and general nodes belong to the nondangling nodes.
There is no dangling node in the blocks  1  1 ,  1 2 , . . .,  1  , so we consider separating all the common nodes from the blocks  1  1 ,   1, a simple example is shown to illustrate the change after a separation of the common nodes.In Figure 1(a), there are four blocks  1 ,  2 ,  3 , and  4 in a web link graph, and each of them has links to others.However, in Figure 1(b), after separating the common nodes from the four blocks and lumping the common nodes into a block denoted by , there are no links among the four new blocks.The links exist only between the  and the four new blocks.Once the above is done, the hyperlink matrix  corresponding to the partition of the web nodes,  = ( 2  1 ,  2 2 , . . .,  2  , , ), has the following structure: Then the submatrix Ĥ11 , corresponding to the hyperlinks among the nondangling nodes, turns out to be It is apparent that after the separation of the common nodes, the structure of the above matrix Ĥ11 seems much simpler than the former one in (11).

A PageRank Algorithm.
Notice that the matrix in (13) has nonzero submatrices only in the diagonal, the last row, and the last column.This special structure can reduce the computation in every iteration.Let Then The coefficient matrix ( −  Ĥ11 ) has the following structure: Therefore, after Gaussian elimination, x 1 ( −  Ĥ11 ) = V 1 can be written as where ) are divided into general and common sections.The only system that must be solved is (17).
Notice that the matrix  is a block diagonal matrix.Therefore, the subvectors    of   1 = (  1 ,   2 , . . .,   3 ) which are partitioned according to the number and size of the blocks can be calculated independently in each iteration.For example, in th iteration, calculate and divide (  2 +  (−1) 1 )( − ) −1  +   1 into (  1 ,   2 , . . .,    ) according to the number and size of the blocksl then, for vectors    , we have the following function: or As a result, the PageRank system in ( 8) can be reduced into the smaller linear system formulation in (17) in which the subvectors can be calculated independently in each iteration by (20).In summary, we now have an algorithm based on the separation of the common nodes.Meanwhile, this algorithm is an extension of the dangling node method in Section 2.

Analysis of Algorithm 2
As we know, some web link graphs appear to have a nested block structure.Then according to the definition of common node, it is not difficult to find the common nodes among the different blocks.This can be done by a process of locating nonzero entries on submatrices of  , in (10) ( ̸ = , 1 ≤  ≤ , 1 ≤  ≤ ).For example, if the ( 1 ,  2 )th entry of  , is nonzero, then the  1 th nodes and the  2 th nodes are common nodes.This process can be repeated on different submatrices of  , at the same time by using separate computers.At the end, gather the common nodes together from different computers and get rid of the repetitive nodes, and then we get the last set of the common nodes.Since the dimension of  , is much smaller and we can use parallel searching, so the step 1 in Algorithm 2 will not take much time for separating the common nodes.
Note that there is no links among the new blocks  2  1 ,  2 2 , . . .,  2  after the separation of the common nodes just as the zero submatrices in the matrix  in (12).In effect, step 3 in Algorithm 2 reduces time consuming for large matrices by turning a large matrix Ĥ11 into many smaller submatrices   .It shows that vectors  ()  ,  = 1, . . ., , can be computed separately by  ()  ( −   ) =   and the results are used together to yield a new vector for the next iteration.The parallel computation in this step can save much time.
Since    are not required to be accurate in each iteration, we can compute  ()  by  ()  =  (−1)  (  ) +   .Moreover, it can be solved by any appropriate direct or iterative method.Meanwhile, in [22], they have found that acceleration methods [9,11,15,26], such as extrapolation and preconditioners, can be applied to the small  , system to achieve even greater speedups.

Numerical Experiments
5.1.Experiment Foundation.In this section, we give an example to present our algorithms.
Example.We consider three experiments based on three web link graphs: graph 1, graph 2, and graph 3. We assume that each of the graphs contains 200 nodes and four blocks; moreover, the size of the blocks is the same in each graph.Based on our definition about web pages, there are three classes of pages in a web: dangling nodes, common nodes, and general nodes.In order to make comparisons among the experiments, we suppose that the numbers of the dangling nodes are equivalent in these three graphs.Then we set different proportions of the general nodes and the common nodes in these three graphs.Without loss of generality, we assume that there are three kinds of proportions: they are 3 : 7 in graph 1, 5 : 5 in graph 2, and 7 : 3 in graph 3, which indicate that the number of the common nodes relatively decreases and the number of the general nodes relatively increases.We also assume that, in each graph, the proportion between the general nodes and the common nodes in each subblock is similar to the proportion in the whole web link graph.Meanwhile, in these three web link graphs, the choosing of the common nodes and the links in and between the subblocks is random.
For the dot plot graph of these three web link graphs, if there exists a link from node  to node , then point (, ) is colored; otherwise, point (, ) is white.We assure that these three web link graphs satisfy three characters in [4].
(1) There is a definite block structure to the web.
(2) The individual blocks are much smaller than entire web.
For example, Figure 2, it is the graph 3 which contains 200 pages and has a nested block structure of four blocks.The proportion is 7 : 3 in the whole graph.
Then, in each experiment, we separate the nodes into dangling nodes, common nodes, and the rest (general nodes).The result of this process is a decomposition of the  matrix.Figure 3 shows the change of the structure of Ĥ11 in (4) after this process, which is based on the dataset of Figure 2. Figure 3(a) is the web link graph of Ĥ11 before reordering, and Figure 3(b) is the new web link graph of Ĥ11 after reordering.This process amounts to a simple reordering of the indices of the Markov chain.It shows that the character of the new structure is better than the original one.

Experimental Results and Analysis.
Based on the three experiment datasets, we compare Algorithm 2 to the other two algorithms: original PageRank and reordered PageRank.We assume the scaling factor  = 0.85 and the convergence tolerance  = 10 −10 .The experimental results are shown in Figure 4 and Table 1.Figures 4(a), 4(b), and 4(c) are the comparison among the three algorithms about the acceleration of convergence in the three separate experiments.It shows that Algorithm 2 possesses both good capability to search PageRank vector and rigid convergence speed in comparison with reordered PageRank.That is because the dimension of the linear system for Algorithm 2 is smaller than the dimension of the linear system for reordered PageRank.The result in Table 1 implies that Algorithm 2 needs more iterations than Power method.However, since the application of parallel computation in Algorithm 2, Algorithm 2 can largely accelerate the computation time of PageRank.For the next work, we will try to experiment on real data.

Conclusion
It has investigated that the hyperlink graphs of some web pages have nested block structure which can be found in [4].Then we exploit a reordered block structure and present an algorithm to compute PageRank in a fast manner.Algorithm 2 has basically two stages.In Stage 1, the focus is on the partition of nodes in a web.In Stage 2, the vector of general nodes in each block for next iteration is computed independently.Then we calculate the unnormalized PageRank vectors for common nodes and dangling nodes directly.At last, normalize the vector and give the PageRank.The numerical experiments show that Algorithm 2 is guaranteed to outperform the other two algorithms, as long as an appropriate block structure of web exists.However, in real data, the common nodes may increase as the number of the blocks increases, and the dimension of the submatrix  could be larger.Then it will take much time to calculate the value of ( − ) −1 .In this case, similar to Algorithm 2, we will consider calculating the vector for common nodes first and then calculating the vector for general nodes in each block independently.We aslo need to experiment on real data and make comparison with other more existing methods in the future work.

Figure 1 :
Figure 1: A separation of the common nodes for a web link graph which has four blocks.

Figure 2 :
Figure 2: One of the three web link graphs, where the proportion between general nodes and common nodes is 7 : 3 in each subblock.

dataset 3 Figure 4 :
Figure 4: Comparison among the three algorithms which are run on three datasets.
2,  2,+1 ,1  ,2 ⋅ ⋅ ⋅  ,  ,+1 Then in each subblock, a minority of nodes have links to other blocks, and in this paper we call them common nodes.The definition of common node is given as follows.

Table 1 :
Comparison of original PageRank, reordered PageRank and Algorithm 2.