Connecting Patterns Inspire Link Prediction in Complex Networks

Link prediction uses observed data to predict future or potential relations in complex networks. Anunderlying hypothesis is that two nodes have a high likelihood of connecting together if they sharemany common characteristics.The key issue is to develop different similarity-evaluating approaches. However, in this paper, by characterizing the differences of the similarity scores of existing and nonexisting links, we find an interesting phenomenon that two nodes with some particular low similarity scores also have a high probability to connect together. Thus, we put forward a new framework that utilizes an optimal one-variable function to adjust the similarity scores of two nodes. Theoretical analysis suggests that more links of low similarity scores (long-range links) could be predicted correctly by our method without losing accuracy. Experiments in real networks reveal that our framework not only enhances the precision significantly but also predicts more long-range links than state-of-the-art methods, which deepens our understanding of the structure of complex networks.


Introduction
Modern science and engineering techniques increase our availability to various kinds of data including online social networks, scientific collaboration networks, and power grid networks [1][2][3][4][5].Many interesting phenomena could be uncovered from these networks.For example, analyzing the data of Facebook and Twitter helps find lost friends by only counting their common friends [6,7] and recommendation systems in online stores [8,9].Restricted by instrument accuracy and other obstacles, we only obtain a small fraction or a snapshot of the complete networks [10,11], promoting us to filter the information in complex networks [12][13][14].Link prediction is a straightforward approach to retrieve networks by predicting missing links and distinguishing spurious links [15][16][17].Thus great efforts have been devoted to link prediction in recent years [16,18].Link prediction is used in different kinds of networks, including unipartite networks and bipartite networks, where unipartite networks consist of nodes with the same type (e.g., social networks and neural networks) and bipartite networks consist of nodes with two types (e.g., user-object purchasing networks and user-movie networks) [19,20].
In classical link prediction approaches, similarity scores are computed first for two disconnected nodes, and then nonexisting links in the top of the score list are predicted as potential ones [16].Consequently, the key issue is to search effective score-assigning methods that are mainly divided into three categories [16,21]: similarity based algorithms, Bayesian algorithms, and maximum likelihood algorithms.First, similarity based algorithms [22][23][24] suppose that similar nodes have a high probability to link together.Similarities are evaluated by common neighbors, random walk resource allocation, and some other local and global indices.Second, Bayesian algorithms [25][26][27] abstract the joint probability distribution from the observed networks and then utilize conditional probability to estimate the likelihood of a nonexisting link.Third, maximum likelihood algorithms [28,29] presuppose that some underlying principles rule the structure of a network, with the detailed rules and specific parameters obtained by maximum likelihood estimation.Scores of nonexisting links are acquired through the details of 2 Complexity these principles.Most of these methods favor predicting links with high similarity scores and perform badly in the detection of long-range links with low similarities.
In the aforementioned methods, the basic hypothesis that two nodes with a high similarity score have a high likelihood of connecting together lacks an in-depth illustration.Recent works have demonstrated that long-range links exist extensively in complex networks and play an important role in routing, epidemical diffusion, and other dynamics [30,31].However, in practice, the endpoints of a long-range link usually have weak interaction and low similarity [30], which prevents the detection of long-range links by traditional methods [32,33].Hence, the structural patterns underlying the networks are of great importance to study.
Our study takes a different but complementary approach to link prediction problem.By analyzing the score distributions of existing and nonexisting links, respectively, we find an interesting phenomenon that the existing and nonexisting links follow different connecting patterns in respective of their similarity scores.Then, inspired by the precision-recall curves [34][35][36], we propose a metric, named precision-to-noise ratio (PNR), to characterize the ability to distinguish potential links for different scores.PNR describes the local precision of a given set of links with the same score.Based on PNR, a novel framework, which projects one-variable function to adjust the scores of a given method, is put forward.We argue that the framework finds the optimal transforming function that exploits the full capacities of traditional link prediction methods and improves their performance both on precision and on the detection of long-range links.Experiments in six real-world networks demonstrate the effectiveness of our method.
The rest of the paper is organized as follows.In Section 2, we first brief the link prediction problem and then introduce our proposed method.In Section 3, we compare the performances of our method and the classical methods.Finally, the conclusion is given.

Materials and Methods
We give the link prediction formulism in Section 2.1 and the baseline method in Section 2.2.Our proposed framework is introduced in Section 2.3.

Network Formation and Metrics
with   = 1 if node  connects to ; otherwise,   = 0.When evaluating the prediction performance, we usually divide the links randomly into 1−  training set   and   probe set   (  ∈ (0,1)), with   ⋂   = 0 and   ⋃   = .The goal is to accurately predict the links in probe set only by using the information in training set.
We first assign a score to each nonexisting link and then choose links with the highest top-L scores as potential ones.State-of-the-art similarity evaluation methods could be utilized to carry out link prediction, including common neighbors (CN), Jaccard index (JB), resource allocation index (RA), local path index (LP), and structural perturbation method (SPM) (see the part of Baseline and [38]).
There are two popular metrics to characterize the accuracy: area under the receiver operating characteristic curve (AUC) [39] and the precision [40,41].AUC can be interpreted as the probability that a randomly chosen missing link (i.e., a link in   ) has a higher score than a randomly chosen nonexisting link.Then, AUC requires  times of independent comparisons.We randomly choose a real link and a nonexisting link to compare their scores.After  different comparisons, we record  1 times where real links have higher scores, and  2 times where the two kinds of links have the same score.The final AUC is calculated as If all the scores are given by an independent and identical distribution, then AUC should be around 0.5.A higher AUC is corresponding to a more accurate prediction.
Another metric is precision that characterizes the ratio of correctly predicted links for a given prediction list.That is to say, if the length of prediction list is , among which   links are the right potential links, then the precision is Clearly, higher precision means higher prediction accuracy.Intuitively, higher accuracy means higher AUC and higher precision.In the experiments, we will see that precision has little correlation with AUC and that improving the precision may not result in the improvement of AUC.

Baseline Prediction Methods.
There exists a large number of score-assigning approaches in link prediction problem.All these methods could be introduced into our framework.Though we only investigate some state-of-the-art scoreassigning approaches, the results and conclusions are also applicable for other score-assigning methods.The five scoreassigning approaches [6,16] are as follows.
(i) Common Neighbor (CN).The metric supposes that if two nodes  and  have more common neighbors, they are more likely to connect together.The neighborhood overlap of the two nodes is as follows: where Γ() is the neighbor set of node  and | ⋅ ⋅ ⋅ | indicates the size of a set.The drawback of CN is that it favors large-degree nodes.Though the similarity of two large-degree nodes is low, they still have many common neighbors.
(ii) Jaccard Coefficient (JB).Jaccard is a conventional similarity metric that aims to suppress the influence of large-degree nodes, which is Since the similarity is normalized by the size of the union set of the two nodes' neighbors, low similarity still exists between two large-degree nodes even though they may have many common neighbors.
(iii) Resource Allocation (RA).This index is inspired by the resource allocation dynamics in complex networks.Given a pair of unconnected nodes  and , suppose that the node  needs to allocate some resource to , using common neighbors as transmitters.Each transmitter (common neighbor) starts with a single unit of resource and then distributes it equally among all its neighbors.The similarity between  and  can be calculated as the amount of resource received from their common neighbors: Comparing with Jaccard method, RA could also suppress the influence of large-degree nodes, but more specifically.Different neighbors contribute to the similarity differently.
If two nodes prefer to connect low-degree nodes, it means that they have a higher probability to share common interests or characteristics.However, many pair-nodes have common high-degree neighborhoods, resulting in that high-degree nodes play a weak role when evaluating similarity.Based on the idea, Adamic-Adar (AA) index is obtained by using log(  ) instead of   in ( 5).
(iv) Local Path (LP).CN considers the intersection of neighborhoods, which actually utilizes the one-path neighbors to characterize similarity.LP takes a general consideration of paths by considering two-path neighbors: where  is the adjacent matrix of a network and  is a small positive number.LP supposes that one-path neighbors contribute more to the similarity than two-path neighbors.LP is the low order parts of Katz method ( Katz =  2 +  3 +  2  4 + ⋅ ⋅ ⋅ ), but with much lower computing complexity.
(v) Structural Perturbation Method (SPM).Lü et al. [6] suppose that network structure follows consistency after some random perturbation.In SPM, training set   is divided into a small fraction of perturbation set Δ and the remaining set   (  =   + Δ).  has similar eigenvectors with   , but different eigenvalues.For the th largest eigenvalues of   and   , where   is the eigenvector of   , corresponding to   (  ).
The similarity matrix  = (  ) × is SPM first divides a network into training set and probe set and further divides the training set into perturbation set and the remaining set.For a given division of training and probe set, we calculate the average of 10 times independent simulations of (8) as the similarity matrix.
Apart from the five similarity metrics introduced above, for more similarity-evaluating methods, please refer to [42,43].

The Proposed Method.
We start our framework by reinvestigating the definition of precision.Supposing that   is the similarity score of nodes  and  obtained by a prediction method ϝ only based on training set   ,   () is the similarity distribution that a randomly chosen existing link in training set has score , and   () is the similarity distribution that a randomly chosen nonexisting link in the training set has score .Due to random division of training set and probe set, links in the probe set should have the same similarity distribution with that of the training set at high confidence according to the law of large numbers [44,45].Thus we would not differentiate similarity distribution of existing links in the training and probe sets in the following paper.The assumption is reasonable according to the statistical theory if the size of samples goes to infinity [44,45].Since classical methods only predict links with high scores, the estimated precision of the method ϝ is written as where |  | is the size of   ,  0 is a constant, and  is the whole set of all possible links (|| = (1/2)( − 1)). max is the maximum score.In real scenarios, the length of the prediction list is usually the size of the probe set [16], which requires  0 subjecting to If   () ≪   () at  >  0 , the precision  ϝ → 0. Otherwise,   () ≫   () gives rise to a high precision.Since only links with top-L highest scores are predicted as potential links, precision could be calculated by (2) [6,16].Equation ( 2) is a much easier formula to describe precision than (9).
The main concern is to select appropriate set  in (10) to maximize the precision.We propose precision-to-noise ratio (PNR) to determine , where PNR() measures the ability to distinguish real links with the same score.Note that a nonexisting link in training set may be an existing link in probe set.Given a nonexisting link in training set with the similarity   , the probability that it is an existing link in probe set (i.e., the precision) is The central issue of our framework is to use PNR() to determine the optimal score set .We first calculate the similarity scores of all links only based on training set by a traditional method.Second,   (),   (), and PNR() are computed.Third, we reassign the scores of each link    = PNR(  ), where   is the original similarity score by the first step.Finally, we sort links in the descending order of   and links with top-L scores are predicted as potential links [16,18].The optimal score set  opt corresponds to the original similarity scores whose reassigned scores rank in the top-L score list.
Different kinds of similarity evaluations could be introduced into the framework.Taking CN similarity method as an example, our framework is as follows: (1) Divide the links of a network into 1 −      An important property of our framework is that if  is determined according to PNR(), that is, PNR() > PNR(), ∀ ∈ , ∀ ∈ R − , the precision  ϝ could exploit full capacity of a given similarity-evaluating method.PNR() is the optimal transforming function  opt () = PNR().It means that no matter how we transform the similarity by other one-variable function,   =   (), the precision performance of   cannot outperform the proposed method by PNR().For the proof of the optimal PNR(), please see part I in the supplementary materials.

Experimental Results
We first describe the six real networks in Section 3.1.The precision comparison between our method and the baseline methods is given in Section 3.2.Finally, the characteristics of the predicted links by different methods are investigated in Section 3.3.

Datasets.
To verify the effectiveness of the proposed method, we measure the performance of our framework in six empirical networks from diverse disciplines and backgrounds: (1) email [46]: Enron email communication network covers all the email communication within a dataset of around half million emails; nodes of the network are email addresses and if an address  sent at least one email to address , the graph contains an undirected link from  to ; (2) PDZBase [47]: an undirected network of proteinprotein interactions from PDZBase; (3) Euroad [48]: international E-road network that locates mostly in Europe; the network is undirected, with nodes representing cities and links denoting e-road between two cities; (4) neural [49]: a directed and weighted neural network of C. elegans; (5) USair [6]: an directed network of flights between US airports in 2010; each link represents a connection from one airport to another in 2010; (6) roundworm [49]: a metabolic network of C. elegans.
Different real networks contain directed or undirected, weighted or unweighted links.To simplify the problem, we treat all links undirected and unweighted.Besides, only the giant connected components of these networks are taken into account.This is because for a pair of nodes located in two disconnected components, their similarity score will be zero according to most prediction methods.Table 1 shows the basic statistics of those networks.The results are obtained by 50 independent simulations.SPM method achieves high precision, yet low AUC, and JB has low precision, but high AUC (>0.9).

Precision Evaluation.
In the experiments, we set   = 10% that means the networks are randomly divided into 90% training set and 10% probe set.All the experiments are the average of 50 independent simulations.Figure 2 shows AUC and precision of five different methods in USair network.In Figure 2, CN method achieves low AUC, yet high precision, whereas RA method achieves similar AUC with methods of CN, JB, and SPM, but much lower precision.Apart from USair network, the deviation between AUC and precision also exists in other real-world networks (see FIG. S1 in the supplementary materials).The main reason is that AUC characterizes the score difference between existing and nonexisting links in the whole networks, whereas precision only counts the links with top-L high scores.Specifically, from the perspective of score distributions, AUC = ∫ +∞ −∞   () ∫  −∞   () .Comparing with (10), the definitions of the two metrics are completely different, resulting in little correlation between them.
Figure 3 shows PNR and the score distributions of existing and nonexisting links for USair network by CN method.In Figure 3(a), the scores of existing and nonexisting links follow power law distribution largely.High scores sometimes correspond to low PNR, especially at Similarity ≈ 60 (see Figure 3(b)).Nevertheless, some low scores achieve high PNR, indicating that for a nonexisting link in training set with this particular score, the link is likely to be an existing link in probe set.For a nonexisting link in training set with high score, yet with low PNR, it has a high probability not to be an existing link in probe set.The similar phenomenon also exists in other networks (see FIG. S2 in the supplementary materials).In consequence, the foundation of traditional methods, which suppose that similar nodes have a high likelihood to form links, is confronted with great challenges in precisely predicting links of low similarities.
Figure 6 shows the precision difference between the proposed PNR methods and the baseline methods.Our proposed method enhances precision remarkably compared with the original methods in most cases.Some fluctuation exists in these methods, due to the limited size of networks.Table 2 gives the maximal precision increasement in the six networks.In Table 2, precision is obtained by the maximum of traditional methods and PNR methods, respectively, that is, max{CN, Jaccard, RA, LP, SPM} and max{PNR CN , PNR Jaccard , PNR RA , PNR LP , PNR SPM }.Our method outperforms state-of-the-art methods in the six networks.Besides, Figure 4 shows the influence of the probe set size on the precision performance.We find that our method outperforms classical methods when   > 0.85, except for JB method when   > 0.6.Other networks have similar results (see FIG. S3 in the supplementary materials).However, according to the theoretical analysis (see the first part in the supplementary materials), our method should perform better than, or at least equally to, the classical methods.The reason is that we suppose the network structure is not influenced by the random division of training and probe set.Thus, the training subnetwork should have similar structure with the original networks.The assumption is rational when   is small.If the size   of the probe set is large, the training sets have many differences with the entire networks, which violates the assumption of our method.Therefore, our method performs well when the fraction of the probe set is small.Figure 5: The comparison of the predicted edges between JB and the corresponding PNR methods in the Usair network.In the panel, we predict 10 edges for both JB and PNR JB methods.The Usair network is divided into different communities by the method in [37].Nodes in the same community have the same color and short geographical distances.Our method (blue lines) predicts more edges between faraway nodes in different communities, while the original JB method (red lines) only predicts edges between close nodes.

Characteristics of the Predicted Links.
Long-range links play an important role in the dynamics of networks and it is of much significance to predict long-range links [32,50].Figure 5 gives a comparison of the predicted links between JB and the corresponding PNR methods in the Usair network.In Figure 5, our method predicts more links between faraway nodes in different communities, while the original JB method only predicts links between close nodes.Community detection method in [37] is utilized in Figure 5.However, it is difficult to evaluate long-range links solely based on community divisions.Since long-range links usually have long distances and low similarities, we would investigate the average distance and average similarity of the predicted links by our proposed framework.
The distance   of a link   is the shortest distance between nodes  and  only based on training set.Since the endpoints of the predicted links do not connect directly,   ≥ 2. The average distance of the predicted links is  Analogously, the average similarity of the predicted links is where   is the similarity of nodes  and  in training set.
Figure 7 shows the difference of the average distances obtained by PNR method and the corresponding original methods.Generally, PNR method achieves a higher average distance than the corresponding original methods in the six networks, especially for SPM in Email network and LP in USair network, whereas for many cases, PNR and the original methods have the same average distance  = 2.It is because that the distance of most unconnected nodes are 2, revealing that most commonly used methods incline to predict triangle edges.Therefore, our method has little influence on the average distance.However for some sparser networks, such as neural and USair networks, the average distance is improved by our framework, especially for LP in USair network.Previous works show that the two endpoints of a long-range link usually have a high distance or low similarity.Since PNR framework could increase the average distance of the predicted links, it can be conjectured that more long-range links are predicted.Besides, integrating Figures 6 and 7, we can find that our framework predict more long-range links correctly.
Furthermore, Figure 8 shows the difference of average similarity obtained by PNR method and the corresponding original methods.In Figure 8, PNR method achieves a lower average similarity than the corresponding original methods in the six networks, except RA method in roundworm network.The reason is that PNR has much fluctuations due to the limited size of networks, bringing about the unusual phenomenon of RA in roundworm network.Similar to the analysis of average distance, we show that PNR methods are beneficial to the prediction of long-range links, which agrees with the conclusion from Figure 7.

Conclusion
In summary, we systematically study the drawbacks of similarity-based link prediction methods and show that some link prediction methods achieve high AUC, yet low precision.Based on the differences of the similarity distributions of existing and nonexisting links, we propose a metric (PNR) to explain the problem of high AUC and low precision.

4 Figure 1 :
Figure 1: Schematic shows the proposed framework based on CN.(a) A snapshot of a large network.(b) Score of nonexisting links calculated by CN method.(c) The top panel is the score distributions of existing and nonexisting links,   () and   ().The bottom panel is PNR() =   ()/  ().(d) Predicted links.State-of-the-art prediction methods follow the path (a)→(b)→(d), while our proposed framework follows the path (a)→(b)→(c)→(d), which has an additional path PNR().

Figure 1
Figure 1 depicts the proposed framework based on CN method.After obtaining the similarity scores of links (Figure 1(a)→1(b)), traditional CN method directly predicts potential links according to the scores (Figure 1(b)→1(d)), while the proposed framework calculates PNR() (Figures

Figure 2 :
Figure2: AUC and precision of the USair network obtained by five different approaches: common neighbors (CN), Jaccard index (JB), resource allocation index (RA), local path index (LP), and structural perturbation method (SPM).The results are obtained by 50 independent simulations.SPM method achieves high precision, yet low AUC, and JB has low precision, but high AUC (>0.9).

Figure 3 :
Figure 3: Similarity distributions and the corresponding PNR() of USair network, where the similarity is obtained by CN method.(a) Similarity distributions of the existing and nonexisting links,   () and   (), respectively.(b) PNR() as a function of similarity in USair network.

Figure 6 :
Figure 6: Precision comparison of the proposed methods (red) and traditional high similarity based methods (cyan) for six real-world networks.(a) Email network.(b) PDZBase network.(c) Euroad network.(d) Neural network.(e) Roundworm network.(f) USair network.Results are the average of 50 independent simulations.Our proposed framework increases precision in most cases.

Table 2 :
Maximal precision comparison of the proposed methods and traditional high-similarity methods for six real-world networks.Traditional precision is obtained by the maximum of traditional methods, that is, max{CN, Jaccard, RA, LP, SPM}.Proposed precision is obtained by our framework, that is, max{PNR CN , PNR Jaccard , PNR RA , PNR LP , PNR SPM }.