Link prediction uses observed data to predict future or potential relations in complex networks. An underlying hypothesis is that two nodes have a high likelihood of connecting together if they share many common characteristics. The key issue is to develop different similarityevaluating approaches. However, in this paper, by characterizing the differences of the similarity scores of existing and nonexisting links, we find an interesting phenomenon that two nodes with some particular low similarity scores also have a high probability to connect together. Thus, we put forward a new framework that utilizes an optimal onevariable function to adjust the similarity scores of two nodes. Theoretical analysis suggests that more links of low similarity scores (longrange links) could be predicted correctly by our method without losing accuracy. Experiments in real networks reveal that our framework not only enhances the precision significantly but also predicts more longrange links than stateoftheart methods, which deepens our understanding of the structure of complex networks.
Modern science and engineering techniques increase our availability to various kinds of data including online social networks, scientific collaboration networks, and power grid networks [
In classical link prediction approaches, similarity scores are computed first for two disconnected nodes, and then nonexisting links in the top of the score list are predicted as potential ones [
In the aforementioned methods, the basic hypothesis that two nodes with a high similarity score have a high likelihood of connecting together lacks an indepth illustration. Recent works have demonstrated that longrange links exist extensively in complex networks and play an important role in routing, epidemical diffusion, and other dynamics [
Our study takes a different but complementary approach to link prediction problem. By analyzing the score distributions of existing and nonexisting links, respectively, we find an interesting phenomenon that the existing and nonexisting links follow different connecting patterns in respective of their similarity scores. Then, inspired by the precisionrecall curves [
The rest of the paper is organized as follows. In Section
We give the link prediction formulism in Section
Given a network
We first assign a score to each nonexisting link and then choose links with the highest top
There are two popular metrics to characterize the accuracy: area under the receiver operating characteristic curve (AUC) [
Another metric is precision that characterizes the ratio of correctly predicted links for a given prediction list. That is to say, if the length of prediction list is
There exists a large number of scoreassigning approaches in link prediction problem. All these methods could be introduced into our framework. Though we only investigate some stateoftheart scoreassigning approaches, the results and conclusions are also applicable for other scoreassigning methods. The five scoreassigning approaches [
SPM first divides a network into training set and probe set and further divides the training set into perturbation set and the remaining set. For a given division of training and probe set, we calculate the average of 10 times independent simulations of (
Apart from the five similarity metrics introduced above, for more similarityevaluating methods, please refer to [
We start our framework by reinvestigating the definition of precision. Supposing that
Most previous link prediction methods only predict links with high similarity scores. We generalize (
The main concern is to select appropriate set
The central issue of our framework is to use PNR
Different kinds of similarity evaluations could be introduced into the framework. Taking CN similarity method as an example, our framework is as follows:
Divide the links of a network into
Calculate the similarity scores of all existing and nonexisting links by CN method only according to training set.
Calculate PNR
Obtain the readjusting scores of the nonexisting links in training set by
Determine the prediction list by choosing links with
Figure
Schematic shows the proposed framework based on CN. (a) A snapshot of a large network. (b) Score of nonexisting links calculated by CN method. (c) The top panel is the score distributions of existing and nonexisting links,
An important property of our framework is that if
We first describe the six real networks in Section
To verify the effectiveness of the proposed method, we measure the performance of our framework in six empirical networks from diverse disciplines and backgrounds: (1) email [
Different real networks contain directed or undirected, weighted or unweighted links. To simplify the problem, we treat all links undirected and unweighted. Besides, only the giant connected components of these networks are taken into account. This is because for a pair of nodes located in two disconnected components, their similarity score will be zero according to most prediction methods. Table
Structural properties of the different real networks. Structural properties include network size (
Network 






Sparsity 

33696  180811  6.070 

0.170  4.08 


PDZBase  161  209  2.263 

0.001  5.11 

Euroad  1039  1305  1.228  0.090  0.004  18.39 

Neural  297  2148  1.81 

0.292  2.46 

Roundworm  453  2025  4.485 

0.647  2.66 

USair  332  2126  3.464 

0.625  2.74 

In the experiments, we set
Figure
AUC and precision of the USair network obtained by five different approaches: common neighbors (CN), Jaccard index (JB), resource allocation index (RA), local path index (LP), and structural perturbation method (SPM). The results are obtained by 50 independent simulations. SPM method achieves high precision, yet low AUC, and JB has low precision, but high AUC (>0.9).
Figure
Similarity distributions and the corresponding PNR
Figure
Maximal precision comparison of the proposed methods and traditional highsimilarity methods for six realworld networks. Traditional precision is obtained by the maximum of traditional methods, that is,





USair  

Traditional 
0.0171  0.0032  0.0052  0.0107  0.2651  0.4670 
Proposed 






The precision difference
Longrange links play an important role in the dynamics of networks and it is of much significance to predict longrange links [
The comparison of the predicted edges between JB and the corresponding
Precision comparison of the proposed methods (red) and traditional high similarity based methods (cyan) for six realworld networks. (a) Email network. (b) PDZBase network. (c) Euroad network. (d) Neural network. (e) Roundworm network. (f) USair network. Results are the average of 50 independent simulations. Our proposed framework increases precision in most cases.
The distance
Figure
Comparison of average distance of the PNR predicted links with that of the corresponding original methods for different networks. (a) Email network. (b) PDZBase network. (c) Euroad network. (d) Neural network. (e) Roundworm network. (f) USair network. Results are the average of 50 independent simulations. Our proposed framework increases the average distance on the whole, which indicates that more longrange links are predicted correctly.
Furthermore, Figure
Comparison of average similarity of the PNR predicted is linked with that of the corresponding original methods for different networks. (a) Email network. (b) PDZBase network. (c) Euroad network. (d) Neural network. (e) Roundworm network. (f) USair network. Results are the average of 50 independent simulations. Our proposed framework reduces the average similarity on the whole, which indicates that more longrange links are predicted correctly.
In summary, we systematically study the drawbacks of similaritybased link prediction methods and show that some link prediction methods achieve high AUC, yet low precision. Based on the differences of the similarity distributions of existing and nonexisting links, we propose a metric (PNR) to explain the problem of high AUC and low precision. Two nodes with some particular low scores also have a high likelihood of forming links between them. Furthermore, we prove that PNR is the optimal onevariable function to adjust the likelihood scores of links. Experiments in real networks demonstrate the effectiveness of PNR, and the precision is greatly enhanced. Additionally, the proposed framework could also reduce the average similarity and increase the average distance of the predicted links, which indicates that more missing longrange links can be detected correctly.
Though the proposed approach investigates link prediction in unipartite networks, it could also be generalized to bipartite and other kinds of networks. What is more, our method provides a novel way to explore the connecting patterns of real networks that may inspire other better scoreassigning methods in the future.
The authors declare no competing financial interests.
The authors thank Dr. Alexandre Vidmer for his fruitful discussion and comments. This work is jointly supported by the National Natural Science Foundation of China (61703281, 11547040), the Ph.D. StartUp Fund of Natural Science Foundation of Guangdong Province, China (2017A030310374 and 2016A030313036), the Science and Technology Innovation Commission of Shenzhen (JCYJ20160520162743717, JCYJ20150625101524056, JCYJ20140418095735561, JCYJ20150731160834611, JCYJ20150324140036842, and SGLH20131010163759789), Shenzhen Science and Technology Foundation (JCYJ20150529164656096, JCYJ20170302153955969), the Young Teachers StartUp Fund of Natural Science Foundation of Shenzhen University, and Tencent Open Research Fund.
In the supplementary materials, we prove that PNR