Link Prediction in Complex Network via Penalizing Noncontribution Relations of Endpoints

Similarity based link prediction algorithms become the focus in complex network research. Although endpoint degree as source of influence diffusion plays an important role in link prediction, some noncontribution links, also called noncontribution relations, involved in the endpoint degree serve nothing to the similarity between the two nonadjacent endpoints. In this paper, we propose a novel link prediction algorithm to penalize those endpoints’ degrees including many null links in influence diffusion, namely, noncontribution relations penalization algorithm, briefly called NRP. Seven mainstream baselines are introduced for comparison on nine benchmark datasets, and numerical analysis shows great improvement of accuracy performance, measured by the Area Under roc Curve (AUC). At last, we simply discuss the complexity of our algorithm.


Introduction
Many systems can be properly described as complex networks with nodes representing individuals or organizations and links mimicking the relations among them [1][2][3].Link prediction as a critical problem of forecasting potential relations between nonadjacent nodes meaningfully provides strong ability to explore the evolution of network and unveils the development mechanism of network, attracting so many researchers' interests [4].In real-world applications, link prediction gives help to significantly discover potential network links [5], bring in many striking applications of recommending friends in online social networks [6], explore protein-toprotein interactions [7], reconstruct airline networks [8], and boost e-commerce scales [9,10].
Most conventional methods model the task of link prediction in the form of estimating the probability that two nonadjacent nodes would be linked, which is believed to be positively correlated with the similarity between them [11].Mainstream methods take into account topological similarity based on network structures and can be classified into three major classes [11].The first class calculates topological similarity with global structural information, such as the Katz Index that counts all paths connecting two nodes with shorter paths preferred [12].Such global indices show fair performance in prediction but suffer from high computational complexity [11].The second class, defined on local structures, typically includes Common Neighbors (CN) Index [13] counting the number of common nodes, Preferential Attachment (PA) Index [14] preferring the links between endpoints of high degrees, and the methods penalizing highdegree common neighbors, such as Adamic-Adar (AA) Index [15], Resource Allocation (RA) Index [16], Optimized AA (OAA) Index, Optimized RA (ORA) Index [17], and Centrality based Index (DC-CN, BD-CN, and CC-CN) [18].In spite of successfully reducing the computational expense, local indices suffer from relatively poor prediction performance [11].In order to find a nice tradeoff between performance and complexity, the third class of similarity indices is proposed on quasilocal structures.The Local Path (LP) Index ignores longpath terms in Katz Index [19], and its bounded version (BLP) relates local paths in an elaborate way [20].The Significant Path (SP) Index comprehensively considers paths and intermediate node degree [21].The Local Random Walk (LRW) Index limits a random walk within a local range [22], while the Superposed Random Walk (SRW) Index continuously releases a random walker at the starting node to emphasize the nodes near the target node [22].
Influence of a node plays a very important role in link prediction, so that a node with powerful influence can attract and impact on other nodes in a network, such as a very important person (VIP) in a social network, possessing strong power to transmit his/her reputation to attract others to follow.In a network, a node with high degree is believed to have more powerful influence and more likely to arise links in the future.However, when carefully examining the node degree influence in link prediction, we find that not all the influence is significant to the counterpart endpoint.In other words, there are noncontribution relations in node degree influence.For example, in Figure 1, we define the influence degree as the number of endpoint relations, valid influence degree as the number of endpoint relations which can eventually reach the counterpart endpoint, and connectivity chance as the proportion the valid influence occupies in the total amount of an endpoint, featuring the total valid amount of influence emanating from the endpoint.Despite an apparently high influence degree of 3 in subgraph (b), V 4 and V 6 in fact possess valid influence degree only of 1 and connectivity chance only accounting for 1/3 because of two noncontribution relations, and, in contrast, V 1 and V 3 although just with lower influence degree of 1 cast all influence degree to each other and have greater connectivity chance of 1. Furthermore, in subgraph (c), V 7 and V 11 obtain the highest valid influence degree of 3 and also greater connectivity chance of 1 without any useless relations as V 1 and V 3 in subgraph (a).Therefore, we reasonably deduce the link likelihood order (c) > (a) > (b), according to the valid influence degree and connectivity chance.From the theoretical analysis and real illustration, we extract a finding that obviously the high chance of connectivity can enhance link likelihood and on the contrary the noncontribution relations would harm it.
In this paper, we propose a novel similarity index considering penalization on noncontribution relations with a tunable penalization factor  and random walk based valid influence diffusion, namely, noncontribution relations penalization (NRP) index.In contrast with recently published SP Index [21] and also a link prediction algorithm, there are obvious differences: NRP focuses on the penalization on noncontribution relations in endpoints and meanwhile adds bidirectional valid influence diffusions together to form the model, but SP just considers single direction paths and focuses on the heterogeneity of different paths concerning the intermediate nodes without considering the endpoints.Besides novelty, compared with seven mainstream baselines via experiments on nine benchmark datasets, numerical analysis shows a great improvement on NRP.
The rest of the paper is organized as follows: in Section 2, the new model based on penalization on noncontribution relations and significant connectivity is introduced; in Sections 3 and 4, the experimental materials of nine benchmark datasets and methods including metrics and seven mainstream baselines are described, respectively; we present the results and discussions in Section 5 and finally make a conclusion.

Method
An undirected network (, ) is considered, where  and  stand for the sets of nodes and links, respectively.Multiple links and self-connections are not allowed.For each pair of nodes, ,  ∈ , every algorithm referred to in this paper assigns a score   .This score can be viewed as a measure of similarity between nodes  and , and hereinafter we do not distinguish similarity and score.All the nonexistent links are sorted in decreasing order according to their scores, and the links at the top are most likely to exist.To test the algorithm's accuracy, the set of links  is randomly divided into two parts: the training set   is treated as known information, while the testing set   is used for testing and no information in this set is allowed to be used for training.Clearly,   ⋃   =  and   ⋂   = 0.
As described in the introduction and Figure 1, the probability of link between two nonadjacent endpoints is based on the similarity established via considering valid influence diffusion which can be regarded as a resource diffusion process [22].Here, for the sake of emphasizing the common transferring capability in bidirectional resource diffusion, we first extract the intermediate connectivity  ignoring the transfer probability values 1/  and 1/  emanating from the endpoints in Definition 1 as follows.
Definition 1.On an undirected unweighted network (, ), one defines the intermediate connectivity  of a -step path  = {V 0 = , V 1 , . . ., V −1 , V  = }; connecting  and  equals the product of transfer probability from V 1 to V  or from V −1 to V 0 ignoring the transfer probability from the endpoint, as where Although the numbers of links are different in networks, the total resources in different networks are the same, simply set as one, necessarily normalizing resource of endpoint as   /2|| with   denoting the influence degree of endpoint  and || denoting the number of links [22].And, importantly, the more the quantity of the valid resource from one endpoint can be delivered to the counterpart, the more similar the two endpoints are.Furthermore, we add the bidirectional resource diffusion quantities on paths with length from 2 to  together and simplify the ultimate formalism in Definition 2 as our similarity index-noncontribution relations penalization (NRP) index.Definition 2. On an undirected unweighted network (, ), with the initial resource usually assigned according to the importance of nodes, here, one simply sets the initial resource of node  proportional to its normalized degree   /2|| [22].The link prediction similarity index sim(, )|  on a path After merging, we obtain the processed equation, where  is penalty factor in [0, +∞) and () is the number of paths of length , (, ) represents the intermediate connectivity deleting the impact from the endpoint, and emphasizes the depressed influence of endpoints.At last, we obtain the NRP index of summation on path lengths from 2 to , saying that Since paths longer than three cost expensive computations contribute but little for predicting links, we just consider the paths with  = 2 and 3 in practice and later experiments [19,22].

Experiments Data
Experiments are performed on nine real-world networks (datasets are freely downloaded from the following academic websites: http://vlado.fmf.uni-lj.si/pub/networks/data, http://wiki.gephi.org/index.php?title=Datasets, http:// lovro.lpt.fri.uni-lj.si/support.jsp,http://konect.uni-koblenz.de/networks/,and http://www.linkprediction.org/index.php/link/resource/data).We converted arcs into undirected links and removed loops and multilinks to make them simple networks.(i) Network US Air97 (USAir) [23] is the network of the US air transportation system.(ii) Network Yeast PPI (Yeast) [24] is the protein-protein interaction network of yeast.(iii) Network NetScience (NS) [25] is the network of coauthorships between scientists publishing on the topic of networks.(iv) Network Jazz [26] is the network of Jazz musicians.(v) Network C.elegans (CE) [27] is the neural network of the nematode worm C. elegans.(vi) Network Slovka [28] is the Facebook friendship network of Slavko Ž itnik.(vii) Network Email (Email) [29] is the email communication network of University of Rovira i Virgili (URV) in Tarragona, Spain.(viii) Network Infectious (Infec) [30] is the face-to-face contact network of people during the exhibition "Infectious: Stay Away" in 2009 at the Science Gallery in Dublin.(ix) Network EuroSiS (ES) [31] is the mapping network between Science in Society actors on the Web of 12 European countries.Table 1 reports the basic topological features of these networks.
Each dataset is randomly divided into a training set   containing 90% links and a testing set   containing the remaining 10% links.

Experiments Metrics
Area under curve (AUC) [32], an accurate metric, can be interpreted as the probability that a randomly chosen missing link (a link in   ) is given a higher score than a randomly chosen nonexistent link (a link in  \ , where  denotes the universal link set).In the implementation, among  independent comparisons, if there are   times the missing link has a higher score and   times they are the same score, AUC can be calculated as follows: AUC estimates the accuracy of the index globally, with the significance that if all the scores generated from independent and identical distribution, the accuracy should be about 0.5.Therefore, the degree to which the accuracy exceeds 0.5 indicates how much better the algorithm performs than pure chance.

Baselines.
For comparison, we introduce six classical algorithms as follows.
(1) Common Neighbors (CN) Index [13]: considering if two endpoints are similar, they may have many common neighbors with definition by calculating the number of common neighbors: where Γ() is the set of neighbor nodes belonging to endpoint  and Γ()∩Γ() denotes the set of common neighbors of endpoints  and .
(2) Preferential Attachment (PA) Index [14]: believing the probability that the new link will connect  and  is proportional to   ×   : where   and   denote the degree of node  and , respectively.

Results and Discussions
Because of small contribution and expensive cost of computation on long paths, experiments on NRP just considering  = 3 in nine networks are implemented (referenced in [19,22]).After obtaining AUC values experiments, in Section 5.1, we demonstrate important meaning of penalizing noncontribution relations in NRP in Figure 2 of nine subgraphs corresponding to nine different datasets.Furthermore, performance explanations about the nine subgraphs will be offered with different values of penalization factor  when searching for optimal values in [0, 5] which is wide enough to find the optimal values.Then, we compare our algorithms with CN, PA, AA, RA, LP, BLP, and SRW via Table 2.At last, the computational complexity is discussed in Section 5.3.

Performances of Penalization on Noncontribution Relations.
There are nine subgraphs in Figure 2 about AUC of NRP showing functions of penalization on noncontribution relations of endpoint degrees with penalization factor  varying in the range [0, 5], measured by average AUC under 10 independent runs obeying random divisions of training sets and testing sets.The -axis represents the penalization degree on endpoint degrees.According to (2) and (3),  < 1, that is,  − 1 < 0, suggests endpoints with noncontribution relations will be penalized and the smaller the  is, the more severely the noncontribution relations are suppressed, and vice versa.In Figure 2, the optimal 's of all nine datasets locate between 0 and 1 when AUC curves reach the peaks, say, USAir at 0.38, Yeast at 0.77, NS at 0.86, CE at 0.11, Jazz at 0.36, Infec at 0.5, Slavko at 0.63, Email at 0.84, and ES at 0.63, which means CE suffers the most penalization and NS suffers the least.And when  > 1, the AUC performances decrease very severely compared with the optimal AUCs.The  < 1 consistently happening in nine various kinds of datasets means indeed that the noncontribution relations of endpoint degrees are penalized, and the performance will get worse if they are promoted when  > 1.

Comparison on AUC.
To demonstrate the prediction ability, we report the performances of NRP index with the optimal  values on nine datasets, respectively.2).Notice that those datasets represent different kinds of networks with heterogeneous topological features (see Table 1) and disparate organization principles; the comparison highlights that NRP works well consistently on various situations.Analyzing the difference in performance between NRP and baselines, we realize that it is the penalization on large-degree nodes with many noncontribution relations and emphasis on connectivity based on random walk that explain the difference.PA simply assumes that only if the higher the degrees the two endpoints have, the more similar the two endpoints are, ignoring the existence of noncontribution relations in degrees of endpoints and the significant connectivity between endpoints, leading to the poorest accuracy performance, especially in Yeast.Instead, CN considers the connectivity via counting the number of common neighbors on 2-hop paths, but ignoring influence of endpoint degrees and connectivity of long paths, resulting in an improved but still worse performance.Further, AA and RA extend CN by similarly penalizing intermediate large-degree nodes and not surprisingly obtain better performance than CN.However, as the same as CN, ignorance of influence of endpoint degrees and long paths cause AA and RA to still work dissatisfiedly such as in Yeast.In contrast, LP and BLP take long paths into account and thus outperform on many networks such as in ES.However, lacking consideration of influence of endpoint degree, LP and BLP meet difficulty in accurately predicting missing links in contrast to NRP, such as in USAir.An exception for BLP is its worse performance especially than CN in USAir, Jazz, and CE because of inelegant formalism of coefficient which cannot hit an optimal value to help it overwhelmingly defeat CN.SRW, with consideration of influence of high endpoint degree and of long paths based on random walk, further improves the performance in contrast to the former algorithms, but neglect of penalization on endpoint degree with noncontribution relations preventing it from better prediction performance.Above all, NRP, outstandingly leveraging the valid influence of endpoint degree with penalization on noncontribution relations and connectivity based on random walk, outperforms all other mainstream algorithms on nine datasets.

Complexity Discussion.
In application of algorithm, the low computation complexity is another very important concern in the design of prediction algorithm.As we know, the time complexity of product of two  ×  matrices is ( 3 ).
From the definitions of CN, PA, AA, RA, LP, BLP, and SRW, PA has the time complexity of ( 2 ), CN, PA, AA, and RA have the time complexity of ( 3 ), and LP, BLP, and SRW are of the complexity  × ( 3 ) with coefficient  ≪  3 .In contrast, although with the same time complexity of  × ( 3 ) of LP, BLP, and SRW, more than CN, PA, AA, and RA, our index shows stronger performance than them all.
Above all, our index, NRP, achieves the best performance with little increase in complexity.

Conclusions
In research on topological similarity based link prediction, we find that not all relations involved in the degree of endpoint make contribution to valid influence diffusion due to the existence of noncontribution relations.Accordingly, we expect, in case of penalizing the noncontribution relations in endpoints degree with a penalization factor  in the valid influence diffusion process, the link prediction performance could be improved.Therefore, in this paper a model named NRP based on noncontribution relations penalization of endpoint degree and random walk based valid influence diffusion is proposed.
To detect our model, many experiments are implemented on nine benchmark networks and the results compared with the CN, PA, AA, RA, LP, BLP, and SRW confirm our expectation that penalization on noncontribution relations of endpoint degree could greatly improve the accuracy performance in link prediction.And due to the various structures and properties of experimental datasets, it is obvious that our model can be used in many applications of link prediction, such as transportation planning, biological reactions, friend recommendation in social network, and disease prevention.

Figure 1 :
Figure 1: Illustration of noncontribution relations and valid influence diffusion.With the same valid influence degree of 1 and smaller connectivity chance of 1/3 than 1 in subgraph (a), link likelihood in (b) is less than (a) because of two noncontribution relations attached on V 4 and V 6 ; meanwhile, subgraph (c) with the highest valid influence degree of 3 and the same connectivity chance in (a) obtains the greatest link likelihood.Totally, we rank (c) > (a) > (b) according to reasonable link likelihood.

Table 2 :
Prediction accuracy measured by AUC values on the nine benchmark networks.Each data point is an average over 10 independent runs, each of which corresponds to a random 90%-10% division of training set and testing set.All the present results are optimal values.Numbers in brackets stand for the standard deviations.
Table 2 reports the average AUC values of NRP and baselines.The results in the table are all optimal and the best values are emphasized in bold font.NRP achieves the best performances in all nine datasets (see boldface in Table