Research Article Predicting Missing Links Based on a New Triangle Structure

. With the rapid growth of various complex networks, link prediction has become increasingly important because it can discover the missing information and predict future interactions between nodes in a network. Recently, the CAR and CCLP indexes have been presentedforlinkpredictionbymeansofdifferenttrianglestructureinformation.However,bothindexesmaylosethecontributions ofsomesharedneighbors.Weproposeinthisworkanewindextomakeuptheweaknessandthenimprovetheaccuracyof linkprediction.Theproposedindexfocusesonanewtrianglestructure,i.e.,thetriangleformedbyoneseednode,onecommon neighbor,andoneothernode.Itemphasizestheimportanceofthesetrianglesbutdoesnotignorethecontributionofanycommon neighbor.Inaddition,theproposedindexadoptsthetheoryofresourceallocationbypenalizinglarge-degreeneighbors.Theresults ofcomparisonwithCN,AA,RA,ADP,CAR,CAA,CRA,andCCLPon12real-worldnetworksshowthattheproposedindex outperformsthecomparedmethodsintermsofAUCandrankingscore.


Introduction
As a fundamental research hotspot in complex network analysis, link prediction has a wide range of applications in both theory and reality, such as analysis of network evolution [1,2], recommendation system [3], and checking potential interactions between proteins in biological networks [4,5].The basic task of link prediction is to estimate the missing or latent existent links between unconnected nodes in a network [6,7].To date, a host of algorithms and models have been proposed for link prediction [6,8,9].Reference [8] groups them into two ways: similarity-based approaches and learningbased approaches.A similarity-based approach computes similarity scores between unconnected nodes based on the known information.Then, a ranked list of node pairs in descending order according to their similarity scores is obtained and the node pairs at the top are thought most likely to have links.A learning-based approach formalizes the link prediction problem into a binary classification task [10] and uses machine learning methods to solve the problem.The key job in a learning-based approach is to construct the feature vectors of node pairs.In general, learning-based approaches are more complicated than similarity-based ones.
The hypothesis behind similarity-based approaches is the more similar that two nodes are, the more likely that a link exists between them [8].This idea is simple and intuitive.Thus, the study of this kind of approaches has become the mainstream [6,9].The Common Neighbors (CN) index [11], as its name suggests, simply counts the number of common neighbors between two nodes.The Adamic-Adar (AA) [12] and Resource Allocation (RA) [13] indexes are two variants of the CN index; they penalize the contributions of large-degree common neighbors.These indexes are called local methods because they only use local structure information.Besides, some global and quasilocal methods have also been proposed by researchers, such as Katz [14], SimRank [15], Random Walks with Restart [16], Local Path [17], FriendLink [18], and Local Random Walk [19].
With the increasing growth of sizes of complex networks, local methods are still good candidates because they are more efficient in terms of running time than global and quasilocal methods.Therefore, we focus in this study on local methods.Recently, Cannistraci et al. proposed the CAR index [20], which suggests that links between the common neighbors, i.e., local-community-links (LCLs), are more valuable than common neighbors in link prediction.In CAR index, a local community is a triangle passing through two common neighbors and one seed node.In the example network shown in Figure 1(a), there is one LCL between the common neighbors of seed nodes  and  (see Figure 1(b)).Thus, CAR index assigns a similarity score of four to nodes  and .However, if we remove the link between  and , CAR will assign a zero similarity score to  and , even though they have four common neighbors.In addition, the idea of LCL is also plugged into AA, RA, and Jaccard indexes [20].Later, Wu et al. proposed the CCLP index based on the clustering coefficients of common neighbors.This index considers all triangles passing through a common neighbor.For the example network in Figure 1(a), there are triangles passing through nodes , , and , respectively (see Figure 1(c)).Thus, CCLP index accumulates the clustering coefficients of nodes , , and  when calculating the similarity between  and , but utterly neglects the contribution of node .In real-world networks, it is possible that there are no triangles passing through some or even all shared neighbors of one node pair.Thus, CAR and CCLP indexes may assign a very low or even zero similarity score to the node pair, even if it has many common neighbors.
In this paper, we defines a new type of triangle structure, called TRA-triangle, which is formed by one seed node, one common neighbor, and one other node (see Figure 1(d)).Based on the TRA-triangle, a new similarity index, namely, TRA index, is proposed for link prediction.This index suggests that the common neighbors that can form TRA-triangles with a seed node are more important than others.In addition, the proposed index also penalizes the large-degree neighbors, as done in RA index [13].Although all the TRA, CAR-based, and CCLP indexes are based on triangle structures, the intuitions behind them are different.The CARbased indexes believe that LCLs are more valuable than common neighbors.The CCLP index is inspired by CAR index but employs all triangles passing through common neighbors, while the TRA index, which only uses the TRAtriangles, strikes a balance between CAR and CCLP.Furthermore, as aforementioned, CAR-based and CCLP indexes lose the contribution of those common neighbors with no triangles passing through them, whereas TRA index counts the contribution of all kinds of common neighbors.Therefore, TRA index can achieve better prediction accuracy than CARbased indexes and CCLP index.The accuracy of TRA index is evaluated on 12 real-world networks from various fields.The experimental results show that our index is far superior to CAR-based indexes and CCLP index.Take the network of HEP as an example, which is a very sparse network, the improvements made by TRA on CAR and CCLP, under the metric of AUC, are up to 26.9% and 4.2%, respectively.
The rest of the paper is structured as follows.In Section 2, we give the description of the link prediction problem and the evaluation metrics, list the compared methods and networks, and depict the Wilcoxon signed-ranks test.Section 3 introduces the proposed method.In Section 4, the experimental results and performance analysis of the proposed method are presented.Finally, Section 5 concludes this work.

Preliminaries
2.1.Problem Description and Metric.Given an undirected and unweighted network (, ), in which  and  are the node set and link set, respectively, in this study, multilinks and self-loops are not allowed.Let  = || be the number of nodes in the network, and let  be the universal possible link set, which contains ( −1)/2 possible links.Then, the set of nonobserved links or nonexisting links is −.Suppose there are some missing links in  − , the task of link prediction is to find those links.A similarity-based approach assigns a similarity score to each node pair in  −  and assumes that the higher score a node pair has, the more likely there is a link between them.
To test the performance of a similarity index, we randomly divide the link set  into two parts: training set   and testing set   , such that  =   ∪   and   ∩   = 0.   is supposed to be the observed information, and   is used for testing.Two parameter-free metrics are employed to quantify the accuracy of link prediction algorithms: AUC [6] and ranking score [21,22].In this situation, the AUC score can be interpreted as the probability that a randomly selected missing link (i.e., a link in   ) is given a higher score than a randomly selected nonexistent link (i.e., a link in  − ).When implementing, if we perform  independent comparisons, there are  1 times that the missing link has higher score and  2 times that they have the same score.The AUC value is then computed as Ranking score (RS) takes the ranks of links in testing set after sorting in descend order according to their similarity scores into consideration.Let  =  −   be the set of nonobserved links.Let   be a missing link in   and   be its rank.The ranking score of   is defined as (  ) =   /||, and the ranking score of the link prediction result is as follows: Note that the AUC value is the higher the better, whereas the ranking score is the smaller the better.

Local Similarity Indexes
. As yet, many similarity indexes have been proposed for link prediction [6,8,9].Here, we list some local similarity indexes that will be used in our experiments for the purpose of comparison.
(1) Common Neighbor (CN) index [11] defines the similarity between  and  as the number of their common neighbors, which is where Γ() denotes the set of neighbors of node .
(2) Adamic-Adar (AA) index [12] is a variant of CN index, which believes that small-degree neighbors have more contributions than large-degree neighbors when computing similarity.Its definition is as follows: where   is the degree of node .
(3) Resource Allocation (RA) index [13] defines the similarity between  and  as the amount of resource that  received from  through their common neighbors, which is (4) Adaptive Degree Penalization (ADP) index [23] penalizes a common neighbor according to its degree and the average clustering coefficient of the network.Therefore, it can automatically adapt to the network.The definition of ADP index is as follows: where  is a constant and  is the average clustering coefficient of the network.We set  = 2.5, as suggested by the authors.
(5) CAR index [20] suggests that two seed nodes are more likely to link together if there are links between their common neighbors, which is defined as where () is the number of links between  and other common neighbors of  and .( 6) CAA and CRA indexes [20] are generated by plugging the idea of CAR index into the AA and RA indexes, respectively, which are defined as (7) CCLP index [24] computes the similarity between  and  by employing clustering coefficient of common neighbors, which is where   denotes the clustering coefficient of node , which is in which   is the number of triangles passing through node .

Networks.
In this study, we use 12 real-world networks drawn from various fields to evaluate the effectiveness of link prediction methods.
(1) Advogato (ADV): a social network whose users are mainly free and open source software developers [25].
(3) Dolphin: a social network of 62 dolphins in a community living off Doubtful Sound, New Zealand [27].
(4) Email: a network of email interchanges between members of a university [28].
(7) HEP: the coauthorships network of scientists who posted preprints on the high-energy theory archive from 1995 to 1999 [31].
(8) Karate: the social network of a karate club at a US university [32].
(10) USAir: a network of the US air transportation system [6].
(11) Word: an adjacency network of common adjectives and noun in the novel "David Copperfield" by Charles Dickens [34].
In this work, all the aforementioned networks are treated as undirected and unweighted networks, and only the giant component of each network is used.Table 1 lists the basic statistics of the giant components of these networks.
Given network (, ), suppose ,  be two seed nodes.(, ) is called a seed node pair with common neighbors if they have at least one common neighbor. Λ denotes the set of seed node pairs with common neighbors, formally Let ,  be two seed nodes, and  is one of their common neighbors.If   = 0, we call  is a zero-triangle-neighbor; otherwise,  is a triangle-neighbor.If () ̸ = 0,  is called a CAR-triangle-neighbor and if △(, ; ) ̸ = 0 (see ( 18)),  is called a TRA-triangle-neighbor.Let  △ be the set of triangleneighbors, and   ,   denote the sets of CAR-and TRAtriangle-neighbors, respectively.Clearly,   ⊆   ⊆  △ .Let  ∃ ( △ ) and  ∀ ( △ ) be two subsets of  Λ .For any pair in  ∃ ( △ ), at least one of their shared neighbors is not a triangle-neighbor, and for any pair in  ∀ ( △ ), all of their shared neighbors are not triangle-neighbors.More explicitly, Similarly, we define  ∃ (  ),  ∀ (  ),  ∃ (  ), and  ∀ (  ), which are Correspondingly, the ratios of those subsets to  Λ are, respectively, defined as Table 2 lists these ratios over the 12 networks.

Wilcoxon Signed-Ranks
Test.The Wilcoxon signed-ranks test is a nonparametric statistical hypothesis test used to check whether two methods perform equally well over multiple networks [38,39].Let   be the difference in performance scores of two link prediction methods on the th network.
The differences are ranked in accordance with their absolute values; in case of ties, average ranks are assigned.Let  + be the sum of ranks for the networks on which the second method outperformed the first, and  − the sum of ranks for the opposite.For a larger number of networks, the statistics is distributed approximately normally [39].In (16),  = min( + ,  − ) and  is the number of networks.
With  = 0.05, if  is small than -1.96, we reject the nullhypothesis, which states that both methods perform equally well.

The Proposed Index
The link prediction problem has a familiar relationship with the network evolving mechanism [2,40].A recently proposed triangle growth mechanism demonstrates that various key features observed in most real-world networks can be generated in simulated networks [41].Therefore, triangle structure information has an important effect in link formation.
In this work, we focus on a new triangle structure, namely TRA-triangle.A TRA-triangle passes through one seed node, one common neighbor, and one other node.In our opinion, the common neighbors that can form TRA-triangles are more important than others.Given two nodes  and V, we denote the number of triangles passing through them as △(, V), which is For the example network in Figure 1(a), the triangles used for seed nodes ,  are shown in Figure 1(d).Clearly, △(, ) = 2 and △(, ) = 1.Thus, node  is in more close contact with  than .Given seed nodes  and ,  is one of their common neighbors.Function △(, ; ) sums up the number of TRA-triangles formed by , , and , , which is In this paper, we propose a new similarity index, by combining the aforementioned triangle structure and the idea of RA index [13].For the convenience of statement, we name our new method TRA index.Its definition is In (19), the numerator is 1 + △(, ; )/2.Therefore, the TRA index does not miss the effect of any common neighbor.If all common neighbors are zero-triangle-neighbors, TRA degenerates to RA.For the example network in Figure 1(a), (, ) = (1+3/2)/4+(1+2/2)/3+(1+0/2)/2+(1+0/2)/4 = 49/24.1, both networks are dense ones.Roughly speaking, the probability that there exist TRAtriangle-neighbors between seed nodes on dense networks is more than on sparse ones.

Experimental Results
To check whether the proposed index is significantly different with compared methods, we applied Wilcoxon signedranks test [39] based on the results in Table 3.The pairwise test results are presented in Figure 2. From the statistical point of view, our index is significantly better than others except ADP index, because ADP index has the capability of adapting to the structure of a network automatically.Although there is no statistical difference between our index and ADP index according to Wilcoxon signed-ranks test, our index performs better than ADP index in terms of AUC. Figure 3 exhibits the changes of AUC on 12 networks when the proportion of   in  increases from 10% to 20%.It is quite evident from Figure 3 that the AUC values of all indexes show downward trends when the proportion increases from 10% to 20% except on FW.The reason is that the increase of   will decrease the size of training set   and then will result in the number of common neighbors between seed nodes becoming small.Consequently, the difficulty of link prediction will enhance.The FW network, which possesses high average degree, small average shortest distance, and small-degree heterogeneity, is a very dense  network.Therefore, the decrease of training set gives slight influence of accuracy on FW.In addition, we can observe from Figure 3 that the performance presented by all indexes on ADV, CE, Dolphin, Email, Hamster, HEP, Karate, Word, and Yeast is very similar.On these nine networks, the AUC values of CAR-based indexes are obvious lower than those of others.On the network of FW, the results of CAR-based indexes are better than those of CN, AA, RA, and ADP indexes, because FW is a very dense network in which the ratio of CAR-triangle-neighbor is very high (see Table 2).On PB and USAir, the performance of CAR-based indexes is not as bad as on other nine networks.The reason is both networks have high average degrees, small average shortest distances, and high ratio of CAR-triangle-neighbors.Furthermore, we list the AUC values of different methods on the 12 networks when |  |/|| = 0.2 in Table 4.The results of our index outperform others on eight among the 12 networks, while CCLP index achieves the highest value on CE.
Table 5 gives the results in terms of ranking score.These results are similar to those in Table 3.The ranking score of TRA index outperforms others except on Dolphin, HEP, and USAir.The pairwise Wilcoxon signed-ranks test results are shown in Figure 4. Similar to the test in Figure 2, TRA index is significantly better than compared methods except ADP index.As depicted above, ADP has the adaptive capability and hence performs better than other compared methods.
Figure 5 describes the changes of ranking score on 12 networks when |  |/|| increases from 10% to 20%.Clearly, all indexes yield higher ranking scores with the increase of   .Do not forget that higher ranking score means lower accuracy.As analyzed above, FW is very dense.Thus, the changes of AUC on FW are very slight (see Figure 3).However, the changes of ranking score on FW are more Complexity  evident, especially for CAA and CRA indexes.The reason is that the calculation of ranking score considers all missing links.In addition, as seen in Figure 5, CAA and CRA indexes perform worse than CAR index according to ranking score.
From the definitions of these three indexes, we find that both CAA and CRA indexes can get more negative impact than CAR index from zero-triangle-neighbors. Finally, the ranking scores of all methods on the 12 networks with |  |/|| = 0.2 are listed in Table 6.Our index outperforms all other indexes except on HEP and USAir in terms of ranking score.These results are consistent with them of AUC.In contrast with that on FW, the influence of TRAtriangles on HEP and USAir is small.
From the above results, we can conclude that TRA index is superior to CAR-based indexes and CCLP index and performs better than common-neighbor-based methods on most of networks.

Conclusion and Discussion
Link prediction is an important research topic of complex network analysis and has a wide range of applications in various fields.Inspired by the triangle growth mechanism in network evolving [41], this paper proposed the TRA index for link prediction.When computing the similarity between two seed nodes, the proposed index not only counts the contributions of all common neighbors but also emphasizes the importance of the neighbors that can form TRA-triangles. To some extent, TRA-triangles reflect the close relationships between neighbors and seed nodes.In addition, the proposed index also adopts the theory of resource allocation [13] due to its effectiveness.
The accuracy of the TRA index is experimentally evaluated over 12 real-world networks from various fields in terms of AUC and ranking score.The experimental results show that the proposed index performs far better than CAR-based indexes.Meanwhile, our index outperforms the CCLP index because of the superior strategy in our index.For commonneighbor-based methods, the proposed index yields some improvements of accuracy on most of networks.These results indicate that combining the information of TRA-triangles and the theory of resource allocation in similarity index is a helpful idea for link prediction.There are some improved studies for our index in future.One of them is to analyze the degree of influence of TRAtriangles on different networks and further to be adaptive to set the weight of TRA-triangles on different networks.The second is to study the application of TRA index on other topics, such as community detection and anomaly detection.In addition, for learning-based link prediction approaches, TRA index can be used as a feature for a node pair.

Figure 3 :
Figure 3: The changes of AUC when |  |/|| increases from 10% to 20% on 12 networks.Each point is obtained by averaging over 50 independent realizations.

Table 3 :
The AUC of different methods in 12 networks.The results are the average of 50 independent implementations with |  |/|| = 0.1.The best performance for each network is emphasized by boldface.

Table 3
lists the predicted results of different methods in terms of AUC on the 12 networks.The results are obtained by averaging over 50 independent realizations for each network with testing set containing 10% links.The highest AUC value for each network is highlighted in boldface.Clearly, TRA index gets nine best results over the 12 networks.Meanwhile,TRA index outperforms the CAR, CAA, CRA, and CCLP indexes on all networks.We can see from Table2that, on most of the networks, there exist varying degrees of such seed node pairs with common neighbors that belong to  ∃ ( △ ) and/or  ∀ ( △ ).As stated in Introduction, CCLP index will give lower or zero similarity scores to those pairs.Furthermore, both values of  ∃ (  ) and  ∀ (  ) are very high on most of the networks.Particularly, on Dolphin, Email, Hamster, HEP, and Yeast, the corresponding values of  ∀ (  ) are greater than 0.8.This phenomenon indicates that only a very small fraction of seed node pairs with common neighbors on those networks can be assigned similarity scores by CARbased indexes.Although there are some seed node pairs belonging to  ∃ (  ) and/or  ∀ (  ), TRA index still can assign reasonable similarity scores to them.Therefore, the results of TRA index in Table3are better than them of CAR, CAA, CRA, and CCLP indexes.For CN, AA, RA, and ADP indexes, ADP index performs the best, since it can penalize common neighbors by automatically adapting to the network.On Dolphin, HEP, and USAir, ADP index obtains the best accuracy; the performance of our index approximates to the best.In addition, TRA index achieves much better AUC scores than others on FW and Karate.This result suggests that TRA-triangles play an important role on these two networks.From Table Figure 2:The results of Wilcoxon signed-ranks test based on Table3.With  = 0.05, if  <= −1.96, the null-hypothesis is rejected.
Figure 4: The results of Wilcoxon signed-ranks test based on Table5.With  = 0.05, if  <= −1.96, the null-hypothesis is rejected.The changes of ranking score when |  |/|| increases from 10% to 20% on 12 networks.Each point is obtained by averaging over 50 independent realizations.

Table 4 :
The AUC of different methods in 12 networks.The results are the average of 50 independent implementations with |  |/|| = 0.2.The best performance for each network is emphasized by boldface.

Table 5 :
The ranking score of different methods in 12 networks.The results are the average of 50 independent implementations with |  |/|| = 0.1.The best performance for each network is emphasized by boldface.

Table 6 :
The ranking score of different methods in 12 networks.The results are the average of 50 independent implementations with |  |/|| = 0.2.The best performance for each network is emphasized by boldface.