The Comprehensive Contributions of Endpoint Degree and Coreness in Link Prediction

. In past studies, researchers ﬁnd that endpoint degree, H-index, and coreness can quantify the inﬂuence of endpoints in link prediction, especially the synthetical endpoint degree and H-index improve prediction performances compared with the traditional link prediction models. However, neither endpoint degree nor H-index can describe the aggregation degree of neighbors, which results in inaccurate expression of the endpoint inﬂuence intensity. Through abundant investigations, we ﬁnd that researchers ignore the importance of coreness for the inﬂuence of endpoints. Meanwhile, we also ﬁnd that the synthetical endpoint degree and coreness can not only describe the maximal connected subgraph of endpoints accurately but also express the endpoint inﬂuence intensity. In this paper, we propose the DCHI model by synthesizing endpoint degree and coreness and the HCHI model by synthesizing H-index and coreness on SRW-based models, respectively. Extensive simulations on twelve real benchmark datasets show that, in most cases, DCHI shows better prediction performances in link prediction than HCHI and other traditional models.

To reveal the structure of complex networks, researchers propose a large number of link prediction models. Specifically, the models based on local information have been gained more attentions. For example, Kossinets [23] finds that two strangers become friends if they have more common friends in social networks. Newman [24] finds that two scientists have more likely to establish cooperation relationship in the future if they have more common cooperators. Based on this phenomenon, researchers propose the common neighbors' model (CN). Based on CN, some researchers propose improved models, such as Salton [25] and LHN-I [26]. Furthermore, according to the different similarity contributions of common neighbors, Adamic and Adar propose the AA model [27]. Zhou et al. [28] propose the resource-allocation model (RA). Moreover, Cannistraci et al. [29] propose that CN, AA, RA, and other algorithms can be weighted by local community information, which can further improve the performances of these models. However, the models based on common neighbors only consider the influence of endpoints on one-step paths. ough further research studies, Lü et al. [30] propose local path model (LP) through considering the influence of endpoints on three-step paths. In addition, some models consider global information, such as Katz [31] and hierarchical structure model [32]. Besides, some models' consider quasi-local information, which can compensate the defect of low accuracy in local information and high-computational complexity in global information. For example, local random walk model (LRW) [33] considers a random walker within a quasi-local range, and superposed random walk model (SRW) [33] considers the effects of LRW with different path lengths. Based on SRW, HSRW [34] and CSRW [34] models consider the roles of H-index [35] and coreness [36] with different path lengths, respectively. Simple hybrid influence model (SHI) [37] synthetically considers the role of endpoint degree and H-index as hybrid influence with different path lengths.
At present, many link prediction models only consider the degree [33] of endpoints, such as Sørensen [38], LHN [26], LRW [33], and SRW [33]. ese models illustrate that the source endpoint can effectively spread its influence to the target endpoint if the source endpoint has more neighbors to connect the target endpoint. rough abundant study, Lü et al. [39] find that H-index shows a better performance to quantify the influence of endpoint than degree and coreness. Zhu et al. [37] find that an endpoint possessing large synthetical degree and H-index can acquire a more extensive maximal connected subgraph, which can help the endpoint to attract other nodes. rough further investigations, we find that the endpoint influence can be expressed by the aggregation degree of neighbors. e large aggregation degree of neighbors illustrates that the endpoint has the extensive maximal connected subgraph, leading to attract more nodes. e aggregation degree of neighbors can only be quantified by coreness of endpoints. us, we synthesize the endpoint degree and coreness (or H-index and coreness) to quantify the endpoint influence and build the new link prediction models. Although the SHI model based on the synthetical degree and H-index has been explored, the synthetical endpoint degree and coreness (or the synthetical H-index and coreness) has not been fully verified. Figure 1 shows a clear illustration. In Figure 1, endpoint b possesses degree � 6, H-index � 3, and coreness � 3, respectively. e influence intensity of endpoint b is size 6 in consideration of only degree. However, only the degree cannot express the depth and scope of the influence of endpoints accurately. Due to the role of coreness in influence of endpoints, synthesizing degree, and coreness or synthesizing H-index and coreness can better quantify the maximal connected subgraph of endpoints and the aggregation degree of neighbors. For endpoint b, the product of degree (H-index) and coreness is 18 (9). Obviously, degree and H-index indicate the different sizes of maximal connected subgraph belonging to endpoint b with the same coreness, leading to different influences of endpoints. erefore, the prediction performance on the influence of endpoints based on the different quantification index needs to be further explored.
In real world, we find many phenomena to confirm our idea. For example, in Weibo, an ordinary individual possesses the limited influence because he/she only has many individual followers from colleagues, classmates, relatives, or friends, indicating that he/she only has large degree.
However, public figures possess extensive and strong influence because they have large number of fan club, indicating that they have large coreness to strengthen their influence. In scientists' collaboration network, if a scientist only cooperates with many scholars, meaning he/she has large degree but small coreness, the scientist cannot be known by more researchers and can hardly further attract them to cooperate. In e-commerce network, the applicability of products depends on purchase groups with similar identities, such as male/female group, student group, and teacher group, which shows the importance of aggregation degree. In paper-citation network, the value of a paper depends on the citation of researchers in the same field, not the citation of researchers in the different fields.
In summary, in this paper, we define the hybrid influence of synthetical degree and coreness (synthetical H-index and coreness) to redefine the SRW and propose two improved models DCHI and HCHI to further explore the accuracy of link prediction. Experimental results on twelve real networks show that DCHI exhibits better performances of link prediction. e rest of this paper is organized as follows. In Section 2, we build two models based on the synthetical degree and coreness and the synthetical H-index and coreness, respectively. In Section 3, the thirteen benchmark experimental datasets are introduced. In Section 4, a link prediction metric and eight mainstream baselines are described, respectively. In Section 5, the experimental results are discussed. In Section 6, the conclusion is described.

Models Based on Hybrid Influence of Endpoints
Firstly, we study link prediction models in an undirected simple network G(V, E), where E is the set of links (|E| denotes the number of all edges.) and V refers to the set of nodes. Multiple links and self-connections are eliminated. For every pair of nodes, x, y ⊂ V, a score, s xy , is given to calculate the probability of their future connection. In this paper, we set the similarity value as a score directly, and a larger score illustrates that the potential link has more possibility to be found. Secondly, we show two models based on the degree (SRW [33]) and the synthetical degree and H-index (SHI [37]) separately as follows.

SRW Model.
Liu et al. [33] build the similarity model using random walk, which finds all intermediate nodes sequentially between two endpoints according to a Markov chain with one-step transmission probability p xy � a xy /k x , where k x represents the degree of node x and a xy � 1 if node x successfully connects y and a xy � 0 if not. e sequence of node with t-step between x and y is expressed as us, the t-step transmission probability from x to y is denoted by π xy (t) � t−1 i�0 p x i x i+1 and π yx (t) � t−1 i�0 p y i y i+1 . Importantly, Liu et al. consider the degree k x and k y to quantify the influence of endpoints and define the SRW as 2 Complexity where k x and k y denote the degree of endpoint x and y, respectively, and |E| indicates the number of links in the network.
(k x /2|E|) and (k y /2|E|) describe the influence of endpoint x and y, respectively.

SHI Model.
Zhu et al. [37] find that the H-index can represent the maximal connected subgraph of endpoints and describe the influence intensity. us, Zhu et al. simply synthesize degree and H-index as the hybrid influence of endpoints and replace the degree in SRW to define a simple hybrid influence model (SHI) as ) denote the hybrid influence of node x and y based on synthetical degree and H-index, respectively.
Although the endpoint degree and H-index can quantify the endpoint influence, they only represent the number of neighbors and the maximal connected subgraph of endpoints separately, ignoring the influence intensity of endpoints. e influence intensity of endpoints can be expressed by the coreness of endpoints because the coreness can quantify the aggregation degree of neighbors which represents the endpoint influence intensity.
us, we consider the role of coreness for endpoint influence. Finally, we build two models based on synthetical degree and coreness (DCHI) and synthetical H-index and coreness (HCHI) separately as follows.

DCHI Model.
rough the explanation in Section 1 and the illustration in Figure 1, we synthesize degree and coreness to quantify the influence of endpoints and replace the degree in SRW to build a new model DCHI as where ������ where ( ������ h x × c x /2|E|) and ( ������ h y × c y /2|E|) denote the hybrid influence of node x and y based on synthetical H-index and coreness, respectively.

Experimental Data
In this section, we introduce 12 real network datasets to prepare the following experiments. (1) US Air97 (USAir) [40] represents the US airline network. (2) Yeast PPI (Yeast) [41] represents the yeast network of relationship between proteins. (3) Food Web (Food) [42] represents the relations of carbon exchanges in the cypress wetlands of Florida ecosystem. (4) Power Grid (Power) [43] represents the western US's electrical power transmission network. (5) NetScience (NS) [44] represents partnerships between scientists in publishing papers concerning the subject of networks. (6) Jazz [45] represents the networks of Jazz musicians. (7) e-mail network (e-mail) [46] represents e-mail communication network of University Rovira i Virgili (URV) in Spain. (8) Slavko [47] represents the friendship network of Slavko Zitnik on Facebook. (9) UC Irvine dealing with social network (UCsocial) [48] represents an online social network composed of students in the University of California, Irvine. (10) Infectious (Infec) [49] represents the offline contact network of visitors in the course of the exhibition named "Infectious: Stay Away" in the Science Gallery in Dublin, 2009. (11) EuroSiS web (EuroSiS) [50] represents interactions network between Science in Society actors from twelve European countries. (12) C. elegans (CE) [43] represents the network of neurons in the C. elegans worm. Table 1 lists the mentioned networks fundamental topological features.
To achieve preprocess, arcs are changed as nondirectional links, and loops and multiedges are eliminated to ensure the network unweighed and undirected. Subsequently, the largest linked simplified network subgraph is extracted to make sure the connectivity.
In the beginning, the set of network links is divided into the training set E T containing 90% links in a random manner, and the testing set E P containing 10% links, while the connectivity of E T is ensured [1]. Besides, 30 divisions are identically and separately conducted on the network. Next, experimental processes are performed over the 30 separated training and testing sets, the averaged accuracy is achieved in a statistical manner, and metrics is recalled more than 30 times realization. [36], a metric of accuracy, can be interpreted as the probability that a potential link (a link in E P ) ranks a higher score than a nonexistent link (a link in U∖E, where U denotes the universal link set). In the specific implementation, among n independent comparisons if the potential link ranks higher in n ″ times and the same as the nonexistent link in n ″ times, and the total score accumulates n ′ and 0.5n ″ . After that, AUC expresses the averaged score over n-time comparisons as

Metric. AUC
AUC evaluates the performance of a model globally. If all scores originate from an independent and identical Complexity distribution, the value should equal to 0.5. erefore, the extent to which the accuracy exceeds 0.5 suggests how much better a model performs than pure chance.

Baselines.
Comparatively, we introduce eight fundamental models as follows: (1) Common neighbors (CN) [24] describe the similarity between endpoints by calculating the number of common neighbors, defined as where Γ(X), X ∈ x, y , represents the set of neighbors of endpoint X and |Γ(x) ∩ Γ(y)| refers to the number of common neighbors of endpoints x and y. (2) Adamic/Adar (AA) [27], based on CN, suppress the contributions of common neighbors with big degree by applying the inverse logarithm, which is defined as where k z represents the degree of node z. (3) Resource-Allocation (RA) [28], analogous to AA, suppresses the large degree of common neighbors by applying the reciprocal of the degrees of common neighbors, which is defined as (4) Local Path Index (LP) [30] considers the similarity on two-step and three-step paths between endpoints simultaneously, with the two-step paths preferred, which are defined as where A represents the adjacency matrix and ε is a punishment parameter. (5) Superposed Random Walk (SRW) [33] is introduced in Section 2.
(6) CSRW [34] exploits the coreness to quantify the influence of endpoint and replace the degree influence in SRW, which is defined as where c x and c y represent the coreness of node x and y, respectively. (7) HSRW [34] exploits the H-index to quantify the influence of endpoint and replace the degree influence in SRW, defined as where h x and h y represent the H-index of node x and y, respectively. (8) Simple hybrid influence (SHI) [33] is introduced in Section 2.

Results and Discussion
To explore the prediction performances of the proposed models, extensive simulations are conducted on 12 real datasets. rough comparisons with several main baselines in terms of accuracy metric, we obtain the experimental results on the models and discuss the findings in the following. SHI, HCHI, and DCHI models mainly consider two aspects: random walk on paths and hybrid influences of endpoints.
rough simulations, the experimental results show that the number of steps in random walk between endpoints can affect the accuracy of link prediction. For illustrating the changes of prediction accuracy on the number of steps t, we plot the relation curves in Figure 2.
In Figure 2, SHI (synthetical degree and H-index), HCHI (synthetical degree and coreness), and DCHI (synthetical H-index and coreness) models show their prediction performances on the random steps t, and they exhibit different optimal accuracies at certain number of steps t, respectively. Specifically, SHI shows optimal AUC values at t � 15 in food, power, NS, e-mail, UCsocial, and Eurosis, t � 5 in USAir and CE, t � 3 in yeast, t � 2 in Jazz, t � 6 in Slavko, and t � 9 in b a c Figure 1: Illustration of the influences of endpoint b based on degree, H-index, and coreness. Endpoint b has degree � 6, H-index � 3, and coreness � 3. Node a and c represent two target endpoints, respectively. e blue nodes represent the medial nodes between endpoint b and two target endpoints.

Complexity
Infec. Obviously, the optimal number of steps on SHI mainly appears on the long path t � 15, illustrating that long paths can further facilitate the hybrid influence spreading based on the degree and the H-index. However, HCHI and DCHI all show optimal AUC values at t � 5 in USAir, Food, Slavko, Infec, Eurosis, and CE, illustrating that quasi-local paths can further facilitate the hybrid influence spreading based on H-index and coreness or degree and coreness. Importantly, we find that the influence concerning coreness can easily leak in the random-walk process on longer paths, which leads to weaken the intensity of influence spreading between endpoints. However, in power, the prediction performances of HCHI and DCHI reach the optimal value at t � 15 because power network includes large numbers of long paths with average distance 〈d〉 � 15.87 much longer than other datasets (referring to Table 1). In addition, DCHI, compared with SHI and HCHI, has larger size of maximal connected subgraph and more paths to spread the hybrid influence of endpoints. erefore, DCHI shows the best prediction performances in ten datasets (black mark on each dataset) except yeast and CE.
In addition, we compare HCHI and DCHI with eight link prediction models CN, AA, RA, LP, SRW, CSRW, HSRW, and SHI. To exhibit the experimental results, we show the averaged AUC values over 30 simulations in Table 2 for all models. e underlined bold fonts represent the best AUC values in each dataset and the numbers in parenthesis indicate the optimal random walk steps t, at which HCHI and DCHI obtain the optimal AUC values in eight datasets altogether.
As can be seen from Table 2, optimal values on seven datasets exist in DCHI with Power, NS, Jazz, Email, Slavko, Infec, and Eurosis. In contrast, local models CN, AA, and RA show worst prediction performances because they only consider the local paths and ignore the influence of endpoints. en, optimal values on three datasets exist in LP with yeast, food, and UCsocial, illustrating that the quasilocal paths can limitedly promote the prediction performances. And then, SRW, CSRW, and HSRW also show worst performances because they only consider separately the contributions of degree, coreness, and H-index, meaning that degree, coreness, and H-index all cannot quantify the 6 Complexity influence of endpoints comprehensively. Finally, we focus on the performances of SHI, HCHI, and DCHI. In twelve datasets, there are seven optimal performances in DCHI. DCHI, compared with SHI and HCHI, shows the effective influence of endpoints (e.g., extensive maximal connected subgraph of endpoints and aggregation degree of neighbors) and finds sufficient paths between two unconnected endpoints. erefore, because the synthetical degree and coreness as hybrid influence can be a good quantification index, DCHI can better enhance prediction accuracy than SHI and HCHI in many cases of link prediction.
Besides, the low computation complexity is a necessary condition in link prediction.

Conclusions
At present, researchers pay more attention to the contributions of the influence of endpoints for link prediction based on local, quasi-local, or global similarity. To quantify the influence of endpoints, researchers consider the degree, H-index, or coreness separately, which all cannot evaluate the influence of endpoints comprehensively. Specifically, the endpoint degree only represents the number of neighbors of endpoints, but cannot describe the maximal connected subgraph. e H-index can express the maximal connected subgraph of endpoints to quantify the influence scope. However, the endpoint degree and H-index cannot quantify the influence intensity of endpoints and result in incomplete influence expression. We find that the coreness can represent the aggregation degree of endpoints, which can quantify the influence intensity of endpoints accurately.
rough abundant investigations, we find that the synthetical degree and coreness and the synthetical H-index and coreness can quantify the influence of endpoints accurately and comprehensively. erefore, we synthesize degree (H-index) and coreness as the hybrid influence of endpoints and replace the degree in SRW to build two models DCHI and HCHI.
We explore the prediction performances of DCHI and HCHI by the comparisons among CN, AA, RA, LP, SRW, CSRW, HSRW, and SHI on twelve real datasets. As a result, we show that DCHI obviously outperform other models on the metric AUC and do not increase computational complexity. e outstanding improvement in accuracy illustrates the synthetical degree and coreness as hybrid influence of endpoints can describe the endpoint influence intensity accurately and can attract more nodes to produce links.
Although our models have been verified on the datasets, the models only make a simple synthesis between endpoint degree, H-index, and coreness. We find degrees differ in different networks, and so do H-indices and coreness. e network heterogeneity characterized by heterogeneous degrees, H-indices, and coreness directly results in heterogeneous influences. And, we find that endpoints in network with smaller heterogeneous influence can attract each other more likely. For such characteristic, we will further carry out research on the heterogeneous hybrid influence model based on DCHI and HCHI. In the future research studies, the impact of heterogeneous complex networks will become a crucial problem.
In addition, our study may provide new findings relating to link prediction based on similarity in future. Our research results can be applied to friends' recommendation, products' recommendation, scientists' cooperation, biological experiments, and so on.

Data Availability
e data used to support the findings of the study are available at http://vlado.fmf.uni-lj.si/pub/networks/data/ and http://snap.stanford.edu/data/index.html.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Table 2: AUC on the twelve benchmark networks with L � 100, where L denotes the number of the candidate links to measure the prediction accuracy in each data set. Every data point is an average over 30 independent realizing processes, and every point represents a random 90%-10% division of training set and testing set. e values in the parentheses indicate the corresponding optimal number of steps. All the present results represent the optimal cases by (if any) adjusting the coefficients.