Adaptive Similarity Function with Structural Features of Network Embedding for Missing Link Prediction

Link prediction is a fundamental problem of data science, which usually calls for unfolding the mechanisms that govern the micro-dynamics of networks. In this regard, using features obtained from network embedding for predicting links has drawn widespread attention. Though edge features-based or node similarity-based methods have been proposed to solve the link prediction problem, many technical challenges still exist due to the unique structural properties of networks, especially when the networks are sparse. From the graph mining perspective, we first give empirical evidence of the inconsistency between heuristic and learned edge features. Then we propose a novel link prediction framework, AdaSim, by introducing an Adaptive Similarity function using features obtained from network embedding based on random walks. The node feature representations are obtained by optimizing a graph-based objective function. Instead of generating edge features using binary operators, we perform link prediction solely leveraging the node features of the network. We define a flexible similarity function with one tunable parameter, which serves as a penalty of the original similarity measure. The optimal value is learned through supervised learning thus is adaptive to data distribution. To evaluate the performance of our proposed algorithm, we conduct extensive experiments on eleven disparate networks of the real world. Experimental results show that AdaSim achieves better performance than state-of-the-art algorithms and is robust to different sparsities of the networks.


Introduction
Networks have recently emerged as an important tool for representing and analyzing many kinds of interacting systems ranging from biological to social science [1]. As technological innovation and data explosion gather pace, we humans are now moving into the era of big data, hence the reach of and participation in these networks is rapidly expanding. Studying these complex, interlocking networks can help us understand the operation mechanism of realworld systems. erefore, in the past years, lots of work has been dedicated to studying evolution [2,3], topologies [4,5], and characteristics [6] of networks, attracting researchers from physics, sociology, and computer science. Under many circumstances however, the current observations of various network data are substantially incomplete [7]. For example, in protein-protein interaction and metabolic networks, whether two nodes have a link must be determined experimentally, which is very costly. As a result, the known links may represent fewer than 1% of the actual links [8]. Besides, in social networks like Facebook, only part of the friendships among users are shown by the observed network, and there still exist user pairs who already know each other but are not connected through Facebook. Due to this, it is always a challenging yet meaningful task to identify which pairs of nodes not connected in the current network are likely to be connected in the actual network, i.e., predicting missing links. Acquiring such knowledge is useful, for example, in biological domain, it gives invaluable guidance to carry out targeted experiments, and in social network domain, it can be used to recommend promising friendships, thus enhancing users' loyalties to web services. e way to solve the link prediction problem [9][10][11][12][13][14] can be roughly divided into two categories, i.e., unsupervised methods and supervised methods. In current research work on unsupervised link prediction, they mainly focus on defining a similarity metric s uv for unconnected node pairs (u, v) using information extracted from the network topology. e defined metrics represent different kinds of proximity between a pair of nodes and have different performance among various networks and no one can dominate others. Most of the metrics are easy to compute and interpret, but they are so invariant that they are fundamentally unable to cope with dynamics, interdependencies, and other properties in networks [15]. e link prediction problem can also be posed as a supervised binary classification task from a machine learning perspective [16]. Since then the research of supervised methods for link prediction has become prominent [15,[17][18][19], and the results of these researches provide confirmatory evidence that a supervised approach can enhance the link prediction performance.
Choosing an appropriate feature set is crucial for any supervised machine learning task [20][21][22]. For link prediction, each sample in the dataset corresponds to a pair of nodes. A typical solution is using multiple topological similarities as features and this is the most intuitive way. But all these features are handcrafted and cost much human labor. Besides, they often rely on domain knowledge, thus restricting the generalization across different fields.
An alternative method is learning the features automatically for the network. By treating networks as special documents consisting of a series of node sequences, the node features can be learned by solving an optimization problem [23]. After obtaining the features of nodes, the link prediction task is traditionally conducted using two approaches. e first one is similarity-based ranking method [24], for example, cosine similarity is used to measure the similarity of pairs of nodes. For two unconnected nodes, the larger the similarity value, the higher the connection probability they have. e other one is edge feature based classification method [25,26]. In this method, the edge features are generated by heuristic binary operators such as Hadamard operator and Average operator. en a classifier is trained using these features and will be used to distinguish whether a link will form between two unconnected nodes.
As the features learned through network embedding preserve the network's local structure, the cosine similarity works well for strongly assortative networks but fails to capture the disassortativity of the network, i.e., nodes prefer to build connections on large scales than on small scales [7].
us, using cosine similarity for link prediction suffers from statistical performance drawbacks. Besides, the edge features obtained through binary operators will potentially lose node's information, since the features of nodes are learned by solving an optimization problem but the edge features are not (see Figure 1 for a clear explanation and the details will be discussed in Section 3.3). Furthermore, the edge and node features have the same dimensionality, which is usually on the scale of several hundreds. is means that even for linear models such as logistic regression, it still needs to learn hundreds of parameters, which presents us with the question of feasibility especially when the data size is large. How to design a simple, general yet efficient link prediction method using the node features directly learned from network embedding still remains an open problem.
To solve the abovementioned issues, we propose a novel link prediction method, AdaSim (Adaptive Similarity function), for large-scale networks. e node feature representations are obtained by optimizing a graph-based objective function using stochastic gradient descent techniques. Instead of generating edge features using heuristic binary operators, we perform link prediction solely leveraging the node features of the network. Our essential contribution lies in defining a flexible node similarity function with only one tunable parameter, which serves as a penalty of the original similarity. e optimal value can be obtained through supervised learning and thus is adaptive to the data distribution, which gives AdaSim the ability to capture the various link formation mechanisms of different networks. Compared with the original cosine similarity, the proposed method generalizes well across various network datasets.
In summary, our main contributions are listed as follows: (i) We propose, AdaSim, a novel link prediction method by introducing an adaptive similarity function using features learned from network embedding. (ii) We show that AdaSim is flexible enough with only one tunable parameter. It is adjustable with respect to the network property. is flexibility endows AdaSim with the power of capturing the link formation mechanisms of different networks. (iii) We demonstrate the effectiveness of AdaSim by conducting experiments on various disparate networks of the real-world. e results show that the proposed method can boost the performance of link prediction in different degrees. Besides, we find that AdaSim works particularly well for highly sparse networks. e rest of the paper is structured as follows. Section 2 reviews some research works related to link prediction. e problem definition of link prediction and feature learning are described in Section 3, and some empirical findings on the datasets are also given in this section. Section 4 illustrates the proposed link prediction method AdaSim with detailed explanations of each component. e experimental results and analysis are represented in Section 5. Finally, Section 6 concludes the paper.

Related Work
Early works on link prediction mainly focus on exploring topological information derived from graphs. Liben-Nowell and Kleinberg [27] studied several topological features such 2 Complexity as common neighbors, Adamic-Adar, PageRank and Katz and found that topological information is beneficial in predicting links compared with a random predictor. Subsequently, some topology-based predictors were proposed for link prediction, e.g., resource allocation [28], community-enhanced predictors [8] and clustering coefficientbased link prediction [29]. Al Hasan et al. were the first to model link prediction as a binary classification problem from a machine learning perspective [16]. Various similarity metrics between node pairs are extracted from the network and treated as features in a supervised learning setup; then, a classifier is built with these features as inputs to distinguish positive samples (links that form) and negative samples (links that do not form). ereafter, the supervised classification approach has been prevalent in the link prediction domain. Lichtenwalter et al. proposed a high-performance link prediction framework called HPLP. Some new perspectives for link prediction, e.g., the generality of an algorithm, topological causes, and sampling approaches, were included in Ref. [15]. Later, a supervised random walkbased algorithm was proposed by Backstrom and Leskovec [17] to effectively incorporate the information from the network structure with rich node and edge attribute data.
In addition, the link prediction problem is also extended to heterogeneous information networks [30][31][32][33]. Among these works, a core concept based on network schema was proposed, namely, meta-path. Multiple information sources can be effectively fused into a single path and different meta paths have different physical meanings. Some similarity measures can be calculated using meta paths; then, they are treated as features of a classifier to discriminate positive and negative links.
All the works mentioned above on supervised link prediction use handcraft features, which require expensive human labor and often rely on domain knowledge. To alleviate this, one can use the latent features learned automatically through representation learning [34]. For networks, the unsupervised feature learning methods typically use the spectral properties of various matrix representations of graphs, such as adjacency and Laplacian matrices. In the perspective of linear algebra, this kind of method can actually be regarded as a dimensional reduction technique. Several works [35,36] have been done aiming to acquire the node features of graphs, but the computation of eigendecomposition of a matrix is costly, thus making these methods impractical to scale up to large networks.
Perozzi et al. [23] extended the skip-gram model to graphs and proposed a framework, DeepWalk, by representing a network as a special "document" consisting of a series of node sequences, which are generated by random walks. DeepWalk can learn features for nodes in the network, and the representation learning process is irrelevant to downstream tasks like node classification and link prediction. Later, Node2vec was proposed by Grover and Leskovec [26]. Compared with DeepWalk, Node2vec uses a biased random walk to control the sampling space of node sequences.
e network properties such as homophily and structure equivalence can be captured by Node2vec. e link prediction was performed using edge features obtained through heuristic binary operators on node features. In [37], the authors proposed a deep model called SDNE to capture the highly nonlinear property of networks. e first-order and second-order proximity were jointly exploited to capture the local and global network structure, respectively. More recently, Wang et al. [38] proposed a novel Modularized Nonnegative Matrix Factorization (M-NMF) model to incorporate not only the local and global network structure but also the community information into network embedding. In order to model the diverse interacting roles of nodes when interacting with other nodes, Tu et al. [39] presented a Context-Aware Network Embedding (CANE) method by introducing a mutual attention mechanism. CANE can model the semantic relationships between nodes more precisely. In order to save computation time, in [24], link prediction was directly carried out using cosine similarity of node features instead of edge features. e above works mainly focus on the network embedding techniques and ignore the typical characteristics of link formation. e main difference between existing work and our efforts lies in that we consider an adaptive similarity function yet with a learning-based idea, making our model flexible enough to capture the various link formation patterns of different networks. For example, a negative value of p can weaken the role of "structural equivalence" and enhance the score of dissimilar node pairs, thus capturing the disassortativity on link formation.

Problem Statement and Feature Learning Framework
In this section, we first give the formal definition of the link prediction problem. en the feature learning framework for networks is presented. Finally, we introduce the empirical findings on several network datasets when using node features for link prediction.
is the set of links, no multiple links or self-links are allowed for any two nodes in the network. It is assumed that some of the links in the network are unobserved or missing at the present stage. e link prediction task aims to predict the likelihood of a link between two unconnected nodes using information intrinsic to the network. Since here we are considering a supervised approach for link prediction, we first need to construct a labeled dataset in which u i and v i denote the features of node u i and v i , respectively. e node features are learned from network representation learning. Φ(·) is a mapping function from node features to node pair features. Any node pair in D, y i � 1 indicates that this node pair belongs to positive samples and otherwise the negative samples. Positive samples are the edges, E p , chosen randomly from the network G. We delete E p from G and keep the obtained subnetwork (G s )fully connected. To generate negative samples, we sample an equal number of node pairs from G which have no edge connecting them. e dataset D is spitted into two parts: training dataset D T and test dataset D P . A classification model M can be learned with dataset D T , then this model will be used for predicting whether a pair of nodes in dataset D P should have a link connecting them. Our algorithms are typical methods from the field of graph mining. Hence, in contrast to part of our previous papers [9,10], we follow conventions in the field of artificial intelligence in which E p (positive samples) has 50% of the observed links, and scores are based on the combination of E p and the same number of nonobserved links (negative samples). For the highly sparse sexual contact network, which has only a small number of nodes, E p instead comprises all observed links.

Feature Learning of Network
Embedding. For a given network G � (V, E), a mapping function f: V ⟶ R |V|×d from nodes to feature vectors can be learned for link prediction. Here d is a user-specified parameter that denotes the number of dimensions of the feature vectors and f is a matrix of size |V| × d parameters. e mapping function f is learned through a series of document-like node sequences, using optimization techniques originated in language modeling. e purpose of language modeling is to evaluate the likelihood of a sentence appearing in a document. e model is built using a corpus C. More formally, it aims to maximize over all training corpus, where w is a word of the vocabulary, context(w) is the context of w that includes the words that appear to both the left side of w and the right side. Recent research on representation learning has put a lot of attention on leveraging probabilistic neural networks to build a general representation of words, extending the scope of language modeling beyond its original goals. Each word is represented by a continuous and low-dimensional feature vector. e problem then is to maximize where f(·) denotes the latent representation of a word. e social representation of networks can be learned analogously through a series of node sequences generated by a specific sampling strategy S. Similar to the context of word w in language modeling, N S (u) is defined to be the neighborhood of node u using sampling strategy S. e node representation of networks can be obtained by optimizing the following expression e learned representations can capture the shared similarities in local graph structure among nodes in the networks. Nodes that have similar neighborhoods will acquire similar representations.

Empirical Findings on Several Network Datasets.
After learning the representations for the nodes in the network, there are two approaches to the link prediction task, i.e., node-similarity-based method and edge-featurebased method. e former is simple and scalable, and the latter is complex yet powerful. But both methods have their limitations in effectively characterizing the link formation patterns of node pairs. Since the node-similarity-based method was not involved in learning, it cannot be aware of the effects of global network property in link prediction. e edge feature-based method could not describe the node pair relationship very well at the feature level using a heuristic binary operator, as the information loss exists in the mapping procedure from node features to edge features. We show the empirical evidence for the limitations of these two kinds of methods in the following subsections. Figure 1(a) shows a toy network (krackhardt kite graph) with 10 nodes and 18 edges. Each node and edge is marked with a unique label. Given a specific sampling strategy S, we can obtain the node sequences and the corresponding edge sequences simultaneously after performing S on the network. Hence, both node representations and edge representations of the network can be learned using optimization techniques. For a specific pair of nodes, the learned edge representation is called the "true" features and the generated edge representation using binary operator is called the "heuristic" features. It is known that if a binary operator is good enough, it should be able to accurately characterize the relationship of pairs of nodes, i.e., the correlation between heuristic features and true features should be as strong as possible. For the 18 edges in the toy network, five different kinds of heuristic binary operators [26] (see Table 1) are chosen to generate edge features (for the Division operator, we omit the kind of (f i (v)/f i (u)) since it has very similar results compared with (f i (u)/f i (v))), and their correlation with the true edge features are displayed in Figure 1(b).

Limitation of Heuristic Binary Operators.
On the basis of the evidence from Figure 1, we can tell that different operators have different results in representing features of pairs of nodes and no one can dominate the others. Some of the edges, e.g., edges 10 and 16, can be well characterized by the Hadamard operator, while others, for example, edges 12, 14, and 17, can be characterized by the Average operator. Furthermore, most values are less than 0.5, which means a weak correlation between the heuristic edge features and true edge features. is verifies our claim that edge features obtained through heuristic binary operators may cause the loss of information of node features.

Limitation of Similarity-Based Method.
Given an unconnected node pair (u, v), several metrics can be used to measure their similarity, for example, the common neighbors between u and v, and the number of reachable paths from u and v. But here we only consider the metric of cosine similarity since we have the node pair's feature vectors u and v, respectively. e cosine similarity is used to characterize the link formation probability and it is defined as where (·) T denotes the transpose and ‖ · ‖ means the l 2 -norm of a vector. e cosine similarity measures the cosine of the angle between two d-dimensional vectors obtained from network representation learning. In fact the idea of cosine similarity has been used for link prediction in several works [24,40,41]. But there are a few issues when directly using cosine similarity for link prediction. e first one is that it did not consider the label information of node pairs. us, it belongs to the category of unsupervised learning. However, lots of works have demonstrated that supervised learning approaches to link prediction can enhance the performance [15,18,19]. e other one is that cosine similarity is too rigid to capture different link formation mechanisms of different networks.
In the phase of representation learning for networks, it is assumed that two nodes have similar representations if they have similar contexts in the node sequences sampled by strategy S. For networks, this indicates that if two nodes are structurally close (for the three nodes (v 1 , v 2 , v 3 ) in the graph, suppose the geodesic distance of (v 1 , v 2 ) is 2 and (v 1 , v 3 ) is 5, we say v 2 is closer to v 1 than v 3 ) to each other, then they have a high probability to simultaneously occur in the same sequence which results in a high value in terms of cosine similarity. But in real-world networks, whether two nodes will form a link is not simply influenced by this kind of structural closeness. Two nodes far from each other in the network will also have a high chance to build relationships if they are structurally equivalent [26]. For two nodes, "the closer the graph distance, the easier for them to build link" holds not necessarily true, especially when the network is sparse and disassortative.
As shown in Figure 2, we can see that different networks have different patterns in building new connections (the datasets are described in Section 5.1). ese patterns are closely related to the network properties, such as clustering coefficient, graph density and assortativity. To some networks with high assortativity, two unconnected nodes tend to be connected if they are structurally close, while others are not. More specifically, the link formation probability for two unconnected nodes is vastly decreasing with the increase of geodesic distance in the C.elegans dataset, and 97.8% new links span the geodesic distance less than 3. But for the Gnutella dataset, with an increase of geodesic distance, the link formation probability first increases then decreases, and most of the new links (62.4%) are generated by node pairs with distance equal to 5 or 6. For the Router dataset, the new links span a wide range of geodesic distances from 2 to 34 and almost half of the new links (48.67%) span a distance larger than 5. e distribution of link formation probabilities is more complex than the other two datasets. e cosine similarity function assigns higher scores to pairs of nodes if they are close to each other and vice versa. It can capture link formation patterns in the case of Figure 2(a), i.e., the shorter the distance between two unconnected nodes, the higher the probability to be connected. But cosine similarity fails to capture the patterns in the cases of Figures 2(b) and 2(c), especially Figure 2(b), in which the link formation pattern follows a Gaussian distribution. For the pattern of Figure 2(b), nodes prefer to build connections with those that are relatively farther from them. When performing a link prediction task in cases like this, pairs of nodes with a relatively longer distance should be more similar than those with a shorter one.
us, we need to design a flexible similarity function for link prediction and capture the various patterns of link formation. Besides the similarity function should be devised on the basis of concision and scalability. is can be achieved by adjusting the similarity of node pairs and balancing link formation probabilities among different distances. Inspired by this, we propose a modified similarity function which is defined as where p is a balance factor to control the similarity of node pairs with different geodesic distances. As we have the labels of node pairs, the optimal value of p can be learned in a supervised way.

The Proposed Framework
In this work, we propose a novel link prediction framework, AdaSim, based on an adaptive similarity function using the features learned from network representation. e whole framework is illustrated in Figure 3. It can be divided into three parts: subgraph generation, feature representation, and similarity function learning. First, the positive and negative node pair indexes are obtained through random sampling. e corresponding subgraph G s is generated via edge removal.
en we learn the representation of nodes in the network using an unsupervised way. Finally, a similarity function is defined and the optimal parameter is determined through supervised learning. e obtained similarity function with optimal penalty can be directly used to solve the link prediction problem.

Subgraph Generation.
Unlike other tasks such as link clustering or node classification, in which the complete structural information is available, a certain fraction of the links needs to be removed before performing network representation learning for link prediction [42][43][44]. In order to achieve this, one can iteratively select one link and determine whether it is removable or not. But this operation is less effective and very time consuming, especially when the network is very sparse since it needs to traverse almost all the nodes in the graph.
Instead, we propose a fast positive sampling method based on minimum spanning tree (MST) in this paper. An MST is a subset of the edges in the original graph G that connects all the nodes together. at means all the edges are removable except those that belong to the MST and their deletion will not break the property of G of connectivity. Lines 1-4 in Algorithm 1 show the core of our approach. We first generate a MST of G denoted as G mst � (V, E mst ) using Kruskal's algorithm. e positive samples E p are randomly selected from E − E mst . To generate negative samples E n , we sample an equal number of node pairs from G, with no edge connecting them (lines 5-7). en we delete all the edges in E p from G and obtain the subgraph G s (line 8).

Feature
Representation. Now we proceed to perform the feature learning task on subgraph G s . is task consists of two core components, i.e., a node sequence sampling strategy and a language model.

Node Sequence Sampling.
In terms of node sequence sampling, the most classical strategies are Breadth First Search (BFS) and Depth First Search (DFS) [26]. BFS starts at a specific node and explores the neighbors first before moving to the next level. On the contrary, DFS traversing the network starts at one node and explores as far as possible along each branch before backtracking. BFS and DFS represent two extreme sampling strategies with respect to the search space they explore, bringing about valuable implications on the learned representations. In fact, the neighborhood sampled by BFS can reflect the structural equivalence about the networks and the sampled nodes in DFS can reflect a macro-view of the neighborhoods, which is essential in inferring communities based on homophily [26]. Although they are of paramount significance for producing interesting representations, neither can simultaneously reveal the complex properties of networks. We need a sampling strategy that can smoothly interpolate between DFS and BFS, whose requirement can be fulfilled by random walks on graphs.
A random walk of length l on G s rooted at node u is a stochastic process with random variables (v 1 , v 2 , . . . , v k ) such that v 1 � u and v i+1 is a node chosen uniformly at random from the neighbors of v i . Random walks arise in a variety of models for large scale networks, such as computing node similarities [19,45], learning to rank nodes [46,47], and estimating network properties [48]. Besides, they are the foundation of a class of output-sensitive algorithms that employ them to calculate community structure's local information.
is connection is the reason that motivates us to use random walks as the node sequence sampling strategy for extracting network information.
Lines 1-6 in Algorithm 2 show the procedure of node sequence sampling. As shown in Figure 3, we can obtain a series of node sequences using random walks. For example, if we want a random walk of length l � 5 rooted at A on the toy network, we may get the result of W � A, D, F, E, F { }. e other sequences are obtained similarly.

Language Model.
In order to get the representations of networks, the objective is to solve where , λ is the context size, and f is the mapping function from node to feature representations. For Pr(v i | x v i ), we can use softmax, which is a log-linear classification model, to get the posterior distribution of nodes. However softmax involves the summation over all the node pairs and doing such computation for every training instance is very expensive, making it impractical to scale up to large networks. To solve this problem, an intuition is to limit the number of output vectors updated per training instance. us, hierarchical softmax [49] is proposed to improve the learning efficiency, which we adopt in this work. In the end, we use stochastic gradient descent (SGD) techniques to optimize the objective function (lines 7-12 in Algorithm 2) to get the social representations of each node, i.e., f(v i ), in the graph. As illustrated in Figure 3, for each node in the toy network, we can get a d-dimensional representations associated with it.

Similarity Function
Learning. For node pair (u i , v i ) ∈ D S , we use u i and v i as their features obtained from network representation learning. Considering the distribution bias of real links among different geodesic distances, we propose a novel similarity function which is defined as We denote a i � u T i v i and b i � ‖u i ‖‖v i ‖ for simplicity. en, (7) can be rewritten as A logistic function is applied for mapping the node pair similarity to a value in (0, 1), which is a probability indicating that it belongs to the positive class or negative class. We use y i ∧ to denote this probability, which is represented as In order to measure the closeness between the predicted value and the true label, we select cross-entropy loss as our objective, which is defined as e stochastic gradient descent technique is used to get the optimal value of p by minimizing C, its updating rule can be written as where dC dp � dC Algorithm 3 shows the core part of the parameter learning process. e training dataset and test data set are first obtained through line 1 in Algorithm 3.
en the optimal value of p opt is learned using SGD on the training dataset D t (line 2). e p opt is used to measure the similarity of node pairs in D p and we can get their probability of being connected through lines 3 to 6 in Algorithm 3. Finally, the evaluation results are obtained through line 7.

Experiments
In this section, we first give a brief description of the datasets used in the experiment. Next, we introduce the baseline models and evaluation metrics for link prediction. en, the experimental results are presented with a detailed analysis. As the AdaSim framework involves several parameters, Input: G s � (V, E − E p ), window size λ, feature size d, walks per node k, walk length l Output: node representation f  8 Complexity lastly, we show how the different choices of these parameters affect the performance of link prediction.

Datasets.
To comprehensively evaluate the performance of our proposed link prediction algorithm, we use ten realworld datasets to conduct our experiments, and these datasets are commonly used in the link prediction domain. ese datasets come from various fields and their details are described as follows: (i) C. elegans [50] is the neural network of the Caenorhabditis elegans worm. e nodes represent the neurons and the edges denote synapse or gap junction. (ii) PB [51] is a network of hyperlinks between weblogs on United States politics. (iii) Wiki-vote [52] is a social network that contains all the Wikipedia voting data from the inception of Wikipedia till January 2008. Nodes in the network represent Wikipedia users and a directed edge from node i to node j represents that user i voted on user j. (iv) Email-enron [52] is a communication network that covers all the e-mail communication around half a million emails. Nodes of the network are e-mail addresses and if there is at least one e-mail from address i to address j, then they have a link between them. (v) Epinions [52] is who-trust-whom online social network. Members of the site of Epinions can decide whether to trust each other. If user i trusts user j, then there is a link between them. (vi) Slashdot [52] is technology-related news website. e network contains friend/foe links between the users of Slashdot. (vii) Sexual [53] is a well-known sexual contact network. is network is very sparse and has almost no closed triangles. (viii) Roadnet [52] denotes a road network of California, which is a typical sparse and treelike network.
(ix) Power [50] is a traditional sparse network, which denotes the power grid of the western United States. (x) Router is an Internet network of router-level collected by the Rocketfuel Project [54]. (xi) p2p-Gnutella [52] is a peer-to-peer file sharing network of Gnutella. Nodes in the network represent hosts and edges represent connections among those hosts of Gnutella.
e basic topological information of these networks is listed in Table 2, including the number of nodes and edges, average degree, average clustering coefficient, diameter, and density of the network. We roughly divide the networks as dense and sparse based on the average degree and average clustering coefficient. To sum up, we conduct experiments on networks with various properties, i.e., sparse and dense, small and large. us, the datasets can comprehensively reflect the characteristics of the proposed method (different from other networks, E p has 100% of the observed links for the highly sparse sexual contact network due to the small number of nodes).

Baseline Methods and Evaluation Metrics.
In order to validate the performance of our proposed algorithm, we compare AdaSim against the following link prediction models.
(i) Common neighbors (CN): for node u, let Γ(u) denote the set of neighbors of u. Two nodes, u and v, have a high probability of being connected if they have many common neighbors [55,56]. e simplest way to measure this neighborhood overlap is by directly counting the number of common neighbors, i.e., (ii) Resource allocation (RA) [28]: For an unconnected node pair u and v, it is assumed that u can send some resources to v by the medium of neighbors. e similarity between u and v can be defined as the amount of resources received by v from u, which described as Input: node representation f, pairs of nodes E p + E n , train test split ratio r′ Output: Evaluation results val Complexity where k z is the degree of z. (iii) Preferential attachment (PA) [57]: preferential attachment mechanism is used to generate random scale-free networks, in which the new links connecting to u is proportional to k u . Similarly, the probability that a new link connecting u and v is proportional to k u × k v . e PA similarity index is defined as (iv) Salton Index (SI) [40]: the other name of SI is cosine similarity and is defined as (v) Clustering coefficient for link prediction (CCLP) [29]: it is a similarity index with more local structural information considered. In this method, the local link information is conveyed by clustering coefficient of common neighbors.
where t z is the number of triangles passing through node z. (vi) Heterogeneity Index [10]: this method is based on the network heterogeneity and the state-of-the-art for sparse and treelike networks.
where α is a free heterogeneity exponent. (vii) Node2vec [26]: this is a supervised way of link prediction using logistic regression. e features used in this method are generated through heuristic binary operators of node pair features which are learned from network embedding. ere are two parameters, p and q, to control the node sequences sampling. Note that when p � q � 1, node2vec equals to DeepWalk [23].
Beside Node2vec, there are other approaches for unsupervised feature learning for graphs, such as spectral clustering [58] and LINE [59]. We exclude them in this work since they have already been shown to be inferior to Node2vec [26]. We also exclude other supervised methods, such as ensemble learning [15] and support vector machines [16]. ese methods can get relatively better performance but at the cost of high complexity, which is not our original intention.
We adopt the area under the receiver operating characteristic (AUC) to quantitatively evaluate the performance of link prediction algorithms. e AUC value quantifies the probability that a randomly chosen missing link is given a higher score than a randomly chosen node pair without a link. A higher score means better performance.

Experimental Results.
In order to obtain the following results, we set the parameters in line with the typical values in [26]. at is, d � 128, k � 10, l � 80, λ � 10, and the optimization is run for a single epoch. Fifty percent of the edges are removed and treated as positive examples. e negative node pairs which have no edge connecting them are randomly sampled from the network. For the two parameters, p and q, in Node2vec, they are selected through a grid search over p, q ∈ 0.25, 0.5, 1, 2 { }. After the dataset is prepared, we use tenfold cross validation to evaluate the performance. For the sake of objectivity, the experiment is repeated ten times on each dataset and the average results are reported in Table 3.
A general observation we can draw from these results is that the proposed link prediction algorithm, AdaSim, can obtain better performance than all the baseline methods on all datasets. More specifically, the unsupervised similaritybased link prediction methods achieve relatively lower value than those supervised ones, since the label information is not Table 2: Basic topological information of the datasets. |V| is the number of nodes in the network and |E| is the total links. Avg. degree denotes the average node degree. Avg.CC represents the average clustering coefficient, which indicates the probability to be connected among neighbors of nodes. Diameter is the longest of all the calculated shortest paths in a network. Density is the ratio of |E| to the number of possible edges. 10 Complexity leveraged to boost model performance. But the PA predictor achieves competitive results compared with AdaSim and even better than Node2vec on five out of eleven datasets. is is because preferential attachment is one of the key features in generating power law scale-free networks. It reflects the mechanism of network evolution that involves the addition of new nodes and edges. us, it can obtain better performance on link prediction problems. But similarity-based link prediction methods perform extremely worse when the network is sparse since limited or no closed triangular structure exists in these networks. Among all the supervised link prediction methods, AdaSim outperforms both DeepWalk and Node2vec in all the eleven networks with gain ratios of different scales. e gain ratio varies from 0.75% to 43.04% in the AUC values compared with Node2vec.
To intuitively show the influence of penalty p on link prediction performance, p is set to specific values from −val to val with fixed increment a (here val � 50, a � 1 for demonstration) and displays the results of AUC on three datasets, i.e., C. elegans, Router, and Wiki-vote, in Figure 4. Notice that p � 0 corresponds to the original cosine similarity measurement. It can be clearly seen, from Figure 4, that the results of AUC are considerably affected by the value of p. Compared with the rigid cosine similarity, our proposed AdaSim can substantially improve the link prediction performance.
is also verifies our empirical findings in Section 3.3 that different networks have different link formation patterns, thus a flexible and adaptive similarity function for link prediction is needed to capture these various patterns.
We select three representative sparse networks, that is, Sex (small), Power (medium), and p2p-Gnutella (large), and report the wall-clock time of CN, CCLP, HEI, Node2vec, and AdaSim in Table 4. We can observe from this table that with the increase of network size, the prediction time needed for all algorithms also increases. Besides, the learning-based algorithms usually take more time than similarity-based ones since vector-vector multiplication takes more time than simply calculating the neighborhood information of two nodes. Although our algorithm requires more time to predict, it has also achieved considerable performance gains, as explained above.

Performance on Networks with Different Sparsities.
Networks in the real world are often sparse, we only know very limited information about the interactions among the nodes. For example, 80% of the molecular interactions in cells of yeast and 99.7% of human are still unknown [40]. A good link prediction method should have robust performance on networks with different sparsities.
We change the sparsity of the networks by randomly removing a certain percent of links in the original network, and then follow the aforementioned experiment setup to report the results of different methods. e results on the Wiki-Vote dataset are displayed in Figure 5. Only four baseline methods are listed in the figure since CN, CCLP, and Deep-Walk perform similarly with RA and Node2vec, respectively.
It can be seen from the results that the AUC values decrease with the increase of removed edge ratio since it is becoming more and more challenging to characterize node similarity using information on network topology. e similarity-based methods perform well when the removed edge ratio is relatively small. Moreover, AdaSim performs consistently well and is robust to different sparsity conditions of networks. Even when eighty percent of the edges are removed, the AdaSim can still hold the performance around 0.95 in terms of AUC. Overall, AdaSim is not only robust to different network conditions but also achieves better performance than baselines.

Parameter Sensitivity.
ere are several parameters involved in the AdaSim algorithm and in Figure 6, we examine how the different choices of parameters influence the performance of AdaSim on the Wiki-Vote dataset. Except for the parameter being tested, all other parameters assume default values.
We measure the AUC as a function of the representation dimension d, walk length l and the number of walks per node k. We observe that the dimension of learning representations for nodes has limited effects on link prediction performance. With the increase of dimensionality, the AUC values increase slightly and turn to saturate when d reaches 128. It can also be observed that a larger l and k will improve the performance; this is because more neighborhood information of the seed node is included in the representation learning process, and the node similarities can be captured more precisely.

Conclusion
In this work, we focus on the link prediction problem with features obtained from network embedding. As the edge features generated through heuristic binary operators are an information-loss projection of the original node features, we have quantitatively given the evidence of inconsistency between heuristic edge features and learned ones. Moreover, we have developed a novel link prediction framework AdaSim by introducing an adaptive similarity function to deal with the inflexibility of cosine similarity, especially for sparse or treelike networks. AdaSim first learns node representations of networks by solving a graph-based objective function, then adds a penalty parameter, p, on the original similarity function. At last, the optimal value of p is learned through supervised learning. e proposed AdaSim is flexible and thus is adaptive to data distribution and can capture the various link formation mechanisms of different networks. We conducted experiments using publicly available real-world network datasets, and extensively compared AdaSim with seven well-established representative baseline methods.
e results show that AdaSim achieves better performance than state-of-the-art algorithms on all datasets. It is also robust to the sparsity of the networks and obtains competitive performance with even though a large fraction of edges are missing.

Data Availability
All datasets can be obtained from the corresponding author upon request.

Conflicts of Interest
e authors declare no competing interests. Funds for the Central Universities (14370119 and 14390110).