ECLP: Friend Recommendation Using Ensemble Approach for Detecting Communities Performing Link Prediction

Social networks provide a variety of online services that play an important role in new connections among members to share their favorite media, document, and opinions. For each member, these networks should precisely recommend (predict) the link of members with the highest common interests. Because of the huge volume of users with diﬀerent types of information, these networks encounter challenges such as dispersion and accuracy of link prediction. Moreover, networks with numerous users have the problem of computational and time complexity. These problems are caused because all the network nodes contribute to calculations of link prediction and friend suggestions. In order to overcome these drawbacks, this paper presents a new link prediction scheme containing three phases to combine local and global network information. In the proposed manner, dense communities with overlap are ﬁrst detected based on the ensemble node perception method which leads to more relevant nodes and contributes to the link prediction and speeds up the algorithm. Then, these communities are optimized by applying the binary particle swarm optimization method for merging the close clusters. It maximizes the average clustering coeﬃcient (ACC) of the whole network which results in an accurate and precise prediction. In the last phase, relative links are predicted by Adamic/Adar similarity index for each node. The proposed method is applied to Astro-ph, Blogs, CiteSeer, Cora, and WebKB datasets, and its performance is compared to state-of-the-art schemes in terms of several criteria. The results imply that the proposed scheme has a signiﬁcant accuracy improvement on these datasets.

Social networks play an important role in people's daily lives [27][28][29][30]. Social media creates a substrate for users to share their information, in terms of media, documents, and opinions [31][32][33][34]. ese networks provide online services imitating and simulating real-life interactions and relationships. Social networks also enable making connections among members, in a way that each user can select and add new friends from a long list of candidates, suggested by the recommender system module. Research findings show that users usually connect with their friends/colleagues (who they know and see in real life) as well as new friends who are introduced by the link prediction service of social networks [35,36].
With the daily growth of information in social networks, the process of introducing proper friends by the link prediction service has been a very challenging task and requires high precision [37]. Recommender systems, which predict the link for users, have been used for more than ten years to offer products and services to users based on their interests, preferences, and online behavior [38][39][40][41][42][43][44][45]. Another challenge in the LP problem is precise link prediction among members in the large size network [46,47]. Also in the link suggestion systems, in which actual communities are very scattered, the prediction will be challenging. In other words, the nodes in these communities are associated with only a small fraction of all network nodes. For example, in the case of Facebook, a user typically only connects to about 100 (out of 500 million) nodes. erefore, this method gives acceptable accuracy of prediction in a way that out of 500 million possible predictions, only 100 errors can occur. One of the most important problems that recommender (link prediction) systems are faced with is the problem of cold start that means there is not enough and sufficient information (ratings) in the system to provide a recommendation [48][49][50][51][52].
In general, link prediction methods are divided into two categories: supervised (with observer) such as decision tree, Bayes, and K-nearest neighbor (KNN) [14] and unsupervised (without observer) which are based on the network's topology information. ese methods are divided into three categories: local, global, and semilocal [53].
Local criteria are based on neighborhood topological information. ese methods are based on the idea that if there are some similarities between the nodes m 1 and m 2 , they will be more likely to be connected in the near future. e most important local criteria are common neighbors (CN) [48,54], Jaccard's coefficient [55,56], and Adamic/ Adar index (AA) [57]. ese criteria successfully reduce the computational cost; however, they do not have relatively high accuracy in forecasting [58]. Semilocal criteria need more information about the network topology rather than the local criteria but do not require the information of all the network connections. is criterion has been created with the aim of balancing the local and global criteria and does not have the challenges of global criteria such as being time-consuming and difficulties to find the complete network structure. It is also a good trade-off between the complexity and performance, such as local path index (LPI) methods and local random walk (LRW). Global criteria, similar to methods based on all paths in the network [10], need complete information about network topology. In this approach, based on all the paths between the two nodes in the network and all the information extracted from the entire network topology, the similarity of the two nodes is measured. Well-known global criteria are Katz [59], random walk, collision time (Hitting Time), and Sim-Tank, all of which suffer from high computational complexity [60,61].
Some of the link prediction methods are designed based on data mining methods such as clustering that can categorize each group of members with mutual interests into one cluster. is technique can highly limit the search space and suggests a list of candidate members who are selected from the corresponding cluster for each member in a cluster. is process can be used to improve the matching process and provide suggestions on social networks and also predict future events. Previous researchers clustered the nodes of a social network for their link prediction algorithm and deduced that there was a significant relation between communities related to the network structure and the precision of their algorithm. Moreover, using the similaritybased link prediction algorithm, which used the clustering information, improved the accuracy of link prediction [62]. A gravity-based link prediction method was proposed which included community information of networks. e researchers used the Louvain algorithm for community detection as well as parallelism in the community processing to speed up the prediction globally [59].
None of the previous work considered overlapping between the recognized clusters; also, they did not check merging clusters to find an optimum combination of clusters for this particular problem. erefore, a three-phase method is proposed in this paper which considers all the prediscussed subjects. In the first phase of the algorithm (community detection), ensemble of clustering is applied. In this phase, dense communities are searched by deploying node perception algorithm, which, despite the previous works, is a clustering method that considers overlapping between the detected clusters by using three hyperparameters. To find proper values for the parameters, a greedy search is applied. In the second phase, the clustering process is continued by applying binary particle swarm optimization (BPSO) algorithm to optimize clusters by merging near-enough neighbor clusters. e fitness function of BPSO is defined to maximize average clustering coefficient (ACC) of the network. en, those communities are merged where their merge enhances the fitness value. Executing this phase leads to more suitable communities because they improve network ACC. Eventually, in the third phase of our work, a similarity-based link prediction algorithm (Adamic/Adar) is applied with the hope of achieving high performance in link prediction in social networks. e rest of this paper is organized as follows. Section 2 presents previous related works. Our proposed algorithm is described in Section 3 in detail. Section 4 describes the dataset and evaluation criteria. e results and analysis of our work are reported in Section 5, and the last section concludes our paper.

Related Work
Due to the importance of friend suggestions in social networks, this issue has been investigated in several studies. Bastami et al. [59] introduced a new similarity measure called Triadic, which used some units called Motifs. ey defined Motifs as the different forms of a Triadic network including three nodes. eir main proposed method consisted of two phases; in the first phase, a data table was created including the information of each node in the graph. en, in the second phase, they used a classifier to be trained on the data table entries to predict the links. ey applied their work on directed datasets to identify the Motifs, which means that the undirected networks cannot be analyzed by their algorithm. ey also did not consider the problem of an imbalance dataset in their model. Datasets in link prediction are very imbalanced because the positive class (existing links) is in minority compared with nonexisting links.
Rafiee et al. [60] proposed link prediction based on common neighbors degree penalization (CNDP) and used quasilocal techniques instead of using local or global techniques, by considering neighbors of neighbors instead of only direct neighbors. ey defined a new similarity score which is determined according to the topological structure of the network including common neighbors of each two nodes and the average clustering coefficient of the network. ey performed their experiments on BUP, CEG, SMG, and HMT datasets and showed that their proposed method was superior to other methods such as adaptive degree penalization (ADP), the node-coupling clustering coefficient of node (NCCCN), and Triadic measure. ey showed that contributing the common neighbors of three Triadic nodes including x, y, and z as the two desired nodes for calculating their similarity (x, y) and common neighbor (z) can affect the prediction performance. ey also calculated the degree penalization scores in terms of area under the curve (AUC) and precision metrics using regression over different datasets which improved their work compared with the ADP algorithm that used a fixed value for degree penalization.
Most of the studies consider a static network to assess their method while this is a rough approximation because all social networks are updated over time. In this regard, Yao et al. [61] introduced a hybrid method based on link prediction that is assessed over dynamic networks with three metrics: the timevaried weight (revealing topological structure changes over time), the degree of common neighbors, and the intimacy between common neighbors. e last metric checks whether two common neighbor nodes have mutual relationships and checks the probability of a link appearing between the increases. ey redefined the common neighbors by considering the nodes within two hops to achieve better performance. In their proposed method, they only focused on the topological features of common neighbors that can change over time, but the other metrics related to other features such as the node's attributes or graph topology on a larger scale (communities) or the change of path between two nodes were ignored. Moreover, one of their metrics, which check if intimacy between the common neighbors exists or not, is independent of time and can be used in static networks. ey performed their experiments on DBLP datasets and used the Jaccard similarity criterion. ey also used TPR and execution time to evaluate the performance of their method.
Bao et al. [61] proposed an incremental dynamic link prediction method that predicts relative links in dynamic social networks. eir algorithm called DLP-IRA is based on the resource allocation (RA) algorithm which is an algorithm based on the nodes' degree and considers the fact that the smaller the degree two nodes have, the higher the chance of establishing a link. erefore, they defined a weight on each nonexisting link which is inverse of the maximum degree of source and destination nodes. ey also stated that common neighbors of two nodes have a huge impact on link appearance, so they contributed the common neighbors' degree in their calculation. To adapt their algorithm with the dynamic network, they update the changing node set S with all the nodes with changed connected links. When no new node was added, subgraphs including the nodes in S and their neighbors are considered as the input of their improved resource allocation algorithm. ey claimed that their algorithm can avoid recomputing when the graph structure changes and also speeds up the process. e basis of their algorithm is nodes' degree, and no other extra information of the network had contributed. Even though the statement of the reverse correlation between the nodes' degree and their chance of link creation is generally true, in many situations, it is totally reverse, like when a TV star becomes famous, and as his popularity increases, more fans follow him, or when a person is extrovert, he tries to connect with more and more people. Furthermore, they did not consider the graph change evolution over time and just relied on the last change.

Proposed Method
In this research, a combined method based on three phases has been proposed for link prediction (suggesting friends) in social networks. In this regard, one can easily combine local, global, and community information in very large spaces with millions of users and billions of links, where it can increase the prediction accuracy and cover all possible modes that solve the problem of data scattering. It also improves the problem of the cold start of users and the accuracy of the offer.
e three stages of the proposed method are (i) community detection, which detects dense communities with the node perception algorithm, (ii) optimization task with binary particle swarm optimization (BPSO) algorithm, and (iii) applying link prediction algorithm to the output of the former phase. e roadmap of the proposed method is illustrated in Figure 1 and Tables 1 and 2. Since the simulations and algorithms are applied to the graph, we shortly discuss the role of graphs in link prediction systems.

Graph eory.
Interactions on social networks and their structure can be simulated and implemented on a graph. A graph G � (V, E) consists of a set of nodes V and a set of edges E, where the edges connect the nodes. V demonstrates the role of users in the social network and E is the relationship between users. Adjacent matrix A of graph G is a two-dimensional matrix; if two users are connected, the corresponding edge is set to 1; otherwise, it sets to zero as shown in Figure 2. Figure 1, the first phase is community detection. To detect the communities, nodes should be clustered, and each cluster can be considered as a community. For each node, we determined the clustering coefficient (CC) as follows:

Community Detection. As shown in
where Γ x demonstrates the neighbors of node x and |Γ x | is the degree of node x. E y,z represents the edge that exists between two nodes y and z. erefore, to calculate the clustering coefficient of node x, all the edges between the Mathematical Problems in Engineering neighbors of node x are counted and divided into the total number of possible mutual relationships between the neighbors of node x. In other words, we determine how many people (nodes) are neighbors of our desired node, and among whom, how many of the nodes are friends with each other. More friendships between the neighbors of the desired node lead to its higher CC value. After determining the CC value of all nodes in each community, we determined the value average clustering coefficient (ACC) as the aggregation of that community according to the following equation:  Velocity function as a probability threshold value that determines whether the position vector is zero or one s(z) Sigmoid function that takes the velocity function of next time as input where C x represents the clustering coefficient of node x and |V| is the number of nodes in the graph. After determining the ACC of each community, all of them are summed together, and the result is divided into the number of the communities. To detect each community by the node perception algorithm, first, we identify small communities [23]. is algorithm has an acceptable performance even when encountering millions of users and billions of links. is method has some properties compared to other competitive methods in which one user can be assigned to more than one cluster. is algorithm uses three hyperparameters to perform clustering. To properly set the parameters, the ensemble of clustering is used, which performs a greedy search on parameters in the search space. is algorithm consists of two phases that run alternately. Suppose that we are dealing with a network with n vertices. First, we take each vertex equal to a community. Hence, first, we have several communities equal to the number of vertices; then for each vertex i, we find the neighboring community j in a way that for removing i from its community and joining it to community j, modularity is maximized, and then, we add vertex i to community j. is is done only if the modularity is increased; otherwise, vertex i remains in its community. is is repeated frequently for all vertices until no change happens. At this stage, the first phase is finished. It should be noted that in the first phase, a vertex may be moved several times between different communities. e first phase stops at a locally optimal point. en, in the second phase, clustering continues by merging small groups that have the ability to form larger groups. We continue this process until we reach the desired number of communities or exceed a certain threshold. is process should satisfy some constraints. e size of each community should be lower than a threshold.
We have an overlap threshold that controls the overlap tolerance of communities. A threshold is determined for the tolerance required to merge communities. e threshold parameter is set between 0.1 and 1, and we change the value of this parameter by increasing step 0.1. We set the smallest community size that can be detected. To properly set the parameters, we used an ensemble of clustering. e ensemble is a set of parameters set in node preservation with different values, which is considered to increase the value of aggregation indicating the ACC. To obtain the ACC value in a community, we must first calculate the value of the clustering coefficient for all nodes in the community based on equation (1). en, based on equation (2), we determine the ACC of each community, and exceeding this value from a threshold implies that it is a valid community.

Clusters' Optimization by BPSO.
Particle swarm optimization (PSO) is one of the known evolutionary algorithms introduced by Kennedy and Supermart (1995). Similar to other metaheuristic algorithms, PSO starts with a population of particles, each having the potential to be a solution to the problem.
ese particles are first randomly initialized all over the search space. After determining the fitness value of all particles, one of them has the highest fitness score, and it attracts all particles toward itself. After the second iteration, particles are moved according to two directions: the direction to the position of the best particle (gbest) and the direction to the position of best fitness that the corresponding particle has experienced (pbest). Unlike other evolutionary computational techniques, each particle (particle) in PSO depends on the position, velocity, and acceleration. e particles continuously move in the search space at an acceleration that is dynamically calculated according to their previous behaviors. Meanwhile, through the movement of particles, two random parameters try to insert a degree of randomness to enhance the exploration of this algorithm while PSO preserves a good exploitation property. If through the movement of particles, the fitness of one particle exceeds gbest, this particle is considered as the new gbest [24,25].
Due to the discrete nature of our problem, the continuous operators of PSO should be discretized. In binary PSO (BPSO), introduced by Kennedy and Eberhart [24,25], velocity is used as a probability threshold value that determines whether the position vector is zero or one. Suppose that X ij is the value of the j th bit of the binary vector which represents the position of the i th particle. In this case, the following equation describes how the binary PSO algorithm works: where X ij is an array with 1 and 0 elements, which, respectively, indicate if the corresponding community should merge to other communities or not. t indicates the iteration, v ij is a velocity function, s(z) is a sigmoid function that takes the velocity function of the next time as an input, and then, if the output of the sigmoid function is greater than a random number, the value of 1 is placed in X ij array for the corresponding community; otherwise, 0 is placed. In binary PSO, the particle velocity is changed similar to the standard PSO algorithm. In addition to introduced parameters, the number of particles (initial population), which can increase the exploration, and the maximum number of iterations are important parameters of the algorithm. BPSO optimizes the communities that were detected in the first stage to achieve greater ACC. e fitness function of each particle (network) is the summation of ACC of its corresponding elements (community). We come to an array for each particle whose size is equal to the number of the communities. en, we specify the position of each community, which can be zero and one. It should be noted that for multiple communities, integrated by BPSO, ACC plays the main role in merging communities. A high value indicates that the integration has been done properly; otherwise, we ignore merging.
For example, if we have a network with 5 communities, we consider a 5 * 1 array whose elements are 0 or 1, which is initialized at random. For example, if the array elements are [0, 0, 0, 1, 1], this particle means that the first and fourth communities should be merged. en, for all the resulting community nodes, we calculate the clustering coefficient and consequently calculate the ACC as the fitness of the integrated community. ese operations are repeated until all possible particles of integration are checked. is is where the work of the second step ends, and the best possible communities were selected.

Link Prediction.
In the first phase, we had an initial graph, and we came to divide this network into smaller communities that had more density. en, in the second phase, we optimized the communities of the first part by BPSO. We created subgraphs that merged with each other to give better fitness results, leading to the production of more valuable communities. In this regard, we considered each of these subgraphs as input and using the Adamic/Adar index method (AA) to find the suggested link according to equation (4). en, for each edge of the graph based on the obtained score in equation (4), a high value indicated that the edge can appear.
where x and y are two nodes, which we tend to calculate their similarity and z is the set of their common neighbors. Γ z demonstrates the neighbors of node z and |Γ z | is the degree of node z. In this regard, we find common neighbors for two nodes (x, y), then for each common neighbor obtained, we divide 1 by the log (logarithm) of the desired number of nodes, and finally, the values obtained from the common nodes are added together, which results in a score for each edge connected to the two nodes. is score indicates the amount of similarity between the two nodes x and y, creating a link between them. Assuming that if x and y are nodes that have a large number of common neighbors, then according to the AA index, a higher score is determined. In other words, it can be said that there is a high probability that a connection is established between x and y in the future (link). At this stage, a threshold is considered, and based on it, if the value of that edge exceeds this threshold, it means that there is a very high probability that there will be a link between them; otherwise, no link is suggested. For example, in Figure 3, to calculate AA (2, 5), we have

Datasets and Evaluation Criteria
To evaluate the proposed method, we divided the edges in the graph into two categories: E_train and E_test. E-train are the edges that are used for calculating the similarity score between each pair of nodes in the graph and E_test are the edges that are used to validate the proposed method. Based on this, we use the following data to show the effectiveness of the proposed method. e employed datasets are Astro-ph [26], Blogs [27], CiteSeer [28], Cora [26], and WebKB [28], and the features that make up the dataset can be expressed as

Experimental Results
e purpose of the experiments was to evaluate the proposed method (ECLP) versus state-of-the-art methods over the introduced datasets in terms of the described criteria. e tests were performed on a system with Intel Core i7 GHz CPU specifications and 16 GB RAM. e test results of the proposed method (ECLP) were tested and compared with KI, CN, JC, AA, and GLP methods. It should be noted that in the deployed graphs, the edges are considered weightless.
In order to evaluate our new method, we use fivefold cross-validation which divides the data into five folds; each time picks a fold as test data and keeps the rest as train data.
en, based on the proposed method, the similarity scores are calculated for the nodes in four folds of train data, and these scores are evaluated on the test fold and thus accuracy is determined.
is process is repeated for all five folds. Finally, AUC is calculated based on the five accuracy values obtained in fivefold cross-validation and this average accuracy is recorded in Table 3.
As shown in Table 3, the proposed method works well in real-world datasets. Evaluation of experimental results is expressed in terms of the AUC metric. Table 3 indicates that our method is significantly improved compared to state-ofthe-art methods. A comparison of columns two and three in Table 3 shows that ECLP outperforms the GLP algorithm in all datasets except one of them, and by comparing ECLP to the methods of columns four to seven, it is obvious that ECLP significantly improved (more than 4%) the prediction accuracy in terms of AUC. Figure 3 visualizes the quantitative results of Table 3, to better understand the difference between the aforementioned algorithm results. Figure 3(a) represents the accuracy (AUC) values for a unique dataset when different algorithms are applied. As shown in Figure 3, ECLP outperforms other algorithms for the four datasets (Blogs, CiteSeer, Cora, and WebKB). It is noteworthy to say that ECLP performs better than even GLP for these four datasets, especially for the CiteSeer dataset, where there was a big difference between ECLP and GLP accuracy. Figure 4 shows the results in a different way, facilitating comparison among the algorithms' performance. It is so obvious that ECLP performs better than state-of-the-art algorithms such GLP, AA, CN, JC, and KI in terms of AUC criterion. Moreover, it can be interpreted that ECLP performs better on two datasets (Cora and CiteSeer) compared with the other two.
As shown in Figure 4, our proposed method had a lower standard deviation compared to the other algorithms except CN. is indicates that our proposed method is more robust.

Conclusion
is paper proposes a three-stage scheme by optimizing the community detection to improve link prediction precision in social networks. In the first stage of the proposed method, we are faced with an initial graph, which should be divided into smaller dense subgraphs (communities). is is performed by applying node ensemble perception, which   incorporates global information. en, in the second phase, we applied BPSO to optimize the communities of the former phase and, therefore, merge those communities that alleviate the fitness function. e second phase preserves semilocal information of the network. e third phase computes the similarity between each pair of nodes by applying Adamic/ Adar similarity-based algorithm and then the edges with a high similarity score considered as the predicted edges of the algorithm.
is phase enables ECLP to capture local information of the network. Moreover, our method considers the overlap between communities which means a node may belong to multiple communities while GLP uses the Louvain algorithm for community detection that does not consider the overlapping between communities. Our proposed algorithm was evaluated and analyzed by applying it to different datasets, and the results indicated that it performs better than all rival algorithms.

Data Availability
e data used to support the findings of this study are available from the first author, Hasan Saeidinezhad, upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.