Layer Information Similarity Concerned Network Embedding

,


Introduction
In the past few decades, network embedding has obtained remarkable achievements. e basic idea is converting a node into a low-dimensional space in which the network structure and properties can be preserved effectively. In the early period, traditional models such as MDS [1], Isomap [2], LLE [3], and LE [4] are mainly based on dimensionality reduction technologies. ese models are not suitable for large networks due to their computational complexity. As Word2Vec [5] plays a vital role in the field of natural language processing, random walk-based methods that regard nodes in the network as words are proposed, such as DeepWalk [6] and Node2Vec [7]. In recent years, with the continuous development of deep learning, SDNE [8], DNGR [9], and GCNs [10] have developed neural networks into network embedding models. e methods mentioned above are all designed for single-layer networks. Figure 1(a) shows an example of a single-layer network, through which we can see that there is only one relation in the network. However, there are still many complex scenarios in the real world that cannot be described by single-layer networks. For example, the same set of individuals in social networks may participate in Twitter, Facebook, or Weibo for different purposes. Interactions in different social networks can be represented by a single-layer network. Each layer of the network has a specific relationship and specific semantics. However, these singlelayer networks do not operate in isolation and there are always connections between them. Instead, these complex scenes can be represented as a multiplex network, which is also a multilayer network in which layers share the same set of nodes. Each layer in a multiplex network represents a particular relationship of nodes, and the structure of each layer is typically associated. Figure 1(b) illustrates an example of an undirected multiplex network, and it has a unique structure in different layers while there also exist correlations between layers. Unlike the single-layer network, there are three relationships among a set of nodes, each of which describes a unique interaction in the given network structure. Multiplex relationships cannot be captured using single-layer methods. erefore, it is necessary to conduct in-depth research on multiplex network embedding.
Compared with the single-layer network, one of the challenges for multiplex network embedding is how to aggregate the diverse types of structure in different layer networks without destroying their unique properties. To solve this problem, MCGE [11], MANE [12], and MVNE [13] use the tensor factorization to concurrently capture the main local structure and correlations between different layers. MNE [14] and MGCN [15] define one common vector shared by all layers to capture the shared information in all layer networks and low-dimensional node vectors in each layer to capture the unique properties. In addition to introducing the common vector, CrossMNA [16] also introduces a layer vector to extract the semantic meaning. One2Multi [17] uses one encoder to encode the most informative network from which we can extract the shared information and multiple decoders to reconstruct all layers learning the specific structure in each layer. DMNE [18] and MrMine [19] take advantage of the links between subgraphs or communities to learn the cross-network relationships.
While each layer in the multiplex network is constructed from different semantics and makes the structure of each layer different, the varying relatedness between different semantics leads to diverse structural similarities between different layers. For example, we can observe from Figure 1(b) that layer2 and layer3 have more of the same edges between nodes compared to layer1, that is, the structure of layer2 and layers3 is more similar than that of layer1. Also, the similarity of any two layers is always different, which leads to the divergence in different layers of network analysis. It has been proved that considering interlayer similarity can significantly improve the performance of link prediction [20] and community detection [21] in multiplex networks. Hence, it is an essential feature that should not be ignored in multiplex network embedding. However, the existing methods can obtain embedded representations of a multiplex network, and most of them fail to consider the similarities between different layers which is an important characteristic in the multiplex network.
To incorporate layer similarities when learning node vectors or layer vectors, we propose a novel model, Layer Information Similarity Concerned Network Embedding (LISCNE), and our model takes advantage of the common and local features in multiplex networks and exploits layer similarity at the same time.
Specifically, we firstly obtain node embeddings by concatenating common vector for each node shared by all layers and layer vector for each layer. Common vectors capture characteristics shared by cross layer by merging all the networks into a new single-layer network and training the common vector for each node in the new network. In addition, layer vectors learn the overall semantics for each layer. en, to model the layer similarity, we define an index to formalize the similarity between different layers. With the constraint of layer similarities, we force the vectors with greater similarity to be closer. 2 Complexity e major contributions are summarized as follows: (i) After investigating the existing multiplex network embedding methods, we find that the methods consider the node connectivity among layers but ignore the inter-layer similarities. (ii) We propose a novel Layer Information Similarity Concerned Network Embedding (LISCNE) model, which effectively exploits the overall and local structure in multiplex networks and combines the concept of layer vectors with layer similarity at the same time. (iii) We conduct experiments to evaluate the proposed method using several real-world datasets on link prediction and node classification tasks. Compared with existing benchmark methods, LISCNE can achieve better or comparable performance.

Related Work
In this section, we review related work from two main aspects, namely, single-layer network embedding and multiplex network embedding.

Single-Layer Network
Embedding. By assuming that the more similar the structure of nodes is, the closer their representation vectors are, the network embedding can learn latent low-dimensional representations for the nodes or links in a network. Earlier studies [2,3,[22][23][24] were mainly based on matrix factorization. Isomap [2] obtained the shortest path d ij between node i and node j by constructing a neighborhood graph with connectivity algorithms and then obtained the vector presentation by minimizing the function of (d ij − ‖u i − u j ‖) 2 . GraRep [24] defined a node transition probability and preserved k-order proximity. Inspired by Word2Vec [5], new types of methods [6,7,25,26] using skip-gram model [27] have gradually emerged. e goal of the skip-gram model is to maximize the co-occurrence probability based on the context in a sentence: DeepWalk [6] regarded each vertex in the network as a word. It applied the Depth-First Sampling (DFS) strategy to obtain walk sequences when conducting random walks and performed the skip-gram algorithm for training the sequences. Node2Vec [7] employed a biased random walk strategy when getting the walk sequence. It defined two parameters p and q to adjust between BFS and DFS during random walks. Topo2Vec [26] used a greedy goal-based searching strategy to generate the node context and obtain the local and global topologically proximal nodes in a network. While these random walk-based methods cannot model the nonlinear structural information, some methods based on deep neural networks [8][9][10][28][29][30] have been proposed. Both SDNE [8] and DNGR [9] used deep autoencoders, where SDNE used the encoder to preserve the first-and second-order proximity of nodes, while DNGR captured higher-order proximity by using PPMI matrix which is indirectly transformed by the probabilistic co-occurrence matrix created by random surfing. GCNs [10] iteratively aggregated previous node embeddings and their neighbor embeddings to learn the new node embeddings. VGAE [28] was an inference model parameterized by a two-layer GCN. Pedronette and Latecki [31] proposed rank-based self-training to improve the accuracy of GCNs on semisupervised classification tasks. Recently, some novel algorithms [32][33][34] in the field of Contrastive Self-Supervised Learning have yielded good results. e core is to measure the similarities of sample pairs in a representation space, and the similarity between positive samples is much greater than the negative samples.
ese models are performed on the single-layer network. More discussion and methods for network embedding can be found in [35][36][37][38].

Multiplex Network Embedding.
To better represent the multiplex networks used to describe the real-world data, there also exist various works for multiplex network embedding.
MCGE [11] applied tensor factorization and defined a multiview kernel tensor to obtain common latent factors that capture the global structure information. Random walks have been applied in network embedding [14,19,[39][40][41][42]. MNE [14] learned two vectors for a node at the same time, i.e., a common vector sharing by all layers and a lower-dimensional vector for an individual layer. en, it introduced a transformation matrix to align these two vectors. PMNE's [39] network aggregation and result aggregation are essentially single-layer approaches. Considering the interactions between layers, the co-analysis method can traverse between layers with a probability r when taking a random walk. GATNE [43] proposed a unified framework to address the problem of embedding learning for attributed multiplex heterogeneous networks, and GATNE-T was a generalization of MNE [14] when training edge embeddings directly. MrMine [19] simultaneously learned the multinetwork representation at three resolutions of network, subgraph, and nodes, and it further constructed cross-resolution including network-subgraph, subgraph-node, node-node context. HMNE [44] defined a heuristic 3D interactive walk and sampled sequences of node cross layers. It preserved cross-layer neighborhood of nodes and learned information of multitype relations into a unified embedding space.
MVE [45] learned the robust representation by promoting the collaboration of different layers and different weights which were assigned to layers during voting. CrossMNA [16] defined a network vector extracting the semantic meaning of the network and an inter-vector reflecting the common features of the anchor nodes in different networks. en, these two vectors were added to form an intra-vector, which preserved the specific structural feature for a node in its selected network. MGCN [46] extended GCN to multiplex networks, which defined a general vector and dimensionspecific vector to capture the common and individual layer information. TCMGC [47] developed a multilayer GCN to capture the structure and multiview information. DMNE [18] used an encoder for all individual networks and regularized the cross-network embeddings through two types of loss functions to penalize the embedding inconsistency. DMGI [48] was an unsupervised model based on DGI [49]. In an individual layer, it performed the DGI algorithm to get the relation-type specific embedding and then took advantage of the multiplexity of the network by introducing consensus regularization and multiheaded attention mechanisms. MEGAN [50] was a multiplex GAN that designed a multilayer generator to model multilayer connectivity to generate fake samples and a node pair discriminator to enforce the generator to more accurately t the distribution of multilayer network connectivity. One2Multi [17] used the network with the most information as the input of encoder to learn the shared information of all the networks and then used a multidecoder to reconstruct the multiplex network from the shared information.
All the single-layer models mentioned above are effective for single network embedding; however, they do not consider the correlation in the multiplex network. In addition, the GCN-based multiplex network embedding models only consider the local information in the network, while other models ignore the similarities between layers. Our model combines the similarities between the layers and can simultaneously capture the local and global information in the network and the multiplex relationships between layers.

Notations and Problem Formulation
We begin with a formal definition of multiplex network, followed by the problem formulation. For the sake of clarity, the main notations are summarized in Table 1.
All layers share the same set of nodes V and the nodes form diverse structures in each layer. e structure layer l can be represented as ε l . We denote this multiplex network as Given such a multiplex network with L layers, the goal of our work is to learn low-dimensional embeddings Z l i ∈ R d for each node v i on each individual network G l , where d is the dimension of the embedding. e learned representations can be used as features in a variety of applications such as node classification and visualization, relationship mining, and link prediction. In our experiments, we perform both link prediction and node classification tasks to verify the effectiveness of the learned embedding.

Layer Information Similarity Concerned Network Embedding
As the nodes in each layer of the multiplex network are same, they shared the common information and the same node may show some similar features among layers. However, the structure among nodes in each layer is formed by different semantics and thus leads to quite diverse local structures of this node in each layer, and the varying relatedness between different semantics also leads to diverse structural similarities between different layers. In this paper, we propose LISCNE which models the common and local features in multiplex networks and exploits layer similarity at the same time. Figure 2 illustrates the framework of LISCNE for a threelayer multiplex network. e architecture contains two components. e first part is modeling the common vector for all nodes that are shared by the counterpart nodes among different layers. e second part is learning the node embedding in each layer by integrating the common layer and layer vector introduced to capture distinct semantic information of different networks. e last part is describing the process of training layer vectors with layer similarities. e embedding for node v i in layer l is defined as where f is the map function integrating common vector and layer vector to get the final node presentation. In our model, we use concatenation as the map function. LISCNE specifies the relationship of different networks by the layer similarities, i.e., S 12 indicates the index of structural similarity of network G 1 and network G 2 . By adding layer similarities to the layer vector, it can associate within-network and crossnetwork structure information.
Next, we will describe our model LISCNE in detail and introduce it in three parts: common feature modeling, learning node embedding in each layer, and integrating the similarity between layers.

Common Feature Modeling.
In this part, we learn the common feature shared by the counterpart nodes among different layer networks in the multiplex network. Firstly, we use a network aggregation method to aggregate all layers into a new single-layer network, where multiple edges are not allowed. Specifically, we set the new network as G new � V, ε new , and for the edge in ε l ∈ ε 1 , ε 2 , . . . , ε L , we add the edge in ε new . e process is shown in Figure 3. en, over the obtained new network, we learn the common vector

Symbol
Definition L e number of layers in the multiplex network G l e network for layer l in the multiplex network N e number of nodes V e node set of the multiplex network ε l e edge set of l-th network r l e layer vector for l-th network u i e common vector for node v i U e common vector matrix for all nodes Z l i e embedding vector for node v i in network G l d 1 e dimension of common vector d 2 e dimension of layer vector d e dimension of final node vector S e similarity matrix between networks S αβ e similarity between networks G α and G β 4 Complexity matrix U for all nodes. We take node v i as an example; to get the common vector u i , our goal is to maximize the probability of its neighbors' context in the given walk sequence: where w is half of the window size and the neighbors of v i are v i− w , . . . v i+w . Based on the assumption of conditional independence and using the logarithmic probability, it can be further factorized as where P(v j |v t ) can be defined with a softmax function as where u i and u j are the common vectors for the input node v i and context v j , respectively.

Learning Node Embedding in Each Layer.
As discussed before, each layer in a multiplex network has distinct information, and to capture the specific structure for an individual network, we introduce the layer vector that maps single layers into a latent space, i.e., the layer vector r l for the individual graph G l . To obtain the overall structure of the multiplex network and layer vectors and learn semantics for each layer simultaneously, we get the node embedding for each layer by concatenating them. For a random node v i , the embedding in layer G l can be defined as V l i � u i � � � �r l .
To preserve local neighborhoods of nodes in each layer, our goal is to maximize the probability of specific neighbors' context in each individual layer: where C l (v i ) is the context of node v i in layer G l and P(v j |v i ; V l i ) can be defined as Complexity where V l i and V l j are the node embeddings for the input node v i and context v j , respectively.

Integrating the Similarity between Layer-Networks.
e layer vector learned above can capture the distinct structure information within the layer, while in the multiplex network, there is another essential characteristic, which is the similarity between layers varying from layer to layer. Najari et al. [20] testified that incorporating the inter-layer similarities can improve the link prediction performance.
erefore, inspired by their study, we thought of using similarities to enhance embedding capabilities. We added constraints for layer vectors with the similarities between this layer and other layers. rough integrating into the layer similarity, we made the layer vector capture the cross-layer and within-layer information simultaneously.
Firstly, in our model, we used the Global Overlap Rate (GOR) algorithm to measure the similarity among layers in multiple networks. In detail, given two layers α and β in a multiplex network, an overlap edge means that the same node pair simultaneously exist in both networks. e global overlap between layers α and β is denoted by S αβ , which represents the total number of overlapping edges observed in layers α and β. It can be formulated as where ε α is the total number of edges in layer α. e range of S αβ is in [0,1], and the higher the value, the more the similarity between layers. Particularly, S αβ � 0 represents that there are no overlapping edges between layers, indicating that the layers are not related; otherwise, S αβ � 1 means that the layers are completely correlated. e similarity S αβ between layer α and β is the same as similarity S βα of layer β and α, and this can also be seen from equation (8).
After illustrating the definition, the next problem we should deal with is how to incorporate it into the model. To address this issue, we assume that if the structures of the two layers are more similar, their representation in the vector space should be closer. We force the following equation to obtain the minimum value: From equation (9), we can employ stochastic gradient descent to minimize I 3 function as follows:

Time Complexity Analysis.
Our loss function includes two components. e first part is maximizing the probability-specific neighbors' context in each individual layer to learn the node embedding in each layer, where the main processes of time consumption include getting random walk sequences and skip-gram training, just as the ordinary random walk algorithm. Assuming that the number of nodes is N, the number of edges in each layer is M, the walking length is T, and the number of walking sequences per node is t, the complexity of sampling all sequences is Besides, the complexity of optimization of N * t sequences with the skip-gram model is O (N log N). erefore, the time complexity of learning the node embedding in each layer is O(N log N). e second part is integrating the similarity between layer-networks. In this part, we exploit the structural similarity between pairs of two layers, and the time complexity is O(L * (L − 1)). In real-world network data, the number of layers of L is often very small. e time complexity of this part is relatively insignificant compared to that of the first part of learning node embedding in each layer. So, the overall time complexity of our model is L * (O(M) + tOn(N)q + hO ( N log N)).

Experiments
In this section, we conduct experiments to validate the proposed LISCNE. To compare our model with some stateof-the-art single-layer embedding methods and multiplex network embedding methods, we perform link prediction and node classification tasks on several datasets with different types of networks.

Datasets.
We employ five real-world multinetwork datasets from three different fields: social, co-authorship, and genetic. e basic statistical information of the datasets is presented in Table 2.
All these datasets are downloaded from the CoMuNe lab's website (https://comunelab.fbk.eu/data.php). e detailed descriptions are as follows: (i) CKM [51]: by asking the physicians in Illinois, Bloomington, Quincy, and Galesburg three questions, this dataset is classified into three types of relationships. Its ground truth is related to node labels; therefore, we also use this dataset to perform the node classification task. (ii) PIERRE [52]: this dataset maps layers to different working tasks within the Pierre Auger Collaboration. Based on the keywords and contents of all submissions between 2010 and 2012, the multiplex network is divided into 16 layers. (iii) ARABIDOPSIS [53,54]: based on BioGRID, this multiplex network considers genetic interactions of different types of organisms. e multiplex network used in the paper makes use of the following layers: direct interaction, physical association, additive genetic interaction defined by inequality, suppressive genetic interaction defined by inequality, synthetic genetic interaction defined by inequality, association, and colocalization.
6 Complexity (iv) MUS [53,54]: the dataset is also based on BioGRID. e layers in this dataset are physical association, association, direct interaction, colocalization, additive genetic interaction defined by inequality, synthetic genetic interaction defined by inequality, and suppressive genetic interaction defined by inequality.
(v) Arxiv [52]: choosing papers with "networks" in the title or abstract up to May 2014 in arxiv, the dataset is divided into 13 layers corresponding to different categories with 14,489 nodes.

Baseline Models.
To show the performance of our model, the following six baseline models are implemented for comparison, which can be classified into single-layer network embedding and multiplex network embedding.
(i) DeepWalk [6]: this is a classic single-layer network embedding method, which applies a random walk to get walk sequences and then conducts the skip-gram algorithm on the sequences to train the model. (ii) Node2Vec [7]: this is also a typical single-layer network embedding model, which utilizes two parameters to take control of the traverse probability in taking the random walk strategies. (iii) PMNE [39]: this is a multiplex network embedding model that consists of three methods, where network aggregation and result aggregation simply merge all networks or the embedding results of all networks into one, while co-analysis takes the interaction among layers. PMNE_n, PMNE_r, and PMNE_c are used to denote the network aggregation, result aggregation, and co-analysis, respectively. (iv) MNE [14]: this is a multiplex network embedding model that defines two different dimensional vectors for a node to capture the common information in the whole network and the specific features in a single layer, respectively. (v) CrossMNA [16]: this is a multiplex network embedding model and also a model for network alignment. It learns simultaneously inter-vector sharing by the anchor nodes in different networks and a network vector for each single layer.

Experimental
Setting. For our model, we set both the common vector dimension and layer vector dimension to 100, and thus after concatenation, the final node embedding vector dimension is 200. For the sake of fairness, we set all the dimensions of final vectors compared with our models as 200. Additionally, for DeepWalk, we set the walk to 20 and the walk length to 80 for each node taking a random walk. For Node2Vec, we empirically set p � 2 and q � 0.5. For PMME, we follow the default setting in the original paper, which sets α, p, and q to 0.5. For MNE, we set the additional vector dimension to 10 and the common vector dimension to 200. For CrossMNA, according to the original paper, we set the dimension of the inter-layer vector to 200 and the dimension of the network vector to 100.

Evaluation Metrics.
We perform link prediction and node classification tasks to validate the efficiency of our model. For the link prediction task, we execute experiments in each layer and take the average as the final results. en, we randomly divide datasets into testing sets and training sets. When predicting each positive edge, we also randomly sample unconnected node pairs as a negative edge. We adopt the ROC-AUC evaluation metric to test model performance, that is, the higher the value of AUC is, the better the model performs. For the node classification task, we train all data to get node embeddings of individual layers through our model and baseline models, get the average node embedding of all layers, and then inject the embeddings into a classifier to evaluate the effect. In our experiment, we select a logistic regression classier and choose the F1 (weighted) and precision (weighted) as evaluation metrics.

Performance on Link Prediction.
For single-layer methods, we train the node embedding for each layer and use it to predict links in the corresponding layer. For the three methods of PMNE, which take different strategies to aggregate the representations of all layers into one, we take the final node embedding to predict links in all layers. For all models, we average the AUC values of all relation types as final results. In experiments, we take five-fold cross-validation for all datasets. e results are shown in Table 3, from which we can draw the following observations: (i) e proposed LISCNE model can stably outperform or achieve comparable performance with all the baseline methods. e results show that merging the layer similarity into models can exactly improve the performance. (ii) e multiplex network models almost perform better than single-layer models. Meanwhile, these single-layer models in different datasets vary a lot, e.g., in PIERRE dataset, DeepWalk and Node2Vec

Performance over Common Vector Embedding
Dimension. Figure 4 shows the performance of our model as the embedding dimension of the common vector increases. It can be clearly seen from the figure that the larger the dimension, the better the prediction effect. When the dimension reaches 10, the curve tends to stabilize. Here, for the sake of both accuracy and computational complexity, we set the common vector dimension d 1 to 100.

Performance on Node Classification.
In the node classification task, we choose the CKM dataset with reliable node labels to conduct the experiment and take the companies as the classification label. In addition, the ones injected into the classifier are average node vectors for node embeddings in individual layers. For single-layer network methods, we train all nodes in each layer and get the average of node vectors in each layer. For MNE, CrossMNA, and our model, we also get the average of node vectors of intra-vector in individual layers. en, all the node representations and corresponding node labels in each layer are divided into training and testing datasets to train the classifier. In our experiment, we use a logistic classifier and evaluate the classification performance with the metrics accuracy, precision, and F1, respectively, which can be defined as follows: accuracy � TP + TN TP + FP + TN + FN , precision � TP TP + FP , As shown in Figure 5, the results prove the effectiveness of our model, where our model LISCNE can provide the best performance in terms of F1 and precision and achieve comparable accuracy with PMNE_n and PMNE_c. However, the effectiveness of multiplex network embedding models like CrossMNA on the link prediction task is not obvious. is may be because every model injected into the classifier is the average of node embedding in all layers. e effect of average is somewhat like aggregation and gets the shared information in all layers.

Conclusion and Future Work
In this paper, we propose an effective method called LISCNE for multiplex network embedding. LISCNE defines a common vector for all counterpart nodes in the multiplex network and also introduces a layer vector for each layer. Moreover, when learning layer vectors, it first merges the layer similarities to simultaneously capture intra-layer information and cross-network information. We have performed link prediction and node classification tasks to test LISCNE and conducted extensive experiments  8 Complexity to verify the effectiveness of our proposed model. is model is applicable to aligned networks and certain networks, in which one node in some network is only connected to one node in another network.
Unfortunately, this kind of network cannot cover lots of scenarios in the real world, e.g., the association between a collaboration graph of researchers and a citation graph of papers, where an author can cite papers on multiple topics. In the future, we will extend our model to more manifold networks, for example, one node in some network is connected to several nodes in another network through different weights.

Data Availability
e datasets used to support the results of this study can be available from https://comunelab.fbk.eu/data.php.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.