Multiplex Network Embedding Model with High-Order Node Dependence

Multiplex networks have been widely used in information diﬀusion, social networks, transport, and biology multiomics. They contain multiple types of relations between nodes, in which each type of the relation is intuitively modeled as one layer. In the real world, the formation of a type of relations may only depend on some attribute elements of nodes. Most existing multiplex network embedding methods only focus on intralayer and interlayer structural information while neglecting this dependence between node attributes and the topology of each layer. Attributes that are irrelevant to the network structure could aﬀect the embedding quality of multiplex networks. To address this problem, we propose a novel multiplex network embedding model with high-order node dependence, called HMNE. HMNE simultaneously considers three properties: (1) intralayer high-order proximity of nodes, (2) interlayer dependence in respect of nodes, and (3) the dependence between node attributes and the topology of each layer. In the intralayer embedding phase, we present a symmetric graph convolution-deconvolution model to embed high-order proximity information as the intralayer embedding of nodes in an unsupervised manner. In the interlayer embedding phase, we estimate the local structural complementarity of nodes as an embedding constraint of interlayer dependence. Through these two phases, we can achieve the disentangled representation of node attributes, which can be treated as ﬁned-grained semantic dependence on the topology of each layer. In the restructure phase of node attributes, we perform a linear fusion of attribute disentangled representations for each node as a reconstruction of original attributes. Extensive experiments have been conducted on six real-world networks. The experimental results demonstrate that the proposed model outperforms the state-of-the-art methods in cross-domain link prediction and shared community detection tasks.


Introduction
e abundant relations and views between entities can be collected from various sources or scenarios, allowing a slew of problems to be better solved in different application domains, e.g., information diffusion [1], social network analysis [2], intelligent transportation [3], biomedicine, and ecology [4,5]. Taking together these data may be able to give a more accurate and nuanced picture of network structure than the individual network alone [6]. Taking social networks as an example, different online social networks show different views and behavior patterns of people. A user makes connections to their friends on Facebook or WeChat but uses Twitter or Weibo to follow people that interested him/her. ough different online social networks present distinct views and aspects of social behavior of one same user with the consistent feature, abundant user features and social information can facilitate the construction of a more accurate and nuanced user profile. erefore, these multiple sources and views of network data are worth exploring because they often contain complementary information that improves the quality of analysis results [7].
Intuitively, modeling the information fusion problem of nodes as a feature fusion problem is a straightforward way. Based on the fused features, we can further mine the network data for node classification, link prediction, node clustering, and visualization. Multiple-relation or view network data are vividly modeled as a multiplex network (also known as multidimensional, multiview, or multilayer networks) [8][9][10][11][12] in which the same set of nodes are connected by different types of relations. Different from a single network, multiplex networks reflect more complex topological properties. Multiplex networks can not only present the intralayer dependence between nodes but also can well model the interlayer network dependence. e analysis of multiple networks not only needs to consider the interdependence or interaction between nodes at the intralayer and interlayer but also focus on the dependence of node attributes and the topological structure of nodes. In this paper, the high-order node dependence of multiplex networks is defined as intralayer dependence between nodes, interlayer dependence in respect of anchor nodes, and the dependence between node attributes and the topology of each layer. In a multiplex network, the information fusion of multiple layers of nodes is a significant fundamental issue for the joint analysis of networks. A multiplex network, as shown in the middle of Figure 1, is composed of three social networks, which are Douban (https://www.douban.com/), LinkedIn (https://www.linkedin.com/), and Weibo (https://weibo. com/). ese three social networks are geared towards different social scenarios; Douban provides books and music services, LinkedIn serves for social occupation, and Weibo is geared towards entertainment services. Multiplex network representation learning (also known as multiplex network embedding) is an effective method to analyze and mine the network. It can project the node (or network) into a continuous low-dimensional space. In this paper, we are motivated to focus on multiplex network representation learning considering the high-order dependence.
Recently, existing methods have achieved excellent performance in the intralayer dependence between nodes. However, few studies have comprehensively focused on the properties unique to multiplex networks. e first challenge is preserving high-order proximity information of nodes. Some state-of-the-art models based on the graph neural network (GNN) [10,13,14] take into account both intralayer and interlayer dependencies of nodes. However, due to the oversmoothing problem of GNN models [15], such methods cannot effectively preserve high-order proximity information. e second challenge is preserving the interlayer dependence property of multiplex networks. e layers with strong interlayer dependence have similar local structure characteristics, while those with weak interlayer dependence show obvious differences in the local topology [16]. From Figure 1, we can see that the nodes in the Douban layer and the Weibo layer have similar local structures. It indicates the interlayer dependence property of nodes in these two layers.
is dependency cannot be preserved by extended random walk-based methods [17][18][19] and GNN-based methods [20][21][22]. e extended random walk-based representation learning method realizes the generation of node sequences through cross-layer sampling. In the node sampling process, most of them use random strategy to cross-layer sampling, but this ignores the similarity between layers. For GNNbased methods, nodes are embedded independently in interlayer. e node embedding of each layer is concatenated in the later stage. Such embedding and fusion processes will introduce repetitive and redundant information. However, LinkedIn layer is dissimilar with the other two layers. In this situation, it makes the fusion embedding of nodes obtained by methods [23][24][25][26] based on the assumption of information sharing between layers is inaccurate. e third challenge is preserving the dependence between node attributes and the topology of each layer. Previous studies also ignore the interaction of node attributes with the topology of each layer. Figure 1 illustrates this important property that different social scenarios depend on different attribute information of the user. e formation of friendship in the Douban network mainly depends on the user's preference for music and books. e formation of the following relationships in LinkedIn mainly depends on attributes such as the user's job and education level. e formation of relationships in Weibo mainly depends on the user's multiple attributes (books, sports, and music) besides job and education. In support of the dependence between the node attributes and the network structure, the interaction between them has been shown in several cases [27][28][29]. erefore, the embedding of multiplex networks contains not only dependence information between nodes in each layer (intralayer dependence) but also local structure similarity information (interlayer dependence) and dependence between node attributes and the topology of each layer (attribute dependence).
In light of this, we propose a novel and hierarchy representation learning model for multiplex networks with node attributes called HMNE. We propose a symmetric graph convolution-deconvolution (GCD) method with multiple convolution layers to embed the intralayer adjacency information of a node as a low-dimensional dense vector in an unsupervised manner. e graph convolution module (GCM) preserves high-order proximity information, and the graph deconvolution module (GDM) serves as an embedding restriction to alleviate the oversmoothing problem of GCM. To preserve interlayer local dependence information, inspired by Graph Infomax [30], we use the similarity between the representations obtained by the multilayer convolution and the entire layer embedding as the estimation of complementary information. We fit this estimation of complementary information to actually quantify the local structural complementarity of nodes. For the dependence between attributes and the topology of the layer where the node is located, we treat the output of the graph deconvolution module as the disentangled representation of node attributes. Each disentangled representation is the result of the interaction between node attributes and the topology of each layer. e main contributions of this paper are summarized as follows: (i) We propose a symmetrical graph convolutiondeconvolution neural network model to achieve intralayer node embedding, which is an unsupervised and general representation learning method. is method can not only flexibly adjust the number of hidden layers to capture the high-order structural information but also avoid the oversmoothing problem.
(ii) We present a method to estimate interlayer complementary information. is method can measure 2 Complexity the interlayer dependence property [9,10,31] in respect of the topology of the layer where the node is located and constrain the intralayer embedding. (iii) We design a disentangled representation learning architecture to solve the dependence between node attributes and its local topology. Graph deconvolution component is used to select attribute fragments associated with the semantics of each layer. We use a linear layer to restructure the original node attributes. (iv) Extensive evaluations on real-world datasets have been conducted, and the experimental results demonstrate the superiority of the proposed HMNE model against the state-of-the-art models. e rest of the paper is organized as follows. Section 2 describes some related works. Section 3 introduces related definitions of the data model we use, problem formulation, and preliminary knowledge. Section 4 presents HMNE's core modules. Section 5 shows the experiment results. Finally, the summary and outlook are described in Section 6.

Related Work
In this section, to distinguish from the single-layer network, we call the traditional representation learning method of one network as single-layer network embedding and the embedding of multiple networks as multiplex network embedding. Among them, we introduce the related work from joint embedding and cooperative embedding of multiplex network embedding. We first describe the ideas of network embedding for a single-layer network. en, we, respectively, introduce related works about multiplex network (mainly involves multiview networks, multirelation networks, multidimensional networks, and multilayer networks) embedding methods. Finally, we also summarize the shortcomings of these related works and the similarities and dissimilarities with the proposed model.

Single-Layer Network Embedding
Embedding techniques based on random walk to obtain node representations have been proposed: DeepWalk [32] is the first algorithm based on random walk to learn node representation. Based on the breadth-first search and depth-first search, node2vec [33] was proposed to replace the node sampling strategy of the Deep-Walk method. Both algorithms are traditional single-layer network embedding. Gu et al. [34] proposed an approach based on the open-flow network model to reveal the underlying flow structure and its hidden metric space of different random walk strategies on networks. It shows that the essence of network embedding by random walk is the latent metric defined on the open-flow network. In order to learn the representation of multirelation heterogeneous information networks, the following algorithm is proposed. Dong et al. [35] proposed a strategy for random walk sampling from heterogeneous networks, where the random walk is restricted to transition between particular types of nodes. is strategy allows many methods to be applied to heterogeneous graphs and complements the idea of taking type-specific encoders and decoders into account. Ribeiro et al. [36] presented struc2vec, a novel and flexible framework with the target to learn latent representations for the structural identity of nodes. e framework uses a hierarchy to measure node similarity at different scales and constructs a multilayer graph to encode structural similarities and generate a structural context of nodes.

Graph Neural Network-Based Methods. Kipf and
Welling [37] introduced the variational graph autoencoder (VGAE), a framework for unsupervised learning on graph-  Figure 1: e illustration of our data structure and dependence between node attributes and each layer's topology for three-layer multiplex networks as an example. e first part on the left presents node attributes. e second part on the middle indicates the multiplex network data model. e latter part shows the different dependence between node attributes and each layer.
structured data based on the variational autoencoder (VAE). is model makes use of latent variables and is capable of learning interpretable latent representations for undirected graphs. Hamilton et al. [38] proposed GraphSAGE, which uses a two-layer deep neural architecture. In each convolution layer, a node computes its representation as an aggregation of its neighbors' representations (from the previous layer). In addition, to achieve unsupervised embedding, the parameters of aggregation functions are learned using the loss function similar to DeepWalk. GraphSAGE is incapable of selective neighbor sampling and has a lack of memory of known nodes that have been trained. To address these problems, Luo and Zhuo [39] proposed an unsupervised method that samples neighborhood information attended by co-occurring structures and optimizes a trainable global bias as a representation expectation for each node in the given graph. Velickovic et al. [30] presented Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs, both derived using established graph convolution network architectures. Li et al. [29] proposed a principled unsupervised feature selection framework ADAPT to find informative features that can be used to regenerate the observed links and further characterize the adaptive neighborhood structure of the network. Yu et al. [15] proposed KS2L, a novel graph Knowledge distillation regularized Self-Supervised Learning framework, with two complementary regularization modules, for intra-and cross-model graph knowledge distillation. Xiao et al. proposed three rumor propagation models based on evolutionary game and antirumor [40], data enhancement [41], and representation learning [42]. ey proved that rumors are not only influenced by antirumor information but also affected by user behavior and psychological factors. And they studied the user's network structure and historical behavior characteristics in the rumor topic communication space in social networks and predicted the user behavior in the next time slice based on the current time slice data. At the same time, they introduced evolutionary game theory and considered the internal and external factors that affect user behavior within rumor propagation.

Multiplex Network Embedding.
e goal of multiplex network embedding methods is to achieve the information fusion of multiple features of networks, in which these methods can be divided into joint representation learning and coordinated representation learning [43] (in Figure 2 of [44], an illustration of coordinated and joint representation learning is presented).

Joint Representation Learning.
Zhang et al. [24] proposed a scalable multiplex network embedding (MNE) method, which assumes that the same nodes in multiple networks preserve certain common features and unique features of each layer. us, the common and unique embedding of nodes in each layer is learned by the DeepWalk algorithm separately. Ma et al. [25] implemented node embedding for multidimensional networks with hierarchical structure. ey simply added up node embedding in multiple dimensions as the fusion feature of nodes in multiple networks. Matsuno and Murata [26] presented a multilayer network embedding method (MELL) that captures and characterizes each layer's connectivity. e method utilizes the overall structure to consider the similar or complementary structure of the layer. Finally, the fusion feature learning of nodes in multiplex networks is obtained by combining node embedding in each layer with layer vectors. Cen et al. [9] focused on embedding learning for attributed multiplex heterogeneous networks, where different types of nodes might be linked with multiple different types of edges, and each node is associated with a set of different attributes. GATNE splits the overall node embedding into three parts: base embedding, edge embedding, and attribute embedding. GATNE-T contains only the first two parts. Zhao et al. [45] proposed a novel and principled approach: a multiview adversarial completion model (MV-ACM). Each relation space is characterized in a single viewpoint, enabling us to use the topological structural information in each view. Yuan et al. [46] proposed a novel multiview network embedding model with node similarity ensembles. Node similarities are first selected to maximize the represented network information while minimizing the information redundancy. For each combination of the selected node similarities, a latent space is generated as a view of the network.

Coordinated Representation
Learning. In some cases, graphs have multiple "layers" that contain copies of the same nodes. ey can be beneficial to share information across layers so that a node's embedding in one layer can be informed by its embedding in other layers. Qu et al. [47] proposed an attention-based method (MVE) to learn the weights of views for different nodes with a few labeled data. MVE can obtain robust node representations across different views by vote strategy. Recently, Liu et al. [17] extended a standard graph mining into the area of the multilayer network. e proposed methods ("network aggregation," "results' aggregation," and "layer coanalysis") can project a multilayer network of a continuous vector space. Zitnik and Leskovec [18] proposed the OhmNet framework to learn the features of proteins in different tissues. ey represented each tissue as a network, where nodes represent proteins. Individual tissue networks act as layers in a multilayer network, where they use a hierarchy to model dependencies between the layers (i.e., tissues). Schlichtkrull et al. [20] introduced relational graph convolution networks (R-GCNs) and applied them to two standard knowledge base completion tasks: link prediction (recovery of missing facts, i.e., subject-predicate-object triples) and entity classification (recovery of missing entity attributes). Zhiyuli et al. [48] proposed highly scalable node embedding for link prediction in large-scale networks. e method learns node pairs' co-occurrence features to embed a node into a vector by a damping-based random walk 4 Complexity algorithm. In the node sampling process, there is a bias problem with these existing methods that samples are trapped in a local structure. In addition, cross-layer sampling heavily depends on fixed parameters, which is in an inflexible manner. Sun et al. [49] presented a MNGAN framework for multiview network embedding by the generative adversarial network, aimed at preserving the information from the individual network views while accounting for connectivity across different views. Wei et al. [50] proposed an attributed node random walk framework, which can not only be able to incorporate both topology and attribute information flexibly but also easily deal with missing data and is applied to large networks. For the multiple-network alignment problem, Chu et al. [51] proposed a cross-network embedding method (CrossMNA). It defines two categories of embedding vectors for each node: intervector, and intravector. e idea of CrossMNA is the same as that of MNE. ey thought intravector contains both the commonness among counterparts and the specific local connections in its selected network due to the semantics. Park et al. [10] presented a simple yet effective unsupervised network embedding method for the attributed multiplex network called DMGI, inspired by Deep Graph Infomax (DGI), which maximizes the mutual information between local patches of a graph and the global representation of the entire graph. Vashishth et al. [21] proposed a novel graph convolutional framework (COMPGCN) which jointly embeds both nodes and relations in a relational graph. COMPGCN leverages a variety of entity-relation composition operations from knowledge graph embedding techniques and scales with the number of relations. Yu et al. [22] proposed a novel GEneralized Multirelational Graph Convolutional Networks framework, which combines the power of GCNs in graph-based belief propagation and the strengths of advanced knowledge-based embedding methods, and goes beyond.
In summary, in response to the challenges presented in this paper, the single-layer network embedding methods cannot achieve the preservation of interlayer-dependent information. e joint representation learning methods of multiplex networks assume that nodes have shared embeddings in interlayer, and information sharing and transfer are realized through these embeddings. However, different levels of dependence between layers will cause this assumption to be invalid (please refer to Figure 3 of literature [44]). e existing coordinated representation learning methods neglect node attributes and their local topology. Aggregating these coarse-grained attributes in the graph neural network can include noise and affect the performance of the model. In order to fill this gap, we propose a hierarchical multiplex network embedding (HMNE) model with high-order node dependence. e specific implementation will be described in detail in Section 4.

Data and Problem Formulations
In this section, we describe related symbols, concepts, and definitions in detail. Our data model's basic concepts are introduced in Section 3.1. en, we formalize a generalized node embedding problem of multiplex networks in Section 3.2. e important notations are summarized in Table 1.

Data Model.
In terms of network data of multiple views and sources, it is more appropriate to represent such networks as multiplex networks. As shown in Figure 1, three layers of this multiplex network are derived from three modal data, such as social network, semantic relation network, and co-occurrence network. Multiplex networks can not only express the intralayer link but also can well model the dependencies and interactions between networks [44]. e detailed definitions of multiplex networks are as follows.

Complexity
Definition 1 (multiplex network architecture). Given a multiplex network of N nodes with the sets of layer L, in which each node can interact with the other ones through |L| kinds of relations with |L| ≥ 2, we denote an aligned multiplex network G � {G l (V, E l ), l ∈ L} which is made up of |L| layers with N � |V| nodes and E � | l∈L E l | edges.

Complexity
Each layer in multiplex networks has the same node set and different edge sets, as shown in the middle part of Figure 1. Let i, j ∈ V be two nodes. i l denotes node i at layer l, and e l i,j ∈ E l denotes the edge to link i l and j l in layer l. i l and i l are the duplicates of the same node i in different layers. We assume that nodes i l and j l′ can be implicitly linked by the duplicates of i in layer l ′ and e l′ i,j cross-layers l and l ′ . Figure 1 shows an illustrative example of a multiplex network with |L| � 3-layer network (i.e., L � {Douban, LinkedIn, Weibo}) and a target node User. e dotted line represents an anchor link. e LinkedIn User,A is an edge between node User and node A in layer LinkedIn. e Douban,LinkedIn User,A is a cross-layer link between node User Douban and node A LinkedIn through an anchor link.

Problem Formulation
Definition 2 (multiplex network representation learning). Suppose the methods make use of a real-valued superadjacency matrix A, A ∈ R (N×|L|,N×|L|) (e.g., representing text or metadata associated with nodes). Node embedding aims at learning a map function f: to a d-dimensional representation of node i, and A i is a group of vectors of node i in the superadjacency matrix of G, and it can also be understood that it is composed of adjacency matrices of multiple layers. H is a d-dimensional vector/tensor, and d ≪ N. For coordinated representation learning, h i is a vector for node i. For joint representation learning, h i is a tensor for node i. Notice that all the aforementioned definitions can be easily extended to the case of weighted networks. We only focus on coordinated representation learning in this paper.

Proposed Model
In this section, we introduce the overall model of our HMNE by addressing the three major challenges mentioned in Section 1: (1) Preserving high-order proximity information of nodes: as shown in Figure 2, a symmetric graph convolution-deconvolution network (SGCD) model is designed to solve the oversmoothing problem of the traditional GCN. GCD includes the graph convolution component (GCC) and graph deconvolution component (GDC). We formulate a restriction constraint for the GDC to restructure the original input feature of the GCC. e output feature of the GCC with K (graph) convolution layers x k i in respect of node i is inputted into the GDC for reconstructing original input feature x i . Even if many graph convolution layers are added to the GCC, the oversmoothing problem can be avoided because of this reconstruction constraint. erefore, we can conveniently preserve high-order proximity information of nodes by increasing the graph convolution layers.
(2) Preserving the interlayer dependence property of multiplex networks: as shown in Figure 2, there are two major components to capture the intralayer dependence property of multiplex networks. We first utilize a structural similarity metric method to measure the difference target layer l and the other layer l ′ , respectively, in respect of node i. e result is served as a structural complementary information estimation P true . en, the similarity measure between the embedding h l i of node i in the target layer l and the global embedding H l′ in the other layer l ′ is served as the complementary information P pred in respect of node i. rough the minimization of P pred and P true , the learned embedding of node i can preserve the dependency property between layers.
(3) Preserving the dependence of node attributes with the topology of each layer: as shown in Figure 2, the input feature is attributes x i of node i. We utilize the idea of disentanglement learning to disentangle x i as |L| attribute subsets. ese attribute subsets dependent on the topology of each layer have different semantic information. ree main processes are as follows: firstly, we use x i of node i as the input of the GCC. en, the embedding h l i of node i with attribute information and structure information is obtained by the GCC in layer l of multiplex networks. Finally, the disentangled representations of the node's attributes are the output of the GDC in each layer. In the GCC, the attributes associated with the topology of each layer are preserved. In the GDC, the structure information is disentangled from h l i .

Notation Explanation G
A multiplex network V, E α e sets of nodes and edges in layer α, respectively G l , A l A network/adjacency matrix of layer l, respectively N, E l e node number/edge number of layer l, respectively In graph signal processing, a feature vector of node i of a graph is a graph signal x i ∈ R N . e graph Fourier transform to a signal x i is defined as F(x i ) � U T x i , and the inverse graph Fourier transform is defined as F − 1 (x i ) � Ux i , where x i represents the resulting signal from the graph Fourier transform. e graph convolution of the input signal x i with a convolution kernel (filter) g is defined as where ⊙ denotes the Hadamard product. If we denote a filter as g θ � diag(U T g), then the graph convolution is simplified as e graph convolution component from [52] limits the layerwise convolution operation to alleviate the problem of overfitting on local neighborhood structures for graphs with very wide node degree distributions. e equation simplifies to After constraining the number of parameters with θ � θ 0 ′ � − θ 1 ′ , we can obtain the following expression: Kipf and Welling [52] introduced the trick: Finally, we treat Θ as a convolution kernel (a matrix of filter parameters), a general definition of graph convolution as follows: In order to express the following sections more clearly, we denote Z l as the node embedding of layer l of a multiplex network G and Z (k) as a node embedding output of the k-th layer of the graph convolution neural network.

Graph Deconvolution.
To capture the high-order proximity information of the nodes, we can simply stack multiple convolution layers as our HMNE's graph convolution component (GCC) based on equation (5). However, previous studies showed that graph convolution is a type of Laplacian smoothing. ey proved that, after repeatedly applying Laplacian smoothing many times, the features of the nodes in the (connected) graph would converge to similar values. To avoid this problem and capture the highorder proximity of nodes, we design a graph deconvolution component (GDC). We first take the output Z (k) of the k (multiple) stacked convolution neural layers as the input of the GDC. en, analogous to the definition of deconvolution in the field of computer vision, according to equation (5), a graph deconvolution layer with a deconvolution kernel Θ d is defined as where A is an adjacency matrix, A ∈ R N×N , A � A + I N D is a degree matrix, and D ii � j A ij . e embedding of nodes Z (k) is an output of the k-th layer of the graph deconvolution neural network.

Intralayer Embedding Loss.
In this initialization of the GDC, the input matrix Z (1) is Z (k) , where k is the number of graph convolution layers, and Z (k) is a final node embedding matrix according to equation (5). To separate the structure information from the input by the deconvolution kernel, we propose an intralayer embedding loss formula. We use a symmetric structure containing k convolution layers and k deconvolution layers as our graph convolution-deconvolution component (SGCD). e reconstruction loss formula of SGCD is We assume the input Z (1) of the GCC is the node attributes X so that the output Z (k) of the GDC is a reconstruction matrix in respect of X. is reconstruction process is significant for Section 4.3.

Node Representation Learning.
In order to preserve the attribute and structural information of the node in each layer, we need to aggregate the embedding h l i (l ∈ L) of node i in each layer to obtain a more complementary global node embedding h i . We use a sum function to integrate the embeddings of node i in each layer: en, the final embedding of nodes in multiplex networks is e final embedding h l i ∈ Z l of node i in layer l obtained by the GCC, where Z l ∈ R N×d and h l i ∈ R 1×d , is a row of Z l . layers, we introduce how to get the true sample (l ′ , l, i), which indicates layer l ′ is complementary for i in l. is complementary information computing can effectively measure the interlayer dependence property. e basic idea is that the more dissimilar the local structures in two layers, the more reason to believe complementary information exists between these two layers. So, we utilize the structural similarity between two layers to produce true samples. Let P true (· | i, l) denote the true underlying connecting distribution of node i in layer l, and we can estimate it as

Preserving Interlayer
en, the locally topological structural similarity of node i between layers l ′ and l can be calculated by Jensen-Shannon distance between P l′,i � P(· | i, l ′ ) and P l,i � P(· | i, l) as where M � ((P l,i + P l′,i )/2) and D KL is the Kullback-Leibler divergence: Note that when the locally topological structures of node i between layers l ′ and l are identical, D JS (P l′,i � � � �P l,i ) � 0; otherwise, D JS (P l′,i � � � �P l,i ) � 1. So, we get S struc (l ′ , l | i) � 1 − D JS (P l′,i � � � �P l,i ) as the locally topological structural similarity between layers l ′ and l regarding node i. Finally, we can estimate P true (· | l, i) and sample true layers according to the distribution: where Δ denotes a function that can concatenate each element successively. Actually, this structure complementary information estimation can be served as the similarity of node i in layer l with respect to the topology of the layer where the node is located.

e Interlayer Dependence Estimation of Nodes.
In order to realize the interlayer dependence property, inspired by the idea of Deep Infomax in [53], we regard the membership of node i in layer l for layer l ′ as a measure of the interlayer local dependency of node i. erefore, a layer-level embedding H l of layer l in multiplex networks is computed by employing a readout function Readout: R n×d ⟶ R d .
where Z l is a final embedding matrix of layer l in the graph convolution component, h l i is an embedding of node i of the l layer, and σ is a logistic sigmoid nonlinearity function. Based on the layer-level embedding and the embedding of each node in this layer, we calculate the measure of the interlayer dependence property of node i in layer l on layer l ′ . In this paper, we apply a simple bilinear scoring function as it empirically performs the best in our experiments: where σ is the logistic sigmoid nonlinearity and W ∈ R d×d is a trainable scoring matrix. We can estimate the interlayer local dependence measure of the nodes by calculating the scores of the nodes' embedding in each layer and the global embedding of each layer: where P pred (· | l, i) denotes a vector of interlayer dependence of node i in layer l in respect of the duplication of node i in each layer and Δ denotes a function that can concatenate each element successively.

Interlayer Dependence Loss.
Comparing equation (13) with (16), we have designed an objective function with BCELoss loss function for saving the node interlayer dependence property:

Preserving Dependence between Attributes and Topology.
In order to preserve the dependence between attributes and the topology of each layer, the original attributes of nodes are fed into the GCC. We perform GCC and GDC processes to disentangle the attributes of nodes as different semantic representations. We believe that the GCC can strengthen the attribute value related to the layer's semantic in the node attributes. GDC can disentangle the attributes of nodes with structure information of nodes. is is the main advantage of our GCD (intralayer embedding) method compared with the graph autoencoder and variational autoencoder. en, in the final graph deconvolution network phase, each final output embedding of the GDC for each layer of multiplex networks is aggregated by a concatenate function. A linear layer is used to reconstruct the original attributes x i , which makes the overall model framework designed as an autoencoder architecture. Based on the embedding of node i in layer l and the embedding of node i in other layers, we construct a simple nonlinear fusion method to obtain the reconstruction attributes x i of node i: where σ is a sigmoid nonlinearity activation function, W is the trainable parameters, and Z l′ i is the output of the GDC of i node in the l ′ layer network. en, we also utilize the Complexity 9 BCELoss function to calculate the loss between original attributes x i and reconstruction attributes x i of node i: where L is the layer number of multiplex networks and x i is the attributes of the i node. Finally, the global loss function of HMNE also considers the loss of different components. erefore, we simply sum all the loss functions as the loss of the entire model and use Adam optimizer for backpropagation and parameter learning. e loss function of HMNE is

e Optimization and Time Complexity.
We present the node representation learning process (HMNE) for multiplex networks in Algorithm 1. e total time complexity of HMNE is O(TNE|L| 2 ) where T is the number of iterations, N is the number of nodes in each layer, E is the number of edges of the multiplex network, and |L| is the number of layers.

Experiment Analysis
In this section, we study the performance of HMNE in different real-world datasets. We use cross-domain link prediction and shared community detection tasks to verify the performance of HMNE.

Datasets.
For our experiments, we conduct HMNE and compare baseline methods on each of the following multiplex networks. ese datasets contain two categories: public datasets and private dataset. Public datasets are composed of five multiplex network benchmark datasets involving social, biological, and transportation. Private dataset is an interesting semantic network dataset that we construct.
is dataset is a network of acknowledgment relationships extracted from the acknowledgment part of dissertation data and the coauthor network of corresponding entities from AMiner (https://www.aminer.cn/). e specific information about public and private datasets is shown in Table 2.
Vickers classroom social multiplex network: this dataset was collected by Vickers from 29 seventh-grade students in a school in Victoria, Australia. Students were asked to nominate their classmates on a number of relations (class, best friend, and work).
CS-Aarhus social multiplex network: this dataset consists of five kinds of online and offline relationships (Facebook, leisure, work, coauthorship, and lunch) between the employees of the computer science department at Aarhus. ese variables cover different types of relations between the actors based on their interactions.
London multiplex transport network: this dataset was collected in 2013 from the official website of Transport for London and manually cross-checked. Nodes are train stations in London, and edges encode existing routes between stations. Tube, overground, and DLR stations are considered.
CKM physicians' innovation multiplex network: this dataset was collected by Coleman, Katz, and Menzel on medical innovation, considering physicians in four towns in Illinois: Peoria, Bloomington, Quincy, and Galesburg. ey were concerned with the impact of network ties on the physicians' adoption of a new drug, tetracycline. ese views are advice, discussion, and friend.
Celegans multiplex connectome network: this dataset considered different types of genetic interactions for organisms in the Biological General Repository for Interaction Datasets (BioGRID, thebiogrid.org), a public database that archives and disseminates genetic and protein interaction (ElectrJ, MonoSyn, and PolySyn) data from humans and model organisms.
ese networks have been used as benchmark datasets for evaluating multiplex network analysis methods. In addition, the CKM dataset has ground-truth information about the community label of nodes. erefore, HMNE performs performance testing of the cross-domain link prediction task on all datasets and performs performance testing of the shared community detection task on the CKM dataset.

Private Dataset.
is dataset is a two-layer network constructed from two views, one of which is a coauthor network constructed in the form of author co-occurrence from common paper data (from AMiner). Another view is to take the author of the dissertation as the central node from each acknowledgment chapter of the dissertation data, the named entity (including tutor, teacher, classmate, or family member) identified in the acknowledgment text as the neighbor node, and the co-occurrence of the entity as the edge constructed from the center network (ego network). Based on the acknowledgment text of the dissertation and paper data, the acknowledgment layer network and coauthor layer network of the Ack-co-author dataset are constructed, respectively.

Baseline Methods.
In these experiments, we test 14 other comparison algorithms: 11 baseline methods with the same parameters and dimensions and 3 traditional methods. e explanations of these baseline methods are as follows. Some of these methods can be used to test two tasks simultaneously. Other methods can only be suited for one of two tasks. e details of baseline methods are as follows: (i) CN (common neighbor): it captures the notion that two nodes that have a common neighbor may be introduced by that neighbor. It has the effect of 10 Complexity "closing a triangle" in the graph and likes a common mechanism in real life. (ii) JC (Jaccard coefficient): it is a measure used for gauging the similarity and diversity of sample sets and is defined as the size of the intersection divided by the size of the union of the sample sets. (iii) AA (Adamic/Adar): it is a measure to predict links, according to the number of shared links between two nodes. It is defined as the sum of the inverse logarithmic degree centrality of the neighbors shared by the two nodes. (iv) AAMT [54]: it is a link prediction method for multiplex networks based on the Adamic/Adar coefficient neighbor similarity, which considers the intensity and structural overlap of multiplex links simultaneously. (v) Node2vec [33]: it adds a pair of parameters to achieve BFS and DFS sampling process on the single-layer network. It makes it better for capturing the role of nodes, such as hubs or tail users.
(vi) OhmNet [18]: it is a node embedding method for multiplex networks, where hierarchy information is used to model dependencies between the layers. (vii) PMNE [17]: it has three methods of node embedding, each of which generates a common embedding of each node by merging multiple networks. We compare these three models with other baseline methods. We denote "network aggregation," "results' aggregation," and "coanalysis model" as PMNE(n), PMNE(r), and PMNE(c), respectively.
(viii) MNE [24]: it is a scalable multiplex network embedding. It contains one high-dimensional common embedding and a lower-dimensional additional embedding for each type of relations. en, multiple relations can be learned jointly based on a unified network embedding model.
(ix) MELL [26]: it is a novel embedding method for multiplex networks, which incorporates an idea of layer vector that captures and characterizes each layer's connectivity. is method exploits the Input: graph G � 〈V, E, L, X〉; neural layer number K ≥ 3 for GCC and GDC, graph convolution/deconvolution kernel Θ, Θ d , iteration times T. Output: H: the node embeddings of multiplex network G (1) begin (2) Initialize all parameters for GCC and GDC with K neural layers, respectively.
(3) t � 1 (4) while t ≥ T or not converge do (5) for l in L do (6) Sample nodes and calculate P (l,·) in layer l based on equation (15). (7) Generate convolution embedding Z l using χ and G l by equation (5)  (8) Readout the embedding H l of layer l by equation (14)  (9) Generate disentangled embedding Z l using Z l and G l by equation (6).
ALGORITHM 1: HMNE model. overall structure effectively and embeds both directed and undirected multiplex networks, whether their layer structures are similar or complementary. (x) GraphSAGE [38]: it is a graph neural network framework for inductive representation learning on graphs. GraphSAGE is used to generate lowdimensional vector representations for nodes and is especially useful for graphs that have rich node attribute information. We use an unsupervised learning version of GraphSAGE to serve as a baseline method of the link prediction task. (xi) GATNE-T [9]: it considers the network structure and uses base embeddings and edge embeddings to capture the influential factors between different edge types. e attention mechanism is used to capture the influential factors between different edge types. (xii) DMGI [10]: it is a simple yet effective unsupervised network embedding method for the attributed multiplex network, inspired by Deep Graph Infomax (DGI), which maximizes the mutual information between local patches of a graph and the global representation of the entire graph. (xiii) MV-ACM [45]: it is a novel multiview adversarial completion model (MV-ACM). Each relation space is characterized in a single viewpoint, enabling them to use the topological structural information in each view. (xiv) GenLouvain [55]: it is a modularity-based multiplex network community detection algorithm. e algorithm not only considers the modularity within the layer but also considers the modularity between layers. By maximizing the modularity metrics, the algorithm completes the community detection task. We only use this algorithm as a baseline method for the node clustering task.
In this paper, we only apply CN, JC, AA, node2Vec, and GraphSAGE to link prediction tasks at the single layer where test edges are located at. For OhmNet, we construct a hierarchy describing relationships between different layers randomly. We regard the common embedding in the MNE algorithm as the global embedding of nodes. For MELL, we add layer-level embedding as the global-level embedding and then add it to the node-level embedding of the test node. AAMT uses the multiplexity property of nodes (interlayer information) and similarity between nodes (intralayer information) to predict the probability of link. For GATNE-T and MV-ACM, we only use the homogeneous skip-gram model for node representation learning. e categorical multislice network model is selected for GenLouvain. Besides the same walk length, walk times and embedded dimensions are set as the same parameters of HMNE, and we also set other experimental baseline methods using the default parameters, such as PMNE, MELL, and DMGI.

Experimental Setup.
For implementing the network feature extraction module, we use representation learning of nodes to extract the feature of each layer. In these datasets we use, if nodes in these datasets have no attributes, we use the adjacency matrix of merged multiplex networks as the attribute information of nodes in compared experiments. e definition of the matrix is the adjacency matrix of the multilayer network after the multilayer network is aggregated or flattened (that is, the union of edges for each layer). e matrix can reflect that the topology of nodes in different networks depends on the network topology. In other words, neighbor nodes (denote node attributes) are dependent on the formation of the node topology under different semantics. We set p = 2 and q = 1 as default parameters in the biased sample process of the node2vec method. We set the number of walks to 20 and walk length to 30 for OhmNet, node2vec, PMNE (n, r, c), MNE, MELL, GraphSAGE (unsupervised vision), GATNE-T, and MV-ACM. e dimension of embedding is set to 128 for all methods. For GATNE-T, DMGI, MV-ACM, and our HMNE, the optimizer of the model is Adam, the learning rate is selected from {0.0001, 0.002}, and the batch size is 50 (except for the Vickers dataset). For three heterogeneous embedding network methods, an edge is usually input into the model as a meta-path for training. All the experiments are conducted on a Linux server with sixteen logical CPUs on Intel Xeon E5 CPU and four GTX 1080Ti GPUs. Notice that, in the community detection task, we uniformly remove the community label in the node attributes for representation learning. Although our model can alleviate the oversmoothing problem of the current graph neural network algorithm, to verify this feature of our model, we show the effect of different layers of the neural network on the model performance. According to the experiment results, it is a tradeoff between the performance and complexity of the model to use a 2-layer graph neural network in both compared experiments.

Cross-Domain Link Prediction.
In this section, we perform the cross-domain link prediction task on these multiplex networks. We refer to the experimental settings of the multiplex networks of literature [45]. For the crossdomain link prediction task, we remove 20% of edges of each layer in the original network and use the area under the curve (AUC) score and adjusted mutual information (AMI) score to evaluate the performance of these algorithms for predicting missing edges in each layer. We use the residual (80%) edges of each layer for training and the 20% of edges randomly selected from each layer for testing. ese node pairs in edge sets of the test set are regarded as positive examples. en, we randomly sample an equal number of node pairs from the test set, in which no edge connecting node pairs are served as negative examples. AUC is the area under the receiver operating characteristic (ROC) curve, which is equal to the probability that a classifier ranks a randomly chosen positive example higher than a randomly Mutual information (MI) is also used to measure the degree of agreement between the two data distributions.

Complexity
Assuming that U and Y are the distribution of N sample labels, the entropy of the two distributions is where E[MI] is an expected value of mutual information. e range of AMI values is [− 1, 1], and its value is larger, which means that the result is more consistent with the real situation.
We calculate the similarity between nodes by CN, JC, and AA metrics in the layer where the test node pair is located. For other single-layer network embedding methods, we train a separate embedding for each relation type of the network to predict links on the corresponding edges. It means that they do not have information from other layers of multiplex networks. We aim to verify the interlayer dependence can provide complementary structure information from other layers. In terms of node embedding methods, we use the cosine function of vectors as a similarity metric. e larger the similarity scores are, the more likely there exists a link between them.
From Table 3 and Figure 3, we can know that HMNE is significantly better than other comparison algorithms. Our model shows better performance on multiplex network datasets than single-layer methods such as CN, JC, AA, node2vec, and GraphSAGE, which directly proves that fusing different structural information by preserving the interlayer dependence property can improve the accuracy of the cross-domain link prediction task. is property of the multiplex network can provide critical complementary information from other layers. We regard OhmNet, PMNE, MNE, MELL, GATNE-T, DMGI, and HMNE as comparative experimental groups. ese compared algorithms are the latest multiplex network representation learning methods to learn multiplex network representation. OhmNet and PMNE are extensions of the traditional single-layer network embedding method, but there is no direct consideration of the interlayer dependence property in the final embeddings. It leads to an inevitable loss of information in the embedding process, so the complementary information of the interlayer cannot be well preserving. For MNE and MELL methods, the common (or layer) embedding is considered based on the assumption that nodes have similar local structures in different layers. In fact, this assumption is rare, and it also affects the generalization ability of the algorithm.
is process of interlayer node embedding based on common embedding can lead to distortion and inaccuracy of information. GATNE-T, DMGI, and MV-ACM are specially designed to handle such a scenario that the nodes have different types and attributes in each layer, so they cannot show excellent performance in the problem we are trying to solve. Moreover, these three methods ignore the dependence property between node attributes and the topology of the layer where the node is located. For our model, HMNE simultaneously considers intralayer, interlayer, and attribute dependence properties of nodes in the node embedding process.

Shared Community
Detection. Shared community detection task aims to group similar nodes so that nodes in the same group are more similar to each other than those in different groups. In other words, each node in a multiplex network has different relations/views and only belongs to a unique community. In the CKM dataset, nodes have the global community label. For this dataset, this task is usually called a shared community detection task, which is a significant mining task in multiplex network analysis. erefore, we treat the CKM dataset as the benchmark dataset of the shared community detection task. For these methods based on node representation learning, we use K-means++ algorithm to calculate the cluster of the final embedding of nodes. In order to evaluate fairness, we set the number of communities (clusters) to 2.

Evaluation Metrics.
Given the ground-truth community in the real-world datasets, we use normalized mutual information (NMI) to evaluate the performance of the methods: where X and Y denote two partitions of the network and H(X | Y) denotes the normalized conditional entropy of partition X with respect to Y shown in the following equation: where |C| denotes the number of communities. e larger the NMI is, the better the result is. e value of NMI takes from 0 to 1. It is equal to 1 meaning two partitions match perfectly and is equal to 0 on the contrary.
In the domain of node clustering, the chance-corrected version of this measure is adjusted Rand index (ARI). It is known to be less sensitive to the number of parts. It is possible to say that two elements of Y, i.e., (x, x ′ ), are paired in P if they belong to the same cluster. Let Q and U be two partitions of the object set Y. A formally formulation of the adjusted Rand index is Complexity 13 where a is the number of pairs (y, y ′ ) ∈ Y that are paired in Q and in U; b is the number of pairs (y, y ′ ) ∈ Y that are paired in Q but not paired in U; c is the number of pairs (y, y ′ ) ∈ Y that are not paired in Q but paired in U; and d is the number of pairs (y, y ′ ) ∈ Y that are neither paired in Q nor in U. is index has an upper bound of 1 and takes value 0 when the Rand index is equal to its expected value. Table 4, HMNE shows excellent performance in the shared community detection task. Among them, HMNE has obtained the largest NMI and ARI scores. In terms of other methods, MNE and MELL learn a representation of a node separately in each layer. We sum the representations in different layers of nodes as the global embedding of nodes and compare them with our model. erefore, the performance of MNE and MELL in this task shows that this kind of join representation learning algorithm cannot well preserve the shared community information of nodes. Compared with MV-ACM, GATNE-T, and DMGI that can handle heterogeneous networks, our model can show more excellent performance in the shared community detection task. e comparison with methods GATNE-T, MV-ACM, and DMGI that can handle heterogeneous networks shows that our model also has good performance. Unlike them, HMNE takes into account the high-order proximity property of nodes. e property encourages node embeddings for an identical community is similar. It should be noted that due to the use of the iterative strategy of maximizing modularity, GenLouvain shows competitive performance. However, GenLouvain only considers the topology of the multiplex network. HMNE can capture fine-grained semantic information by preserving the dependence property between node attributes and the topology of each layer. Compared with other algorithms, it is verified in the shared community detection task that our model can preserve the global mesoscale information of the multiplex network more effectively. We further validate that our model can more fully consider multiple properties of networks. e execution time of MV-ACM is more than 24 hours, so it does not show the final results on Celegans and Ack-co-author datasets. In general, the results of crossdomain link prediction and shared community detection tasks prove the effectiveness of our model. For the crossdomain link prediction task, the graph convolutiondeconvolution component of HMNE guarantees that our model can save high-order proximity information. When there is a lack of available information within the layer, the interlayer dependence component of HMNE can provide more abundant information. For the shared community detection task, the component preserving dependence between node attributes and the topology of the layer where the node is located can obtain more fine-grained semantic information related to the layer's topology by disentangling the original attribute information.    and then tends to stabilize. In other words, HMNE does not appear to be oversmoothing as the number of layers increases like other methods [56] based on graph neural networks. erefore, our proposed HMNE can not only preserve the high-level proximity information of nodes but also avoid oversmoothing problems caused by stacking multiple neural layers. Figure 4(c) illustrates that AUC scores of HMNE also first increase with the increase of the number of embedding dimensions and then tend to stabilize. When the embedding dimension reaches a certain level, HMNE can capture enough key information. In a certain embedding range, node embedding already contains most of the important information that is needed by some tasks. If the embedding dimension continues to increase, it will learn higher-order or more abstract information. erefore, its performance can show a certain stable state in an interval. In this state, owing to that HMNE has similar self-supervised and autocoder structure, we believe that, with the further increase of dimensions, the objective function designed by our model will purify the original information, filter some meaningless and redundant information, and preserve fine-grained features. erefore, as the dimension increases, the performance of the model will not show an increasing trend again in a certain dimension range.

Ablation Experiment.
In this section, we will verify the effectiveness of the two properties separately by ablating the constraints of the corresponding loss function from HMNE.
(1) HMNE-Inter: to verify the effect of the interlayer dependence property on HMNE, we only ablate loss function equation (17). (2) HMNE-Attr: to verify the dependence between node attributes and the topology of each layer on HMNE, we only ablate loss function equation (19). e experimental results are shown in Figure 5.

e Effectiveness of the Interlayer Dependence Property.
As can be seen from Figure 5(a), the interlayer dependence property is critical for link prediction tasks. After removing loss function equation (17) (called HMNE-Inter), the performance of HMNE in the cross-domain link prediction task decreases more significantly than the decrease in the community detection task. e reason is that the structure information of other layers provides effective complementary information for the node pair prediction of the target layer.

e Effectiveness of Dependence between Node
Attributes and the Topology of Each Layer. After removing loss function equation (19) (called HMNE-Attr), Figure 5(b) illustrates that HMNE-Attr decreases significantly in the community detection task. In the shared community detection task, we believe the performance of HMNE is more dependent on the attribute information of the node. However, in the link prediction task, the information provided by the dependence between node attributes and the topology of each layer is limited.

Conclusion
In this paper, we propose an unsupervised node embedding model for multiplex networks, called HMNE. HMNE first addresses the problem of preserving of high-order proximity information of nodes through the symmetric graph convolution-deconvolution component (SGCD). SGCD utilizes the designed graph deconvolution component (GDC) to reconstruct the input of the graph convolution component (GCC) with multiple graph convolution neural layers. It can effectively avoid the oversmoothing problem. Secondly, HMNE preserves the interlayer dependence property with interlayer complementary information of multiplex networks by our designed interlayer dependence component. When there is a lack of available information within the layer, the interlayer dependence component of HMNE can provide more abundant information from other layers (e.g., cross-domain link prediction scenario). Finally, HMNE preserves the dependence between the node attributes and the topology of each layer through disentangled representation of attributes of nodes. It enables HMNE to have more fine-grained attributes with different semantic information of nodes associated with each layer structure. e final representation of nodes with fine-grained attribute information can perform better in downstream tasks (e.g., shared community detection scenario). Systematical experiments on six real-world networks show the excellent performance of HMNE on two downstream tasks compared with the state-of-the-art baselines. Experiments on large-scale network data based on HMNE will be our future research focus.

Data Availability
ese compared methods and our code required for replicating reported results are available at https://github.com/ Brian-ning/HMNE/. e public datasets can also be downloaded from https://comunelab.fbk.eu/data.php.

Conflicts of Interest
e authors declare that they have no conflicts of interest.