Variational Approach for Learning Community Structures

. Discovering and modeling community structure exist tobe a fundamentally challenging task. In domains suchas biology, chemistry, and physics, researchersoften rely on community detectionalgorithms touncovercommunity structuresfrom complex systems yet no unified definition of community structure exists. Furthermore, existing models tend to be oversimplified leading to a neglect of richer information such as nodal features. Coupled with the surge of user generated information on social networks, a demand for newer techniques beyond traditional approaches is inevitable. Deep learning techniques such as network representation learning have shown tremendous promise. More specifically, supervised and semisupervised learning tasks such as link prediction and node classification have achieved remarkable results. However, unsupervised learning tasks such as community detection remain widely unexplored. In this paper, a novel deep generative model for community detection is proposed. Extensive experiments show that the proposed model, empowered with Bayesian deep learning, can provide insights in terms of uncertainty and exploit nonlinearities which result in better performance in comparison to state-of-the-art community detection methods. Additionally, unlike traditional methods, the proposed model is community structure definition agnostic. Leveraging on low-dimensional embeddings of both network topology and feature similarity, it automatically learns the best model configuration for describing similarities in a community.


Introduction
Real-world complex systems are often projected into networks to observe complex patterns.Entities in a complex system can be represented as nodes (vertices) and their interactions represented as an edge (link).For instance, social interactions between people can be represented in the form of a social network.Publications by authors and their respective publication venues can be represented with a bipartite citation network.The flexibility of networks and its vast literature on graph theory make network science very appealing to researchers.Although networks are merely represented in forms of nodes and edges, a large complex system could easily scale from hundreds to millions of nodes and edges.This poses a very challenging task in machine learning, especially tasks such as graph clustering or more commonly known as community detection [1] in the literature of network science.Given a network (graph) with its node content and structural (link) information, community detection aims to partition the nodes in the network into a number of disjoint groups.These partitions can be formulated depending on the given definition.For example, in modularity maximization [2], each partition is compared against a null model (random network).A partition is classified as good when the modularity score is greater than partitioning a random network.On the other hand, statistical methods such as the Stochastic Blockmodel (SBM) introduced Bayesian treatment of uncertainty when partitioning the network.Nodes with similar statistical similarity have higher probability to cluster together regardless of the cluster's density [3].This is known as stochastic equivalence.In general, a universal definition of community structure does not exist.Nevertheless, the objective remains the same, i.e., to find a group of nodes that shares some form of similarity between one another.In this paper, such similarity is defined as latent similarity; the similarity measure is not predefined.Quantifying such similarity is arguably subjective and difficult especially when a given network can be feature-rich or structure-only; there is no one-size-fits-all solution for community detection (i.e., the no free lunch theorem).Therefore, it is essential 2 Complexity that algorithms capture both higher-order information and structural information.To this end, we look at network representation learning [4,5] as a potential solution.
In machine learning, representation learning [6] has been successfully applied to various fields such as natural language processing and computer vision.Notably, successes of deep learning have surpassed human accuracy with ease [7].However, these successes are difficult to be explained.More precisely, it is difficult to explain "why" deep learning model performs so well.In an attempt to solve this problem, researchers bridged the understanding gap by introducing probabilistic deep models (also known as Bayesian Deep Learning) [8].Using fundamental building blocks from a probabilistic perspective, assumptions are given in forms of noninformative priors and the model is forced to correct these assumptions while learning.Consequently, the models become less ambiguous than a typical deep learning model which is commonly known to be a black-box.
Leveraging on recent advances in representation learning, network representation learning aims at a similar objective, but from a network perspective.Given a network, the objective is to find a latent representation that generalizes for various machine learning tasks such as classification, link prediction, and clustering of nodes.Generally, a common choice for finding community structure in networks often involves a two-step approach.First, the network is embedded into a latent space (i.e., Euclidean space).Next, a general clustering algorithm such as Spectral Clustering [9] or means is applied to the learned embedding.For instance, Tian et al. proposed a network representation [10] learning model to learn a nonlinear mapping of the original network using a Stacked Autoencoder by showing that spectral clustering and Autoencoders have the same optimization objectives.Yang et al. considered a Stacked Autoencoder as a modularity optimization problem and further introduced a semisupervised approach through must-pair nodes for increased performance [11].Assignment of communities is then obtained through -means clustering from the latent representation that exhibits the highest modularity score.Inspired from Denoising Autoencoders [12], Wang et al. proposed Marginalized Graph Autoencoder for Graph Clustering (MGAE) [13] that artificially corrupts the feature matrix to increase the number of training data and provides a closeform solution for optimization.Spectral Clustering is then applied to the learned latent representation.Clearly, these methods all employ a two-step approach which is unsuitable for studying network generation or graph modeling [14].
Instead of a costly two-step approach and ignoring uncertainty in the modeling process, the problem can be solved from a Bayesian point of view, by encoding our latent beliefs and assumptions as probabilistic graphical models.Specifically, one can assume that nodes and edges are modeled from a mixture model such as the Gaussian Mixture Model (GMM).This effectively couples the learning of cluster assignment with respect to its network representation into a joint probability distribution.Additionally, it helps to capture network properties exhibited by common networks which consequently helps in better understanding of real-world networks.
Concretely, this paper proposes an extension to Variational Graph Autoencoder (VGAE) [15].Originally, VGAE projects graph convolutions into a Univariate Gaussian latent space and have only been considered for semisupervised task such as link prediction and graph classification.The proposed model, VGAECD, relaxes this notion by introducing a Mixture of Gaussian.This is desirable as we would like to capture higher-order patterns from community structures and model its generative process.It is worth noting that similar approaches have been applied to VAE in domains such as image recognition [16].However, these approaches are not readily applicable for networks, especially in a community detection problem.
To summarize, this paper explores the idea of learning network representations using Bayesian treatment.We extend VGAE to include clustering-aware capability specifically targeting a community detection task.The contribution of this paper is summarized as follows: (i) This paper proposes a novel generative model for community detection which is agnostic to the necessity of a predefined community structure definition.
Through the process of automatic model selection, nodes are assigned a community based on the criterion that best reduces the loss function.
(ii) The proposed model inherits the benefits of Variational Autoencoder Framework.The advantages are threefold: (1) it provides a variational lower bound which is guaranteed to converge to a local minimum, (2) the lower bound is scalable, and (3) the model is generative, allowing generation of synthetic networks.
(iii) The proposed model outperforms the state-of-theart models in community detection without requiring additional priors (unlike the Degree-Corrected SBM).

Problem Definition
A network pertaining to nodes, edges, and node features can be formally defined as  = (, , ), where  = {V  , . . ., V  } consists of a set of nodes || = ,  = {  } is a set of edges, and  = {x 1 , . . ., x  } is the set of node features.Each x  ∈ R  defines a vector of real-values associated with node V  .From an Autoencoder's perspective, the inputs are given in terms of structural information  ∈ R × , and node features  ∈ R × , where  denotes the adjacency matrix of , and the node features are content information provided in forms of vector representation.In this work, we consider the undirected and unweighted network , such that   = 1 if   ∈  and otherwise it is equal to 0.
Given the network , the objective of community detection or graph clustering is to partition the nodes in  into  disjoint groups { 1 ,  2 , . . .,   }, such that nodes grouped within the same cluster are close to each other while nodes in different clusters are distant in terms of network structure.Vertices grouped within the same cluster are more likely to have similarities in node features.
Additionally, we consider the definition of a generative model.The discriminative model, ( | X, A), infers the model parameters  from the observed network .Subsequently, a network   can be generated from the same set of parameters.Concretely, (A | ) =   .Under the model selection criterion, the model is said to be good when   ≊  and satisfies the condition of having community structures; i.e.,   is not an Erdős-Rényi network.By definition, generative models can be considered as an ensemble learning model.

Related Work
Recent work in community detection can be broadly categorized into two types of models, namely, discriminative and generative models.The former includes a class of methods that infers communities given an observed network and, optionally, node features.Meanwhile, the latter considers the reconstruction of network while exploring plausible models that explain the observed phenomenon.
3.1.Discriminative Methods and Models.Predominantly, modularity maximization [2,17] has been considered as the most successful method for detecting communities.However, it suffers from a resolution limit problem [18] and is known to exhibit degeneracies [19].In terms of speed, label propagation [20] is capable of detecting communities in large-scale networks near linear time, though the solutions are usually nonunique.Additionally, other approaches such as Walk-Trap [21], Infomap [22], Louvain [23], and their empirical competitiveness are subjected to trade-off between accuracy and scalability [24].Representation learning methods such as GraRep [25] and CFOND [26] consider the completion of their adjacency matrix and can be generally considered as matrix factorization problem.Meanwhile, others like DeepWalk [27] and node2vec [28] consider representation of each node via a biased random walk.It assumes that neighboring nodes share similarities from the pivot node.Hence, when nodes are clustered together, they tend to cooccur on short random walks over the network.
Besides standard linear methods mentioned previously, recent advances in deep learning revisited Autoencoders for networks.Particularly, GraphEncoder proposed by Tian et al. shows that optimizing the objective function of Autoencoder is similar to finding a solution for Spectral Clustering [10].Leveraging on deep learning's nonlinearity and recent advances in Convolutional Neural Networks, [29,30] proposed the Graph Neural Network (GNN) and its generalization, the Graph Convolutional Neural Network (GCN) [29].Defferrard et al. first cast the problem by projecting graph convolutions into spectral space, and convolving within this space.
A widely known generative model for capturing networks with group structure is the Stochastic Blockmodel (SBM) or also known as the planted partition model.First explored by Snijders and Nowicki [38] two decades ago, the key idea behind SBM is stochastic equivalence.The probability that two nodes  and  are connected depends exclusively on their community memberships: two nodes within a community sharing the same stochasticity.However, the vanilla SBM exhibits a problem where high degree nodes are clustered into a community of their own.Karrer and Newman proposed the Degree Corrected (D.C.) SBM [39] which introduces a normalizing prior.Extensions to SBM include the Mixed Membership SBM (MMSBM) [40] for identifying mix community participation and bipartite SBM (biSBM) [41] for finding communities in bipartite networks.Today, SBM is well explored and its limitations has been widely studied [42,43].However, SBM is not a network representation learning model.Instead, SBM learns the latent variables Π and Z which describe the probabilities of cluster connectivity and cluster assignment, respectively, of a particular node which differs from common representation learning method.
Contrary to SBM, typically Autoencoders consists of two nongenerative steps (encoder and decoder).Consequently, the learned representation cannot be generalized for generation of networks.To alleviate this problem, most recent approaches consider generative models for representation learning such as Generative Adversarial Networks (GAN) or Variational Autoencoder (VAE).For graphs, Kipf and Welling [15] introduced a variant of VAE for link prediction tasks in graphs and for GAN, and Pan et al. [44] recently introduced adversarially regularized graph autoencoder (ARGA).In this work, we only consider the framework of VAE.We discuss this in Section 4.1.

Variational Graph Autoencoder. Variational Graph
Autoencoder (VGAE) [15] extends the problem of learning network embedding to a generative perspective by leveraging on the Variational Autoencoder (VAE) framework [45].Consider a given network  with structural information A and node features X; the inference model of VGAE parameterized by a two-layer GCN is defined as Here,  and  denote the mean and standard deviation vectors for node  which is obtained from a GCN layer,  = GCN  (X, A) and log  = GCN  (X, A).The two-layer GCN is then defined as with W 0 and W 1 representing the weight matrices for the first layer and second layer, respectively.W 0 is shared between GCN  (X, A) and GCN  (X, A). (⋅) is the nonlinear function Complexity such as ReLU(⋅) = max(0, ⋅) or sigmoid() = 1/(1 +  − ).Â = D −1/2 AD −1/2 denotes the symmetric normalized adjacency matrix.The generative model is simply the inner product between the latent variables: In accordance to the VAE framework, both models can be tied together and optimized by maximizing the variational lower bound L(⋅): [  (⋅) ‖   (⋅)] defines the Kullback-Leibler (KL) divergence between   (⋅) and   (⋅).The lower bound can be maximized with respect to the variational parameters (, ) = W  via stochastic gradient descent, performed with a fullbatch size.Here, the prior is defined as   (Z) = ∏  =1 N(z  | 0, I), which is the isotropic Gaussian distribution, whose gradients can backpropagate via a reparametrization trick [45].
In the absence of node features, X becomes the identity matrix.This relaxation allows the reconstruction of a structure-only network.When provided with node features, the accuracy of VGAE link prediction improves [15].

Variational Graph Autoencoder for Community Detection (VGAECD).
A major drawback in VGAE's approach is its restriction of nodes to be projected in a Univariate Gaussian space.This restriction suggests that all generated nodes come from a single clustering space.More specifically, dissimilar nodes tend to stay away from the Gaussian mean (centroid) [15].On the contrary, the mean of the Gaussian should be a better representative of each respective community such that nodes which are similar should stay closer to their represented mean.Thus, nodes that are well represented by the mean representation hold equivalence in similarity.In this scenario, we can consider this as a relaxation of SBM which requires nodes in the same block to uphold stochastic equivalence.
Utilizing this fact, we consider the unsupervised learning problem of community detection while adhering to the VGAE framework.Suppose that each node originating from a particular community is similar in some way; we can encode their similarity into the node's representation vector z which is better described by the mixture's mean.The generative process then follows: The function (z; ) is optionally a nonlinear function whose input is z and is parameterized by .Particularly, we use the (z ⊤  z  ) inner product decoder.Bern(⋅) denotes the multivariate Bernoulli distribution parameterized by the latent vector   .Then, the joint probability (a, z, ) can be factorized as with A = {a 1 , . . ., a  }.Since a and  are independently conditioned on z, the factorized probabilities can be defined as Substituting ( 6) and ( 11) into (10), L ELBO (x) can be rewritten as The inference model (z | x, a) is then modeled using a twolayer GCN as follows: Similar to VGAE, the first layer's weight matrix W 0 is shared between μ and log σ.Substituting the terms, L ELBO (x) can be further rewritten as with  being the total number of samples through sampled using the Monte Carlo Stochastic Gradient Variational Bayes (SGVB) estimator [45].x  is the vector of node ,  is the number of clusters with   denoting the prior probability of cluster , and   denotes ( | x, a) for brevity. ()   is computed as where z () is the  th sample from (z | x, a) as written in (13).To allow gradient backpropagation through the stochastic layer, the reparameterization trick is used; then z () can be obtained via Then, according to [45],  () ∼ N(0, I); ∘ is the Hadamard product operator.μ and σ are obtained through GCN(⋅).
If we consider regrouping L ELBO (x) with like-terms, (12) can be rewritten as The first term in (17) has no dependency on  and from the definition of KL divergence, it is nonnegative.Therefore, From that, we follow [16], by defining ( | x, a) as From ( 18) the information loss induced by the mean-field approximation can be mitigated by forcing its dependency on the posterior ( | z) and noninformative prior ().The complete VGAECD algorithm can be found in Algorithm 1 and Figure 1 illustrates the conceptual idea of VGAECD.

Experiments
Community detection algorithms are often evaluated against two kinds of networks: synthetic and empirical datasets.These are discussed in detail in the following subsections.
5.1.Synthetic Datasets.Two synthetic networks are used in our evaluation.We consider two most common benchmark graphs used for benchmarking community detection algorithm.Namely, we used the Girvan-Newman (GN) benchmark graph [1,34,46] and the LFR benchmark graph [35].
The GN benchmark graph is a variant of the planted partition.In our experiment, we vary the  out value from a range of {1, . . ., 8}.Each node has an average degree of  = 16, with 32 nodes in each community (a total of 128 nodes) and 4 communities in total.
The LFR benchmark graph is an extension of the GN benchmark graph.It is considered to be more realistic than the GN benchmark graph.It introduces a skewed degree distribution and accounts for network heterogeneity, resulting in communities that are generated in different sizes.The LFR benchmark graph is generated using default parameters as suggested by Lancichinetti et al. [35].These parameters are number of nodes ( = 1000), average degree ( = 15), and minimum (  = 30) and maximum (  = 50) number of nodes per community.The generation follows the scalefree parameters settings of exponents  1 = −2 and  2 = −1, respectively.On average, between 20 and 30 communities are generated.

Empirical Datasets.
The empirical datasets are divided into two kinds: networks with features and without features.The datasets are as follows: (i) Karate: a social network represents friendship among 34 members of a karate club at a US University [47].
(ii) PolBlogs: a network of political blogs assembled by Adamic and Glance [48].The nodes are blogs and web links between them are represented by their edge.These blogs have known political leanings and were labelled by hand by Adamic and Glance.
(iii) Cora: a citation network with 2,708 nodes and 5,429 edges.Each node corresponds to a document and the edges are citation links [49].
(iv) PubMed: A network consisting of 19,717 scientific publications from PubMed database pertaining to diabetes was classified into one of three classes ("Diabetes Mellitus, Experimental", "Diabetes Mellitus Type 1", "Diabetes Mellitus Type 2").The citation network consists of 44,338 links.Each publication in the dataset is described by a TF-IDF weighted word vector from a dictionary which consists of 500 unique words.For starters, experiments are performed on datasets in accordance to Karrer and Newman.These networks (Karate and PolBlogs) are featureless and only contain structural information.The Karate network is a commonly studied empirical benchmark network for community detection.Similar to [39], only the largest connected component and its undirected form are considered for Polblogs.Next, two networks containing features are used (Cora and Pubmed) [30,50].Table 1 summarizes the list of datasets and their respective properties.

Baseline Methods.
We establish a baseline by comparing against several state-of-the-art methods.These methods are divided into two categories.The first category comprises discriminative methods and the second category comprises generative methods.

Discriminative Methods
(i) Spectral Clustering [9] is a commonly used approach for performing graph clustering.By identifying the Fiedler Vector of the Graph Laplacian, we can divide the network into two components.Repeating this process, the graph can be subdivided further, giving more clusters in the process.
(ii) Louvain [23] is a greedy modularity optimization method for maximizing modularity score.(iii) DeepWalk [27], proposed by Perozzi et al., is a network embedding method that performs a bias random walk on a given network.
(iv) node2vec [28] is a generalization of DeepWalk.It leverages on homophily and structural roles in embedding.

Generative Methods
(i) Stochastic Blockmodel (SBM) [38,39] is a state-ofthe-art generative model.It models the likelihood of two nodes forming an edge on the basis of stochastic equivalence.Degree Correction (D.C.) penalizes the formation of single node modules by normalizing the node degrees.

Evaluation
Metrics.Some of the common approaches to evaluate detected communities are Normalized Mutual Information (NMI), Variation of Information (VI), and Modularity.In some cases, accuracy can be accurately measured (i.e., when the number of clusters  is 2).Furthermore, these measures are only possible when ground truth exists.Hence, we include other forms of measures which consider the quality of a partition without ground truths.

Ground Truth
(i) Accuracy measures the number of correctly classified clusters given the ground truth.Formally, given two sets of community labels, i.e.,  being the ground truth and   the detected community label, the accuracy can be calculated by ∈ ,    ∈   , where (⋅) denotes the Kronecker delta, (  ,    ) = 1 when both labels match, and | ⋅ | denotes the cardinality of a set.For clustering tasks, accuracy is usually not emphasized as labels are known to oscillate between clusters.
(ii) NMI and VI are based on information theory.Essentially, NMI measures the "similarity" between two community covers, while VI measures their "disimilarity" in terms of uncertainty.Correspondingly, a higher NMI indicates a better match between both covers while VI indicates the opposite.Formally [51] NMI and where H(⋅) is the entropy function and I(,   ) = H() + H(  ) − H(,   ) is the mutual information function.

Community Quality
(i) Modularity (Q) [17] measures the quality of a particular community structure when compared to a null (random) model.Intuitively, intracommunity links are expected to be stronger than intercommunity links.Specifically, where   −    /4 measures the actual edge connectivity versus the expectation at random and (  ,   ) defines the Kronecker delta, where (  ,   ) = 1 when both nodes  and  belong to the same community, and 0 otherwise.Essentially, Q approaches 1 when the partitions are considered good.(ii) Conductance (CON) [52,53] measures the separability of a community across the fraction of outgoing local volume of links in the community, which is defined as where the nominator defines the total number of edges within community  and () = ∑ ∈ ( ∈ ) defines the volume of set  ⊆ .A better local separability of community is achieved when the overall conductance value is the smallest.(iii) Triangle Participation Ratio (TPR) [53] measures the fraction of triads within the community .
where  denotes the total number of edges in the graph .A larger TPR value indicates a denser community structure.

Experiment Settings.
For discriminative models such as node2vec and DeepWalk, the latent representation is learned first.Next, -means is applied to the learned latent vector with  given a priori.The parameters used for node2vec are performed using exhaustive search on variables ,  ∈ {0.25, 0.5, 1, 2, 4} as suggested in [28].Specifically, the parameters obtained were ( = 0.5,  = 4), ( = 0.25,  = 0.25), ( = 1,  = 0.25), and ( = 0.5, = 1) for Karate, PolBlogs, Cora, and PubMed, respectively.As for DeepWalk, the parameters used are  = 128,  = 10,  = 80, and  = 10 which were the suggested values [27].On the other hand, generative models like SBM (and D.C.) have several optimization strategies.In this case, we applied the Expectation-Maximization (EM) algorithm as suggested in [39].For a fair comparison between VGAE and VGAECD, we used identical layer configurations for both models.The layer configurations are , , , and (32-8) for Karate, PolBlogs, Cora, and PubMed, respectively.These configurations are determined empirically as suggested in [15].Generally, we found the first layer to be insensitive and second layer to be sensitive.By reducing the size of the second layer with respect to the number nodes we found that 8 was ideal for Cora and PubMed.The hyperparameter  is given a priori for all methods.For a fair comparison, the average of 10 runs was taken for both discriminative and generative models.All experiments were conducted on an Ubuntu 16.06 LTS machine with 64 GB of RAM and two GeForce GTX 1080 Ti graphics cards.

Experiment Results.
We first compare our result with 8 baseline methods on several state-of-the-art methods that employ unsupervised network embedding, except SBM: the only generative model that does learn a network embedding.Since VGAE is nonclustering, the two-step approach for clustering was applied, i.e., obtaining the latent vectors and subsequently applying -means.The * symbol denotes methods that were confined to structural information only.5.6.1.Synthetic Dataset Performance.Figure 2 depicts the performance of the proposed model in comparison to other methods.In Figure 2(a), VGAECD can be seen as a strong performer when  out ≥ 4. On the LFR benchmark graph in Figure 2(b), the performance of VGAECD is comparable to other methods.When  < 0.4, VGAECD is capable of outperforming other methods.When  > 0.55, VGAECD is seen to exhibit similar performance to other methods.
In both cases, the performance was as expected since the mixing parameter ( out and ) is consistent with the study recoverability limit in planted partitions [42,43].

Empirical Dataset Performance.
Experiments performed on four different empirical datasets are shown in Tables 2, 3, 4, and 5 for Karate, PolBlogs, Cora, and PubMed, respectively.We measure the performance of clusters found using metrics as proposed in the Section 5.4 and the best values are marked in bold.
Generally, the experiments revealed that our method outperforms other methods when ground truth is given.In terms of cluster quality, VGAECD performs relatively well in terms of modularity score (Q).However, it retains competitiveness on Conductance (CON) and Triangle Participation Ratroio (TPR) measures.Since datasets such as Cora and PubMed have more than 2 clusters ( > 2), the accuracy of labels can be affected by label oscillation.Therefore, it is a less accurate measure for measuring cluster's label when compared to classification accuracy measures.However, accurate measures can still be obtained for datasets with only two clusters such as Karate and PolBlogs, which revealed that the proposed method is better than baseline methods.In most cases, the results of our method are comparable to SBM (D.C.).This is plausible since SBM (D.C.) has an advantage due to its prior knowledge on degree normalization.Regardless, when more than two clusters are given, the modularity score of VGAECD outperforms SBM (D.C.) as shown in Cora and PubMed datasets.

Time Complexity Analysis.
Since the proposed model follows the VAE framework, it employs a similar optimization method using SGVB.Therefore, it follows a linear-time complexity for one epoch, but requires  number of runs to achieve convergence.The convergence rate of NMI with respect to the number of epochs can be observed in Figure 3 in comparison to VGAE.In contrast to VGAE, the proposed method can achieve convergence at a faster rate.

Synthetic Network Generation.
The implication of a generative model is its ability to generate a graph when prescribed a certain set of parameters.Therefore, a synthetic network can be generated using the proposed VGAECD model.Given parameters  and , we can generate a network simply by following the generative process specified in Section 4.2.However, in order to vary the community structure, we can follow the planted partition's approach by including the mixing of a random network model:     planted defines the amount of actual draws from the Gaussian Mixture model and  random draws from the random model.For instance,  planted can be specified as where   denotes the number of draws from the mixture model with   and   .In (26), we specify the number of nodes drawn for four different communities.A generated matrix Ã can be obtained as shown in decoder part of Algorithm 1. Ideally, each node is represented by z, and the Hadamard product between z  and z  determines the likelihood of edge connectivity between nodes  and  which is obtained after the nonlinearity (⋅) function.

Network Visualization. Community assignments for
Cora dataset are visualized in Figure 4. Since Cora has several disconnected nodes, only the largest connected component is visualized.Among them, VGAECD has closer resemblance to the ground truth's cluster assignment.Notably VGAECD is able to recover a community structure in the center of the network.Additionally, it also has less tendency to cluster nodes that are far away which is seen in VGAE + -means and SBM (D.C.).DeepWalk, however, appears to have a resolution problem, resulting in larger clusters merging together.This can be seen as the number of clusters depicted in the largest component is fewer than  = 7.This problem is not observed in node2vec since the sampling strategies are generalization of DeepWalk.This generalization of  and  allows node2vec to explore more locations.In contrast, DeepWalk is highly restricted to visiting nodes within the pivot node's vicinity.However, to achieve the observed results, node2vec requires a costly parameter search which is not ideal.Among the baseline methods, Spectral Clustering and Louvain appear to struggle in finding a community structure, even though they performed very well on synthetic benchmark graphs.Louvain in particular had a very competitive NMI score, but visually, the results are not very satisfactory.

Conclusion
In this paper, we propose a novel community detection algorithm termed Variational Graph Autoencoder for community detection (VGAECD).It generalizes VGAE for community detection tasks.The model is capable of learning both features and structural information and encodes them into a community-aware latent representation.The lowdimensional representation learned differs from previous network representation methods.Concretely, the latent representations themselves are parameters to a probabilistic graphical model, i.e., the Gaussian Mixture Model.Therefore, this allows us to draw samples from the learned model itself and generate synthetic graphs like SBM.Additionally, the flexibility of the proposed method shows that, by leveraging on more feature information, it is capable of outperforming other methods in community structure recovery.Unlike other representation learning methods which require a two-step approach (applying -means as the second step), VGAECD is a generative model capable of recovering communities in a single step.Moreover, in comparison to existing state-of-the-art generative models such as SBM, VGAECD is community structure definition agnostic.Specifically, nodes are not forced to be similar under a specific similarity measure.This is an advantage over other community detection algorithms where the definition of community structures is always assumed.This is a desirable feature in cases where networks can have a mixture of community structures, i.e., multilayer networks.

Data Availability
All data used in our research are publicly available data.Upon request, we could point them to their respective sources.
(i) For communities  = { 1 , . . .,   } (a) Obtain a sample  ∼ Cat() (b) where  is the number of clusters hyperparameters and   is the prior probability for cluster ,  ∈ R  + , ∑  =1   = 1.Cat() is the categorical distribution parameterized by .(ii) For nodes Z = {z 1 , . . ., z  }, (a) Obtain a latent vector z ∼ N(  ,  2  I) (b) where   and  2  are the mean and variance of the multivariate Gaussian distribution corresponding to cluster .(iii) Obtain a sample a by (a) computing the expectation   = (z; ) (b) sample a ∼ Bern(  )

Figure 2 :
Figure 2: Comparative performance of VGAECD against other methods on synthetic networks.

Figure 4 :
Figure 4: Visualization of community assignment on Cora Dataset (largest connected component).
Conceptual illustration of Variational Graph Autoencoder Framework for Community Detection (VGAECD).In the encoding phase, VGAECD first convolves on the network, learning structural and nodal features in the process.These pieces of information are then mapped into a latent representation,   |  and   |  which are parameters to Mixture of Gaussian Model.Subsequently, we can then sample to obtain a latent representation for each node z.Finally, Ã can be reconstructed using a decoding function, (⋅).The loss is calculated and backpropagated to the latent variables.

Table 2 :
Experimental results on Karate dataset.

Table 4 :
Experimental results on Cora dataset.

Table 5 :
Experimental results on PubMed dataset.