Enhanced Unsupervised Graph Embedding via Hierarchical Graph Convolution Network

Graph embedding aims to learn the low-dimensional representation of nodes in the network, which has been paid more and more attention in many graph-based tasks recently. Graph Convolution Network (GCN) is a typical deep semisupervised graph embedding model, which can acquire node representation from the complex network. However, GCN usually needs to use a lot of labeled data and additional expressive features in the graph embedding learning process, so the model cannot be eﬀectively applied to undirected graphs with only network structure information. In this paper, we propose a novel unsupervised graph embedding method via hierarchical graph convolution network (HGCN). Firstly, HGCN builds the initial node embedding and pseudo-labels for the undirected graphs, and then further uses GCNs to learn the node embedding and update labels, ﬁnally combines HGCN output representation with the initial embedding to get the graph embedding. Furthermore, we improve the model to match the diﬀerent undirected networks according to the number of network node label types. Comprehensive experiments demonstrate that our proposed HGCN and HGCN ∗ can signiﬁcantly enhance the performance of the node classiﬁcation task.


Introduction
Graph embedding aims to map high-dimensional sparse network data to low-dimensional dense real-value vector space, which can adaptively extract features and be applied to the network analysis tasks, such as node classification, link prediction, and dimensional visualization [1,2]. Generally, graph embedding models mainly include the shallow embedding model and the deep embedding model. e shallow embedding model represented by Deepwalk is directly learned by defining the loss function for network representation, which is usually an unsupervised learning method [3]. is model is suitable for the undirected graph, but it also has the disadvantage of ignoring the feature information in graphs. On the contrary, the deep embedding model represented by the GCNs uses the additional feature and label information of the graph at the same time, which is generally a semi-supervised or supervised learning method. However, the deep embedding model is not more suitable for undirected graphs with only structure information, at the same time many real networks do not include the node features and labels due to privacy protection and annotation difficulty, so the study of deep unsupervised graph embedding method for a real undirected graph is urgent.
To help address these issues, we propose a novel unsupervised graph embedding method with a hierarchical graph convolution network, which does not need the extra features and labeled data. e general process is shown in Figure 1, which includes 3 layers with the initial, update, and output layer. Among them, the input of the model is a graph, the output is the low-dimensional vector representation of nodes that relate to the number of label types, and the detailed structure of the model is partly omitted.
Our contributions are the following: (1) we propose an enhanced hierarchical graph convolution network for the undirected graph with only structure information. (2) According to the number of label types in the different experimental data sets, we introduce max-pooling to improve the proposed model. (3) We conduct the node classification experiments on Wikipedia, American airtraffic, Cora, and Citeseer data sets separately, and our proposed method achieves significant improvements than other state-of-the-art baselines.

Related Work
Graph embedding methods have different classification systems. In this section, firstly, we introduce the traditional graph embedding methods. en, we review the different graph embedding methods in the field of machine learning. In the end, we classify the graph embedding models according to the structure of the embedding methods, and a profound analysis of the disadvantages of the existing graph embedding methods is given. Additionally, we further introduce existing GCN variants for graphs.

Traditional Graph Embedding Methods.
Traditional network representation learning methods are realized by dimension reduction technology [4]. Classic dimension reduction methods include Principal Component Analysis (PCA) [5] and Multi-dimensional Scale (MDS) [6]. Both methods can capture linear structure information, but they cannot acquire a nonlinear structure in input data.
From a linear algebraic perspective, all unsupervised graph embedding methods are generally represented by the various graph matrices, especially the Laplace operator and the adjacent matrix [7]. In terms of computational efficiency, the feature decomposition of the data matrix is expensive.

Graph Embedding Methods for Machine Learning.
In the representation learning for machine learning, the semi-supervised graph embedding method requires a set of features that can distinguish nodes [8]. Typical model mainly uses manual feature extraction to learn node representation for specific professional fields, which has the disadvantages of the inaccuracy and poor robustness. Another method mainly learns node representation such as MMDW [9] by solving optimization problems, which improves the accuracy of the extracted feature. However, the number of estimation parameters is large and the time complexity is high [10].
As for the unsupervised graph embedding method, to balance accuracy and computational efficiency, the feature representation needs to define an objective function independent of downstream tasks, such as Deepwalk and LINE. e above unsupervised graph embedding method only uses network structure information to obtain low-dimensional embedding. Moreover, the graph embedding method also learns from the rich content of node attribute and edge attribute. TADW [11] incorporates text features of vertices into network representation learning under the framework of matrix factorization. CENE [12] integrates text modelling and structure modelling in a general framework by treating the content information as a special kind of node. HSCA [13] is a network embedding method of attribution graph, which simulates homophone, network topology, and node characteristics at the same time. Inspired by pre-training methods, pre-training models based on GCN are proposed. DeepGraphInfoMax [14] relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs, while both of them derived using established graph convolutional network architectures. e pre-trained GCN model proposed by Hu et al. [15] can capture generic graph structural information that is transferable across tasks.
Compared with the methods of manually extracting features, the representation method by optimizing objective function can obtain more comprehensive features, which is closely related to the downstream predicting task [16]. erefore, the unsupervised graph embedding method improves the disadvantages of semi-supervised machine learning method, such as difficult extensibility and high training complexity.

Graph Embedding Methods with Different Structures.
According to the structure of the representation methods [17], graph embedding methods are divided into shallow embedding method and deep embedding method. In the shallow embedding model, inspired by the successful application of skip-gram [18] in natural language processing [19], a series of skip-gram based models are proposed to encode network structures into continuous spatial vector representations in recent years, such as Deepwalk, LINE, and Node2vec.
Deepwalk [20] simulates word sequences to generate node sequences by random walk from each node, which forms a "corpus" based on those node sequences. Furthermore, Deepwalk sets the size of the background window and then imports the "corpus" into the skip-gram model to get the node representation. e first-order similarity of direct connection and the second-order similarity of shared neighbor nodes are optimized respectively, and the two similarities are combined at the output of the nodes in LINE [21]. Node2vec [22] uses two additional parameters to control the direction of the random walk on the generation step of the "corpus" of Deepwalk. Struc2vec [23] defines the nodes that are not structurally adjacent but have the same structural roles. e shallow embedding models usually use unsupervised machine learning methods, so they have the disadvantage of ignoring the node features.
In the deep embedding model, the mainstream methods use the deep learning model to capture the nonlinear relationship between nodes [24]. Typical deep learning models are NE-FLGC, SDNE, GCN, and a series of GCN variants. NE-FLGC [25] studies the problem of representation learning for network with textual information, which aims to learn low-dimensional vectors for nodes by leveraging network structure and textual information. SDNE [26] automatically captures the local relationship of the nodes by using the unsupervised learning method and takes the second-order neighbor of the nodes as the input to learn the low-dimensional representation of graphs, but the model still does not consider the node features.
GCN [27] is a deep semi-supervised graph embedding model with incorporating the extra feature and labeled data into the graph embedding learning process, which cannot be easily applied to the undirected graph with only structure information. In terms of applicability, the existing GCN variants are mainly studied in two aspects. One variant assumes the graph is attributed, such as GraphSAGE [28], GAT [29], N-GCN [30], and Fast-GCN [31]. GraphSAGE [28] can be used to generate node embeddings for previously unseen nodes or entirely new input graphs, as long as these graphs have the same attribute schema as the training data. GAT [29] uses attention mechanism to address the shortcomings of prior methods based on graph convolutions or their approximations. N-GCN [30] improves the scalability of GCN on the whole graph by setting the size of the convolution kernel. Fast-GCN [31] is a batch training algorithm combining importance sampling, and it can not only make GCN training more efficient but also generalize well for inference. Generally, they take the structure information or the node degree information as the feature and then use the correct label to train GCN to learn the graph embedding. However, the model still needs to incorporate the correct labeled data into the graph embedding processing. Another variant is proposed to solve the label problem, such as M3S and GMNN. M3S [32] uses the correct extra feature and enlarges the labeled data by self-training to learn the undirected graph embedding. GMNN [33] similarly applies the correct extra feature and neighbor nodes to generate pseudo-labels for unsupervised learning. As a consequence, the previous method still cannot consider the feature and the label of the node at the same time.

Model
In this section, we describe the proposed method and give a detailed framework for the model. Firstly, we generate the initial node embedding using Deepwalk. Simultaneously, the pseudo-labels are built by the AP clustering algorithm [34]. en, we build the hierarchical GCN to update the node embedding and labels. Finally, we introduce max-pooling to enhance our model according to the number of label types in the different experimental datasets.

Initial Node Embedding.
e first step of the model is to generate the initial node embedding using Deepwalk. Deepwalk is the shallow embedding model with random walking, and it can better learn the global structure and the context of the node in the undirected graph.
erefore, Deepwalk is used as a pre-training method to generate the initial node embedding in this paper.
Let G � (V, E) be a given undirected graph with vertex set V and edge set E, where n � |V| denotes the number of nodes in the graphs. Our formal definition is general and can apply to any undirected and unweighted network. Firstly, we acquire a specific length random walk sequence w(v i ) for the node v i , which is shown in the following formula (1): where w(v i ) n is the n th node. And then, we use a language model to learn from these random walk sequences and denote the initial embedding as follows: where w u i is the word sequence in the language model, and skip − gram is the language model. E M (.) is denoted as the output function. We use w u i � (w 1, w 2, . . . w n ) to denote the word sequence, where w i is a specific word in the language model. e detailed processing flow is shown in Figure 2.

Pseudolabels.
e second step of the model is to construct the pseudo-labels and predict labels with GCN. After obtaining the initial node embedding of the undirected graph, we use the clustering method to acquire pseudo-label y i for the unlabeled node i, where y i ∈ Y and Y is a pseudolabel set. e specific processing flow is shown in Figure 3.
And now, we update G to denote the graph, as shown in the following formula (3): In this paper, AP clustering model called near-neighbor propagation clustering algorithm is adopted to generate Mathematical Problems in Engineering 3 pseudo-labels, which is mainly because the method contains the following three advantages. (1) Because the number of label categories is unknown, the model needs to choose a clustering method that does not need to specify the number of clusters. Unlike k-means and k-center algorithms, AP clustering model is not necessary to set the final number of clusters when clustering.
(2) e cluster centers of AP algorithm are the actual nodes in the data set, which are the representative of each class, so AP clustering model is not sensitive to the initial negative value of the node embedding. (3) If the sum of squares of errors is used to measure the performance of the algorithms, the sum of squares of errors of AP clustering is lower than that of other methods. is evaluation index shows that AP clustering method is effective.
Based on this, we further use GCN to predict the label for the unlabeled node and define the predicted maximum value as the label y i ′ , where y i ′ ∈ Y ′ and Y ′ is the predication label set. And now, formula (3) is updated as following in formula (4):

Hierarchical GCN.
In this section, we focus on the hierarchical GCN model, which includes two GCN [27] models. At the start, we introduce GCN. Next, we set up hierarchical graph convolution network to propose the HGCN model and the HGCN * model by analyzing the updated undirected graphs. Additionally, we further explain the reason for the improved model.

Graph Convolution Network.
Different from the conventional convolutional neural networks (CNN), the convolution operation for a graph is defined as the weighted average of neighbors of one particular node. Mathematically, the graph convolution layer is defined as where P � D − 1/2 (A + I)D − 1/2 is the normalized Laplacian matrix of the graph G. H (l) is the input of the l th hidden layer of GCN, i.e., the output of the (l − 1) th hidden layer. W (l) is the weight matrix in the l th hidden layer that would be trained. A is the adjacency matrix of the undirected graph. And σ(·) is the activation function. For any given node, a GCN layer aggregates the previous layer's embedding of its neighbor with A, followed by linear transformation W and nonlinear activation σ, so as to obtain a contextualized node representation. We denote F w (·) as L-layer GCNs, parametrized by w i L i�1 . For each given graph G = (V, A) with input features H (0) , F(·) can get node representations by:

HGCN and HGCN * .
After the initial node embedding section and the pseudo-labels section, we have learned the graph information from formula (4). After predicting labels, we also update the graph embedding with GCN, which is as shown in formula (7): where E M ′ (w v i ) is the updated node embedding by the first GCN.
According to formula (6), E M ′ (w v i ) is defined as formula (8): where H (0) � E M (w(v i )) is the input feature, which is the initial node embedding using Deepwalk. In this paper, we Pseudo-label graph AP clustering model 1 3 2 Initial node embedding with Deepwalk  use 3-layer GCNs with 128-dimensional hidden vectors to accomplish the pre-training and adaptation. And then, we further use the second GCN to learn the undirected graph again. e final embedding E M ″ (w v i ) is denoted as formula (9): Moreover, after this second learning process, GCN will lead to too smooth for the embedding, which can affect node classification. erefore, to relieve smoothing, we combine the secondary learning representation E M ″ (w v i ) with the initial node embedding E M (w v i ). Finally, we get the ultimate embedding O(G) for the undirected graph as shown in formula (10): where ⊕ is a vector splicing operation. e previous HGCN method can be applied to learn a better embedding, when the graph is an undirected graph with a few numbers of node label types. However, when the undirected graph has a large number of node label types, the proposed model will also decompose labels into multi-dimensional features and add them to the learning processing, which can lead to the feature redundancy. erefore, motivated by this analysis, we introduce max-pooling to improve the embedding model for the graph with a large number of label types, namely HGCN * . e exact processing flow is shown in Figure 4. (i) Cora dataset [35] consists of 2,708 machine learning papers classified into one of seven classes and 5,429 links between them. (ii) Citeseer dataset [35] is a link dataset built with permission from the Citeseer web database. It contains 3,327 publications from six classes and 4,732 links among them. (iii) Wikipedia dataset [36] consists of 2,405 Wikipedia pages from 17 categories and 17,981 links between them. It is much denser than Citeseer and Cora datasets. (iv) American air-traffic dataset [23] is from the United States Air Traffic Network. It contains 1190 airports and 13599 relationships between airports and airports, and the airport labels are divided into four groups.

Experiments
All specific nodes and edges in our experimental datasets are shown in Table 1.
Note that all datasets are used as undirected graphs with only network structure information and the correct labels in the datasets are only used in the node classification verification task in this paper.

Baselines. For comparison, we adopt six kinds of baselines as follows:
(i) Deepwalk [20] is a typical unsupervised graph embedding method which adopts the skip-gram language model. (ii) LINE [21] is also a popular unsupervised method which considers the first-order and second-order proximity information. (iii) Node2vec [22] learns low-dimensional representations for nodes in a graph by optimizing a neighborhood preserving objective, which explores the structure of the network by controlling the p and q parameters, and learns the d dimension feature representation by simulating biased random walks. (iv) GraRep [37] captures the long-distance nodes and uses the matrix decomposition method to learn the d dimension feature representation of the node. (v) Hope [38] introduces higher-order similarity and uses the matrix decomposition method to learn the d dimension feature representation of the node. (vi) SDNE [26] learns the d dimension node representation by capturing the network nonlinear structure using multi-layer nonlinear functions, which is a semisupervised deep graph embedding model.
Several existing methods include shallow embedding models, matrix decomposition models, and deep embedding models. Note that since some existing work ( [27][28][29]) utilizes extra node features, we do not compare with them directly.

Node Classification Experiments.
To verify the effectiveness, we follow the same settings used in Hamilton et al. (2016) [27], including the following benchmark tasks: (1) using HGCN to classify nodes on American air-traffic, Cora, and Citeseer datasets; (2) using HGCN * to classify nodes on Wikipedia dataset.

Experimental Settings.
In this experiment, Micro-F1, Macro-F1, and Accuracy are used as measurements. e model uses the same proportion of training and test data as the existing published papers, i.e., 50% of the training data and 50% of the test data.
To facilitate the comparison, on the Cora and Citeseer datasets, we report the results of one experiment for comparison with the baseline models. However, on the American air-traffic network dataset, we repeat the experiment 5 times using random samples for training and demonstrate the average performance. e experimental results are given in Table 2. " * " denotes the published results of the existing papers [29], and the roughened numbers represent the best results.

Mathematical Problems in Engineering
On the Wikipedia dataset, we use the same measurements to compare with the results of open-source framework OpenNE [39]. e best experimental results are presented in Table 3. Table 2, it can be seen that our proposed model achieves the best results compared to all shallow baseline models, and the outcomes are very close to the deep baseline model SDNE. On the Cora and Citeseer datasets, the experimental results of our model are higher than all baseline models in all measurements, which show that the model can effectively learn the undirected graphs.

Results and Discussion. As shown in
Obviously, on the American air-traffic dataset, the performances of HGCN are close to the results of SDNE. e main reason is that the nodes on American air-traffic dataset are labeled by the network global structure. In other words, the node label of the dataset is more related to their structure identity than the labels of their neighbors. HGCN uses the label information of neighbor nodes to update and learn the labels for the unlabeled data, so the node classification performances of HGCN are significantly affected. To visually display the performances of HGCN, the accuracy is presented in Figure 5.
As illustrated in Figure 5, we compared the accuracy of all baselines including the shallow and the deep embedding model on the three undirected graph datasets. e average performances of the shallow embedding model are higher than SDNE model on the Cora dataset and the Citeseer dataset. And the result of the SDNE is better than the other shallow baselines on the American air-traffic dataset. is suggests that SDNE can learn well the special structure network, which node labels are related according to the network global structure. While the proposed model HGCN has a significant increase on most experimental datasets, which further verifies the applicability of our model on undirected graphs.
As for the improved model according to the number of label types, we verify its effectiveness on the Wikipedia dataset and compare it with the results of the open-source framework OpenNE [39]. e experimental results are shown in Table 3.
According to the results of Table 3, when the HGCN model uses only vector splicing, the performance degrades in Micro-F1. In Macro-F1, the results are better than the other all baseline models except for Deepwalk. at suggests that   To help address this problem, the max-pooling layer is introduced to the HGCN model, which can select the features to improve the effect of node classification on the Wikipedia dataset. As can be seen from the experimental results in Table 3, compared to Deepwalk, LINE (2nd), and Node2vec, the Micro-F1 of HGCN * is improved by 4.8%, 14.1%, and 6.6%. And in Macro-F1, the results are enhanced by 3%, 20.3%, and 4.9%, separately. Compared to GraRep, HOPE, and SDNE, the Micro-F1 of HGCN * is improved by 8.4%, 11.6%, and 7.4%, and in the Macro-F1, the results are improved by 11.4%, 15.2%, and 9.2%, separately.
To further illustrate the interesting trends, we plot the Micro-F1 and Macro-F1 to investigate the changes of the different baseline methods in Figure 6.
As can be seen from the results in Figure 6, compared to shallow embedding models, matrix decomposition models, and deep embedding models, the Micro-F1 and Macro-F1 of HGCN * are all improved on the Wikipedia dataset. Figure 6 also shows that the Micro-F1 and Macro-F1 of HGCN have been significantly improved by using the max-pooling. e experimental results further indicate that the improved  model for the dataset with a large number of label types can get better graph embedding effectively.

Conclusion
In this study, we explore an unsupervised graph embedding method for the undirected graph. Comparing with conventional graph embedding models, we introduce hierarchical graph convolution network to propose the HGCN and HGCN * methods respectively. Besides, we come to the following conclusions: (1) In this paper, a hierarchical GCN for the undirected graph with only structure information is proposed. e model does not need the extra features and labeled data, which improves the applicability of the GCN in the real network.
(2) Moreover, we introduce max-pooling to improve the proposed model according to the number of label types. Experimental results show that our model achieves considerable improvement than all the baselines.
However, due to the limitations of the pre-training and the clustering methods, the proposed model in this paper still has the disadvantage of obtaining the initial node embedding and the poor label accuracy. In the future, we will further focus on dealing with the problems and plan to explore the impact of initialing models and clustering methods on the proposed model, which ultimately can provide a better method of unsupervised graph embedding.

Data Availability
All experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.