Hybrid Low-Order and Higher-Order Graph Convolutional Networks

With the higher-order neighborhood information of a graph network, the accuracy of graph representation learning classification can be significantly improved. However, the current higher-order graph convolutional networks have a large number of parameters and high computational complexity. Therefore, we propose a hybrid lower-order and higher-order graph convolutional network (HLHG) learning model, which uses a weight sharing mechanism to reduce the number of network parameters. To reduce the computational complexity, we propose a novel information fusion pooling layer to combine the high-order and low-order neighborhood matrix information. We theoretically compare the computational complexity and the number of parameters of the proposed model with those of the other state-of-the-art models. Experimentally, we verify the proposed model on large-scale text network datasets using supervised learning and on citation network datasets using semisupervised learning. The experimental results show that the proposed model achieves higher classification accuracy with a small set of trainable weight parameters.


Introduction
Convolutional neural networks (CNNs) have achieved great success in grid structured data such as images and videos [1,2]. It is attributed to a series of filters of convolutional layers from the CNNs that can obtain local invariant features. Compared to a regularized network, the number of neighbors of a node in a graph network may be different. Therefore, it is difficult to directly implement the filter operator in an irregular network structure [3].
In the graph network, the nodes and the connecting edges between them contain abundant network characteristic information. A graph convolutional network (GCN) aggregates the neighborhood nodes to realize continuous information transmission based on a graph network. By making full use of this information, a GCN can effectively achieve tasks such as classification, prediction, and recommendation.
In the spatial domain, to simulate the convolution operation of the traditional CNN on an image, the convolution operation aggregates the information of the neighborhood nodes [7][8][9][10]. Henaff et al. [11] proposed a smoothed parametric spectral filter to realize localization and to preserve the parameters of filters independent of the input dimension. One of the key challenges is that the number of neighborhood nodes in the network irregularly changes.
In the frequency domain, Bruna et al. [5] were the first ones to extend CNN-type architectures to graphs. Cao et al. [12] applied a generalized convolutional network to the graph frequency domain using the Fourier transform. In this method, eigenvalue decomposition is performed on the neighborhood matrix. To reduce the computational complexity, Defferrard et al. [13] proposed the Chebyshev polynomial of the eigenvalues of the graph Laplacian to achieve efficient and localized graph convolutional operation filters. Kipf and Welling [6] proposed a classical GCN, which was approximated by a first-order Chebyshev polynomial. This approach reduces the computational complexity but introduces truncation errors. This introduction results in the inability to capture high-level interaction information between the nodes in the graph, and it also limits the capabilities of the model. The information propagation process in the graph is related not only to its first-order neighborhood but also to its higher-order neighborhood.
Abu-El-Haija et al. [14,15] proposed the high-order convolutional network layer on a graph that used linear combination of the high-order neighborhood basis of the GCN [6]. Tiao et al. [16] proposed a Bayesian estimation approach via the stochastic variational inference in the adjacency matrix of the graph. Levie et al. [17] proposed Cayley polynomials to compute the localized regular filters of the interest frequency bands of graphs. Therefore, the rational use of second-order neighborhoods, third-order neighborhoods, and other high-order neighborhood information will be beneficial to classification prediction accuracy [14][15][16][18][19][20].
Based on the classical GCN [6], to make full use of the high-order and low-order neighborhood information, we propose a novel hybrid low-order and higher-order graph convolutional network (HLHG). As shown in Figure 1, the graph convolutional layer of our model is simple and effective at capturing the high-order neighborhood information, nonlinearly combining the different order neighborhood information. The contributions are summarized as follows: (1) We propose a new fusion pooling layer to achieve high-order neighborhood fusion with the low-order neighborhood of graph networks (2) We propose a low-order neighborhood and highorder neighborhood weight sharing mechanism to reduce the computational complexity and number of parameters of the model (3) The experimental results show that our HLHG achieves state-of-the-art performance in both the text network classification with supervised learning and the citation network with semisupervised learning The rest of the paper is organized as follows. In Section 2, the related theoretical basis such as the graph convolution and the high-order graph convolution are introduced. In Section 3, the general information fusion pooling for the high-order neighborhood is presented. Then, the proposed model and its variant are presented. The computational complexity and parameter quantity of the proposed model are also theoretically analyzed. In Section 4, our proposed model is verified and the corresponding analysis are presented. Finally, Section 5 concludes the paper.

Related Theoretical Background
In this section, the related theoretical basis will be introduced, including the graph convolutional network (GCN).
2.1. Graph. Given a graph G, its nodes set V, and its edges E, the graph is represented as G � (V, E). If nodes V i and V j are connected, then E ij � 1; otherwise, E ij � 0. The information in the graph propagates along with the edge E. It also applies when considering the network node self-loop, which means that E ii � 1. Assuming that the information that is propagated by each node in the graph network is x ∈ R r , the information matrix in the graph is X ∈ R n×r , where n is the total number of nodes in the graph network and r is the dimension of the information feature. It assumes that if the loop graph network G is represented as G, then the adjacency matrix of the graph network G is represented as A � (A + I).
The degree matrix of A in the graph network G is the diagonal matrix, D ii � j A ij .

Graph Convolutional Network.
In the given graph G, there are two signals f � (f 1 , . . . , f n ) T and g � (g 1 , . . . , g n ) T . The graph's Fourier transforms are defined as f � Φ T fand g � Φ T g, where Φ is the orthonormal eigenvalues of the graph Laplacian of graph G. The same as in Euclidean space, the spectral graph convolution operation of f and g is given as an elementwise product as follows: where G � diag(g 1 , . . . , g n ) represents the diagonal matrix of g. Defferrard et al. [13] utilized the k-th order polynomial filters based on Chebyshev to represent the graph convolutional operation of Laplacian G � i α i Λ i , where α i denotes the coefficients andΛ represents the eigenvalues of the Laplacian.
Kipf and Welling [6] propose the classical graph convolutional neural network model based on the Fourier transform, g * f � αAf. The GCN model approximates the model using a first-order Chebyshev polynomial. The propagation model in the graph network is as follows: where H (l) denotes the information propagation matrix; W (l) represents the trainable weight of layer l; when l � 0, H (0) � X ∈ R n×r , which represents the initial input value of the GCN; σ(.) denotes the activation function. To reduce the computational complexity, the convolution operator in the graph is defined by a simple neighborhood average. However, the convolutional filters are too simple to capture the high-level interaction information between the nodes in the graph. Therefore, the classification accuracy on citation network datasets is low. Abu-El-Haija et al. [14,15] propose a high-order graph convolutional layer model based on the GCN for semisupervised node classification. The propagation model of the 2 Computational Intelligence and Neuroscience high-order graph convolution is as shown in formula (3). In this model, the transfer function of the (l + 1)-th layer is a column concatenation from the first order to the p order in the l-th layer, which is the linear combination of the highorder neighborhood. In the propagation model, the different order neighborhoods of the same layer use different weight parameters: where B � D − (1/2) AD − (1/2) . However, as the network layers deepen, the dimensions of H (l+1) will increase and propagate between layers. Therefore, the number of trainable weight parameters will be more, and the training resource will also be increased to learn the optimized dimension of the weight.

Method
When the message passes through the graph network, the nodes will receive latent representations from their first-hop nodes and from their N-hop neighbors every time. In this section, we propose a model to nonlinearly aggregate the trainable parameters, which can choose how to mix latent messages from various hop nodes.

General Information Fusion
Pooling. The information propagation of the graph network is passed along the edges between the vertices in the graph. It assumes that the graph network G � (V, E) is an undirected graph. The general procedure of fusion pooling is described as follows. It assumes that the k-th order neighborhood matrix is , and the result after the fusion pooling operator is (2) ij ,. .., a (k) ij )) and k represents the hop from the given node. Here, is an example to show how to fuse the different order neighborhoods. For a given adjacency matrix A, assume that h 1 denotes the first-order neighborhood and h 2 denotes the second-order neighborhood.
In the information dissemination and fusion process, both the first-order neighborhood features and the highorder neighborhood features are fully considered. Therefore, the classification accuracy should be improved. Figure 2, we propose the highorder graph convolutional network model to fuse the highorder messages that pass through the graph network. The model consists of an input layer, two graph convolutional layers, and an information fusion pooling layer that is connected to the graph convolutional layer. The softmax function is used for the multiclassification output.

Our Proposed Model. In
The proposed model extends the classical GCN model [6] to the graph neural network of higher-order neighborhoods. Each node in the model can get its representation from its neighborhood and integrate messages. The system model is as follows: where p is the order of the neighborhoods, is the activation function, function F(.) denotes the softmax function. Parameter W l+1 is the trainable weight parameter of layer (l + 1) in the graph network, and function Pm(.) represents Pmax(.), which denotes the hybrid highorder and low-order of the information fusion. When parameter l is equal to 0, , which is the output of the first convolutional layer of the graph propagation model. In addition, H (0) � X ∈ R n×r , which represents the initial input of our model. In the preliminary experiment, we found that the twolayer high-and low-order mixed graph convolution is better than the one-level high-and low-order mixed graph convolution, and stacking more layers does not significantly improve the accuracy of the graph recognition task. Therefore, this paper uses a 2-layer graph convolution layer. In further experiments, we validate p � 2 and p � 3 in equation (4) for our HLHG models. In the supervised learning and unsupervised learning classification tasks, our HLHG models show very good performance and achieve a good balance between the classification accuracy and computational complexity. We also validate that at p � 4 and p > 4, the classification accuracy is not significantly improved. Therefore, we only analyze and implement our model for p � 2 and p � 3 in the following sections.
In equation (4), the model with p � 2, that is, the hybrid model of the 1st and 2nd order neighborhoods, is called the HLHG-2 model. The model with p � 3, that is, the hybrid model of the 1st, 2nd, and 3rd order neighborhoods, is called the HLHG-3 model.
In the HLHG-2 model, it assumes that the graph convolutional network has 2 convolutional layers and the activation function is Relu. Then, the output Y of the HLHG-2 model can be expressed as follows: where M2 � P max(AXW 1 , A 2 XW 1 ) and Pm denotes the fusion pooling P max. The same as with the HLHG-2 model, the output Y of the HLHG-3 model can be expressed as follows: where T � (Relu(M3))W 2 and M3 � P max(AXW 1 , For a large-scale graph network, it is unacceptable to In general, the dimension of AX is less than A, and this procedure avoids large-scale matrix multiplication operations.
Therefore, our HLHG model has a 2-layer graph network, and the iterative expression of the 2nd order neighborhood is as follows: where H � P max(AXW 1 , A 2 XW 1 ). We use P max as our fusion pooling operator, which assumes the maximum value in the corresponding element. Algorithm 1 shows how to fuse the different order neighbors. We use the multiclassified cross entropy as the loss function of our HLHG model, L � − i y i log(q i ), where Y is the labeled samples. The graph neural network trainable weights W 1 and W 2 are trained using gradient descent. In each training iteration, we perform the batch gradient descent.

Computational Complexity and Parameter Quantity.
In the large-scale graph network, the adjacency matrix is A ∈ R n×n . It is difficult to directly calculate A (p) . To reduce the computational complexity, we iteratively calculate A (p) . For higher orders, the right to left iterative multiplication pro- For example, when p � 1, In the proposed model, the input feature of the graph network is X ∈ R n×r . The weight of the first convolutional layer is W 1 ∈ R r×r 1 , and the weight of the second layer is W 2 ∈ R r 1 ×r 2 . Then, the input of the first convolutional layer is H (0) � X ∈ R n×r where the parameter r represents the dimension of the input feature. For example, r 1 denotes the number of hidden neurons in the first convolutional layer and r 2 denotes the number of hidden neurons in the second convolutional layer. In our HLHG model, the trainable weight parameters are shared in the same convolutional layer. Therefore, in the first convolutional layer, the output dimension after the convolutional operator is the same. That  Figure 2: HLHG mode. The graph convolutional network layer of the HLHG model consists of two convolutional layers and information fusion pooling. The input parameters are from the first-order to the n-th order neighborhoods. When n � 1, the model degenerates into a classical graph convolution GCN model. When the neighborhood order is n � 2, it is called the HLHG-2 model, and its input parameters are the 1st order neighborhood and the 2nd order neighborhood. When the neighborhood order is n � 3, it is called the HLHG-3 model, and its input parameters are the 1st order neighborhood, the 2nd order neighborhood, and the 3rd order neighborhood.

Softmax classification
is, where k is the order of the adjacency matrix A.
In the l-th convolutional layer, where r l denotes the number of hidden neurons in the l-th convolutional layer. It assumes that A is a sparse matrix with m nonzero elements. For the l-th convolutional layer of our HLHG, the computational complexity is O(r l × k × m × r l− 1 ) and the quantity of trainable weight is O(r l × r l− 1 ). The total computational complexity of our HLHG model is O( j l (r l × k × m × r l− 1 )), and the total number of trainable parameters is O( j l (r l × r l− 1 )), where parameter j denotes the total number of convolutional layers and l denotes the l-th convolutional layer. When l � 1, r 0 represents the feature dimensions of the datasets and r l represents the number of hidden neurons in the l-th convolutional layer. For all the datasets, r 0 ≫ r l ; therefore, we only consider the first convolutional layer when we compare the computational complexity and number of parameters.
Compared to [14], we set fewer filters to maintain a similar computational complexity and the number of parameters is less via weight sharing for both the lower-order and higher-order convolutions.

Experiments
We conduct experiments in order to verify that our HLHG model can be applied to supervised learning and semisupervised learning. On the text network datasets, we compare our model with the state-of-the-art methods using supervised learning. On the citation network datasets, we compare our model with the state-of-the-art methods using semisupervised learning. For all experiments, we construct a 2-layer graph convolutional network of our model using TensorFlow. The code and data are available on GitHub.

Supervised Text Network Classification.
We conduct supervised learning on five benchmark text graph datasets to compare the classification accuracy of HLHG with the graph convolutional neural network and other deep learning approaches.

Datasets.
In our supervised experiments, the 20-Newsgroups (20NG), Ohsumed, R52 and R8 of Reuters 21578, and Movie Review (MR) are used to verify the proposed models. These datasets are publicly available on the web and are widely used as test-verified datasets. The summary statistic features of the text network are shown in Table 1.
These benchmark text datasets were processed by Yao et al. [21], who converted the text datasets into graph network structures. Then, they used preprocessing to construct the adjacency matrix of the graph network input and input parameters. The dataset is divided into a training dataset and a test dataset in the same way.
In our HLHG-2 model, we set the dropout rate � 0.2. The learning rate is updated from Adam [28] during the training process. In our model, we set the L2 loss weight as 0, and we adopt early stopping. We set the learning rate to 0.02 for the R8 dataset, and the learning rates of the remaining datasets are all set to 0.01. We set different epochs for different datasets. The number of epochs in the R52 dataset is 350. The number of epochs in the OH and 20NG datasets is 200, and the number in the R8 and MR datasets is 60. In the HLHG-2 model, we set the number of hidden neurons in the 1st convolutional layer as 128 for all datasets.
Except for the parameters in Table 2, the other parameters are the same as in the HLHG-2 model. For our HLHG-3, we set the number of hidden neurons in the first convolutional layer to 128 except for the MR dataset, which is set to 64. To obtain better training results, we separately set different hyperparameters such as the dropout rate, learning rate, and number of epochs for different datasets (see Table 2). In addition, the other parameters of HLHG-3 are the same as those in HLHG-2.
We construct the graph network for our HLHG-2 and HLHG-3 models, and the feature matrix and other parameters are the same as those by Yao et al. [21].

Results.
We show supervised text classification accuracies for the five datasets in Table 3. We demonstrate how (1)   Computational Intelligence and Neuroscience 5 our model performs on common splits that were taken from Yao et al.'s study [21]. Table 3 presents the classification accuracies and standard deviations of our models and the benchmark on the text network data. In general, our HLHG-2 and HLHG-3 achieve high levels of performance. Specifically, they achieve the best performances on R52, OH, 20NG, and R8. Compared to the best performing approach, the proposed models yield worse accuracies on the MR dataset. In general, the HLHG-3 and HLHG-2 models perform equally well. More specifically, the 3rd order HLHG has slightly better classification accuracy than the 2nd order HLHG on most datasets. However, the performance difference is not very large. Overall, the proposed architecture with hybrid highand low-order neighborhoods has good classification performance, which indicates that it effectively preserves the topological information of the graph, and it also obtains a high-quality representation of the nodes.
The benchmark test results are copied from [8]. The mean standard deviation of our model is the average of 100 runs. Table 4 shows the comparison of the network complexity and the number of parameters with the Text GCN [21]. Our HLHG can match the Text GCN with respect to computational complexity while requiring fewer parameters than the Text GCN. As described in Section 3.3, the number of features in the dataset is much larger than the number of neurons in the hidden convolutional layer. Therefore, we only compare the computational complexity and number of parameters of the first convolutional layer in our HLHG model. In Table 4, Comp. and Params represent the computational complexity and the number of parameters in the first layer of the graph convolutional network, respectively. In the computational complexity results, the first constant denotes the number of neurons in the first convolutional layer and the second constant denotes the order of the adjacency matrix. The parameter m denotes the number of nonzero entries of the sparse regularization adjacency matrix. The parameter r denotes the feature dimension of the nodes in the graph network.
In the Text GCN [21], the number of hidden neurons in the first convolutional layer is 200; therefore, the complexity and params are 200. In our HLHG-2 model, 128 denotes the number of hidden neurons in the first convolutional layer and 2 represents the highest order of HLHG-2. In our HLHG-3 model, 128 and 64 denote the number of hidden C indicates the category, D is the total number of texts, Tr is the training set, Te is the test set, and N is the number of vertices of the graph network.   Table 4 shows that our HLHG-3 model has better computational complexity for the MR dataset. Because of the weight sharing in the different order neighborhoods, our HLHG models require fewer trainable weight parameters. Especially on the MR dataset, the number of parameters is only 1/3 of that of the Text GCN [21].

Semisupervised Node Classification.
We conduct semisupervised learning on three benchmark citation network datasets to compare the node classification accuracy of HLHG with some classical approaches and with some graph convolutional neural network approaches. The graph semisupervised learning corresponds to the process of "label" spreading on citation networks.

Datasets.
In semisupervised node classification, we use the CiteSeer, Cora, and PubMed citation network datasets [29]. In these citation datasets, the nodes represent the articles that were published in the corresponding journal. The edges between the two nodes represent references from one article to another, and the tags represent the topics of the articles. The citation link constructs an adjacency matrix. Those datasets have low label rates. The summary statistic features of the citation graph are shown in Table 5.
For tthe HLHG-3 model, we set different numbers of hidden neurons for the different datasets. We set 8 hidden neurons for the CiteSeer dataset to reduce the computational complexity and the number of parameters, and set 10 hidden neurons for the Cora and PubMed datasets to capture richer features. The hyperparameters of the HLHG-3 are set as shown in Table 6.

Results.
In the semisupervised experiments, we train and test our models on those citation network datasets following the methodology that was proposed by Yang et al. [30]. The classification accuracy is the average of 100 runs with random weight initializations.
The benchmark test results were copied from [15,30]. The mean standard deviation of our model is the average of 100 runs.
In Table 7, the node classification accuracies that are above the line are copied from Abu-El-Haija [14,15] and Yang et al. [30]. The values below the line are our HLHG models. ± represents the standard deviation of 100 runs with different random initializations. These splits utilize only 20 labeled nodes per class during training. We achieve the best test accuracies of 82.7% and 71.5% on the Cora and CiteSeer datasets, respectively. Compared with other high-order graph convolutional neural networks [14,15] on the same datasets, they get the high-order information using linear combinations of features from farther distances. Our HLHG model acts nonlinearly to get the high-order neighborhood information.
In Table 8, we compare the network complexity and the number of parameters with the other high-order graph convolutional networks and the classic GCN. The result shows that our model has the same computational complexity as other approaches. With respect to the number of    parameters, our HLHG-3 model has fewer parameters than the GCN [6]. The reason is that our model shares the weights in the same layer among the different order neighborhood matrixes.

Conclusion
In this paper, we propose a hybrid lower-order and higherorder GCN model for the supervised classification of text network datasets and for semisupervised classification in a citation network. In our model, we propose a novel nonlinear information fusion layer to combine the low-and higher-order neighborhoods. To reduce the number of parameters, we propose sharing the weights in the same convolutional layer with different order neighborhoods. Experiments on the two network datasets suggest that HLHG has the capability to fuse higher-order neighborhoods for supervised classification and semisupervised classification. Our model significantly outperforms the benchmarks. We also find that the computational complexity and the number of parameters are less than those of the high-order method. In order to obtain more neighborhood information, we could use more higher-order adjacency matrix. However, the direct use of higher orders may lead to oversmoothing problems. Therefore, in future research work, we will extend our HLHG models to fuse graph attention networks [36] to develop a deeper graph convolutional network.

Data Availability
The Supervised Text Network Classification data used to support the findings of this study have been deposited in the repository DOI:10.1609/aaai.v33i01.33017370. The Semisupervised Node Classification data used to support the findings of this study have been deposited in the repository DOI:10.1609/aimag.v29i3.2157

Disclosure
The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.