GNEA: A Graph Neural Network with ELM Aggregator for Brain Network Classification

Brain networks provide essential insights into the diagnosis of functional brain disorders, such as Alzheimer’s disease (AD). Many machine learning methods have been applied to learn from brain images or networks in Euclidean space. However, it is still challenging to learn complex network structures and the connectivity of brain regions in non-Euclidean space. To address this problem, in this paper, we exploit the study of brain network classiﬁcation from the perspective of graph learning. We propose an aggregator based on extreme learning machine (ELM) that boosts the aggregation ability and eﬃciency of graph convolution without iterative tuning. Then, we design a graph neural network named GNEA (Graph Neural Network with ELM Aggregator) for the graph classiﬁcation task. Extensive experiments are conducted using a real-world AD detection dataset to evaluate and compare the graph learning performances of GNEA and state-of-the-art graph learning methods. The results indicate that GNEA achieves excellent learning performance with the best graph representation ability in brain network classiﬁcation applications.


Introduction
In recent years, researchers have generated functional brain networks from resting-state functional magnetic resonance imaging (rs-fMRI) data [1]. e brain network has given researchers the possibility of analyzing brain regions and the connections among them. However, most brain network analysis methods [2,3] learn from either manually extracted shallow features or deep features in Euclidean space. ere is still an urgent demand for incorporating graph learning methods into brain analysis and disease detection. Exploiting the graph structure and connectivity among brain regions thoroughly can significantly improve the comprehensiveness and depth of brain analyses. us, in this paper, we leverage graph learning methods to study the brain network classification problem for the detection of Alzheimer's disease. e graph neural network (GNN) has become one of the most popular graph representation and learning methods.
Sperduti and Starita first applied neural networks to graphs in [4], which motivated the initial outline [5] and detailed description [6] of GNN. But these early networks are based on recurrent neural networks (RNNs) [7], which are computationally expensive. en, the concept of convolution with graph data is studied. Bruna et al. first developed a graph convolution based on the spectral graph theory in [8], and they followed by applying improvements and extensions, for example, ChebNet [9] and GCN [10]. In general, spectral methods face high computational costs due to eigendecomposition [11]. On the other hand, spatial convolution was first studied in [12], in which the NN4G network performed sum aggregation on neighbor information directly. In follow-up works, GraphSAGE [13] adopted sampling strategy to improve the convolutional efficiency, GAT [14] adopted an attention mechanism to learn edge weights, and CGMM [15] studied probabilistic interpretability based on NN4G.
Although several recent works have applied GNNs to brain-related problems, there are still many remaining open problems. Ktena et al. [16] studied similarity metric learning for brain networks, but the classification performance on brain networks was not explored. Rajchl et al. [17] realized the prediction of spectrum disorders and AD by running node classification on a patient network, in which each node denoted a patient. However, this work focused on disease prediction among patients, and the brain connectivity of each individual patient was not studied. Lee and Huang [18] applied GCN to study the brain connectome, but due to the unknown precise structure of the graph, the graph learning procedure relied on iterative graph generation and was thus very time-consuming.
In this paper, to improve the efficiency of graph learning, we propose a graph learning neural network named GNEA (Graph Neural Network with ELM Aggregator) for the graph classification problem. e graph learning procedure of GNEA is presented in Figure 1.
e aggregation performance is boosted by an aggregator based on extreme learning machine (ELM) [19,20]. e extremely fast training speed and good generalization performance of ELM have been proven in various applications, for example, time-series learning [21][22][23], text mining [24,25], biomedical data analysis [26][27][28][29], graph classification [29,30], and game strategy [31]. e ELM aggregator learns a more complex aggregation function than those of other aggregators, which provides an extremely fast learning speed and a powerful aggregation ability. e contributions of this paper are summarized as follows: (1) An ELM-based aggregator is proposed, which achieves high aggregation ability and training efficiency. (2) A graph learning neural network named GNEA is designed, which possesses a powerful learning ability for graph classification tasks. (3) We apply GNEA to a real-world brain network classification problem to verify its ability to perform graph representation learning and classification. e remainder of this paper is organized as follows. Section 2 introduces the brain network. Following the framework overview of GNEA in Section 3, Section 4 presents the graph convolution of GNEA, including propagation based on correlation-biased sampling and aggregation based on ELM. Section 5 presents the results of the performance evaluation, comparison, and discussion. Section 6 concludes this paper.

Functional Brain Network
A functional brain network is represented as a graph, where each node denotes a brain region and each edge denotes the functional connection between two brain regions [32].
A 4-dimensional fMRI image is a sequence of 3-dimensional brain models, which themselves are sequences of 2-dimensional brain image slices. e 4th dimension is the temporal dimension. e brain regions of each 3-dimensional brain model are mapped using an Atlas. For example, the automated anatomical labeling (AAL) Atlas maps 116 brain regions in the rs-fMRI images. With extracted brain regions, according to the correlation coefficients along the 4th dimension, the connectivity of each pair of brain regions can be estimated. e Pearson correlation coefficient is one of the most popular statistics for measuring the linear correlation between two normally distributed variables [33]. e Pearson correlation coefficient of two brain regions x and y is calculated as where x and y have the same length n, cov(x, y) is the covariation of x and y, σ x and σ y are the standard deviations of x and y, and x and y are the mean values of x and y. e calculated correlation coefficient between each pair of brain regions indicates the weight of the edge between these two nodes in the brain network. Since a brain network is theoretically a sparse graph with dense local connectivity [34], a weight threshold is set to remove irrelevant edges from the complete graph of the brain network. e edges with lower weights than the threshold are considered inactive connections between brain regions.
An example of rs-fMRI and the corresponding functional brain network is presented in Figure 2. In subfigure Figure 2(a), the original rs-fMRI images are presented in three cuts, namely, frontal, axial, and lateral cuts. Figure 2(b) presents the three cuts of the AAL Atlas, which maps 116 brain regions in the rs-fMRI images. Figure 2(c) is the graph matrix generated from the AAL-mapped brain regions and calculated correlation coefficients. Figure 2(d) presents the final generated brain network according to the graph matrix with a determined weight threshold.

Framework Overview of GNEA
To boost the performance of brain network classification by GNNs, we improve the graph learning ability by designing a graph learning model named GNEA, which learns graph embeddings with graph convolution based on the ELM aggregator. A brain network is represented in a graph format, where each node denotes a brain region and each edge denotes a strong connection between two brain regions. e edge weight represents the correlation coefficient. A brain network consists of three matrices, namely, the adjacency matrix A, the node embedding matrix X, and the correlation coefficient matrix C.
e training data and targets of the ELM aggregator are generated from the graphs. With the trained aggregator, three layers of graph convolution are applied to update the node embeddings by propagation and aggregation based on the structures of the graphs. With the learned embeddings of nodes and graphs, the fully connected layer and Softmax output the 2 Complexity classification results of the input graph.
e structure of GNEA is presented in Figure 3.
GNEA accepts three graph-formatted matrices as inputs. e adjacency matrix A represents the connectivity of the brain network, where element A ij � 1 indicates a connection between the corresponding i th and j th nodes, while element A ij � 0 indicates a disconnection. e embedding matrix X concatenates all the node embeddings by rows. e correlation coefficient matrix C keeps all the weights of the edges. In the case of the brain network, each weight C ij is the calculated Pearson correlation coefficient between the i th and j th brain regions. Note that we keep both A and C instead of a single weighted adjacency matrix, so that all the original correlation coefficients between brain regions can be stored.
ree layers of graph convolution are conducted sequentially to learn both the semantic and structural feature embeddings. For each node v, the graph convolution collects information from the neighbor node set N(v) and disparate sampling node set S(v). en the collected information is aggregated by the pretrained ELM aggregator. e convoluted node embeddings of each convolutional layer are activated by the activation function ReLU. e sizes of the node embeddings are reduced by graph pooling. e graph embedding generated from the node embeddings by the readout operation is flattened into a vector and classified by a fully connected layer. en, the activation function Softmax is applied to transform the neural distribution outputs into a final set of class labels for the input graph.

Graph Convolution
Graph convolution, as the key module of learning embeddings in GNEA, consists of two major operations, namely, propagation and aggregation. In this section, we propose our sampling-based propagation method and ELM aggregator, following an overview of graph convolution in GNEA.

Spatial Convolution. A graph is defined as
where V denotes the node set and E⊆V × V denotes the edge set. In the case of an undirected graph, such as brain network, the condition (i, j) ⊂ E iff (j, i) ⊂ E holds. Graph is often represented by an adjacency matrix A ∈ R |V×V| .
In a spectral convolutional solution, the normalized graph Laplacian L is defined as where D is the diagonal degree matrix and I is the identity matrix. e representation H ∈ R N×C of the (l + ) th layer with C input channels and F filters can be calculated as where W ∈ R C×F is a matrix of filter parameters. However, this spectral convolution requires a stable and complete matrix A and has a computation cost of O(|ε|FC). Different from aggregating information from the spectral perspective, the process of spatial convolution involves aggregating a node's information with the information of its spatial neighbors. An increase in the number of layers of spatial convolutions results in the propagation of more information from further neighbor nodes. e spatial convolution used to learn the node representations is described in Algorithm 1.
Given a graph G, the graph convolution algorithm first generates the adjacency matrix A, node embedding matrix X, and correlation coefficient matrix C. e number of sampling operations m and the number of graph convolutional layers K are also accepted as hyperparameters.
e input embedding X is taken as the initialized embedding H 0 (Line 1).
Iteration through the K aggregators (Line 2) can be viewed as propagation through the K layers of spatial convolutions. In each layer, for each node v ∈ V (Line 3), the function ϕ(v, m, A, C) returns the top-m neighbors of node v according to the adjacency matrix A and the correlation coefficients C (Line 4).
e aggregator function f k agg (·) in the k th spatial convolution layer aggregates the embeddings of all the sampled neighbor embeddings. en, it generates an aggregated neighbor embedding h k N(v) (Line 5). en, the embedding of node v and the aggregated neighbor embedding are con- . Dimension reduction is done by calculating the elementwise average values of the concatenated embeddings (Line 6). is graph convolution algorithm finally returns H K , which includes the concatenated embeddings in the K th layer of every node (Line 8).
In the following subsections, we present the two major operations in the graph convolution of GNEA, our sampling-based propagation method, and the ELM aggregator.

Propagation Based on Correlation-Biased Sampling.
Previous graph learning methods aggregated node embeddings from the entire graph. In our sampling-based propagation method, to maintain the stability of propagation, we Brain network Graph matrix fMRI data Graph neural network Result AD detection Figure 1: Graph learning for brain network classification for AD detection.  sample m nodes from the neighbors of each node. Since the sampling is an operation biased by the correlation coefficients, we sample m neighbors with the highest relevance according to the correlation coefficient matrix C. e sampling function ϕ(·) (Line 4 in Algorithm 1) is described as where v is the current node for the embedding update, m is the sample number, A is the adjacency matrix of the current graph for retrieving neighbor information, and C is the correlation coefficients matrix as the sampling bias criterion.
is sampling function retrieves m nodes u i , i � 1, . . . , m from the neighbors N A (v) of node v, guaranteeing that the correlation coefficients between node v and these m neighbors are the top-m most relevant neighbors of node v.
Note that if the number of neighboring nodes is less than m, the collected node embeddings are zero-padded. en, the GNEA network tunes the weight matrices W for all the K aggregators and other trainable parameters by using stochastic gradient descent.

Aggregation Based on ELM
4.3.1. Symmetry Aggregators. Since the neighbor nodes N(v) of node v have no ordering in the non-Euclidean space, the aggregator function of the spatial convolution process should be symmetric to maintain invariance to permutations of the input vectors of representations, for example, the mean aggregator and pooling aggregator in GraphSAGE [13].
(1) Mean Aggregator. Taking the elementwise mean of the embedding tensors is nearly equivalent to the convolutional propagation rule used in transductive spectral convolution [13]. e mean aggregator applies the mean operator to the concatenated embeddings of both node v and the propagated neighbor embeddings, that is, . us, the k th mean aggregator function f k agg− mean is written as (2) Pooling Aggregator. e pooling aggregator of a graph convolutional network takes the average or maximum element out of an embedding. It was found in [13] that there is no significant difference between max-pooling and mean-  Complexity pooling in practice. us, we apply only the max-pooling strategy to realize the pooling aggregator of our rival methods. By performing the max-pooling operation on each of the activated and weighted features, the aggregator is able to capture different aspects of the neighbor nodes. e k th pooling aggregator function f k agg− pool is described as

e ELM Aggregator.
e trainable parameters of symmetry aggregators are learned through backpropagation iterations according to the total loss. e aggregator must be tuned along with the other trainable parameters of graph neural networks. However, for a neural network-based aggregator, the weighted mapping of the embeddings can be performed by input mapping within the network.
us, to boost both the learning efficiency and performance of the aggregator, we propose an aggregator based on extreme learning machine [19,20]. e ELM aggregator is capable of powerful aggregation and efficient training because it avoids iteratively tuning the weights with the total loss. In each graph convolutional layer, the ELM aggregator learns an ELM feature mapping from the neighboring embedding space into the aggregated embedding space of the central node.
e ELM aggregator is presented in Figure 4. Given N arbitrary samples x i , t i ∈ R n×m , the ELM feature mapping matrix H ELM is calculated as where L is the number of hidden layer nodes, ω i is the input weight vector from the input nodes to the i th hidden node, and b i is the bias of i th hidden node. g(ω i , b i , x) is the activation function that generates mapping neurons, and it can be any nonlinear piecewise-continuous functions. e ELM aims to minimize the training error and the norm of the output weights. us, the output weight matrix β can be calculated as where H † ELM is the Moore-Penrose inverse of H ELM and T is the training target. e training procedure of the ELM is presented in Algorithm 2.
During the feedforward phase, the aggregation result of the ELM aggregator is calculated as

Supervised Learning of the ELM Aggregator.
e ELM aggregator is trained with supervised learning. e target T should be specified before the training procedure. However, only the target labels of the downstream task, that is, the labels for graph classification, are acknowledged. us, we obtain T according to the total loss of GNEA, which is calculated using categorical cross-entropy loss. e ELM target matrix T is first initialized randomly. Each row in T denotes the target embedding of a node. en the corresponding targets of the sample graph are updated according to the total loss. e partial derivative of the total loss with respect to T is obtained for the update, and it is calculated as where T (k) is the updated target of the k th layer, λ is the update rate, and J(θ, y, y) is the total loss of the downstream task, which is calculated as where y is the target label, y is the output label, M is the number of classes, and N is the number of samples.

Experiments
In this section, we first introduce our generated brain network dataset and experimental setup. en, the performance of GNEA is evaluated and compared with those of state-of-the-art graph neural networks.

5.1.
Dataset. e brain networks in our dataset are generated using resting-state fMRI data from the Alzheimer's Disease  Complexity Neuroimaging Initiative (ADNI) by the LONI Image and Data Archive (IDA). ree patient types are included in our dataset, which are patients with Alzheimer's disease (AD), patients with mild cognitive impairment (MCI), and patients for normal control (NC). We select 118 samples of each patient type from all four project phases of the ANDI, which results in a total of 354 samples in our dataset. e samples of all the different patient types have similar distributions of ages and genders, and these distributions are presented in Figure 5. All rs-fMRI images are processed using the toolboxes Nilearn 2 and DPABI (Data Processing and Analysis for Brain Imaging) [35]. e AAL Atlas [36] is applied to map the brain into 116 brain regions. en, the brain networks are generated, each of which is in the form of a 116 × 116 square matrix. e dataset is split into training data and testing data at a ratio of 4 : 1. Performance evaluation and comparison are obtained by using the testing samples.

Experimental Setup.
We use the AUC (area under curve) with a 95% CI (confidence interval) to evaluate the classification performance. e 95% CI is calculated as where n is the number of cross-validation folds, x is the mean result of the n-fold cross-validation process,α � 1 − 0.95 � 0.05, t n− 1 is the t-test with an n-1 degree of freedom, and s d is the standard deviation, which is calculated as We evaluate the testing performance on three binary classification problems, namely, NC-AD, NC-MCI, and AD-MCI. Furthermore, we also evaluate the 3-class classification problem, in which the macroaverage strategy is applied in the AUC calculation.
All the experiments are conducted on a PC with a 3.7 GHz Intel Core CPU, an NVIDIA GeForce RTX 2080 Ti graphics card, 32 GB of 2400 MHz DDR4 RAM, and a 500 GB solid-state disk drive. e proposed method is realized using MATLAB R2018a and Python 3.6. e deep learning frameworks are TensorFlowGPU 1.8 and Keras 2.2.

Results.
We first evaluate the performance of GNEA with varied hyperparameter settings, namely, the graph convolutional layers and classification layers. en, we compare the overall performance of GNEA with those of several state-of-the-art methods.

Evaluation of Graph Convolutional Layers.
e number of graph convolutional layers can be viewed as the distance of information propagation. In other words, k layers of graph convolution collect information from k-hop neighbors.
e neighborhood sample size determines |N′(v)|, which is the number of nodes sampled from the neighbors N(v) of node v. us, we evaluate the AUC in Figure 6(a) and the runtime in Figure 6(b) with varied layer numbers and sample sizes.
(1). e Number of Graph Convolutional Layers. Regarding our dataset, although more graph convolutions lead to a longer runtime, the network with three graph convolutional layers has the highest AUC performance. We believe that more than three layers of graph convolution may lead to the oversmoothing of the graph embedding, which decreases the discrimination among the brain networks of different class labels.
(2). e Neighborhood Sample Size. More sample nodes lead to higher matrix calculation cost. e incrementation is nonlinear due to the matrix operations in the information propagation and aggregation process. Although the runtime continues to increase nonlinearly, the growth in the AUC performance stagnates when the neighborhood sample size increases to 20. Since the sampling is biased by the correlation coefficients, when the number of neighbors is larger than a set threshold, extra information from the neighbors N(v) of node v provides a little contribution to the embedding h v of node v.
(3). e ELM Aggregator. We evaluate the AUCs and running times obtained with varied hidden numbers of the ELM aggregator, along with varied numbers of convolution layers. It can be found from the evaluation results presented in Figure 7 that the runtime continues to increase as the number of hidden nodes of the ELM aggregator increases. However, the AUC performance begins to drop when the number of ELM hidden nodes is larger than 400. Given a fixed dimension of input space, extra hidden nodes do not result in a more powerful learning ability.

Evaluation of Classification Layers.
e fully connected layers in GNEA provide it with the ability to classify the representations of brain networks. We evaluate the AUCs in Figure 8(a) and runtimes in Figure 8(b) with varied numbers of fully connected layers and fully connected nodes in each layer. Both larger numbers of nodes and layers lead to longer runtimes. Since the major computational cost lies in the graph convolution operations, the differences of runtimes between various settings of the fully connected layers are low. Regarding the AUC performance, one fully connected layer is able to achieve a strong classification performance due to the excellent quality of the graph representation generated by GNEA. Specifically, the AUC performance reaches its maximum when the number of nodes increases to approximately 100 and begins to drop afterward. A larger number of either layers or nodes may lead to overfitting and poor testing performance.

Performance Comparison.
Comparisons of the graph classification performances achieved by GNEA and state-ofthe-art methods, namely, a convolutional neural network (CNN) [37], a long-short term memory (LSTM) [38] network, a graph convolutional network (GCN) [10], and GraphSAGE [13] with three aggregators, which are denoted as GS-mean, GS-pool, and GS-LSTM, respectively, are given in Table 1   8 Complexity convolutional networks, of which GCN applies spectral convolution, while GraphSAGE and our GNEA perform spatial convolution.
To thoroughly measure the performance of AD detection, we decompose the original 3-class classification problem into three binary classification problems. e NC-AD task is to distinguish between normal controls and Alzheimer's disease patients. e NC-MCI task is to distinguish between normal controls and patients with mild cognitive impairment. e MCI-AD task is to distinguish between patients with mild cognitive impairment and patients with Alzheimer's disease. e comparison results presented in Table 1 indicate that (1) the graph learning neural networks achieve higher performance than the deep neural networks in Euclidean space, including the CNN and LSTM; (2) within the group of graph neural networks, spatial convolution outperforms spectral convolution; (3) for the general three-class problem, our proposed GNEA has the best AUC performance; (4) compared with their performances in the other binary classification tasks, all the methods have higher AUC scores in the NC-AD task, since the distinction between MCI and other classes is trivial and explicit; (5) for the NC-AD problem, both GraphSAGE and GNEA exhibit satisfactory learning ability due to the relatively explicit distinction between the brain networks of NCs and AD patients; (6) for the MCI-AD problem with the most implicit distinction, GNEA earns a dominant position due to the powerful representation ability of the ELM aggregator.

Conclusion
To address the graph learning problem for brain network classification, we propose a graph convolution aggregator based on extreme learning machine. e ELM aggregator exhibits an efficient and powerful aggregation ability. en, we design a graph neural network named GNEA, which achieves high performance in graph embedding and graph classification. e results of extensive experiments on a real-world Alzheimer's disease detection task indicate that our proposed GNEA outperforms the stateof-the-art rival methods in the application of brain network classification.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.