SCMAG: A Semisupervised Single-Cell Clustering Method Based on Matrix Aggregation Graph Convolutional Neural Network

Clustering analysis is one of the most important technologies for single-cell data mining. It is widely used in the division of different gene sequences, the identification of functional genes, and the detection of new cell types. Although the traditional unsupervised clustering method does not require label data, the distribution of the original data, the setting of hyperparameters, and other factors all affect the effectiveness of the clustering algorithm. While in some cases the type of some cells is known, it is hoped to achieve high accuracy if the prior information about those cells is utilized sufficiently. In this study, we propose SCMAG (a semisupervised single-cell clustering method based on a matrix aggregation graph convolutional neural network) that takes into full consideration the prior information for single-cell data. To evaluate the performance of the proposed semisupervised clustering method, we test on different single-cell datasets and compare with the current semisupervised clustering algorithm in recognizing cell types on various real scRNA-seq data; the results show that it is a more accurate and significant model.


Introduction
Analysis on the gene expression matrix of the single-cell dataset is the critical step to obtain a single-cell type [1][2][3]. The categories of cells are already unknown. Detecting the type of each single-cell manually will take a lot of time and money. Then, how to obtain the best results of classification through applying a semisupervised learning algorithm effectively and using the single-cell type as little as possible is a research direction worthy of exploration [4,5].
The current common semisupervised learning algorithms mainly contain generative semisupervised models [6], self-training [7], collaborative training (Co-training) [8], semisupervised support vector machines (S3VMs) [9], and methods based on graph theory [10,11]. Generative semisupervised models use the unlabeled data to make an attribution according to the distribution generated by the previously labeled data and modify the previous model parameters to better adjust the decision boundary [12], then iterate this process to optimize the model. Self-training uses existing label data to train a classifier and then uses this classifier to classify unlabeled data to generate pseudolabels or soft labels [13], then develops certain criteria for judging and selects the correct label data from the original pseudolabel data and adds it to the classifier for training, and finally iterates to produce the final classification results. Cotraining is a kind of self-training, in which the algorithm assumes that each data can be classified from different perspectives and then uses these classifiers trained from different perspectives to classify unlabeled samples and selects those that are considered credible to be added to the training set. Since these classifiers are trained from different perspectives, they can complement each other and improve the accuracy of the classification. Supervised support vector machines use structural risk minimization for classification [14], and semisupervised support vector machines also use spatial distribution information for unlabeled data [15]. Among them, the selection of decision-making hyperplanes should focus on the place where the distribution of lowdensity unlabeled data and label data are consistent [16]. However, if this assumption is not true, the spatial distribution information of unlabeled data can mislead decisionmaking hyperplanes and result in worse performance than when only labeled data is used. In recent years, due to the rise of artificial neural networks [17][18][19], semisupervised clustering algorithms have made breakthrough progress, among which the label propagation algorithm is one kind based on graph networks [20,21]. In the label propagation algorithm, the connection between the labeled data and the unlabeled data is found in the training data through the construction of the graph analysis structure. Through the edgeto-edge connectivity, the labeled data flow through the unlabeled data during propagation, then use edge connections between the unlabeled data to obtain new labels and the classification results [22]. Considering that one single cell contains a large number of genes, that is to say, the characteristic dimension of each single cell is extremely high, a single classic classifier cannot learn all the high-dimensional features. Therefore, we consider using a graph convolutional neural network method to deal with high-dimensional complex connections [23][24][25]. The graph convolutional neural network transfers the similarity between cells to the connection relationship between the edges in the graph and then uses the convolution operation to further extract the classification features of the edges. Due to its powerful feature extraction capabilities, this algorithm shows strong performance in semisupervised clustering. However, the algorithm needs to adjust many parameters in practical applications, especially how to transform the expression matrix of genes on cells to a connection graph that can effectively reflect the similar relationship between cells is a key issue. To solve this problem, we propose SCMAG. The framework of our proposed method is presented in Figure 1. We finally demonstrate that the performance of this algorithm is better than other semisupervised clustering algorithms through tests on different datasets.

Data Description and Data Preprocessing.
To verify the effectiveness of the method, we executed four datasets which are summarized in Table 1. These datasets are downloaded from the NCBI Gene Expression Omnibus (GEO) repository (https://www.ncbi.nlm.nih.gov/geo).
The datasets are in the form of a matrix Xðg × nÞ, which represents that there are g genes in a row and n cells in a column. Since the amount of gene expression varies greatly in each single-cell, we use min-max normalization [30] to normalize the data to (0,1): where X min ðaxis=0Þ represents the row vector composed of the minimum value in each column, X max ðaxis=0Þ is the row   Figure 1: The workflow of the SCMAG. The input is a gene expression matrix; the algorithm includes four steps: (1) the similarity matrix is calculated by the cosine similarity formula; (2) the incidence matrix is judged by the threshold; (3) the consensus matrix is constructed by the matrix aggregation method; (4) the consensus matrix is saved as a graph; (5) lastly, the graph is used as input to the GCN classifier for training.

2
Computational and Mathematical Methods in Medicine vector composed of the maximum value in each column, max represents the maximum value of the interval to be mapped to (the default value is 1), and min represents the minimum value of the interval to be mapped to (the default value is 0). X std ðg × nÞ is the standardized result and X scaled ðg × nÞ is the normalized result, then we use cosine similarity to measure the relationship between cells [31].
where X scaled ði, :Þ represents the i-th row of X scaled ðg × nÞ. ⊗ represents the inner product. kX scaled ði, :Þk is the modulus of X scaled ði, :Þ. Hði, jÞ represents the value in the i-th row and j-th column of the similarity matrix Hðn × nÞ.

Data Division by Threshold.
We divide H into multiple different matrices by threshold: where K t is the threshold, S n is the incidence matrix after threshold division, and S ij n represents the value in the i-th row and j-th column of the S n , where 1 means that two cells are correlated and 0 means that two cells are not correlated.

Graph Convolutional Neural Network Construction.
To construct a graph convolutional neural network, first of all, we should save the incidence matrix S n as a graph G n ðV, E Þ. We use the DGL package in the Python library to solve it [32]. Where the number of vertices V n ðGÞ is equal to the number of cells, the number of edges E n ðGÞ is equal to the number of elements in the S n whose value is 1. Whether the two vertices in the graph are directly connected is determined by the value in the incidence matrix; the value of 1 means direct connection and 0 means no connection. Then, we build a graph convolutional neural network with two hidden layers, and its structure is shown in Figure 2.
According to equation (4), we can get 4 initial graphs of S, and we take each S n as the input. We randomly select 10% of the cell labels as the true labels, and the remaining 90% of the cells have no labels. In the Chu dataset, the input dimension is 1018 * 1018, the activation function is ReLU, the hidden layer dimension is 256, the dimension of the final output probability matrix I n is 1018 * 7, and Iði, jÞ represents the probability that the i-th cell belongs to the j-th type. Finally, we select I max ði, jÞ = max fIði, 1Þ, Iði, 2Þ, ⋯, Iði, jÞg as the output and choose j as the type of i-th cell. Table 2 shows the classification accuracy under different epochs and thresholds.
From Table 2, we can see that GCN performs well under 75 epochs. From 75 to 100 epochs, it shows the trend of convergence, and the classification accuracy is close to 90%. Then, we wonder whether there is a way to make full use of different S n to get better performance.   3 Computational and Mathematical Methods in Medicine 2.4. GCN Based on Matrix Aggregation. To solve this problem, we build a consensus matrix P to minimize the distance between different thresholds [33,34]: where P ij is the value of the i-th row and j-th column in the consensus matrix P. Due to the high dimension of the matrix, directly finding the minimum distance will cost a lot of time and memory. Since the values of the incidence matrix S ij n are all 0 and 1, we can convert the problem of finding the minimum distance matrix P between multiple incidence matrices S ij n into finding the number of occurrences of 0 and 1 for each S n . We use count 0 and count 1 to count the total times of occurrences of 0 and 1.
We take the minimum distance matrix P as the input of graph convolutional neural network for training, then we compared it with the current commonly used semisupervised learning methods; under different epochs, the classification accuracy is shown in Figure 3.
On the Chu dataset, we found that the SCMAG showed better performance than other semisupervised methods, and we also compared it with the GCN without matrix aggregation. The result suggests that the accuracy of classification has increased by nearly 5%.

Experiments and Results
To further demonstrate the performance of the proposed method SCMAG, we apply the Patel, Xin, and Usoskin datasets for testing. We use label propagation, label spreading, self-training, and GCN, four classic semisupervised learning algorithms for training; then, we use SCMAG to compare with the previous four methods. After 25,50, and 75 iterations, we get the final result, and classification accuracy is shown in Table 3. Table 3 shows the comparison results for the Patel, Xin, and Usoskin datasets. In the Patel and Xin datasets, while the number of iterations is 25, 50, and 75, the accuracy of the GCN method is higher than that of the label propagation, label spreading, and self-training methods. When the number of iterations is small, the accuracy of the SCMAG method is lower than that of the GCN, but as the number of iterations increases, the accuracy of the SCMAG method gradually approaches and finally exceeds GCN. In the Usoskin dataset, the label spreading method has the highest accuracy after 25 iterations, followed by SCMAG. But when the number of iterations increases, the performance of GCN is better than the previous three methods. It is worth noting that SCMAG has the highest accuracy rate among the five methods. Therefore, SCMAG is the best method for cell identification.

Conclusion
Single-cell RNA sequencing technology has made a great contribution to the identification of single-cell types, but single-cell datasets often have a large amount of data and high dimensionality. It usually takes a lot of time to identify them. So whether other cell labels can be measured with only part of single-cell data labels is a direction worthy of research. In recent years, some semisupervised learning methods have begun to be used for single-cell data analysis.
In this study, we have proposed SCMAG for the classification of cells. Compared with the conventional graph convolutional neural network, we divide the similarity matrix by different thresholds to get different incidence matrices, and then, we construct a minimum distance matrix, and it can make full use of the high-dimensional information in the cells and better reflect the characteristics of the cells. We also test the cell classification accuracy of several commonly used semisupervised learning methods, label propagation, label spreading, self-training, and normal GCN under the same conditions. We found that SCMAG shows the best average performance in classification accuracy compared to the other four competing approaches.
Although SCMAG makes considerable improvement on identifying cell types, there remains room for improvement. Several problems are still open. For example, when the single-cell dataset contains a large number of cells, it will cost a lot of time to save the incidence matrix as a graph, and the division of threshold is also a question worth studying. In the future work, we will focus on these questions and hope to achieve more promising results.

Data Availability
The datasets supporting the conclusions of this article are available in the GEO database repository under accession numbers GSE75748, GSE57872, GSE81608, and GSE59739. The Python codes for our SCMAG method are available from the corresponding author on reasonable request.

Conflicts of Interest
The authors declare no conflict of interest.