Network Threat Detection Based on Group CNN for Privacy Protection

,


Introduction
Application scenarios for the IoT are becoming increasingly mature, which brings people to a new digital lifestyle by connecting everything [1]. However, as the IoT scope and scale continue to expand, the threat of network intrusion has become more serious than ever before [2,3]. Malicious software, DDoS attacks, vulnerability attacks, and other attacks always occur in the IoT cyberspace, which inevitably leads to privacy leaks [4][5][6]. These attacks harm not only physical terminal equipment but also people's lives and property [7].
In the IoT, there are three major security and privacy challenges: terminal authentication, network attack prevention, and personal data protection [8,9]. In terms of privacy challenges, blockchain-enabled technology using encryption algorithm will not cause privacy data leakage [10][11][12]. In terms of network attack prevention, network threat detection technology is required to find network intrusions and meet the demand of IoT assurance. In this situation, intrusion detection [13], malicious code detection [14], malware detection [15], malicious URL detection [16], and vulnerability mining [17] based on machine learning algorithms are considered to be effective network threat behavior detection measures. With the upgrading of attacks and the increase in network security data, traditional machine learning methods are no longer suitable. At the same time, data analysis techniques and deep learning algorithms have developed rapidly and have been successfully applied to natural language processing, image recognition, and video detection [18,19]. In the field of network security, many research studies have used deep learning technology to detect network threats and have garnered many achievements [13,20].
Big data analysis techniques include oversampling imbalanced datasets, dimension reduction of high-dimensional data, and correlation analysis between features [21]. Correlation analysis studies the correlation coefficients among two or more random variables [22]. In probability theory, the correlation coefficient can reflect that there is a close relationship between variables. The range of the correlation coefficient is ½ −1, 1. The closer the absolute value of the correlation coefficient is to 1, the closer the linear relationship between the two variables is. In contrast, the closer the absolute value of the correlation coefficient approaches to 0, the weaker the linear relationship between the two variables will be. Therefore, we use a correlation coefficient matrix to measure the relationships among the column vectors in the data matrix.
Machine learning algorithms are divided into shallow learning and deep learning [13,22,23]. Shallow learning is treated as a traditional machine learning technique that achieves desirable effects to address a small amount of data. Shallow learning algorithms, including support vector machines (SVMs), random forests (RFs), decision trees (DTs), and K-means algorithms, have been employed to distinguish abnormal data from network activities [13,20]. Comparing rule-based intrusion detection systems (IDSs), shallow learning methods do not rely on the domain knowledge and can extend their generalization ability to detect the attack variants and unknown attacks. However, shallow learning is no longer suitable to address the complexity of the dataset and the diversity of the features [13]. In this situation, it emerges that deep learning is required.
Deep learning, also known as deep neural networks (DNNs), is designed from hierarchical structures composed of multiple neural layers [24]. Deep learning can extract and learn information to generate the reconstruction features from the input raw data through layer-by-layer neural processing. Benefitting from their feature reconstruction characteristics, deep learning algorithms, including CNNs, recurrent neural network (RNNs), and generative adversarial networks (GANs) [25][26][27][28][29][30][31] have been widely used not only for visual recognition and language understanding but also for network threat detection. Studies in [32] show that deep learning algorithm-based methods can achieve better performance when working on reconstructed features.
CNN, as one of the typical DNN models, was first proposed to solve the problem of 2D image recognition. 2D CNNs have been successfully used to learn and reconstruct features from raw data and have developed into the dominant approach for accomplishing recognition and detection tasks of image and speech analysis [33]. Due to the good characteristics of CNN learning, 1D CNN has been proposed to address 1D signals based on 2D CNN and has achieved superior performance with high efficiency [34,35]. To adapt to the data characteristics of 1D signals, comparing 2D CNNs, the hierarchical architecture of 1D CNNs is simplified [36]. For example, in the structure of 1D CNN, the data of the convolution kernels and pooling filters are 1D. In the structure of 2D CNN, the data of the convolution kernels and pooling filters are 2D. Therefore, in the structure and running process, 1D CNN is simpler than 2D CNN [34]. Therefore, we build a 1D CNN for analyzing network security data.
However, with the deepening of the network layers, the number of parameters increases exponentially [37]. For example, in a traditional basic 2D CNN, if the size of the input image feature is C * H * W, the number of convolution kernels is N, the size of each convolution kernel is K * K, and the size of the feature map is M * M. The total number of parameters in all convolution layers is C * N * ðK * K + 1Þ * ðM * MÞ. Obviously, the number of parameters is large. To reduce the number of parameters and improve the efficiency of the CNN, a group CNN is proposed to group the convolution kernels separately [38]. Suppose that the convolution kernels are divided into T groups, the number of convolution kernels in each group is N/T, the size of each convolution kernel is K ′ * K ′, and the size of the feature map is M′ * M′. The total number of parameters in the convolution layers is C * ðN/TÞ * ðK ′ * K ′ + 1Þ * ðM ′ * M ′ Þ. When grouping, the sizes of the convolution kernel and the feature maps are considered to be smaller, and the total number of parameters of all of the convolution layers is reduced. At the same time, the performance of the algorithm is improved. Therefore, we use a group CNN to address the big network data. When analyzing the security data for network threat detection, we determined that each threat behavior had 1D characteristics, which makes the threats similar to 1D signals. Additionally, the group CNN can improve the efficiency. Therefore, learning from the successful experience of using 1D CNN to process 1D signals, we build a 1D group CNN model to perform feature learning and reconstruction of the security dataset. In this paper, we combine shadow learning and deep learning algorithms to build a network threat detection model. First, correlation coefficients are computed to measure the relationships of the features. Then, we sort the correlation coefficients in descending order and group the data by the columns. Second, a 1D group CNN model with multiple 1D convolution kernels and 1D pooling filters is built for feature learning and reconstruction. In each convolution layer and pooling layer, the convolution kernels and pooling filters are grouped. Third, the reconstructed features are input to the shadow learning models for threat prediction.
The proposed method includes the following advantages: (1) Compared with the traditional basic 1D CNN, the proposed group CNN model with grouped convolution kernels and pooling filters reconstructs the features layer by layer and reduces the FLOP, parameters, and running time (2) The proposed data grouping, which is based on correlation coefficients between the features, can The remainder of this paper is organized as follows. Section 2 discusses the related work using shallow learning algorithms and deep learning technology in network threat detection. A description of the 1D group CNN model is provided in Section 3. Experimental results and analysis are presented in Section 4. The work is concluded in Section 5.

Related Work
Machine learning techniques, including shallow learning and deep learning algorithms, have been used for anomaly detection since the early 2000s and can automatically mine hidden information on the differences between normal and malicious behaviors.
Shallow learning algorithms, such as traditional machine learning algorithms, were previously applied to analyze system logs, malicious software, and network traffic and to output the predicted labels of the input data. By comparing the predicted labels with the true labels, the performance of the shallow learning algorithms can be achieved. The most widely used algorithms include SVM, DT, NB, and Kmeans [39,40]. Buczak et al. [39] provided a summary as a survey to describe some machine learning and data mining methods, such as DT, SVM, RF, and NB, which were used for cybersecurity intrusion detection. Kruczkowski and Szynkiewicz [41] used SVM with kernels to build a malware detection model. The results revealed that SVM was a robust and efficient method for data analysis and it increased the efficiency of malware detection. Bilge et al. [42] presented the EXPOSURE system to analyze large-scale and passive domain name service (DNS) data. The classifier is built by J48 DT. The experimental results suggested that the minimum error was achieved by a decision tree. Aung and Min [43] used K-means and classification and regression tree (CART) algorithms to mine the KDD'99 dataset for intrusion detection. The experimental results showed that the hybrid data mining method could achieve good accuracy in performance analysis with time complexity. Mo et al. [44] discussed three data clustering algorithms, including K-means, fuzzy C means (FCM), and expectation maximization (EM), to capture abnormal behavior in communication networks. The experimental results showed that FCM was more accurate.
More recently, deep learning technology is developing rapidly and has been successfully been applied to a variety of tasks, such as natural language processing, image recognition, and computer vision [45]. CNNs, as typical DNN models, have feedforward neural networks with convolution calculations and deep structures, which can learn and reconstruct features more accurately and efficiently. According to the type of raw data, 1D CNN and 2D CNN models should be built. A 1D CNN is constructed to process onedimensional sequence signal data and natural language, and a 2D CNN is constructed to address two-dimensional image and video data [36]. Because the CNN can learn and reconstruct features, both 1D CNN and 2D CNN are used for network threat detection.
Xiao et al. [46] proposed a network intrusion detection model based on a CNN. The original traffic data were reduced in dimensions through principal component analysis (PCA) and an autoencoder (AE), and then, the data were converted into a 2D image format. Next, the 2D data were input to the CNN model to evaluate the performance. Wang et al. [47] proposed a method that represented raw flow data as an image and used the CNN for classification and identification without manually selecting and extracting features. Experimental results showed that this method had high availability and high accuracy in malicious traffic identification. Zhang et al. [48] proposed a feature-hybrid malware variant detection approach based on 2D CNN and 1D BPNN. A 2D CNN was designed to compute the dot product and compress the dimension of the PCA-initialized opcode matrix. Experimental results showed that the method achieved more than 95% malware detection accuracy. Zhang et al. [49] proposed converting opcodes into a 2D matrix and adopted the CNN to train the 2D opcode matrix for malware recognition. Experimental results showed that their approach could significantly improve the detection accuracy by 15%. Yan et al. [50] proposed converting Android opcode into 2D gray images with fixed size and adopted a CNN to train and detect Android malware. Through the above literature, we can determine that the input data of the 2D CNN model must be converted into 2D data first.
Ma et al. [51] proposed a hybrid neural network comprised of 1D CNN and DNN to learn the characteristics of high-dimension network flows for network anomaly detection. Experimental results showed that the proposed method was better than those of other algorithms on the comprehensive performances. Azizjon et al. [52] proposed a 1D CNN model to serialize the TCP/IP packets in a predetermined time range as an invasion Internet traffic model for the IDS. Experimental results showed that 1D CNN and its variant architectures had the capability to extract high-level feature representations and outperformed the traditional machine learning classifiers. Wei et al. [53] proposed a 1D CNNbased model to identify phishing websites on a URL address text, which was converted to one-hot character-level representation. This mode liked the 1D CNN to analyze natural language. Experimental results showed that the method was faster to detect zero-day attacks. Zhang et al. [54] designed a flow-based intrusion detection model called SGM-CNN, which first integrated SMOTE and GMM to make an imbalanced class process and used 1D CNN to detect the network traffic data with high accuracy. Experimental results showed that SGM-CNN was superior to the state-of-the-art methods, and effective for large-scale imbalanced intrusion detection.
The group convolution was first proposed and used in AlexNet by Krizhevsky et al. [37] for distributing the model over two GPUs to handle the memory insufficient issue.

Wireless Communications and Mobile Computing
AlexNet was designed as a the group convolution method could increase the diagonal correlations between the convolution kernels, reduce the training parameters, and be not easy to overfit. Zhang et al. [38] proposed interleaved group CNNs called IGCNets, which contained a pair of successive interleaved group convolutions, i.e., the primary group convolution and the secondary group convolution. IGCNets was wider than a regular convolution. Experimental results demonstrated that IGCNets was more efficient in parameters and computation complexity. Xie and Girshick [55] proposed a simple, highly modularized network architecture named ResNeXt, which was based on AlexNet and constructed by repeating a building block. The idea of ResNeXt was consistent of group convolutions. Without increasing the complexity of the parameters, the accuracy of the model could be improved, and the number of super parameters could be reduced. Lu et al. [39] proposed a novel repeated group convolutional kernel (RGC) to remove the filter's redundancy from group extent. SRGC-Nets worked well in not only reducing the model size and computational complexity, but also decreasing the testing and training running time.
In the 2D CNN-based model, the input data are converted to the image format. In the 1D CNN-based model, the input data are treated as timing serial signals, similar to natural language. Compared with a 2D CNN, the structure of a 1D CNN is simpler, which makes the computational complexity lower. Therefore, we intend to learn from the experience of applying the 1D CNN to address the data and to construct a network threat detection model for feature learning and reconstruction.

Proposed Solution
The architecture of the proposed network threat detection model, which combines the 1D group CNN algorithm and machine learning classification methods, is shown in Figure 1. First, correlation coefficients are computed to measure the relationships between the features. Then, we sort the correlation coefficients in descending order and group the data. Second, a group CNN model with multiple groups of convolution kernels and pooling filters is built for feature learning and reconstruction. In the group CNN model, the input data are divided into multiple groups. Similarly, convolution kernels and pooling filters in each layer are divided into multiple groups. Each group of data is computed by each convolution kernel and is then computed by each pooling filter.
Finally, a concatenating layer is used to concatenate multiple groups of data to form one group of reconstructed data. Third, the reconstructed data are input to the shadow machine learning model for threat prediction. In the shadow machine learning model, traditional machine learning algorithms are used to identify normal or abnormal samples from reconstruction data. Then, the accuracy, precision, recall, and F1, which are the detection performance indicators, are computed according to the statistics of the confusion matrix.

Group Convolutional Neural Network for Feature
Reconstruction. The convolutional neural network (CNN) is one of the representative algorithms for deep learning. It is a type of deep feed forward neural network that has convolution calculations [56]. CNNs have the capability of representation learning to generate reconstruction features. At the same time, by the convolution operation and pooling operation, a CNN can achieve the purpose of reducing the dimensions of the input data [57]. Additionally, grouped convolution kernels and pooling filters can reduce the number of parameters and improve the performance [39]. Therefore, we use 1D group convolution kernels to build a 1D group CNN model in this work.
The 1D group CNN includes multiple convolutional layers, multiple pooling layers, a full connection layer, a concatenating layer, and an output layer. In each convolutional layer, the convolutional kernels are divided into multiple groups. At the same time, in each pooling layer, the pooling filters are divided into multiple groups. The fully connected layer determines the dimensions of the reconstruction features of each group. The concatenating layer is used to concatenate the reconstruction features of each group to form the final results. The combination of multiple layers makes the group CNN output the low-dimensional

Feature Correlation.
In this work, we assume that the input data are Usually, the malicious samples have some similar values of the same features and so are the benign samples. Thus, there are certain correlations between the futures and the labels. We calculate the correlations between the data features and labels based on the correlation coefficients.
First, we calculate the correlations between the data features and labels based on the correlation coefficients to form a correlation coefficient matrix R. Then, we randomly select one row vector R i and rank the correlation elements in descending order. Furthermore, we divide the data into T groups by columns equally according to the descending correlations. Usually, each group has the same number of features, which is D/T. The input data in the tth (0 < t ≤ T) group are expressed as X t = ðx 1,t , x 2,t , ⋯, x n,t , ⋯x N,t Þ. So the correlation coefficients of the first group data are the biggest, and that of the last group data are the smallest.

Group CNN.
After the data are grouped, we start to establish the group CNN model, which contains L convolution layers, L pooling layers, a full connection layer, a concatenating layer, and an output layer. Like the group counts of the input data, the convolution kernels and pooling filters in each layer are also divided into T groups. Further, there are M convolution kernels in each group.
Suppose that the mth (0 < m ≤ M) convolution kernel in the tth (0 < t ≤ T) group of the lth (0 < l ≤ L) convolution layer is expressed as K m,t l . Convolution operations are conducted between the grouped data X t and the convolution kernel, or the output R m,t l−1 of the previous pooling layer. Then, activation function is working to generate the feature maps. Suppose the feature map in the tth group of the lth convolution layer by the mth convolution kernel is S m,t l , which is expressed as follows: where Re LUð·Þ is the nonlinear activation function. conv1D ð·Þ is the 1D convolution function. R m,t l−1 is the output of the mth pooling filter in tth group of the ðl − 1Þth pooling layer. b m,t l is the bias of the tth group in the lth convolution layer. After the convolution layer, a pooling layer not only reduces the dimensions of feature maps from the upper convolution layer to reduce the computational cost but also provides basic translation invariance. The lth pooling layer is immediately after the lth convolution layer. Suppose the m th pooling filter of the tth group in the lth pooling layer is P m,t l . The input data of the lth pooling layer is the output of the lth convolution layer, and the output data of the tth group in the lth pooling layer is R m,t l , which is expressed as follows: where max poolingð·Þ is the pooling function. The max pooling is adopted in this paper. After the last pooling layer is the full connection layer. Last pooling layer is connected to a fully connected layer. After the convolution operations and pooling operations, the original data is converted into the feature maps. In the full connection layer, the tth feature map is mapped to the group reconstruction features X t ′ by a global convolution operation: where K m,t f ull is the convolution kernel of the full connection layer. b m,t L is the bias of the full connection layer. Further, the fully connected layer is connected to the concatenating layer. The T groups of the reconstructed features X t ′ are concatenated to form the final reconstructed features X ′: where concatenateð·Þ is the reconstruction features' concatenated function. The size of X ′ is N × D′. When D′ is less than D, it means that the dimension of D′ is less than that of D. In other words, 1D CNN realizes the generation of reconstruction features and the dimension reduction of features.

Floating Point of Operations and Parameters.
Floating point of operations (FLOP) is used to calculate the times of multiplications and additions, which are related to the overall running time of the model [58]. In this section, we want to calculate the FLOP and parameter counts of the group CNN. However, the group CNN is proposed on the basic 1D CNN. So, we first calculate the FLOP and parameter counts of the basic 1D CNN. Then, we calculate the FLOP and parameter counts of the group CNN based on that of the basic 1D CNN.

FLOP and Parameter
Counts of the Basic 1D CNN. Suppose that the basic 1D CNN with fully connected layers is used for feature reconstructed. First, FLOP is computed. We assume that the input data are X, containing N independent D-dimensional samples. In the basic 1D CNN, the number of the input convolution channels is C in , the number of the convolution kernels is M ′ , and the size of the convolution kernels is 1 * W 1 ′ . The size of the feature map of the convolution operation is 1 * W 2 ′ . The numbers of the output convolution channels are C out . The FLOP performed by a 5 Wireless Communications and Mobile Computing convolution layer is as follows: where ð1 * W 1 ′ + 1Þ means that a multiplication is performed by one convolutional kernel sampling the input data. ð+1Þ is to add the bias. * W 2 ′ means the number of multiplications performed by one convolutional kernel to get the feature maps of the output convolution operation. The definition of W 2 ′ is W 2 ′ = ðD + 2padding − W 1 ′Þ/stride + 1, where padding = 0, stride = 1.
* M ′ means multiple convolutional kernels computing in the operation.
* M ′ * W 1 ′ means the number of addition from the feature map of the convolution operation to the output feature map of the convolution layer.
It is noted that the operations of Re LUð·Þ and the pooling layers do not contain multiplication and addition operations. Therefore, the FLOP does not consider the operations of Re LUð·Þ and the pooling layers.
* C in and * C out means repeating calculation in multiple input channels and output channels.
* N means repeating calculation of all the samples. Basic 1D CNN has L convolution layers, so the FLOP of the basic 1D CNN model equals the sum of the FLOP of each convolution layer, which can be computed as follows: Then, the bias term is ignored and the FLOP calculation formula (6) is written as follows: It can be seen that FLOP is determined by the number of the samples, the number of the convolutional layers, the number of the convolutional kernels per layer, the size of each convolutional kernel, the length of the feature map of the convolution operation, and the number of the input and output convolution channels.
Next, we computed the parameter count of basic 1D CNN. The parameter count is to get the statistics of the parameters during the basic 1D CNN operating, containing weighting parameters and bias parameters, which appear in the running process of the model. In the above basic 1D CNN, in the case of a single channel and a single convolution kernel, the number of the parameters is ðW 1 ′ + 1Þ. When the number of the convolution kernels is M ′ and the number of the convolution layers is L, the parameter count of each layer is ∑ L l=1 N * C l,in * M l ′ * ðW l,1 ′ + 1Þ * C l,out . Then, the bias term is ignored and the parameter count calculation formula is written as follows: It can be seen that the parameter count is determined by the number of the samples, the number of the convolutional layers, the number of the convolutional kernels per layer, the size of each convolutional kernel, and the number of the input and output convolution channels.

FLOP and Parameter
Counts of the Group CNN. Like basic 1D CNN, the FLOP and parameter count of group CNN can be computed. Suppose that the input data is X, containing N independent D-dimensional samples, which are grouped to T groups. It means that the dimension of each group data is D/T. The numbers of the input and output convolution channels are C in and C out . The structure of group CNN contains L convolution layers and L pooling layers. There are T group convolution kernels in each convolution layer. The pooling layer is the same. There are M convolution kernels in each group convolution kernels. The size of each convolution kernel is 1 * W 1 . The size of the feature map of the convolution operation is 1 * W 2 . Therefore, the FLOP of each group is where W 2 = ððD/TÞ + 2 padding − W 1 Þ/stride + 1, where padding = 0, stride = 1.
Total FLOP of the model equals the sum of the FLOP of each convolution layer, which can be computed as follows: Then, the bias term is ignored and the FLOP in formula (6) is optimized as follows: Similarly, the parameter count of group CNN can be computed as follows: It can be seen that the FLOP and parameter count are determined not only by the number of the samples, the number of the convolutional layers, the number of the convolutional kernels per layer, the size of each convolutional kernel, and the number of the input and output convolution channels, but also by the number of groups. 6 Wireless Now, let us compare the FLOP and parameter count of group CNN with that of basic 1D CNN. From formula (7), formula (8), formula (11), and formula (12), we can find that there are many parameters to decide the FLOP and parameter count. We cannot compare them directly. But we can assume some comparison conditions. Because the length of input data in group CNN to that of basic 1D CNN is 1/T, we assume that the length of convolutional kernels in each layer of group CNN to that of basic 1D CNN is 1/T, that is, According to the comparison of formula (7) and formula (11), it can roughly be seen that the FLOP of group CNN is smaller than that of 1D CNN. Similarly, according to the comparison of formula (8) and formula (12), it can roughly be seen that the parameter count of group CNN is smaller than that of 1D CNN. Actually, in experiments, we set completely different values of the parameters for the two models to achieve the best feature representation effect. More specifically, a comparison of the results are seen in Section 4.3.5.

Shallow Machine Learning Classifier.
Shallow machine learning has good performance and high efficiency. Therefore, in this work, we use SVM as a shallow machine learning algorithm to build the classification model and identify the malicious samples in the dataset.
Shallow machine learning is consisted of two stages: training stage and testing stage [59]. In the training stage, the high-dimensional original dataset is reconstructed to the low-dimensional features by the training of the group CNN. Then, the dataset containing low-dimensional reconstruction features is input to the shallow machine learning classifier to train and obtain the optimal model structure. In the testing stage, the high-dimensional original testing dataset is input to the trained group CNN model to obtain the low-dimensional reconstructed features [60,61]. Then, the dataset containing low-dimensional reconstructed features is input to the trained shallow machine learning classifier to get the labels of the predicted testing data.
In the experiment, the true labels of the testing dataset have been known, so the performance of the shallow machine learning models, such as accuracy, precision, recall, and F1, can be obtained by comparing the true labels with the predicted labels and calculating the confusion matrix.
The confusion matrix for binary classification includes four index items, such as true positive (TP), false negative (FN), false positive (FP), and true negative (TN). Then, other evaluation metrics as performance are defined as follows:  Table 1. KDDCUP99 [61] is the most famous and frequently cited dataset on intrusion detection. The whole dataset is very big and classified to 5 classes. In our work, we just randomly extract a small part, and only use them in 2 classes consisting of the normal and abnormal samples. The data set contains 41 features, which are divided into 4 categories: 9 basic features of the TCP, 13 content features of the TCP, 9 statistical features of the traffic based on time, and 10 statistical features of the traffic based on host.
CICMalDroid2020 [62] is downloaded from the website of Canadian Institute for Cybersecurity datasets. The original dataset contains 5 categories of Android samples. In our work, we just use the whole banking datase, which contains 2100 malware samples, and the whole benign datase, which contains 1795 benign samples. CICMalDroid2020-139 consists of 139 extracted features including the frequencies of system calls. CICMalDroid2020-470 consists of 470 extracted features including the frequencies of system calls, binders, and composite behaviors.
For most machine learning-based classification tasks, imbalanced datasets could cause the classification surfaces of the classifiers bias to the majority class, which leads to the misclassification of the minority class. Generally, the network threat data is treated as the minority class. Therefore, in our experiment, the ratios of "Normal" and "Abnormal" instances in all the three datasets are close to 1, which can void the imbalanced problem.

Machine Learning Classifiers.
There are many shallow machine learning classifiers, e.g., NB, RF, and LR. Through our previous experimental results and analysis of the existing literature, we find that SVM is the most commonly used classifier.
SVM has many advantages: (1) It has good stability, which in many cases can maintain good classification performance. (2) It can deal with the noise and outlier data well by introducing relaxation variable. (3) It can effectively solve the problem of nonlinear and high-dimensional data. (4) It can keep good classification efficiency and effect for small data sets. In this section, the performances of the reconstructed features at different ratios are compared. According to the output size of the fully connected layer, the dimensions of the reconstructed features are different. In this section, to identify the performance of the reconstructed features, the lengths of the reconstructed features are set according to different situations. Specifically, the ratios of the reconstructed feature length to the original data length are set to 5%, 10%, 15%, 20%, 25%, and 30%. First, the original data are input to group CNN models to generate the reconstructed features. Second, the data composed of reconstructed features are input to SVM, and then, the accuracy, precision, recall, and F1 are computed to evaluate the performance of the reconstructed features. The performances of the reconstructed features at different ratios are plotted in Figure 2. In addition, it should be noted that the number of iterations of the group CNN algorithms is 1000. The recorded results are the average of 5 experiments.
According to the curve of the performance of the reconstructed features at different ratios in Figure 2, including the accuracy, precision, recall, and F1, we can obtain some conclusions. First, the performances of the reconstructed feature data at some low ratios are better than those of the original data, whose ratio is 100%. In particular, the performances of the KDDCUP99 dataset are more obvious. Therefore, it is necessary to reduce the data dimensions by using the group CNN to reconstruct the features, which cannot reduce the     For example, KDDCUP99 is a low-dimension dataset, whose highest accuracy and F1 are at 15%. CICMalDroid2020-139 is a middle-high-dimensional dataset, whose highest accuracy and F1 are at 10%. Meanwhile, CICMalDroid2020-470 is a high-dimensional dataset, whose highest accuracy and F1 are at 5%. To sum up, we can conclude that reconstructed features are helpful to reduce the data dimensions and improve the performance.

Comparison of the Group CNN and the Basic 1D CNN.
Both the group CNN and the basic 1D CNN can reconstruct features. In this part, we compare the performance of the reconstructed features by these two methods. First, the original data are input to group CNN and basic 1D CNN models, respectively. Different ratios from 5% to 30% of the reconstructed features are generated. Second, the data composed of reconstructed features are input to SVM, and the accuracy are computed to evaluate the performance of the reconstructed features. The parameters of their network structures are shown in Table 2. The performance of different ratios of the reconstructed features are recorded in Table 3. In addition, it should be noted that the number of iterations of the CNN algorithms is 1000. The recorded results are the average of 5 experiments.
The original data are directly input to SVM, and the accuracy is recorded in the last column of Tables 3(a) and 3(b). By contrast, the accuracy at different ratios from 5% to 30% of the reconstructed features are recorded in other columns. Comparing the results in Table 3(a), we find that in some situations the accuracy of the reconstructed features by the basic 1D CNN is higher than that of the original data. KDDCUP99 achieves the highest accuracy at 25%. CICMalDroid2020-139 achieves the highest accuracy at 10%. And CICMalDroid2020-470 achieves the highest accuracy with the original data. Comparing the results in Table 3(b), we find the accuracy of the reconstructed features by the group CNN is higher than that of the original data. KDDCUP99 achieves the highest accuracy 0.9764 at 15%. CICMalDroid2020-139 achieves the highest accuracy 0.8091 at 10%. And CICMalDroid2020-470 achieves the highest accuracy 0.8111 at 5%. Comparing the results in Table 3(a) with that in Table 3(b), we find that the accuracy by the group CNN is generally higher than that by the basic 1D CNN. And the highest accuracy of each dataset in Table 3(b) by group CNN is higher than that in Table 3(a) by the basic 1D CNN. Furthermore, the datasets get the highest accuracy by the group CNN at the lower ratios. For example, KDDCUP99 gets the highest accuracy by the group CNN at 15%, but gets the highest accuracy by basic 1D CNN at 25%. Finally, we can conclude that the performance of the group CNN is better than that of basic 1D CNN mainly because grouped data based on the feature correlation helps to improve the inside stickiness of the data of each group.

Training Loss of the Group CNN.
During training stage, the training loss is achieved based on the cross entropy loss function to compare the probability that the predicted labels of the reconstructed features are close to the real labels. The smaller the training loss is, the closer the predicted labels to the true labels of each data. In this section, we study the trend of the training loss of the group CNN. KDDCUP99 and CICMalDroid2020-139 are grouped to two groups, while CICMalDroid2020-470 is grouped to four groups. The grouped data are separately input to the group CNN to train the models. Then, different ratios from 5% to 30% of the reconstructed features are generated. During the training of the group CNN, the loss of each iteration is recorded and plotted in Figure 3. The number of iterations in the training stage is 1000.
From the curves in Figure 3, on the one hand, we find that some training loss curves of the grouped data are closer to each other and approaching to 0. For example, in Figure 3(a), the training loss curves of 20% reconstructed feature data of KDDCUP99, which are grouped to two groups, are closer. So are the training loss curves of 15%

12
Wireless Communications and Mobile Computing reconstructed feature data of CICMalDroid2020-139 in Figure 3(b), and the training loss curves of 5% reconstructed feature data of CICMalDroid2020-470 in Figure 3(c). Furthermore, the ratios of the closer training loss curves in Figure 3 are the same as that of the highest accuracy in Table 3(a). On the other hand, we find that when the curves converge, the training loss curve of group 1 is under that of group 2 in Figures 3(a) and 3(b), and the loss curves are the same in Figure 3(c), where the loss curve of group 1 is at the bottom and the loss curve of group 4 is on the top. That is because the data are grouped based on the feature correlation. We first calculate the correlations between features, and rank the correlations in descending order. Then, we divide the data into several groups equally according to the descending correlation coefficients. So, the correlation coefficients of the first group are biggest, and that of the last group are smallest. Therefore, the loss of reconstructed features are smaller when the correlation coefficients are larger.

Comparison of the Dimension Reduction Algorithms.
The group CNN can reconstruct features and reduce the dimensions of the features. Therefore, the group CNN can be seen as a dimension reduction algorithm. At present, there are many dimension reduction algorithms, such as PCA, FA, ICA, and SVD. In this section, we choose PCA and SVD to compare with the basic 1D CNN and the group CNN. Like in Section 4.3.1, first, the dimensions of the original data by the dimension reduction algorithms are reduced to 5%, 10%, 15%, 20%, 25%, and 30%, separately. Then, the dimension reduction data are input to SVM. Accuracy and F1 are calculated to evaluate the performance of the lowdimensional data. The accuracy and F1 of the dimension reduction algorithms are recorded in Figure 4. In addition, it should be noted that the number of iterations of the basic 1D CNN and the group CNN algorithms are 1000. The recorded results are the average of 5 experiments.
According to the accuracy and F1 of different dimension reduction algorithms in Figure 4, we can obtain some conclusions. First, for the low-dimensional dataset, such as KDDCUP99, the ratios of the highest accuracy and F1 are high. For the high-dimensional dataset, the ratios of the highest accuracy and F1 are low, such as CICMalDroid2020-470. Furthermore, the highest accuracy and F1 at the low ratios are even higher than that of the original data. Therefore, we think that it is quite necessary to reduce the data dimensions by the dimension reduction algorithms. Second, the accuracy and F1 of different ratios by the group CNN are the highest. Therefore, we can obtain that the group CNN is the best dimension reduction algorithm. At the same time, the accuracy and F1 of the basic 1D CNN are less than that of the    group CNN, but higher than that of PCA and SVD, which are traditional methods. Furthermore, we can conclude that the results of the deep learning methods are better than that of the traditional methods. Therefore, we suggest to apply deep learning algorithms to reduce the dimensions.

Comparison of Running Time.
In theory, we have already proved that the parameter counts and FLOP of the group CNN are smaller than that of basic 1D CNN. In this section, we compare the values of FLOP, parameter counts, and running time between the basic 1D CNN and the group CNN. The basic 1D CNN and the group CNN are built with different structures to analyze the datasets. In particular, the numbers of layers and the parameters of each layer are shown in Table 4.
The basic 1D CNN and the group CNN have similar structures, when dealing with the same dataset. It should be noted that the count of convolutional kernels in each layer of the basic 1D CNN is equal to that of the group CNN, which means that the count of convolutional kernels in each layer of the basic 1D CNN is equal to the numbers of the groups multiplied by the counts of convolutional kernels in each group. When the models are operating to analyze the data, running time is recorded. At the same time, FLOP and parameters are computed. The results are shown in Table 4. Table 4 shows the structures, FLOP, parameters, and running time of the basic 1D CNN and the group CNN. It is easy to find that the more layers of the structures have, the larger the FLOP, parameters, and running time in Table 4(a) and    Table 4(b) are less than that of the basic 1D CNN in Table 4(a), when these two CNN models deal with the same datasets. In particular, more FLOP, parameter counts, and running time of the group CNN on CICMalDroid2020-470 decrease, compared to that of the group CNN on KDD99 and CICMalDroid2020-139. Maybe, we can infer that the larger the group count is, the more FLOP, parameter counts, and running time reduce. It should be noted that the structures of the basic 1D CNN and the group CNN in this section are set to compare the running time, which are not used in other sections. On the contrary, in other sections, the structures of basic 1D CNN and group CNN are set to obtain the highest performance, which are totally different from that in this section.

Conclusions
In this paper, we present a 1D group CNN model to reconstruct the features and reduce the dimensionality. The main characteristic is that grouped data are based on feature correlations, which means that the data are grouped by column. CNN model grouping occurs in convolution kernel grouping. In summary, first, compared to all features, our group CNN can achieve the best performance with fewer features. Second, compared to the basic 1D CNN, the group CNN outperforms the basic 1D CNN on the features at different ratios. Third, compared to the dimension reduction algorithms, the accuracies and F1 of the group CNN are the highest. Fourth, compared to the basic 1D CNN, the FLOP, parameters, and running time of the group CNN are lower. Therefore, from the evaluations of all of the aspects, the group CNN spends less time but achieves better performance with fewer features.

Data Availability
The datasets used to support the findings of this study can be downloaded from the public websites whose references are provided in this paper. And the datasets also are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.