A Multimodal Network Security Framework for Healthcare Based on Deep Learning

As the network is closely related to people's daily life, network security has become an important factor affecting the physical and mental health of human beings. Network flow classification is the foundation of network security. It is the basis for providing various network services such as network security maintenance, network monitoring, and network quality of service (QoS). Therefore, this field has always been a hot spot of academic and industrial research. Existing studies have shown that through appropriate data preprocessing techniques, machine learning methods can be used to classify network flows, most of which, however, are based on manually and expert-originated feature sets; it is a time-consuming and laborious work. Moreover, only features extracted by a single model can be used in classification tasks, which can easily make the model inefficient and prone to overfitting. In order to solve the abovementioned problems, this study proposes a multimodal automatic analysis framework based on spatial and sequential features. The framework is completely based on the deep learning method and realizes automatic extraction of two types of features, which is very suitable for processing large-flow information; this improves the efficiency of network flow classification. There are two types of frameworks based on pretraining and joint-training, respectively, with analyzing the advantages and disadvantages of them in practice. In terms of evaluation, compared with the previous methods, the experimental results show that the framework has good performance in both accuracy and stability.


Introduction
Te rapid development of the Internet makes the Internet technology penetrate into all aspects of people's lives.Te quality of the network environment is closely related to the physical and mental health of human beings.At present, various types of bad website trafc or apps will push bad information or use network viruses to infringe on privacy, so it is very important to classify various network applications in a timely and efective way.Network fow classifcation refers to the use of a certain algorithm to construct a classifcation model, which can be used to classify network fow of various applications.It is a fundamental work for providing various network services such as network security, network monitoring, and quality of service (QoS).Terefore, this feld has always been a hot spot in academic and industrial research [1].With the continuous development of the Internet of Tings, many devices are connected to the network, and network fow classifcation has become an important part of this scenario [2].New network applications based on diferent devices are emerging with the mutual information interaction between applications which has prompted the network to face the status quo of large throughput and difculty of network fow classifcation.It is urgent to deal with large and complex types of network fow and improve the efciency of classifcation.
So far, scholars have proposed many diferent network fow classifcation techniques.Tese technologies are mainly divided into four categories: port-based, deep packets inspection (DPI)-based [3], machine learning (ML) [4][5][6], and deep learning methods (diferent from traditional machine learning methods, deep learning is listed separately for better discussion).On the one hand, due to the development of network technology itself, classifcation techniques used previously such as port detection are no longer adequate for the current network fow classifcation.Along with the importance of data privacy, deep packet inspection is no longer favored by researchers and engineers.With the rise and vigorous development of artifcial intelligence technology, intelligent classifcation technology has become an important direction for researchers.Network fow classifcation technology based on machine learning and deep learning has emerged as the main method of current classifcation research.Tis study summarizes the diferent methods based on machine learning and analyzes the main methods of deep learning to propose a multimodal framework, which not only improves the classifcation accuracy but also enhances the stability of the model.Te main contributions of this study are as follows: (i) Te network fow classifcation based on spatial features and time series features is studied by using visualization methods (ii) A multimodal network fow classifcation method is proposed, which integrates diferent network fow features to improve the stability and accuracy of the classifcation model (iii) a comprehensive analysis of the diferences in structure and training methods of the two types of models in multimodal framework and their advantages and disadvantages and solid experiments on the ISCXVPN2016 dataset Te study is structured as follows.Te second part summarizes the related work in the feld of network fow classifcation, and the third part discusses the research methodology, including the data processing, the structure of the framework, and diferences between them.Te fourth part is the comparison of experimental results and analysis.Te ffth part is the conclusion and future work.For the convenience of writing, the acronyms used in this article are listed in Table 1.

Related Work
As the basic task of many network services, network fow classifcation has always been the focus of research in academic and engineering felds.So far, network fow classifcation technologies are mainly divided into four categories, namely, port-based, deep packet inspection, machine learning, and deep learning method.
Te earliest network fow classifcation technology is to use port number of UDP or TCP protocol at the transport layer.Tis method is easy to implement with lower algorithm complexity, so it is often used to detect certain specifed port applications.However, with the diversifcation of applications and protocols, as well as the emergence of port hopping and port masquerading technologies, this method is no longer reliable and can only be used as an auxiliary method.Many current network applications use port numbers that are diferent from common ports to bypass operating system access control permissions [7], while some other network applications encapsulate diferent services into well-known streams such as HTTP protocolbased streams or conversation, these operations usually reduce the accuracy of port-based network fow classifcation [7].In fact, Madhukar and Williamson [8] proved that nearly 70% of network fow cannot be classifed correctly using the port-based identifcation method.
DPI refers to the identifcation of the unique fngerprint characteristics refected in the payload of each packet and then the detection of specifc network fow [9,10].If the payload of the network fow and the known application or protocol can match in certain features, then it can be considered that this network fow is the known application or protocol with a high probability.For example, some traditional load data fngerprint features include: " \ GET"http, "0 × 13Bit"-Bit torrent p2p, "PNG"0 × 0d0a-MSN messenger, "USERHOST"-IRC, "ARTICLE"-nntp, and "SSH"-ssh Internet trafc [11].Compared with port-based method, its accuracy is greatly improved.Although the DPI method is very accurate for the network fow classifcation, it also requires scientifc research staf to extract characteristic fngerprints of network fow, and to maintain and update the existing fngerprint database from time to time, which is a very resource-consuming task.
In recent years, researchers tried to use the statistical characteristics of network fow and machine learning algorithms to classify network fow.Diferent network fow will produce diferent trafc characteristics that can be used to distinguish, such as the distribution of data packet size, Among them, Moore and Zuev [12] proposed a method based on the naive Bayes principle which builds a Bayesian classifer for supervised learning, combining with the fast correlation-based flter (FCBF) algorithm and kernel estimation technology, the method can achieve 95% accuracy.
In [13], authors used nearest neighbour and linear discriminate analysis (LDA) to classify various applications.
Te experimental results showed that supervised ML algorithms are also able to separate trafc into classes with encouraging accuracy.In [4], authors used Bayesian network, C4.5 decision tree, naïve Bayes, and naïve Bayes tree methods to give common features set for classifcation with diferent feature selection algorithms.In [14] authors used a variety of diferent algorithms to flter the wrong label data in the original data set to obtain a more accurate training data set, and used machine learning to retrain the fltered data to obtain a more accurate and stable classifer.In [15], authors extracted the unique characteristics of various application during the information commutation, and realized a lightweight classifcation method for network application.
Although the classic machine learning has solved the problems that cannot be settled with methods based on the port or DPI, it also faces many new problems.Te frst problem is feature selection, as machine learning methods rely on manually and expert-originated feature sets, which requires a lot of manpower to choose a feature set by themselves.Te second problem is feature extraction, as the feature set that performs well for a specifc data set does not have universal applicability in practice.
With the rapid development of deep learning in the feld of artifcial intelligence, researchers have tried to transfer deep learning methods that shine in computer vision processing, natural language recognition to the feld of network fow classifcation.In [16], authors used neural network (NN) and sparse autoencoder (SAE) network to classify specifc network protocol trafc and achieved a high accuracy.In [17], authors explored online trafc detection methods.In this study, the basic idea is to employ a compact nonparametric kernel embedding based method to convert early fow sequences into images which can be trained in convolutional neural network (CNN) and its accuracy exceeds 99%.Classifcation tasks can also be accomplished using network fow sequential information [18].In [19], authors investigated the classifcation and prediction performances of LSTM networks, using real server-generated trafc streams, experiment result showed that LSTM is able to classify and predict the occurrence of highly intensive trafc fows accurately.In [20], authors used CNN LSTM network and their various combinations to detect network fows, which the classifcation accuracy for applications reached 96%.In [21], CNN and CNN & LSTM was used to classify mobile applications where the payload of the frst few packets were mainly used to achieve high accuracy.In [22], authors focused on three practical problems which are network bandwidth, network fow duration, and network fow detection and then proposed a multitask training method, that is, frst used the CNN network to train the network bandwidth and duration tasks and then trained the network fow classifcation task.Based on this training method, it had achieved better result than the previous CNN & RNN method.In [23], authors proposed to introduce the capsule network into the feld of network fow classifcation, and combined the advantages of CNN & LSTM network to achieve high accuracy of network classifcation.Giuseppe Aceto etc. in [24] provided a wide experiment analysis based on multimodal framework (CNN + LSTM) for classifcation of encrypted mobile trafc.Tis work provides guidance for the subsequent exploration of multimodel fusion.Ten, in [25], authors further proposed a novel multimodel multitask deep learning approach and DISTILLER classifer, it can solve diferent trafc classifcation simultaneously.Liu et al. [26] proposed a method which applied RNN to encrypted trafc classifcation.Moreover, the framework added a multilayer structure which can explore sequential characteristics deeply and experiment results outperformed the state-of-the-art methods.In [27], authors tried to use explainable artifcial intelligence to improve multimodel behavior, the experiment results showed that the proposed method provide global interpretation, rather than samplebased ones.Table 2 lists an overview of the above literature citations.

Research Methodology
3.1.Dataset.In this article, we used two diferent datasets including the USTC dataset provided in [28] and the ISCXVPN2016 dataset [29] provided by the Canadian Institute for Cybersecurity.Te USTC dataset has 10 categories of normal trafc such as FaceTime and Gmail generated using IXIA BPS (Professional Trafc Simulator); this study will use this dataset for feature analysis of network fow classifcation.Te ISCXVPN2016 dataset is captured at the University of New Brunswick which contains raw pcap fles of several trafc types.Te dataset provides labels with diferent categorization, such as AIM chat, Gmail, Facebook, chat, and streaming .Te ISCXVPN2016 dataset is publicly available for researchers, and this study will use this dataset to conduct experiments and compare the experimental results.For more details on the captured trafc and the trafc generation process, refer to [29].

Method Background.
In the following sections, the background of the proposed framework is presented.

Feature Selection and Classifcation.
Network fow has an obvious hierarchical architecture: according to the general TCP/IP system structure, the network fow is packaged into data units in diferent layers which is unique to each layer.Te frame of the data link layer is the lowerlevel data that can be studied which receive the data frame from the upper layer and disassemble it into data in the form of bit stream.Terefore, the frame contains diferent types of features in network fow that can be distinguished, so it is very important to take the frame of data link layer as the basic research object for network fow classifcation.In practice, the frame is easy to obtain, and all the protocol packets can be directly captured which are passing through the Computational Intelligence and Neuroscience network card.For example, using wireshark and tcpdump can capture any data packet of interest.In the feld of trafc classifcation, the usage of machine learning methods based on trafc characteristics has greatly improved the accuracy of classifcation compared with the previous methods [12,30].Research on this type of method shows that the key to improve the accuracy of network fow classifcation lies in the usage of a suitable classifer and the ability to design a fow feature set which is based on diferent types of trafc that can meet the classifcation specifcations as shown in Figure 1.

Spatial and Sequence
Features.Tis part mainly analyzes the spatial and sequence features of network fow.In [16], the author proposed that the application of deep learning methods can realize the automatic extraction of network fow features, which is more suitable for the classifcation requirements than the manually and expert-  Inspired by the successful application of CNN in the feld of image processing, the authors in [28] stated that the network data can be represented by a matrix as shown in Figure 2, that is, the fow sequence F p (x), transformed into a matrix M p , can be expressed as In this way, there is a one-to-one correspondence between the specifc matrix M p and the network fow F p (x).For each node M p ij of the matrix, the value range is (0-255), which is the same as the range of each pixel value of the grayscale image, so there is a one-to-one correspondence between the matrix M p and the gray image T p .Te network fow classifcation is transformed into grayscale image classifcation and network fow feature extraction is transformed into grayscale image spatial feature extraction.
It should be noted that although the abovementioned network fow classifcation task can be transformed into image classifcation, the extracted features are only the feature representation in the network fow graph, not the characteristics of the network byte stream, however, graph structure information is still very useful for feature analysis in this paper.On the one hand, if the feature is a unique feature of the network fow, it must be expressed in the form of a specifc pixel in the map to form a specifc spatial structure and this is the basis for the CNN to extract the spatial features of the network fow.On the other hand, the area formed by the feature also refects the focus of the model in the classifcation task, which provides a reference for the analysis of classifcation features in this paper.(2) Te parameters of the convolutional layer are composed of some learnable flter sets (convolution kernels), which can capture the image features of the previous output layer, and another layer, pooling layer which is mainly responsible for subsampling.At last, a set of fully connected layers are often used to capture high-level features of an input.
Tis article uses the above architecture to learn features and classify the processed network fow which contains four convolutional layers (conv2d), two pooling layers (max pooling), three dense connection layers (dense) and one fattening layer (fatten), each data transformation in the model uses a normalization process (batch normalization).
(2) HAN Model Construction.Compared with converting the network fow to the graph and extracting the spatial feature information to complete the classifcation task, it is more straightforward to use the recurrent network to classify the network fow and extract the sequence information features of the network fow.Te experiments in this section refer to the processing ideas of [31], mainly classify the network fow through the hierarchical attention network (HAN) [32], and display the feature distribution characteristics in the classifcation process through visualization technology.
Tis section adopts the byte-packet-stream processing mode, that is, a network fow label corresponds to a threelevel data fow, this obvious hierarchical data structure is similar to the structure of token-sentence-article.Te processing mode of network fow classifcation is shown in Figure 3.
Tis experiment uses the USTC dataset and divides the dataset into training set, validation set, and test set, which account for 60%, 20%, and 20% of the resampling data set, respectively.
Table 3 lists the classifcation results of the two models on the USTC test set.It can be seen from table that the two types of models have achieved high accuracy on the classifcation task, and the models perform well.
Tis study randomly selects 120 samples in the test data for testing and uses Grad-CAM [33] technology to visualize spatial features.Te visualization results are shown in Figure 4; among them, red represents the feature with higher activation degree and blue with lower activation degree, and From the class activation map, it can be seen that most of the features involved in network map classifcation are concentrated in the network fow header information, some of the features used for classifcation are concentrated in the tail of the data, and a small part of the map information also includes features in the middle.From the comparison of the original grayscale images, it can be seen that for MySQL, the black tail indicates that there is no network fow data in the current area, and the reason why the data information is not used for feature extraction is that MySQL does not have a unifed representation in the grayscale image space., that is, feature extraction is more difcult.Compared with the information in the data, the structural information displayed by the map itself is more prominent and obvious, so the features used for classifcation are concentrated in the tail.A small number of features in the middle also show this feature distribution.For example, the middle features of Gmail are mostly concentrated on the boundary between the blank data area and the data area, which also refects the unique feature structure information of the graph.Furthermore, in the WoW class, although the network graph also presents obvious structural features, the structural features of the information features of the WoW class are more obvious than those of the blank data area.
Next experiment randomly selects 10 samples of each type of network fow for attention visual analysis of HAN model.Table 4 lists the visualization results of 10 types of network fow in the data set.Te packet attention column represents the weight value of the packets in the network fow (the shade of red indicates the value), and the byte attention column represents the internal data of each network stream in hexadecimal, and identifes the weight value corresponding to each byte, in which blue indicates that the weight is larger, and green indicates less weight.
It can be seen from the table that when using sequence features to classify network fows, it shows diferent characteristics from spatial features in convolutional networks.When there are multiple data packets in the same fow, there is a specifc packet that has a greater impact on the fnal classifcation, but the sequence characteristics of other packets with less impact are similar to those of packets with  By visualizing the features of the two types of models, it can be found that the spatial features based on the network fow graph are similar to the network fow sequence features.From the perspective of feature distribution, the features with high weights are located in the frst half of the network fow data.At the same time, there are diferences between the two types of features.When the spatial features of the data are not enough to distinguish the types of network fows, the model prefers to use the spatial structure information of the graph, but the sequence information can only be extracted from the network fow itself.Terefore, both spatial features and sequence features can be used for network fow classifcation tasks, and the two types of features are distinguished from each other.

Proposed Classifcation Method.
After obtaining the spatial and sequential features of the network fow, a natural idea is whether the classifcation accuracy can be improved if both types of features are used in the classifcation task.Based on the above two types of models, this study proposes a multimodal framework that uses the two types of features to classify the various types of network fow.
In [22,34,35], the authors proposed that for the same network, fnding the best set of auxiliary tasks will improve the trafc classifcation which should be treated similar to  Computational Intelligence and Neuroscience hyper-parameter tuning.Te abovementioned idea of multitask training can be expressed as: for the same network, using similar tasks to train separately, which can improve the efciency of the network to complete the fnal task.Ten, on the contrary, for the same task, there is a way to use diferent combination of methods to improve training efciency of the target task.

Multimodal Network Flow Classifcation Model.
Trough the abovementioned analysis, we can extract spatial features through the CNN model and sequential features through the GRU model to build the multimodal framework.Tere are two main ways to build the framework.
(1) Model Pretraining (Model A): Te Framework Based on Pretraining.Te framework based on pretraining refers to: train the networks, respectively, and select the characteristics preliminary, then integrate the selected features of each network and send them to the secondary network for further screening, as shown in Figure 5.
Te model in Figure 5(a) shows the structure of the multimodal classifcation model based on pretraining.First, the pcap fle is segmented, and the data is preprocessed to form the input data format fle (image and byte stream), and then sent to the the convolutional layer and the downsampling layer to extract and simplify the data features.Te spatial features and sequence features of the network fow are extracted through the GAP layer and the GRU layer, respectively, and the extracted two types of features are then sent to the feature fusion module to form fusion features.Te dense layer and softmax are used for output of classifcation.
Figure 5(b) shows that the features used in Figure 5(a) are extracted from the two submodels through pretraining.Figure 5(c) shows that the model parameters in the frst half of the model in Figure 5(a) are actually frozen and do not participate in the training of the entire model.Te training part is mainly the parameters of the feature fusion part.
Strategy of combination: during the training processing, the features of the same type will be completely extracted layer by layer, at last, it will focus on the features useful for the task.For example, in the feld of image recognition and classifcation, with the usage of heat map [33], it is easy to fnd the important feature which will be helpful for the fnal task.However, this type of feature set is still redundant for the entire training task.Each feature in the mixed feature set does not necessarily contribute to the fnal classifcation task and same feature may have diferent weights in diferent models.Terefore, it is necessary to further flter the extracted features.In order to flter the combined features, this article adds a layer of weight learning to the second step to realize the automatic assignment of the weight of each feature.
Assuming that the feature set is θ, and each feature is represented as θ i , the weight of θ i can be calculated by the following equations: We calculate the dot product of the weight and the original feature and recombine it into a new feature set θ τ : Te classifcation task is implemented through a fully connected layer and softmax layer.
Based on the abovementioned the framework design, on the one hand, it is benefcial for the framework to flter feature sets automatically, which meets the processing requirements to allocate diferent feature sets for diferent types of network fow.On the other hand, by analyzing the attention of each feature, we can further study the importance of each feature to the classifcation task.
In summary, two types of features are extracted from the trained network previously, after integrating the extracted features, they are sent to the second learner for training again to complete the fnal task.Tis training method needs to be divided into two steps.Te performance of the frst step learner directly afects the second.
(2) Model Joint-Training (Model B): Te Framework Based on Joint-Training.Te main idea of this method lies in the combination of models, that is, the characteristics learned by the two types of models are directly combined in one network to construct a wide and deep large-scale network model as shown in Figure 6.Te extracted features are more diverse and accurate than that of a single network, and can be directly used for classifcation tasks.Te framework only needs a single-step training to obtain a useable model.
(3) Comparison of Two Types of Frameworks.Although the abovementioned models are based on the idea of integration of mixed feature, they are very diferent in the way of framework construction and training method.
Framework Construction.Model pretraining can actually be divided into three models, including two basic models, namely, the CNN model for extracting spatial features and the GRU model for extracting sequential features, and a secondary model.In the secondary model, it is necessary to design a proper extraction strategy of input data.In this study, the attention-like mechanism is used to realize the automatic learning of feature weights.Model joint-training is one model essentially.Te characteristic of model jointtraining is wide and deep, that is, in terms of width, the integration of multiple models is adopted to expand the longitudinal direction of the framework, and for the depth, the feature extracted by the basic model is relearned and trained to expand the horizontal direction of it.
Training Method: Model pretraining is divided into two steps in the training method.Te frst step is to train basic models to complete the preliminary feature extraction.Te second step is to send the extracted features to the secondary model to complete the fnal training task.In fact, the above training can be referred to as one training step, as the training task is completed on the secondary model fnally, not on the basic model.Model joint-training only needs one step for training, that is, sending the data to the network for training directly, which is an end-to-end training model actually.
Considering training task in practice, although model pretraining needs to complete the training of three models, the basic model can be trained in parallel and separately, which is more fexible for actual operations.In contrast, it seems that model joint-training only needs one step to train, actually, the time and hardware parameters required in the training environment are higher than those of model pretraining due to the high complexity of joint training and the inability to perform parallel processing.
Notably, it is worth to mention that the multimodal framework idea is diferent from the ensemble strategies, which is widely adopted in machine learning competitions.Ensemble strategies improve the efciency of the ensemble model through reducing the deviation and variance between basic models by adjusting the data set or combination of training result, like boosting integration [36,37], bagging integration [38] and stacking integration.Te multimodal framework is the integration of extracted patterns from diferent dimensions to produce the fnal result.It is to integrate the data from diferent perspectives to make the collected information more comprehensive with assigning diferent weights according to diferent features automatically that making the framework more robust and efcient.

Te Framework Cross-Validation Criteria.
In order to verify the reliability of the mentioned framework, we used k-fold cross validation on the training data to conduct experiments on diferent data sets.Specifc algorithm 1 implementation is as follows.

Experiment
Te experiment is divided into four steps.
(i) Converting raw data to trainable data (ii) Network fow spatial feature learning, mainly using CNN to classify the grayscale images (iii) Network fow sequential feature learning, mainly using GRU network to classify network fow digital sequences (iv) Hybrid feature learning, which uses the multimodal framework to classify network fow Te TensorFlow [43] is used as an experiment software framework that runs on Windows 10 home edition with Intel(R) Core (TM) i5-9300H CPU @ 2.40 GHz and 8 GB memory.An Nvidia GeForce GTX1650 GPU is used as an accelerator.Te mini-batch size is 256 and the cost function is categorical cross-entropy.Adam optimizer built-in Ten-sorFlow is used as an optimizer, training time is about 70 epochs.

Dataset and Preprocessing.
In this article, we used the ISCXVPN2016 dataset [29] mentioned in A. In order to better compare the experimental results, this study relabels the dataset according to the classifcation method of the literature [44].Under-sampling is also applied according to the number of data set.Sampling is a simple method to tackle this problem.Hence, to train the proposed framework, using the under-sampling method, we randomly select samples of major classes until the classes are relatively balanced.
Te dataset above is obtained from the data link layer.From hierarchical perspective, at the data link layer, the frame header information contains physical connection information, such as MAC address and other protocol content.Te network transmission layer also contains IP address information.Tese data play a key role in network stream transmission, but they cannot provide any valuable information in the feld of network fow classifcation and even training networks will use the address information to classify the network fow, which is ridiculous in practice.Terefore, in the data preprocessing part, the MAC and IP addresses are directly removed to eliminate the impact on the training task due to diferent addresses.
Te second step is fle cleaning, which is to clean up duplicate network stream fles and empty fles to avoid bias when training the network.
Finally, due to the need of using deep learning network to train the data, the data length standard needs to be unifed.TCP and UDP protocol headers have of diferent length.In order to unify the length of the transport layer, we inject zeros to the end of UDP segment's headers to make them equal with TCP headers.Finally, according to the literature [16], most of the valuable information is at the header of payload data.In this article, the frst 1225 bytes (35 × 35) are intercepted as the research object in the ISCXVPN2016 dataset to balance the accuracy and simplicity of the experiment.
Trough the analysis of the number of data samples, we found that it is very diferent with the number of samples of various types of data.In the literature [45,46], experiment results show that the performance of the learner will decrease due to the unbalanced number of samples, and this study will adopt a random sampling method until the number of various fows is relatively balanced.According to the network fow generated by diferent application types, this study relabels the data set and divides the network fow into 17 categories, as shown in Table 5.
Te dataset is divided into the training set, validation set, and test set, which respectively account for 60 %, 20 % , and 20 % of the resampled dataset.

Results of the Multimodal
Framework.Since the framework combines CNN and GRU, it is necessary to convert the network fow into trainable grayscale images and digital sequences, respectively.In the spatial and sequential features section above, we discussed how to convert the  Computational Intelligence and Neuroscience network fow into a matrix and then into a grayscale image in [28], which is the input into CNN for training.Figure 7 shows part of the grayscale images of ISCXVPN2016 data set after network fow conversion.Figure 7 lists visual pictures of part of the network fows which shows vividly that for diferent types of network fow, it is distinguishable for texture characteristics of the picture.Te above analysis can infer that after the network fow is converted to grayscale image, the diferent network fows have distinguishable features, and the same types have consistent ones.
In the part of input data of GRU, similar to the text processing method, the value corresponding to each byte of the network fow is similar to the token after the text tokenization which can be called network fow token, then associating it with vector by using embedding method.Tese vectors are combined into a sequence tensor, which is input into the GRU.
Table 6 shows the performance of four types of models (CNN, GRU, model pretraining, model joint-training) on the test set.Experiments show that the multimodal framework has entirely extracted network fow characteristics to distinguish each application accurately.
In order to show the experimental results of the model on the test set in more detail, this study draws a heat map based on the prediction results of each type of network fow.At the same time, hierarchical clustering is used to analyze the relationship of each type of network fow [47].Tis method uses Euclidean distance as the distance metric, average as the agglomeration method, and reveals the diferent relationships among the diferent types of network fows.Here, we just show the result of the model pretraining and jointtraining, as shown in Figures 8, 9. Te fgures indicate four models have achieved high classifcation accuracy.In particular, model pretraining and model joint-training have similar results in accuracy and are better than the basic model.Te clustering reveals that there are similarities between AIMchat and ICQ, skype and e-mail.Furthermore, there are similarities between Gmail, AIMchat, and ICQ.Tis is consistent in practice, as both AIM and ICQ provide online chat functions, and skype and e-mail also provide chat services based on the online.Te conclusion is similar to [44] but more accurate which reveals the true relationships among the diferent types of network fows.Network fow classifcation shows a certain relationship because of the functional similarity of the applications abovementioned, which just shows that the network fows are classifed on the basis of extracting features accurately.According to [44], we compared the results of the ISCXVPN2016 data set based on diferent machine learning methods, where the decision tree depth parameter is 2, random forests is four, regression (with c � 0.1), and naive Bayes with default parameters.As shown in Table 8, combined with Table 6, we can fnd that the classifcation based on deep learning is better than that of various machine learning.Tis shows the power of deep learning tools, especially in processing big data tasks and is consistent with the analysis results of the III-B1.Computational Intelligence and Neuroscience  In this part, we try to explore why the multimodal framework has better performance.Basic learners with relatively good performance observe the data from diferent dimensions and obtain part of the information of the truth, but not all of the information.Only by gathering all kinds of information together can we get a more accurate description of the data.In order to ensure the efectiveness of the multimodal framework, the key lies in the diversity of the learners.Te deviation of diferent learners is from diferent perspectives; after integrating the diferent learners, the framework will cancel each other in these deviations, so the result is more stable and accurate.
Te abovementioned two models are the CNN and GRU model for extracting spatial and sequential features.Te two types of models are diferent, so the deviations in the classifcation can be ofset by each other, so as to achieve more stable and accurate classifcation results.
Figure 11 is our attention analysis based on the features extracted by model pretraining.Te two types of basic models are integrated into input data after extracting 256 and 128 features of each training object, and then sent to the secondary learner for training.As can be seen from Figure 11, on the one hand, the proportion of sequence features used by secondary classifers is much larger than that of spatial features, because the classifcation accuracy of GRU network using sequence features is higher than that of CNN network classifcation.On the other hand, from the perspective of usage of the overall feature, in the secondary classifer, both spatial and sequence features participate in the fnal classifcation task, which indicates that the above two types of features both contribute to the classifcation task and promote the classifcation efciency of the model.
Figure 12 shows the result of models trained on the diferent data based on k-fold cross-validation.It can be seen from the fgure that the Acc scores of the result of the CNN model vary from 98.07 % to 98.52 %, and the scores of the GRU model are (99.01%-99.21%).Te scores of model pretraining are (99.29 %-99.41 %), and the values of model joint-training are (99.23 %-99.45 %).Although the performance of the basic model is slightly diferent on diferent data sets, the multimodal framework has improved the performance of the basic model.
In III-C1, we compared the two types of models and found that compared with model pretraining, model jointtraining requires higher training conditions and more time.Table 9 shows the specifc parameters of each model.

Conclusions
To solve network security problems, this study proposes a new method of network fow classifcation based on deep learning: the multimodal framework which is based on pretraining and joint-training, respectively.To the best of our knowledge, this is the frst time that such a framework has been proposed in the feld of network fow classifcation.Experimental results show that the framework outperform similar work done in recent years based on the data set ISCXVPN2016.At the same time, the multimodal framework is a deep learning network, which is handy in processing big data.More data is conducive to the improvement of the framework performance, moreover it can realize the automatic extraction of network fow features, saving a lot of manpower and time which has good practical signifcance.We believe that the multimodal framework is a meaningful attempt in the feld of network security, and it is also very useful for the construction of a human physical and mental health system.
In the next step, we can adopt more methods suitable for network fow classifcation and expand the framework to build a better classifcation model.At the same time, it is possible to explore new network for network fow classifcation tasks.For example, in [23], the author proposed that the capsule network can improve the efciency of classifcation tasks, which can be considered as a new direction for future research.With the continuous improvement and use of transformer in the feld of feature extraction, it also has great inspiration for the feature extraction of network fow.
In the experiment, we extracted the high-level information features of the network fow and studied the impact of feature extraction on the classifcation task.In order to ft  the experimental results, high-level information may lose some important information, causing the extracted features to not fully refect the unique information of the network fow.At the same time, due to the usage of CNN, it is difcult to extract the global information of the features, and it may also afect the classifcation results.New technologies, such as the Conformer structure [49], can be applied to the feld of network fow classifcation; moreover, we can explore other feasible technologies.Finally, this article mainly discusses the feasibility of the framework using unencrypted datasets to conduct experiments.In practice, as information security has received much attention, network encryption has become the mainstream.Te next step is to conduct experiments based on the encrypted network fow for better application in practice.

4
Computational Intelligence and Neuroscience originated features.In the article, authors used NN and SAE network to extract and classify the features of the processed network fow.Compared with the previous machine learning algorithms, the accuracy has been greatly improved.In the data preprocessing stage, the data packet obtained by the data link layer is used as the processing object, and packet is represented in the form of byte stream with the usage of the data packet extraction tool.After preprocessing, the data is sent to the neural network for training with the purpose of automatic extraction of feature.

( 1 )
CNN Model Construction.CNN is widely used in the feld of image pattern recognition.It is a kind of deep learning model which contains a large number of convolution operations.A complete CNN includes several convolutional layers (CONV), pooling layers (POOL), and fully connected layers (FC).Te common architecture patterns are shown as follows.INPUT ⟶ [[CONV]×N ⟶ POOL]×M ⟶ [FC]×K.

Figure 2 :
Figure 2: Schematic diagram of network fow transformed into a grayscale graph.

( 1 )
Begin: (2) Break the data into k folds.Data capacity per fold is N; (3) Construct model joint-training based on CNN model and GRU model; (4) for each i in [0, k]do (5) validation data � data [N: (i + 1) × N]; (6) training data � rest of data; (7) train CNN model [i] on the training data, observe it on the validation data; (8) train GRU model [i] on the training data, observe it on the validation data; (9) construct model pretraining [i] based on CNN model [i] and GRU model [i]; (10) train model pretraining [i] and model joint-training; (11) Model pretraining [i] prediction on the test data; (12) Model joint-training prediction on the test data; (13) end for (14) End ALGORITHM 1: k-fold cross validation used to test the stability.

4. 3 .
Comparison.In this section, we compare the experimental results of network fow classifcation based on the ISCXVPN2016 dataset in recent years.Among them, Yamansavascilar et al. [48] used the k-NN method to classifcation, Lotfollahi et al. [44] used a method called "Deep Packet," namely SAE and CNN to classify network fow.It can be found from Table 7 that both model pretraining and model joint-training outperform two methods above.It should be noted that Yamansavascilar et al. used machine learning methods to implement classifcation based on manually and expert-originated feature sets.Lotfollahi M et al., model pretraining and model joint-training use deep learning methods.Tis article has analyzed the shortcomings of classifcation based on manually and expert-originated features, and automatic extraction of network fow features based on deep learning is more suitable for practical applications with the development of intelligence.

Figure 10
further shows the experimental results based on the deep learning on the test set.It indicates that the two types of the multimodal framework outperform the method based on "Deep Packet" in network fow classifcation.

Figure 7 :
Figure 7: Grayscale images of diferent types of network fows.

Figure 8 :s p o t i f y n e t fl i x y o u t u b e h a n g o u t s f a c e b o o k v i m e o v o i p b u s t es p o t i f y n e t f i x y o u t u b e h a n g o u t s f a c e b o o k v i m e o v o i p b u s t eFigure 9 :
Figure 8: Heat map of test results for each model; the abscissa represents the true label of each type of network fow, and the ordinate represents the predicted label.Te color represents the predicted probability.(a) CNN model, (b) GRU model, (c) model pre-training, (d) model joint-training.
Figure 13 shows how much time it needs to build a model on the training set of 113004 fows and validation set of 37668 fows in each epoch based on the same training environment mentioned above.As shown in the fgure, model pretraining requires less training time than model joint-training, and the basic models in model pretraining can be processed in parallel.It seems that the model pretraining has more advantages in practice.But the latter one is an end-to-end model, if the size of sample data is relatively insufcient and data enhancement is allowed for training tasks, model jointtraining is more powerful, and model pretraining is limited to the dataset used by the basic model.From this perspective, it is more fexible in selection of data set comparing model joint-training with model pretraining.Terefore, in practice, it is necessary to select the proper model according to the specifc training task.

Figure 13 :
Figure 13: Time consumption of diferent models.

Table 1 :
List of the acronyms used in this study.

Table 2 :
Summary of methods in the literature employing machine and deep learning models.
subsample Figure 1: Te processing of machine learning.

Table 3 :
Accuracy (%) of models on the test set.As shown in the table, the information column lists the bytes with larger weights.Te top ones include tcp.option_kind, tcp.option_len, ip.hdr_len, ip.len and other bytes.Tese features are related to network fow environment variables.

Table 5 :
Type of network fow in ISCXVPN2016.

Table 6 :
Four models' performance for the network fow classifcation.

Table 7 :
Te comparison between the multimodal framework and other methods on the "ISCXVPN2016" dataset.

Table 8 :
Te comparison between the multimodal framework and other machine learning methods on the "ISCXVPN2016" dataset.

Table 9 :
Comparison of model parameters.