Effective Feature Selection for 5 G IM Applications Traffic Classification

Recently, machine learning (ML) algorithms have widely been applied in Internet traffic classification. However, due to the inappropriate features selection, ML-based classifiers are prone to misclassify Internet flows as that traffic occupies majority of traffic flows. To address this problem, a novel feature selection metric named weighted mutual information (WMI) is proposed. We develop a hybrid feature selection algorithm namedWMI_ACC, which filters most of the features withWMI metric. It further uses a wrapper method to select features for ML classifiers with accuracy (ACC) metric. We evaluate our approach using five ML classifiers on the two different network environment traces captured. Furthermore, we also apply Wilcoxon pairwise statistical test on the results of our proposed algorithm to find out the robust features from the selected set of features. Experimental results show that our algorithm gives promising results in terms of classification accuracy, recall, and precision. Our proposed algorithm can achieve 99% flow accuracy results, which is very promising.


Introduction
Accurate network traffic classification is extremely important for the network management including IP network management, deploying QoS-aware mechanisms, monitoring security, bandwidth management, and intrusion detection.For instance, it is useful for the Internet Service Providers (ISPs), network operators, and network administrators to understand the traffic composition and prioritize some sensitive bandwidth traffic such as video conferencing and voice over IP (VoIP).Moreover, from the perspective of network security, network traffic classification technique can help us in blocking unwanted or attack traffic.
In the last few years, several traffic classification models [1,2] have been proposed in this regard.Traditionally, portbased technique was proposed, which is based on well-known port numbers for traffic classification.This technique is easy to be deployed and implemented.However, many Internet applications use dynamic port number for their communication instead of well-known port number, which makes it difficult for the network operator to identify network traffic composition by port numbers.Moore and Papagiannaki [3] showed that port-based traffic classification technique does not give more than 50-70 percent accuracy.
To address the above problems, payload-based technique was proposed [4,5] which inspects the packet payload signatures.Though this traffic classification technique is easy and accurate, this classification technique is ineffective for encrypted applications as numerous applications such as Skype use encrypted methods to protect their data from being detected.Furthermore, this technique is against the privacy laws to inspect the packet payload.In recent years, machine learning (ML) algorithms have been presented to classify Internet traffic flows, whose features are extracted from the flow statistics.Moreover, this ML technique is user privacy friendly and does not inspect the packet payload.However, the accurate feature selection problems challenge this technique.Features selection refers to the selection or filtration of accurate features in more than available features.For instance, Moore et al. in 2005 [6] presented the most wide features of extraction and selection method and they selected 248 statistical features, based on a whole traffic flow.
(iii) Our third contribution is to select the robust features from the selected features of our proposed WMI_ACC algorithm.We use Wilcoxon statistical test to find out the robust feature from the selected features.
The rest of the paper is organized as follows: Section 2 gives an overview of the related works.Section 3 elaborates our proposed WMI_ACC algorithm.Section 4 demonstrates the evaluation methodology with details, experimental work, and utilized datasets.Analysis and discussions are given in Section 5. Finally, conclusions and future works are shown in Section 6.

Related Work
Recently, machine learning (ML) algorithm has widely been applied in Internet traffic classification in [11][12][13][14][15][16][17][18][19].Some of them are applied for traffic flow classification and some of them are applied for bandwidth management.However, most of the methods are applied for improving the performance of classification by using ML algorithms methods.These proposed methods are able to get 80% accuracy results.For feature extraction and selection, mostly using feature selection method as presented by Moore et al. in 2005 [6], based on whole traffic flow, they selected 248 statistical features, such as RTT and minimum, maximum, and average values of packet size.Using these selected statistical features, the applied classifiers can get very effective performance results in Internet traffic classification.But, in real circumstances, it is not good for traffic classification.Thus, we must select accurate features for Internet traffic classification so that we can manage subsequent management and security policies.In 2012, in [7], Zhang et al. proposed two different algorithms for feature selection for optimization of traffic classification.They evaluated their results based on true positive rate (TPR) and false positive rate (FPR) and proved that their algorithm can achieve greater than 90% flow accuracy.Similarly, Peng et al. in 2016 [20] evaluated the effectiveness of statistical features.But their research study was limited to only early stage Internet traffic classification.Bernaille et al. [21] studied the problem of effective features in network traffic classification.In their study, they used -Means and GMM and HMM model for Internet traffic classification.They selected packet size as a feature and extracted more features for early stage Internet traffic classification.Lim et al. in [22] used the packet size, connection level, and statistical features for Internet traffic classification.Van Der Putten and Van Someren in [23] revealed that features selection is very important for performance optimization compared to the choice of classification classifiers.
Zheng et al. in [24] showed that how many types of selected features metrics affects the identification performances.For accurate features selection purpose, Chen and Wasikowski in [25] proposed new feature selection metric using area under the ROC curve to evaluate features for Internet traffic classification.Kamal et al. [26] proposed three different filtering techniques, Balanced Minority Repeat (BMR), Differential Minority Repeat (DMR), and Higher Weight (HW), for the identification of effective features.
From the last few years, daily use of Internet applications increases day by day due to free of cost availability of instant messaging (IM) applications.It is also important to accurately classify IM applications.For instance, WeChat is an IM and free calling application developed by Tencent Holdings in China.After launching the WeChat application, its online users reached 300 million [27] and, in November 2015, its active users reached 650 million all over the world.Apart from China, its online users reached 100 million [28] in the rest of the world.So, day to day increasing number of active users and traffic of this application can affect performance of the network.It is also important to classify WeChat messages and audio and video call traffic accurately to manage the quality of services (QoS) as Huang et al. [29] proposed measurement ChatDissect tool to measure WeChat application traffic and distinguish 150K users and 16 GB traffic of WeChat from real-world network traces.In 2013, Church and Rodrigo de Oliveira [30] studied the performance of mobile instant messaging sending service with traditional short messages.In 2014, O'Hara et al. [31] studied instant messaging application WhatsApp in smartphone and conducted some interviews and a survey to study the user activity using WhatsApp application.In 2014, Fiadino et al. [32] also studied WhatsApp application's flow stream and collected data in European Network, which consisted of millions of data flow streams.They also studied audio and video flow data stream.In 2014, Liu and Guo [33] studied video messaging services in WeChat and WhatsApp applications.They captured the traffic using mobile devices for their study.Furthermore, in our previous work in [8], we classify IM applications; however, we only classify WeChat text messages service flow traffic.Nevertheless, it is more important to select accurate features for IM applications traffic classification.

Proposed Method
In this section, we explain our proposed WMI_ACC algorithm in detail.First of all, we examine the problem of features selection.Then we introduce WMI based feature metric and then design the WMI_ACC algorithm that uses the WMI combined with ACC metric to select effective features for Internet traffic classification.

Mutual Information Based Metric.
In information theory, mutual information is extensively used for features selection [9,34], image processing [35], speech recognition [36], and so forth.It measures the mutual dependency between two random variables  and , which describes the amount of information held by random variable.The mutual information between two random variables is described as  In (1), the marginal entropies of  and  are () and (), while conditional entropies are ( | ) and ( | ) and joint entropies of X and Y are (, ).Moreover, the relationship between (), (), ( | ), ( | ), Figure 1: The relationship between mutual information and entropies.
(, ), and (; ) is shown in Figure 1.According to Shannon definition of entropy theory, we have where (⋅) indicates the probability distribution of a random variable.As in [11], use the three equations in (1) to achieve the computational formula for mutual information.We also use the same method that the authors therein have used for mutual information.

𝐼 (𝑋
In case of continuous random variables, the summation will be replaced by a definite double integral.

Weighted Mutual Information (WMI) Metric.
To address accurate features selection problem, we proposed weighted mutual information based on weighted entropy.
If the total number of features is , the weight value is calculated as follows: where  is the number of features assigned to features set; then the weighted mutual information (WMI) between two random variables can be defined as In ( 5), the weighted marginal entropies of  and  are   () and so on.Again (⋅) is the probability distribution of random variable.So, by using the three equations in (6), the weighted mutual information can be obtained as follows: For mutual information computational analysis, there is a bundle of software applications publically available, but we select H. Peng's mutual information Matlab toolbox [37] for our study.
3.1.3.ACC Metric.After using WMI metric, it is essential to select the effective features for specific ML classifier to obtain effective performance results.For this purpose, a wrapper method based on accuracy (ACC) metric is applied.On the other hand, to achieve high performance accuracy with regard to classification of applications, the AUC metric is not suitable to rank the features.The highest ACC implies that ML classifier can obtain effective performance.Thus, we rank the features by using ACC metric and select those features with highest ACC values.We used C4.5, Random Forest, and Random Tree machine learning classifiers for ACC metric.

WMI_ACC Algorithm.
In this section, we propose effective features selection algorithm, named as WMI_ACC.WMI_ACC is a hybrid features selection algorithm based on WMI combined with ACC metric.Firstly, it filters most of the features with WMI metric and then selects the effective features with ACC metric for a specific algorithm.Algorithm 1 shows the detailed pseudocodes for our implemented features selection algorithm.In Algorithm 1, there are two steps, given dataset  with  classes and  features.In the first step (lines (1)-( 10)), WMI_ACC algorithm filters most of the features with WMI value.The weight values for each of the features (line (3)) are calculated according to (4) (illustrated in Section 3.1.2).A good feature has greater mutual information values related to other features.WMI_ACC firstly calculates the value of WMI between each of the features (line ( 6)).However, if the value of WMI is greater than the predetermined threshold value (line (7)), it inserts features in the list in descending order.The greater threshold value speeds up the feature selection process but decreases the classification accuracy [20].Thereafter, in line (11), the algorithm will get the list of WMI features set.
In the second step (lines ( 13)-( 26)), WMI_ACC algorithm selects effective features with ACC metric for a particular ML classifier.It gets the features from the desired list one by one and finds the feature that produces high ACC (accuracy) value.Exactly, from lines ( 13)-( 16), firstly it achieves the values of ACC based on  wrapper which consists of first feature list and then it takes the next feature from the list and then inserts it into  wrapper .If the ACC value of new inserted feature is low, WMI_ACC algorithms remove the features from the list in line (21).Lastly,  wrapper includes the effective features set.

Statistical Test.
In more depth, to select the robust features from the selected features list of our proposed WMI_ACC algorithm and to find the significant difference among the results of the applied method, statistical tests are conducted.In this study, we executed Wilcoxon pairwise statistical test on the results of methods [38,39].The detailed introduction to the Wilcoxon pairwise statistical test is given as follows: (i) Wilcoxon test: we also used Wilcoxon signed-rank statistical test in this research.Wilcoxon test is also a nonparametric method used for pairwise comparison between two methods [40] and is also used in many research areas [9].If   is the variance between two methods' performance scores on th out of  problem and the score is in different ranges, then it can be normalized on interval 0 and 1 in [41].Afterwards, the variations are ranked by their absolute values and in ties practitioner will be conducted on one method as in [42].In this case, the positive values indicate that the method performed well and vice versa.
R + is used for the sum of positive values and  − will be used for the sum of negative variation values.It means that if the difference between these  − and  + is very high, then the hypothesis will be disallowed, that is, rejected.This statistical test is also used like Friedman test to determine whether the hypothesis will be rejected or not on the specific significant values .

Evaluation Methodology
This section includes traces traffic, evaluation criteria, and analysis of experimental results.

Performance Measures.
For the evaluation performance of five machine learning (ML) classifiers/algorithms, classification accuracy, recall, and precision values are employed.All the measuring metrics are described as follows: (i) Classification accuracy: it is the number of correctly classified traffic flows divided by total classified flows (ii) Recall: it is the percentage of specific traffic flows Class Z correctly classified as belonging to that Class Z (iii) Precision: it is the percentage of the traffic flows which exactly have Class Z between all those that were classified as Class Z These performance evaluation metrics are important for flow-based traffic classification in network traffic identification.However, flow accuracy is used to measure the overall performance of an ML classifier.

Experimental Results and Analysis.
In this section, our objective is to evaluate the performance of our proposed algorithm, comparing the results of HIT Trace 1 dataset with NIMS dataset's results.Our experiments include three phases.First of all, on HIT dataset and NIMS dataset, we validate that our proposed feature selection algorithm is effective for feature selection with respect to accuracy results.Then, we validate that our proposed algorithm is effective with precision and recall results.Bayes, C4.5 decision tree, Random Forest, and Random Tree machine learning algorithms.Figure 2 shows the detailed accuracy result chart of HIT Trace 1 dataset.However, we use Weka application for our experiments using training and testing method to classify IM applications traffic accurately.
It is clear from Table 3 and Figure 2 that the applied machine learning classifiers give maximum classification From these experimental results, it is evident that Random Forest and C4.5 machine learning classifiers give better performance in terms of classification accuracy as compared to other machine learning classifiers for the NIMS dataset.However, Random Forest machine learning classifier gives very effective results in terms of classification accuracy.The details are shown in Table 4 and Figure 3.In NIMS dataset, SFTP application is classified 100% as compared to other traffic applications and the applied machine learning classifiers give very accurate identification results for SFTP.The applied machine learning classifiers give very accurate performance results for NIMS dataset, but Naïve Bayes ML classifiers give slightly low accuracy results for all traffic applications except SFTP application.However, all the traffic applications are classified using five machine learning classifiers accurately.
Similarly, Bayes Net machine learning classifier gives better recall values for HIT Trace 1 dataset as shown in Table 5 and Figure 4.All the traffic applications are classified accurately with respect to recall metrics, but IM and IMAP traffic applications give very poor results with respect to recall metrics, particularly the IM application that gives very low performance results of recall metric for HIT Trace 1 dataset traffic classification, while ML classifiers C4.5, Bayes Net, and Random Tree give effective recall results.From the table, WTCP application is classified very effectively using the ML classifiers with respect to recall metrics but Random Forest ML classifier gives maximum recall results as compared to other machine learning classifiers and then Random Tree and C4.5 ML classifiers give maximum same performance recall results using WTCP application.Similarly, using WUDP traffic application, only Naïve Bayes ML classifier gives low results and all other ML classifiers give 100 recall results for WUDP traffic application.For P2P, Random Forest and Bayes Net give good recall results, while, for IM, Random Forest  gives maximum recall results.Similarly, for FTP, Bayes Net and Random Tree give maximum recall results.
However, for precision results, as shown in Table 6 and Figure 5, Random Forest machine learning classifier gives effective precision results for HIT Trace 1 dataset.It is clear from the experimental results using HIT Trace 1 dataset that the entire selected machine learning classifiers get high performance results values in terms of classification accuracy, recall, and precision.
Though all the traffic applications are classified very efficiently using machine learning classifiers, IM and IMAP traffic applications give very low precision performance results for HIT Trace 1 dataset as compared to other traffic applications.Using machine learning algorithms, WTCP application is classified efficiently and the applied ML classifiers give good results with respect to precision metric, which are more than 99% results.Similarly, WUDP traffic application is also classified very accurately and the applied classifiers got more than 99% precision results, but Bayes Net classifier gives 100% precision results, which are promising precision results.For P2P application, Random Forest, Random Tree, and Bayes Net ML classifiers give 100% precision results, while IM application traffic is classified very effectively and the applied classifiers do not get more than 63% precision results.Similarly, IMAP application is also poorly classified with respect to precision metric.However, FTP traffics are classified with respect to precision values and mostly classifiers got 100% precision results for HIT Trace 1 dataset.From the experimental result of NIMS dataset, it is evident again that C4.5 and Random Forest ML classifiers give very effective precision results for NIMS dataset.However, all the applied ML classifiers give very attractive results but C4.5 and Random Forest ML classifiers' results are very promising results in terms of precision values.Using ML classifiers, all the applied machine learning classifiers give very effective precision results for GTALK TCP application, but Random Forest, C4.5, and Bayes Net ML classifiers give 100% results.For GTALK UDP application, all the applied classifiers got good precision results, but only Naïve Bayes got low precision result.Similarly, for DNS, FTP, HTTP, and SFTP, all the applied classifiers got promising precision results but only Naïve Bayes and Random Tree got slightly low precision results.However, all the precision results are good using NIMS dataset.
Recalling results for NIMS dataset, C4.5, Random Forest, and Random Tree ML classifiers give accurate results in terms  of recall for the NIMS dataset.The detailed results are shown in Tables 7 and 8 and Figures 6 and 7. Similarly, all the utilized traffic applications of NIMS dataset are classified accurately with respect to recall metric but SFTP and GTALK TCP got 100% recall results using the applied ML classifiers.From the classifiers' point of view, only Naïve Bayes ML classifier got low recall results as compared to other ML classifiers.Moreover, using ML classifiers for GTALK TCP application, the traffic is classified vey accurately as all the applied ML classifiers got 100% recall results, while for GTALK UDP only Naïve Bayes ML classifier gives slightly low recall results.For DNS application, only Naïve Bayes got poor recall value and  the remaining applications got 100% recall values.Similarly, FTP, HTTP, and SFTP applications are classified accurately about 100% but only Naïve Bayes gives slightly low recall results.9 shows the Wilcoxon pairwise test results for the robust features selection from the selected features of WMI_ACC proposed algorithm.From Table 9,  value of features is greater than 0.05 for the accuracy results.Thus, we conclude that there is no significant difference between the results of 9 features and other features for the selected features.We conclude that Mobile Information Systems

Analysis and Discussion
Though the results of the five applied machine learning classifiers are different with respect to accuracy, recall, and precision using HIT Trace 1 dataset and NIMS dataset, some information can be obtained from experimental study for IM traffic classification: (i) From this study, it is clear that our proposed algorithm selects effective features set for IM traffic classification using two different network environment datasets in terms of classification accuracy, recall, and precision metrics.(ii) From the experimental results, all the applied machine learning classifiers give very effective performance results for all application classifications, but only FTP and Telnet applications are classified a little bit low in both utilized datasets as compared to other applications.(iii) In this research study, our proposed algorithm gives effective features sets and it is evident that all the features carry enough identification information for IM traffic classification.(iv) Through accuracy results, the classification performance can be easily evaluated for the instant messaging (IM) traffic classification.But, in some cases, some classifiers get high identification performance results and in some cases they do not get very effective results.It is due to imbalance traffic composition found in the datasets.(v) We discuss that all the applied ML classifiers give very effective performance results.However, C4.5 decision tree and Random Forest ML classifiers give very accurate performance results as compared to other machine leaning classifiers.

Table 1 :
Characteristics of HIT Trace 1 dataset.In this paper, we select two sets of network traces for our experimental study.One dataset is our set of traces collected in our lab, while the other set is an open network trace dataset.The selected two traces are different network environment datasets.We applied our proposed feature selection algorithm on both datasets, respectively, not on only one dataset for better understanding of the composition of Internet traffic.We used two different network environment datasets, because these datasets are different from each other; for example, in our trace dataset, we capture mostly WeChat instant messaging application's traffic, while in NIMS dataset GTALK IM application's traffic is traced.We captured the traffic with duration of one hour for our research study at the laboratory at School of Computer Science & Information Technology, Harbin Institute of Technology, Harbin, China, on 27 December 2015 and 28 April 2016.It should be noted that we only trace the traffic that has none zero payload packets and we are only interested in TCP and UDP traffic of WeChat IM application and P2P, IM, IMAP, and FTP traffic.In this dataset, WTCP traffic and WUDP traffic mean TCP traffic and UDP traffic of WeChat application.The detailed characteristics of HIT Trace 1 dataset are shown in Table1.
[9].1.HIT Trace I Dataset.In this paper, we used the same dataset that we have used in our previous paper in[9].However, for developing HIT Trace 1 dataset, we capture WeChat instant messaging (IM) application traffic.WeChat IM application includes multiple functions, but we only trace text messages, pictures messages, and audio and video calls traffic; also, in more depth, we trace IM, IMAP, and FTP applications' traffic for our research study.In this research study, we are interested in finding out the effective features for IM application traffic classification.Thus, we trace only WeChat IM application text messages, pictures messages, and audio and video calls traffic, respectively, with a Wireshark tool[43].traffic such as DNS, HTTP, SFTP, and P2P traffic.However, we are also interested in instant messaging applications traffic classification.In this case, we also added NIMS GTALK trace traffic, which includes TCP GTALK traffic and UDP GTALK traffic.Moreover, in NIMS dataset, we select only DNS, HTTP, SFTP, GTALK TCP, and GTALK UPD traffic for

Table 3 :
Accuracy result of HIT Trace 1 dataset.
our research work study.The detailed characteristics of NIMS data are shown in Table2.

Table 3
depicts the classification accuracy results of HIT trace dataset using Bayes Net, Navies

Table 5 :
Recall results of HIT dataset.

Table 6 :
Precision result of HIT dataset.

Table 7 :
Precision result of NIMS dataset.

Table 8 :
Recall results of NIMS dataset.

Table 10 :
Selected feature of our proposed algorithm.

Table 11 :
Datasets' average accuracy comparison.This paper proposed feature selection algorithm named WMI_ACC used to select effective features for IM traffic classification.The performance of our proposed algorithm WMI_ACC is very promising for 5G IM traffic classifications.The experimental results showed that our approach is able to improve the classification accuracy, recall, and precision mostly in 5G high dimension traffic.Furthermore, ten flowbased features selected by our approach are very important for 5G IM traffic classification.They are (1) max_fpktl, (2) mean_fpktl, (3) max_fpktl, (4) std_fpktl, (5) min_bpktl, (6) mean_bpktl, (7) max_bpktl, (8) std_bpktl, (9) max_fiat, and (10) total_fpacket.Using Wilcoxon pairwise statistical test, it is evident from the experimental study that these features carry enough classification information.Moreover, all the applied ML classifiers get very effective performance results, but we found that C4.5 and Random Forest ML classifiers with WMI_ACC selected features have very effective performance as compared to other applied machine learning classifiers.In our experiments, some ML classifiers get very efficient performance results in terms of classification accuracy, recall, and precision and some ML classifiers get little bit low classification results.It is due to imbalance of dataset.However, there is still a gap for further research in the 5G instant messaging (IM) traffic classification.A new approach should be designed to select robust feature for IM applications traffic classification and this is our future research work.