The ever expanding communication requirements in today’s world demand extensive and efficient network systems with equally efficient and reliable security features integrated for safe, confident, and secured communication and data transfer. Providing effective security protocols for any network environment, therefore, assumes paramount importance. Attempts are made continuously for designing more efficient and dynamic network intrusion detection models. In this work, an approach based on Hotelling’s T2 method, a multivariate statistical analysis technique, has been employed for intrusion detection, especially in network environments. Components such as preprocessing, multivariate statistical analysis, and attack detection have been incorporated in developing the multivariate Hotelling’s T2 statistical model and necessary profiles have been generated based on the T-square distance metrics. With a threshold range obtained using the central limit theorem, observed traffic profiles have been classified either as normal or attack types. Performance of the model, as evaluated through validation and testing using KDD Cup’99 dataset, has shown very high detection rates for all classes with low false alarm rates. Accuracy of the model presented in this work, in comparison with the existing models, has been found to be much better.
Sophisticated security policies and tools are designed continuously in order to ensure integrity, availability, and confidentiality of data for legitimate users in a network environment. Security tools such as firewall and cryptographic techniques and authentication are designed based on the attacks existing at the time of their development [
Intrusion detection systems generally analyze and dynamically monitor network traffic patterns and log information. The analysis helps in deploying suitable detection methodologies to identify whether the events have any signature of attacks or are legitimate profiles [
Attempts have been made for enhancing detection performance and efficiency of IDS systems for anomaly detection using a wide range of algorithms. These algorithms are largely based on data mining [
Network traffic profiles are often characterized by multiple features. Any deviations caused in such multiple attributes also need to be considered while analyzing the network for intrusions. Therefore, profiles represented by multiple attributes need to employ multivariate analysis techniques for analyzing traffic profiles. This approach can eliminate the problem of comparing a predicted event with an observed event directly [
Hotelling’s T2 test, a multivariate statistical technique, has been developed as a process control tool used for hypothesis testing [
Ye et al. have carried out multivariate statistical analysis of audit trails for detecting intrusions in host systems using Hotelling’s T2 technique and detected both counterrelationship and mean shift anomalies. For smaller datasets, all intrusions are detected with zero false alarm rates whereas, for larger datasets, the detection rate has been 92% with zero false alarm rates [
Though numerous intrusion detection systems have been developed for providing security for network environments, very often it is reported that false alarm rates need to be considerably reduced or eliminated. Since the Multivariate Hotelling’s T2 Statistical (MHT2S) technique for intrusion detection in host machines has been reported to produce zero false alarm rates, it is possible to employ this approach for providing security in a dynamic network environment as well. Studies employing MHT2S model for anomaly detection in network environment, to our knowledge, are very rare. Therefore, in this work, a network anomaly detection system based on MHT2S technique is developed with an objective of achieving high detection rates combined with low false alarm rates.
The MHT2S model is built with legitimate traffic profiles and the statistical deviation of an observed traffic profile from the legitimate ones is measured. If the statistical deviation of an observed traffic falls outside the specified threshold range, the observed traffic is then suspected as an anomalous one. The threshold range is calculated using the central limit theorem for multivariate analysis. The performance of the anomaly detection system proposed in this work is evaluated using the benchmark KDD Cup’99 dataset.
The paper is organized as follows: Section
The KDD Cup’99 dataset [
The KDD Cup’99 dataset is collected from a simulated environment and information available needs to be processed before it is used for developing any intrusion detection system. Four steps of preprocessing have been carried out for the dataset in order to make them suitable for developing the MHT2S model. They are redundancy removal, numeric value assignment, normalization, and feature selection. In the preprocessing step, eliminating redundant traffic profiles of the data source makes the model unbiased towards any recurring traffic profile. Table
Description of redundancy in dataset (10%_corrected_subset_KDD Cup’99).
Class | Number of original records | Number of records after | ||
---|---|---|---|---|
Number of samples | % | Number of samples | % | |
Normal | 97279 | 19.75 | 87832 | 60.79 |
DoS | 391460 | 79.46 | 54573 | 37.77 |
Probe | 3460 | 0.70 | 1627 | 1.13 |
R2L | 442 | 0.08 | 425 | 0.29 |
U2R | 37 | 0.01 | 37 | 0.03 |
Total | 492678 | 100 | 144494 | 100 |
After assigning numeric values, the range of values for different features is different. Table
Minimum, maximum, and distinct values of some features of KDD Cup’99.
Features | Min | Max | Distinct |
---|---|---|---|
Protocol type | 1 | 3 | 3 |
Flag | 1 | 11 | 11 |
Service | 1 | 66 | 66 |
src_bytes | 0 | 693375640 | 3300 |
dst_bytes | 0 | 5155468 | 10725 |
diff_srv_rate | 0 | 1 | 78 |
dst_host_same_src_port_rate | 0 | 1 | 101 |
Count | 0 | 511 | 490 |
T-square distance (TSD) method is used in statistics for hypothesis testing of both univariate and multivariate applications. This technique can identify whether an observed profile belongs to a particular group or not. This technique utilizes first order statistical measures such as mean and variance along with second order statistical measures such as sample covariance matrix for hypothesis testing. These statistical measures analyze correlations between variables and remove dependencies on the scale of measurement during calculation [
Consider a set of
Consider the following.
Calculate sample mean Generate sample covariance matrix For calculate End for Compute mean of TSD Compute standard deviation Return
TSD value is calculated for the observed traffic profile using sample mean vector and sample covariance matrix. TSD value thus obtained is transformed into T2 statistic by multiplying TSD with a constant value as given in (
Instead, central limit theorem is used for detecting multivariate network traffic samples with the assumption that TSD value of multivariate profiles approximately follows normal distribution. Taking TSD values as samples, the mean and standard deviations are calculated for estimating the threshold range. The threshold range is given by
Consider the following.
Generate If Return normal Else Return attack. End if.
The MHT2S intrusion detection system has been evaluated in terms of system accuracy, attack detection rate, and false alarm rate. Accuracy (acc) of a complete system is the ratio of the sum of normal and abnormal records correctly identified to the total number of records using
Apart from these metrics, the visualization tool used for analyzing the performance of the intrusion detection system is the Receiver Operating Characteristic (ROC) curve. The ROC curve provides a clear trade-off between detection rate and false alarm rate for every model. Values that appear in the upper left triangle of the ROC curve, that is, above the line
The proposed MHT2S intrusion detection model was developed on a personal computer with the processor Intel(R) Core i5 – 2410 M, CPU @ 2.30 GHz, 5 GB of memory, and 32-bit Windows 7 Ultimate operating system. The algorithm was implemented in NetBeans IDE 7.0 platform with JAVA SE7 version. The MHT2S intrusion detection model has been evaluated using the KDD Cup’99 dataset. The MHT2S based DoS model utilized 54574 unique DoS profiles. Out of these profiles, 50574 were used for building the model and the remaining 4000 profiles for testing the model. In the probe model, out of 1628 unique profiles, 1478 were used for building and the remaining 150 for testing the model. In the R2L model, 375 unique profiles were used for building the model and remaining 50 for testing. In the U2R model, 32 profiles were used for building the model and 5 for testing the model. In case of normal model, 50000 unique profiles were selected proportionately from 87832 profiles. Out of the 50000 selected profiles, 45000 were used for building the model and the remaining 5000 were used for testing the MHT2S based normal model. The number of features selected after preprocessing in DoS, probe, R2L, U2R, and normal models is 13, 23, 13, 20, and 15, respectively, and the names of the features are listed in Table
Features selected for building MHT2S model.
Class | Selected features |
---|---|
DoS | Protocol type, service, flag, src_bytes, dst_bytes, count, srv_count, serror_rate, srv_serror_rate, dst_host_count, dst_host_srv_count, dst_host_serror_rate, dst_host_srv_serror_rate. |
|
|
Probe | Duration, protocol_type, service, flag, src_bytes, dst_bytes, count, srv_count, srv_serror_rate, rerror_rate, srv_rerror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count, dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_port_rate, dst_host_srv_diff_host_rate, dst_host_srv_serror_rate, dst_host_rerror_rate, dst_host_srv_rerror_rate. |
|
|
R2L | Services, flag, hot, logged_in, is_guest_login, count, same_srv_rate, dst_host_count, dst_host_srv_count, dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_port_rate, dst_host_srv_diff_host_rate. |
|
|
U2R | Duration, protocol_type, service, flag, src_bytes, dst_bytes, hot, logged_in, num_compromised, root_shell, num_root, num_file_creations, num_shells, count, srv_count, same_srv_rate, dst_host_count, dst_host_srv_count, dst_host_same_srv_rate, dst_host_same_src_port_rate. |
|
|
Normal | Protocol_type, service, flag, src_bytes, dst_bytes, logged_in, count, srv_count, same_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count, dst_host_same_srv_rate, dst_host_same_src_port_rate. |
The results obtained are discussed in this section. In this study, separate detection models are developed for normal and four types of attacks based on their history of unique traffic profiles available in the KDD Cup 10% subset of the corrected traffic profiles. Each model is evaluated first by validation followed by testing process. While validation is performed to measure the generalized capacity of the system with the same traffic profile, testing is performed in order to define the efficiency of the proposed IDS with same and attack traffic profiles.
Validation of MHT2S detection system has been carried out using tenfold cross validation technique. The advantage of this technique is that it gives a reduction in variance which makes the results of the model less sensitive towards different training groups. In tenfold cross validation process, legitimate traffic profiles are divided into ten sets from which a training dataset is created by combining randomly selected nine sets to build the MHT2S detection system. The remaining is used as test dataset for evaluating the performance of the model. The process is repeated ten times by combining datasets in ten different ways and the average detection rate is considered as the result of the system. For example, results obtained using tenfold cross validation of the DoS model are shown in Table
Tenfold cross validation results of DoS model.
Fold |
|
|
|
|
|
---|---|---|---|---|---|
1 | 100 | 100 | 100 | 100 | 100 |
2 | 99.36 | 99.40 | 99.40 | 99.44 | 99.44 |
3 | 99.80 | 99.86 | 99.88 | 99.88 | 99.92 |
4 | 100 | 100 | 100 | 100 | 100 |
5 | 99.76 | 99.78 | 99.82 | 99.84 | 99.88 |
6 | 100 | 100 | 100 | 100 | 100 |
7 | 99.04 | 99.04 | 99.04 | 99.08 | 99.10 |
8 | 80.16 | 84.70 | 85.74 | 85.82 | 85.90 |
9 | 99.42 | 99.46 | 99.52 | 99.52 | 99.54 |
10 | 99.84 | 99.92 | 99.92 | 99.92 | 99.92 |
Avg. |
|
|
|
|
|
The average detection rates thus obtained in tenfold validation for all the models with different threshold ranges are given in Table
Average detection rates (%) of different models with 10-fold cross validation technique.
Class |
|
|
|
|
|
---|---|---|---|---|---|
Normal | 97.34 | 98.16 | 98.97 | 99.60 | 99.76 |
DoS | 97.59 | 98.22 | 98.22 | 98.35 | 98.37 |
Probe | 91.55 | 94.15 | 95.48 | 96.44 | 98.07 |
R2L | 89.50 | 96.00 | 96.25 | 97.50 | 98.25 |
U2R | 45.00 | 50.00 | 60.00 | 60.00 | 60.00 |
Performance testing of MHT2S detection system has been carried out using the training dataset consisting of 90% of normal traffic profile. Remaining 10% of normal profile has been combined with 10% of attack profiles to form the test dataset. For example, out of 54572 unique DoS traffic records, 50572 records are taken as training dataset and used for developing the MHT2S DoS model and the remaining 4000 records are combined with equal number of normal records as test dataset.
During the evaluation process, both training and test datasets are kept entirely different in such a way that the model provides a more generalized environment for predicting its efficiency. The performance testing has been carried out by keeping the
Testing performances for five classes.
Class | Evaluation metrics |
|
|
|
---|---|---|---|---|
Normal | DR (%) | 100 | 100 | 100 |
FAR (%) | 3.53 | 1.02 | 0.30 | |
|
||||
DoS | DR (%) | 99.74 | 99.75 | 99.77 |
FAR (%) | 0.26 | 0.23 | 0.23 | |
|
||||
Probe | DR (%) | 96.73 | 95.52 | 97.32 |
FAR (%) | 3.67 | 2.54 | 0.94 | |
|
||||
R2L | DR (%) | 100 | 100 | 100 |
FAR (%) | 10.5 | 3.5 | 2.50 | |
|
||||
U2R | DR (%) | 100 | 100 | 100 |
FAR (%) | 62 | 52 | 44 |
ROC curve for all classes.
The detection system has been found to be efficient based on the ROC curves which provide a good trade-off between detection rates and false alarm rates for all the classes. Figure
Accuracy (%) achieved by the proposed system for different thresholds.
Threshold | Normal | DoS | Probe | R2l | U2R |
---|---|---|---|---|---|
|
98.18 | 99.66 | 96.88 | 92.7 | 64 |
|
99.49 | 99.36 | 72.27 | 98.25 | 69 |
|
99.85 | 99.25 | 59.31 | 92.88 | 53 |
Performance of MHT2S model in terms of detection rate, false alarm rate, and accuracy for all classes is found to be better than the results obtained with the best detection approaches published. Accuracy of MHT2S model is compared with the results in the literature [
Performance comparison.
A new approach for intrusion detection in network environments has been presented by deploying Hotelling’s T2 statistical test, a multivariate process control technique. The MHT2S detection system is developed in three steps, namely, preprocessing, multivariate Hotelling’s T2 statistics, and attack detection. Redundancy removal, normalization, and selecting relevant features are carried out in preprocessing step. Using Hotelling’s T2 statistics, profiles are generated based on T-square distance metrics. Attack detection is implemented by determining a threshold range using central limit theorem. Based on the determined threshold range observed profiles are classified either as normal or attack. The MHT2S model is evaluated using KDD Cup’99 dataset to verify its effectiveness.
Performance of the model has been evaluated through validation and testing. Validation has been performed for analyzing the model for its detection rate based on traffic profiles. Testing helped in understanding the significance of the model through unknown and known attack profiles for each class. The results have shown encouraging performance in terms of detection rate and false alarm rate. 100 percent detection rates are achieved for normal, R2L, and U2R classes. For DoS and probe classes the detection rates are at 99.77 and 97.32 percent, respectively. Very low false alarm rates are achieved for all classes except U2R. For U2R, the false alarm rate is found to be considerably high due to the less number of traffic profiles. Comparing the accuracy of the model presented in this work with the existing models, it is found that the MHT2S based intrusion detection model achieves better performance. Therefore, MHT2S model could be employed as an effective tool for providing security for network environments. A better mechanism needs to be designed to reduce false alarm rate for the U2R class which could be explored in the future.
The authors declare that there is no conflict of interests regarding the publication of this paper.