A Lightweight Flow Feature-Based IoT Device Identification Scheme

Internet of,ings (IoT) device identification is a key step in the management of IoTdevices.,e devices connected to the network must be controlled by the manager. For this purpose, many schemes are proposed to identify IoTdevices, especially the schemes working on the gateway. However, almost all researchers do not pay close attention to the cost. ,us, considering the gateway’s limited storage and computational resources, a new lightweight IoTdevice identification scheme is proposed. First, the DFI (deep/ dynamic flow inspection) technology is utilized to efficiently extract flow-related statistical features based on in-depth studies. ,en, combined with symmetric uncertainty and correlation coefficient, we proposed a novel filter feature selection method based onNSGA-III to select effective features for IoTdevice identification.We evaluate our proposedmethod by using a real smart home IoTdata set and three different ML algorithms. ,e experimental results showed that our proposed method is lightweight and the feature selection algorithm is also effective, only using 6 features can achieve 99.5% accuracy with a 3-minute time interval.


Introduction
With the popularization and development of high-speed networks, artificial intelligence, big data, and other technologies, the number of IoT (Internet of ings) devices connected to the Internet has also rapidly increased. According to Cisco's forecast, there will be 500 billion IoT devices by 2030 to access the Internet [1]. e mounting number of IoT devices poses threats to the network [2] and brings more challenges to network managers [3]. In Cisco's recent comprehensive report on network security [4], it was stated that an increasing number of hackers utilize the vulnerabilities of IoT devices to carry out cyberattacks. In the current Internet environment, exploiting IoT devices to implement DDoS (distributed denial of service) attacks has become a primary form of attack [5]. erefore, learning how to manage IoT devices and ensuring the security of the IoT network system have become the issues of most concern for network managers.
Presently, there are methods to ensure the security of IoT systems by authenticating IoT devices through cryptographic approaches or deep learning [6]. However, these methods are generally costly and unsuitable for the characteristics of low energy consumption and low computing power of networked devices, which will affect the performance of IoT system's effectiveness. At the same time, the traditional anomaly detection system judges whether the device exhibits abnormal behavior by detecting the abnormality of the traffic pattern. However, the Internet of ings devices have massive and heterogeneous characteristics, and it is unmanageable to identify abnormal data behavior patterns. erefore, identifying the types of IoT devices connected to the network is of great significance to the management of IoT devices, especially in a low cost way. In the case of limited gateway computing resources, efficiently and accurately identifying devices is a problem that needs to be urgently solved.
To better identify devices on the gateway, this study proposed a lightweight IoT device identification method based on flow features. is solution studies the flow-related statistical characteristics intensively; then to pursue less cost, a novel NSGA-III-based [7,8] filter type feature selection algorithm is proposed; and finally, the extra random tree algorithm is used to build a device recognition model to classify devices. e features used in this paper are elaborated: first, the features are at the transport layer, so this method is suitable for all IoT devices that communicate on TCP/IP protocol stacks; second, they also do not include plaintext features, effectively avoiding the problem of feature invalidation caused by encrypted transmission and at the same time efficiently perform feature extraction, model construction, and IoT device identification; last, the proposed novel feature selection method also plays an important role in reducing the cost through the device identification process.
Some of the important contributions of our present work are listed below: (1) To solve the problem of IoT device identification in a low-overhead manner, we develop a lightweight IoT device identification scheme based on feature selection and machine learning algorithms. We also demonstrate its ability to identify IoT devices with over 99.5% accuracy with less cost than other schemes. (2) In-depth research has been carried out on flow-related statistical characteristics and the time interval of feature extraction. DFI technology is used to build features to avoid the unavailability of plaintext features due to data encryption and improve the performance of feature extraction, model construction, and device identification. (3) Based on NSGA-III, we introduce symmetric uncertainty and correlation coefficient and propose a novel low-overhead feature selection method to perform feature selection on the extracted flow-related statistical features in IoT device identification, and the valid features are filtered while reducing the dimensionality of the features. (4) Experiments are conducted on a real data set. e experimental results show that the proposed feature selection method performs well and the proposed scheme can achieve higher accuracy in a short time window. Its cost is much lower than the existing method. It can also achieve the same accuracy as the actual scheme. e remainder of this paper is arranged as follows: Section 2 demonstrates the related works. In Section 3, we explain our proposed feature selection method and the IoT device identification model. In Section 4, we exhibit the experimental results and data set. Finally, Section 5 contains the conclusion of this work.

Related Works
Recently, researchers have proposed a variety of solutions for identifying IoT devices. e current IoT device identification schemes can be classified into two categories from the perspective of fingerprint acquisition methods: one is the active detection method, and the other is the passive traffic analysis method. e active detection method obtains the response by sending requests to the target device and extracts the banner for device identification by analyzing the content of the response. e passive method extracts features by analyzing the daily traffic generated by the device. Feng et al. [9] proposed an active detection method for device discovery and identification, which uses the application layer response generated by the device to extract the banner and builds a fingerprint database and then establishes the map between device response and device type, vendor, and model. ey achieved a very fine-grained device identification scheme, but this approach needs to send massive packets to the network, which will bring huge cost to the devices. ese methods focus more on device discovery rather than management. To better manage IoT devices and offer low cost, our proposed method extract feature is passive.
Miettinen et al. [10] proposed a framework to identify the types of networked devices and restrict the communication of vulnerable ones. ey used 23 features generated from the traffic packets of the IoT devices to construct fingerprints for each device. A classifier was trained for each device type to identify vulnerable devices. is method can differentiate vulnerable devices from normal devices easily, but they only detect whether the devices are normal when they are first introduced into the network. is approach is not intended for long-term device management. We resolve this problem in our method by continuously collecting the traffic devices produced. Furthermore, Marchal et al. [11] proposed AuDI, which divides the network traffic into "flows," which are several time series. ey defined the flow as the traffic that uses a specific protocol to communicate with a MAC address. When a packet is sent in 1 second, it is marked as 1 in the time sequence. en, the DFT (discrete Fourier transform) periodic features of traffic are calculated and obtained, and 33 features related to traffic cycle are used to classify the devices combined with the kNN algorithm.
is novel method uses DFT to construct features, but the features have high dimension, and the DFT process also introduces much cost to the identification system. However, our method avoids these expensive calculations.
Santos et al. [12] utilized the four statistical features of traffic characteristics combined with the text information of the user agent extracted from the packet payload and the random forest algorithm to classify the devices. Le et al. [13] proposed a method for device classification based on DNS traffic. ey extract the content of DNS traffic packets, using the TF-IDF algorithm for feature construction to classify and identify the type, vendor, and model of the device. Msadek et al. [14] proposed a dynamic sliding window traffic segment method, and they used DPI (deep packet inspection) technology for feature construction and a variety of machine learning algorithms for model construction and evaluation. ese three methods use plaintext features for device identification. Nevertheless, for encrypt traffic, these features will be invalid. Our method avoids using plaintext features for this reason. Aksoy and Gunes [15] proposed a method using GA (genetic algorithm) to reduce the dimension of the feature vector and utilized a variety of machine learning algorithms to build a secondary classification model to classify devices in a genre-model granularity. Nonetheless, the GA algorithm and secondary classification introduce more cost into the system. Shahid et al. [16] used the size of the first n packets and N − 1 time intervals of TCP session interaction between devices as features and various machine learning algorithms for device identification. is method is also not suitable for long-term device management. ere are also some research developing device identification schemes based on signal process, like [17,18]; their research focuses on physical layer performance of the devices, which is not our point, but as effective methods in IoT device identification, we also consider their works. e main contributions of these studies were to construct special features associated with device type accomplishing device type identification by machine learning algorithms. e features are essential in this type of work. Sivanathan et al. [19] deeply investigated the characteristics of traffic in a flow level, and they constructed a 2-stage classifier for device classification. In the first stage, they extract DNS queries, port number, and cipher suits from these text features to obtain a class and confidence value. In the second stage, they combined the output of the first stage and flow-level statistics with random forest to classify devices. We used their method as a baseline method for comparison. Based on their work, we optimized the feature selection to reduce the IoT device identification system's cost, attaining a lightweight method with comparable identification accuracy.

The Proposed Device Identification Model
e system model is pictured in Figure 1. First, we take the captured traffic as input and select a fixed time interval to split the traffic; second, we generate flows from the split traffic, extract flow-level features by a statistical method, and then filter out invalid and redundancy features by the proposed feature selection method, which is based on NSGA-III; finally, a variety of machine learning algorithms and the features selected in the previous step are integrated to classify devices and multiple time intervals are selected for experimentation. e most suitable time interval and machine learning algorithm is then selected to build the efficient device classification model.

Feature Description.
e purpose of this article is to build an efficient and accurate IoT device identification scheme based on flow-related statistical features for device identification. e first step for device identification is using flow statistical values to represent the behavior of IoT devices. In addition, the method in this paper selects the flow generated in a fixed time window, which prevents the problem of low efficiency of feature extraction caused by a flow of too long duration. At the same time, it was found that when the bilateral flow is used for feature extraction, the features generated by the large amount of flow data produced from the frequent mutual access of devices in the LAN will decrease classification accuracy. is is mainly because the frequent mutual access of the devices generates a large amount of the same traffic, which results in similar features.
For example, the traffic between the Belkin Wemo switch and motion sensor in the data set has this problem. Table 1 shows the result of address statistics on the pcap data of the Belkin Wemo motion sensor using Wireshark. DstIpAddress represents the destination IP address of the packets, and Count is the count of packets. 192.168.1.223 is the IP address of the Belkin Wemo switch. 64.14% of the traffic is accessing each other, which will produce a large number of similar features, leading to the deterioration of the device identification model. In view of the fact that a large number of network attacks require access to the Internet, the flow features used in this solution are all bidirectional flows when local devices interact with external network services or devices.
Flow [19] is identified by a five-tuple group: source IP address, destination IP address, source port, destination port, and protocol. e related statistical characteristics of flow are flowVolume's (the sum of bytes of two-way flow upload and download) median, mode, maximum, minimum, information entropy, mean and variance, flowRate's (flowVolume/duration of flowVolume) the same statistics as flowVolume. At the same time, the port number accessed by the device can also be used as a part of the basis for classification. To fit the machine learning algorithm, the port number-related features are processed as follows in this scheme: first, the port numbers are classified into three categories: the port numbers 0-1023 are assigned to certain services as one category, represented as port1; 1024-49151 are loosely bound to the port numbers of some services as a category, represented as port2; 49 151-65535 dynamic or private ports are in a category, and binary encoding is performed on this three categories, represented as port3. e number of occurrences of the port number is recorded, denoted as port1Cnt, port2Cnt, and port3Cnt. Moreover, the number of occurrences of flows that belong to different protocols (TCP/UDP) is recorded, denoted as (udpCnt, tcpCnt).
For ease of deployment, this solution extracts flow-related information within a fixed time window as classification features. e choice of time window will affect the effect of the solution. When the time window is short, the overhead of storing and extracting features is small. However, in a short period of time, the flow statistical characteristics of some devices show high similarity, which will lead to a decrease in the accuracy of the model; when a long-time window is selected, the storage and extraction of the features will be costly, but the flow statistical features of different devices relatively deviate from each other. erefore, it is necessary to make a trade-off between the storage and extraction feature overhead and the classification accuracy. e gateway device is sensitive to the storage and calculation overhead, so the time window should be shortened appropriately.

Feature Selection.
e purpose of feature selection is to select a valid subset of attributes and to remove irrelevant or redundant attributes. Traditional feature selection methods can be divided into three categories, namely the filter, wrapper, or embedded methods. Compared with the other two types of methods, the filter method does not require machine learning algorithm training in the feature selection process and is the least expensive method of the three. e filter method assumes that the selected optimal feature combination is a set of valid features. How to evaluate the utility of the feature is a key issue in the filter method. To better ensure the effect of selecting features, a feature selection method based on multiple objective functions using NSGA-III is proposed.
To ensure the effectiveness of features, this method models feature selection as a multiobjective optimization problem and uses NSGA-III to search for the optimal solution.
ere are three objective functions/evaluation functions. In the following description, F represents the set of all the features, SF represents the selected feature subset, and NSF represents the unselected feature subset, which have the following relationship: [20] Based Objective Function. Mutual information (MI) of two variables is a measure of the degree of interdependence between variables. e value of mutual information represents the degree to which the uncertainty of the other variable is reduced when one variable is known. Mutual information MI(X; Y) between two random variables X and Y is shown in equation (1).

Symmetric Uncertainty
e value of b is 2, p(X) and p(Y) are the probability density functions of X and Y, respectively, and p(X, Y) is the joint probability density function of X and Y. Symmetric uncertainty is standardized mutual information, which makes the information shared between random variables comparable, and it is always used in the feature selection process. e calculation of symmetric uncertainty is exhibited by using equation (2).
(2) e value range of SU(X, Y) is between 0 and 1. e closer the symmetric uncertainty value is to 1, the more relevant the variables X and Y are. At this point, we obtain the first objective function, which is represented by using equation (3).
SU(f i , f j ) is the symmetric uncertainty between feature i and feature j in SF, and SU(f, c) is the symmetric uncertainty between feature f and class in SF. e smaller the function value, the better the classification effect of feature set SF.

Correlation Coefficient-Based Objective Function.
Correlation coefficient is also a method used to measure the degree of correlation between variables. e difference between symmetric uncertainty and the correlation coefficient is that the latter measures the degree of correlation between variables from the perspective of statistics, while the former measures the degree of correlation from the perspective of information entropy. e calculation of the correlation coefficient is shown in equation (4).
cov(X, Y) is the covariance of random variables X and Y, and σ X and σ Y are the standard deviations of X and Y, respectively. We can design the second objective function, defined as equation (5).
f i , f j , c have the same meaning as equation (3). e smaller the function value, the better the classification result of feature set SF. To enable the feature selection method to achieve the purpose of dimensionality reduction, the third objective function is introduced by using equation (6).

NSGA-III Algorithm.
e framework of the NSGA-III [7,8] algorithm is roughly the same as the NSGA-II algorithm. e main difference lies in the individual selection mechanism of the offspring: NSGA-II selects the offspring based on the crowding distance, and NSGA-III uses the method based on reference points. NSGA-III solves insufficient algorithm convergence and diversity when multiobjective optimization problems with three or more objective functions are involved. e algorithm also makes it easier to find the optimal solution.
To optimize the proposed three objective functions (F 1 , F 2 , F 3 ), the steps of the NSGA-III algorithm are as follows: (1) Generate an initial population that has N individuals.
Individuals are a sequence of random values between 0 and 1. A value larger than 0.5 represents a selected feature, otherwise, the feature is not selected.

Machine Learning Algorithm.
To achieve the best results, we selected three machine learning algorithms based on their descriptions in literature [21], evaluating them from the perspectives of accuracy and training speed and selecting the best performing algorithm to ensure that the method proposed in this article has a higher classification accuracy with less overhead. e following briefly introduces the three machine learning algorithms used in the experiment: (1) k-Nearest Neighbor (kNN) Algorithm. kNN is a classification algorithm with no training process. e most important parameter is k, if the input sample x is given, x will be classified into the k samples closest to x in the training set for most samples in the same category. kNN is used in the preliminary experimental verification process. (2) Random Forest (RF). RF is an ensemble learning method that contains multiple CART decision trees. ere have been many articles using RF to construct the IoT device identification scheme that achieved excellent results, indicating it is suitable for the device identification system.

(3) Extremely Randomized Trees (ET). ET is very similar to
RF. e difference between this method and RF is that the selection of the node bifurcation attributes of the decision tree in ET is random, while the node division in RF of the bifurcation attribute is selected after Gini index calculation. Given its high similarity with RF, we select this algorithm as a part of the device identification system for comparison with the RF's results.

Data Set, Experiment Results, and Analysis
In this section, we will conduct a detailed analysis of the used data set [19] and the selected features of this scheme and use different machine learning algorithms at different time intervals to evaluate the classification results and cost. Finally, the best performing ML algorithm is given, and the model is constructed based on this algorithm. e experimental environment is a personal computer, the detailed configuration is Intel core i5 9400 2.90 GHz, memory 8 GB, win10 64-bit operating system. e experimental steps are as follows: first, the collected data are subpackaged at fixed time intervals, and then the joy tool [19] is used to extract the flow information; second, Python script is used to calculate the relevant statistical values from the output of joy and constructs the features for storage and finally uses the machine learning algorithm provided by scikit-learn [22] to establish machine learning models and classify the devices and evaluate the classification results.

Data Set.
e data set used in this article comes from the public data set of the paper [19], which is obtained by collecting the traffic of smart home devices in the laboratory under the campus network environment. e IoT devices in the data set include cameras, smart lighting tools, activity sensors, and health monitors. e TP-Link router acts as a gateway through which all devices connect to the Internet. In the data collecting progress, they connect to the router through an additional device, use tools such as tcpdump to passively collect the traffic of all devices, and save the traffic collected every day as a pcap file, which is stored in the hard disk connected to the device. is article uses opened 20-day data for experiments. Because the solution in this paper is based on the characteristics of the transport layer construction and classification, the provided data set only gives the mac address corresponding to the device, and we also analyze the IP address corresponding to the devices.

Feature Selection Results.
is solution uses the filter feature selection method based on NSGA-III to remove redundant features while reducing the dimensionality of the features, that is, to reduce the computational cost of the model while ensuring the accuracy of the classification. NSGA-III is a variant of the GA algorithm. For individual construction: the number of elements contained in the individual is the same as the cardinal of the full set of features; initially, the value of each element is a random number between 0 and 1, and an element greater than 0.5 represents the feature is selected. When conducting the experiment, the number of individuals used is 40, and the number of iterations is set to 100. We performed feature selection on the 1min time interval for small overhead introduced to the system. Figure 2 shows the results of NSGA-III operation.
As can be seen in Figure 2, the results appear to have the minimum value of three at the same time. e feature selection process only brings less additional overhead to the system. rough our feature selection method, we select six features from the 22 features we described in Section 3.1. For our objective to be lightweight, this approach markedly reduces the classification and training overhead.
We also compared the features used in this research and the baseline method, and the features and the selection status are shown in Table 2. Our purpose is to deeply investigate the applicability of flow-related statistics and establish a lightweight IoT device identification scheme; therefore, we construct the feature set almost from the flow-related statistics because it is easy to get the flow-related statistics, which means the feature extraction progress only bring little cost to the system. e baseline method just uses the mode of flow volume and flow rate and then also forms word bag models for port, domain name, and cypher suit, and these text features are imported to a Bayes classification to generate the class and probability for final classification. From a lightweight point of view, we only use one-level classifier and remove the text features on account of text features need to be processed additionally and cause extra cost. In the selection process, the features are also selected properly to further cut the cost. At the same time, the classification performance can be maintained above a high level, and the classification details are shown in the following.
About the selected features after feature selection progress, we attempt to explain why our feature selection algorithm chooses them. First, port2 and port2Cnt represent the devices access the port between 1024 and 49151, users' customized services always run on these ports, as different devices access different services, the access times and whether access these ports should show great discrimination between devices. e variance and mode of flow volume represent the quantity of device traffic and the fluctuation of traffic, and they describe device communication behavior from the traffic view. And for the TCP and UDP flow counts, they represent the protocol discrimination between the devices, as different devices access different services, the flows always use different protocols, and these features describe the devices' behavior from the view of protocol. Combine all selected features, we can describe the device communication behavior comparatively comprehensively, and therefore, the classification results can reach a high level on accuracy.

Classification Results.
In this section, we will evaluate our scheme mainly from two points of view. e first is the classification performance, which is used to measure the applicability of an IoT identification method, and to prove our scheme's lightweight characteristics, the second view is the cost of our method.

Classification Performance.
e following will show the results of classifying the data set using the three machine learning algorithms mentioned before and the features  Due to the selected algorithms having hyperparameters, different parameters will have an impact on the accuracy and training speed of the model. RandomizedSearchCV [22] is used in the parameter selection to ensure that the performance of the model in each time interval is the best. e accuracy shown in Figure 3 is the result obtained on the test set. It can be seen that, for the performance of accuracy, the longer the time interval, the greater the deviation of characteristics in the streams of different devices, which brings better classification results. When the time interval is longer than 3 min, the accuracy of the RF and baseline method is stable at about 99.5%. However, a decrease occurred for the kNN algorithm. As we inspected the feature set used in the training, we found that as the time segment became longer, the feature extract frequency became lower, so the feature set became smaller. For the kNN algorithm, the result is strongly dependent on the scale of the feature set unlike the other algorithms. However, in a comparable time segment, the performance of kNN is much worse than that of the other algorithms. To prove that our scheme is statistically better than the baseline method, we conduct 100 times of training and prediction on a 1-minute time segment. As shown in Figure 4, the accuracy of our scheme is statistically 1.5% higher than that of the baseline method.
As shown in Figures 5 and 3, in a short time window, our method's classification performance is better than the baseline method's. As we inspect the features, the DNS interval, NTP interval, and sleep time that the baseline method used are meaningless in a short time interval, but the features chosen in our method always are meaningful. In other words, with a short time segment, some features in the baseline method especially time interval features become homogenized and are inadequate to discriminate different devices. But the features used in our method, constructed from flow-related statistics and selected after the NSGA-IIIbased feature selection method, are adequate to distinguish devices whether the time segment is long or not.
We also present the detailed classification performance on 3-min time segment because as shown in Figure 5 the accuracy will increase and reach a peak value till the time segment is 3 min. As Table 3 shows, our proposed method based on RF and ET can reach a comparative level with the baseline method. e results show our method's strength clearly: comparative or superior classification performance and much less overheads, which will be clarified in the following.

Overhead of Proposed Method.
In terms of training time, the training time of ET is always the shortest, as the time intervals become longer, the shorter the time cost to train the model, and this is mainly for longer time intervals, making the feature set smaller. We should notice that when evaluating the training time of the baseline method, only the time for the second-level classification is considered. e first-level classification will generate a label and a degree of confidence for each sample, and this process will cause heavy cost especially for an enormous data set.
Our method also uses less storage space after feature extraction; as shown in Table 4, as the time intervals become  longer, the storage used by the proposed method is much less than the baseline method, and this is mainly caused by the text features used in the baseline method. erefore, our method is superior to the baseline method on storage cost.
Whether in terms of training time or feature dimension, our scheme achieves better performance with less cost. We also obtained a detailed evaluation when the time interval was 3 min. As shown in Table 2, the performance of ET using the selected features in this article was very close to that of the baseline method, while the overhead was significantly reduced. e accuracy of ET is close to the best, which RF achieved, but ET's training is much faster than RF, and on the basis of trade-off on time cost and classification accuracy, we proved that ET is also a valid algorithm to construct an IoT device identification scheme.

Conclusion
As the popularization of IoT devices are connected to the Internet, managing and annotating these devices is an essential problem for keeping network security. In this paper, we propose a lightweight IoT device identification scheme based on traffic analysis. is scheme used flow-related statistical features to represent the behavior of IoT devices and a filter feature selection method based on NSGA-III to select effective features. Machine learning algorithms are used to classify devices. Experimental results showed that our proposed scheme can achieve comparable accuracy with much less overhead. Based on the ET algorithm combined with the six attributes port2, port2Cnt, tcpCnt, udpCnt, flowVolume's mode, and flow-Volume's variance, the best classification result can be achieved, and the training speed is the fastest. When the time interval is 1 min, an accuracy of 95.8% can be achieved, while the accuracy of the base method is only 94.5%. As for a long time interval like 3 min, our method can achieve an accuracy of 99.3%. At the same time, the overhead is greatly reduced compared with the base method. is method is suitable for deployment on the gateway to identify IoTdevices. Future work will focus on cloud services. How to integrate the models, ensure the trustworthiness of the gateway, and improve the performance and security of the distributed device identification system will be the focus of future work.   Data Availability e data set is the same as the paper "Classifying IoT Devices in Smart Environments Using Network Traffic Characteristics" used and the access link is https://iotanalytics.unsw. edu.au/.

Conflicts of Interest
e authors declare that they have no conflicts of interest.