K-Modes Clustering Algorithm Based on Weighted Overlap Distance and Its Application in Intrusion Detection

In order to better apply the K-modes algorithm to intrusion detection, this paper overcomes the problems of the existing K-modes algorithm based on rough set theory. Firstly, for the problem of K-modes clustering in the initial class center selection, an initial class center selection algorithm Ini_Weight based on weighted density and weighted overlap distance is proposed. Secondly, based on the Ini_Weight algorithm, a new K-modes clustering algorithm WODKM based on weighted overlap distance is proposed. .irdly, the WODKM clustering algorithm is applied to intrusion detection to obtain a new unsupervised intrusion detection model. .e model detects the intrusion by dividing the clusters in the clustering result into normal clusters and abnormal clusters and analyzing the weighted average density of the object x to be detected in each cluster and the weighted overlapping distance of x and each center point. We verified the intrusion detection performance of the model on the KDD Cup 99 dataset. .e experimental results of the current study show that the proposed intrusion detection model achieves efficient results and solves the problems existing in the present-day intrusion detection system to some extent.


Introduction
For the threat of network security, there are many corresponding network security defence technologies. For example, traditional network security defence technologies include firewall, antivirus software, digital signature, digital authentication, and data encryption. Traditional network security defence technology is only a static defence method, and it is difficult to effectively protect our network security [1]. Intrusion detection helps systems count the number of errors in the system and analyze the causes of system vulnerabilities, detecting known intrusions and alarms, auditing of new anomalous behaviors, and integrity assessment of critical data files.
In view of the problems existing in existing intrusion detection systems, data mining techniques have been widely used in the field of intrusion detection in recent years. Among them, cluster-based unsupervised intrusion detection has caused a lot of attention. At present, many clustering algorithms have been applied to intrusion detection. As an effective extension of the K-means algorithm, the K-modes algorithm can effectively deal with categorical data. It inherits the efficient features of the K-means algorithm, and the algorithm is simple, easy to implement, and widely used in many fields. erefore, the K-modes algorithm has a very broad application prospect in the field of intrusion detection. However, the research on applying K-modes algorithm to intrusion detection is still rare [2,3].
In order to better apply the K-modes algorithm to intrusion detection, this paper solves the problems of the existing K-modes algorithm based on rough set theory and applies the improved algorithm to intrusion detection [4]. Firstly, for the problem of K-modes clustering in the initial class center selection, an initial class center selection algorithm Ini_Weight based on weighted density and weighted overlap distance is proposed. Secondly, based on the Ini_Weight algorithm, a new K-modes clustering algorithm WODKM based on weighted overlap distance is proposed.
irdly, the WODKM clustering algorithm is applied to intrusion detection to obtain a new unsupervised intrusion detection model UIDM_WODKM.
e model detects the intrusion by dividing the clusters in the clustering result into normal clusters and abnormal clusters and analyzing the weighted average density of the object x to be detected in each cluster and the weighted overlapping distance of x and each center point. We verified the intrusion detection performance of the model on the KDD Cup 99 dataset. e organization of the paper is as follows: Section 2 describes the intrusion detection system. Section 3 briefly discusses the K-modes clustering algorithm based on weighted overlap distance. e application of K-modes clustering algorithm based on weighted overlap distance in intrusion detection is given in Section 4. e paper is concluded in Section 5.

Intrusion Detection
Intrusion detection is a proactive network security defence strategy that can make up for the shortcomings of traditional static security policies and thus becomes a reasonable complement to traditional static defence strategies such as firewalls [5]. System administrators' security management capabilities have been extended through auditing, monitoring, intrusion identification and response, and reducing the workload of system administrators.
rough the intrusion detection system, system administrators can grasp the status of the network system in real time, including the status of programs, files, and devices, and help network system administrators to develop accurate and complete strategies [6]. In addition, the intrusion detection system dilutes the restrictions on professionals in network security, making it easy for nonprofessional personnel to manage network systems. In addition, the intrusion detection system can respond to the discovered network intrusion, illegal operation, timely and proactive response, and achieve the purpose of active defence [7].
Intrusion detection system (IDS) is a collection of network security systems or systems that detect abnormal behaviors in the system or changes in network status and can alert or take proactive responses [8]. It differs from other network security systems in that it is also an active security protection technology [9][10][11]. Intrusion detection in an intrusion detection system is mainly divided into the following steps: (1) Collect message: intrusion detection first collects the corresponding system and network information. e main content includes the content of network traffic, system health, status, and behavior of user connection activities. (2) Signal analysis: the information collected above is generally analyzed by three technical means: pattern matching, statistical analysis, and integrity analysis. e first two methods are used for real-time intrusion detection, while the integrity analysis is used for postmortem analysis.
(3) Real-time recording, alarm, and response: finally, the intrusion detection system can timely respond to detected intrusions or other network attacks. is includes logging intrusion information and alerting the network administrator.
At present, there are many kinds of intrusion detection systems. According to different classification standards, we classify the current intrusion detection system as follows.

Classify according to the Principle of Detection
(1) Anomaly detection: anomaly detection first summarizes the characteristics of normal behavior and then judges whether the user behavior is intrusive according to the user's activities or the use of resources in the system. Anomaly detection needs to establish a model to determine the normal user behavior and activities before detection and then to determine which qualified behaviors can be marked as "exceptions" and to make corresponding processing. (2) Abuse detection: misuse detection first establishes the feature library based on the behavioral characteristics of the abnormal operation collected. When the observed user activity or system resource usage matches the records in the feature library, it is considered an intrusion. When the intrusion behavior matches the normal user activity, the system produces a false alarm; when the feature library does not store the features matching with a new attack behavior, the system produces a false alarm. erefore, the focus of abuse detection is how to update the feature library automatically to reduce the false alarm rate.
(3) Mixed detection: hybrid detection combines the advantages of the above two methods, mainly based on the normal data flow of the system to detect intrusion behavior. And before making a decision, we not only analyze the normal behavior of the system but also observe the suspicious intrusion behavior. erefore, the results of mixed detection are more comprehensive. It is reliable.

Classify according to Data Sources
(1) Host-based intrusion detection system: host-based intrusion detection system obtains data from the host where the system runs. It mainly protects the security of the host where the system is located. Usually, this system is installed on the host which needs to be checked, mainly on the relatively critical system files and executable files. (2) Network-based intrusion detection system: with the development of network technology, the host-based intrusion detection system has been difficult to meet the needs of network security, and the networkbased intrusion detection system came into being. e data source of this kind of system analysis is the data packet on the network. It mainly analyzes and detects the intrusion behavior on the corresponding network segment according to the data flow, data packet, and corresponding protocol on the network.

Classify according to the Architecture of the Intrusion Detection System
(1) Centralized intrusion detection system: centralized intrusion detection systems usually have multiple auditors on different hosts, but only one central server is used for intrusion detection. In this way, the audit program sends the collected data trace to the central server, which then analyzes and detects the data. e scalability and configurability of such intrusion detection systems are relatively poor. (2) Distributed intrusion detection system: the distributed intrusion detection system (DIDS) sends the corresponding tasks to several different host-based IDS by the central detection server, and these hostbased IDS are nonhierarchical. Each of them is responsible for monitoring suspicious activities on the corresponding host. is kind of intrusion detection system has relatively high scalability and security, but its maintenance cost is high and the workload of the host is relatively heavy.

K-Modes Clustering Algorithm Based on Weighted Overlap Distance
e schematic of the intrusion detection system is shown in Figure 1.

K-Modes Clustering Algorithm.
e K-means algorithm is a clustering algorithm commonly used in the field of data mining. However, this algorithm can only process numeric data, but cannot process subtype data [12]. In response to this problem, Huang et al. further proposed the K-modes clustering algorithm.
Clustering is the process of dividing dataset into a plurality of groups or clusters composed of similar objects [13] e similarity between objects in the same group is made as high as possible, and the similarity of objects between different groups is as low as possible. Clustering is an unsupervised machine learning method that does not have any prior knowledge of the datasets that need to be classified [14]. Datasets are automatically divided into groups or clusters based solely on similarity metrics. Try to make the similarity between samples in the same group high and the sample similarity between different groups is low. e groups in the cluster do not need to be defined in advance and are automatically divided according to the inherent similarity of the data according to the actual characteristics of the data [13]. e cluster analysis system inputs the criteria for measuring the similarity between data and the dataset that needs to be classified, and the output is the result of the well-classified class. e additional result of cluster analysis is a comprehensive description of each group, which can be used to provide a more in-depth analysis of the characteristics of the dataset [15]. e K-means clustering algorithm is a popular clustering algorithm with simple and efficient characteristics, but it can only deal with numerical data. e K-modes clustering algorithm is an extension of the K-means algorithm, which can deal with class attribute data. It inherits the characteristics of the K-means clustering algorithm which is efficient and easy to implement, so it is widely used in various fields.
Distance measure (or similarity measure) is the most basic and important part of the K-modes clustering algorithm. As mentioned earlier, the existing geometric distance measurement for numeric data is not suitable for categorical data. It is necessary to design some new distance measurement mechanisms according to the characteristics of categorical data. At present, in the K-modes clustering algorithms, most of them use simple overlap distance to measure the dissimilarity between any two objects X and Y; that is, the number of attributes with different values of X and Y is taken as the distance between X and Y. e process of clustering and segmentation is shown in Figure 2.
When calculating the overlap distance between any two objects, we mainly get the final result by comparing the values of the two objects on each attribute. If we use the same way to deal with all the attributes in the information table without distinguishing them, this approach obviously does not conform to objective reality and eventually will lead to the clustering results deviation, thus affecting the performance of the K-modes algorithm. e K-modes algorithm uses a simple overlapping distance measure to calculate the distance between objects [4,16]. e specific definition of the distance measure is as follows: for any two objects X and y, X and Y in the universe U, the distance d(x, y) � a∈A δ a (x, y), which for any a ∈ A, δ a (x, y) denotes a simple overlap distance between X and Y on the conditional property a, that is, if e schematic of the K-means algorithm is shown in Figure 3.
Obviously, the traditional K-modes clustering algorithm has the following problems [17][18][19][20]: (1) Select the initial center point by random selection: choosing the initial center point in this way will lead to the instability of clustering results. If we want to get relatively good clustering results, we need to repeatedly execute the K-modes clustering algorithm and select the best group of data from it. For intrusion detection, it is inappropriate to select the initial center point by using the above method because the data to be processed by the network intrusion detection system is usually very huge. Repeated K-modes Scientific Programming clustering algorithms on massive, high-dimensional network data are often not feasible in time. (2) Traditional K-modes algorithms usually use simple overlap distance to measure the dissimilarity between objects X and y. Simple overlapping distance is simple and intuitive, but there are still some problems in its practical application. For example, it assumes that all attributes play the same role in calculating the distance between objects; that is, each attribute has a weight equal to 1. However, in many practical cases, the impact of different attributes on distance calculation is likely to be different. erefore, it is necessary to assign a weight value to each attribute and through the size of the weight value to reflect the contribution of each attribute to distance calculation.

Weighted Overlap Distance.
Aiming at the problems existing in the traditional K-modes clustering algorithm, this paper proposes a new K-modes clustering algorithm WODKM based on weighted overlap distance. Different from the traditional K-modes clustering algorithm, the WODKM algorithm uses the initial class center selection algorithm Ini_Weight based on weighted density and weighted overlap distance proposed in Chapter 3 to select the initial center point. In addition, the WODKM algorithm uses a weighted overlap distance metric instead of a simple overlap distance metric when calculating the distance between objects and the distance between the object and the center point (Algorithm 1). e specific steps of the WODKM algorithm are in Algorithm 1.
In the worst case, the time complexity of steps (1) To evaluate the performance of the Ini_Weight algorithm, we experimented with the following four subtype UCI datasets. e basic information of the above four datasets is shown in Table 1.
We compare the Ini_Weight algorithm with four existing initial center selection methods: (1) random method [21]; (2) Cao's algorithm [22]; (3) Wu's algorithm [23]; and (4) Khan's algorithm. e main steps of the experiment are as follows: first, different initialization algorithms are used to select the initial center; secondly, K-modes clustering is performed based on the initial center selected in the previous step (using the classical K-modes algorithm proposed by Huang for clustering); finally, compare the clustering results corresponding to different initialization algorithms. Table 2 shows the K-modes clustering results for different initialization algorithms on the four datasets Soybean, Zoo, Breast, and Mushroom. ere is (1) From Table 2, we can see that the Ini_Weight algorithm performs better than the random method because the AC, PR, and RE of Ini_Weight outperform the random method on the three datasets of Soybean, Breast, and Mashroom. On Zoo, although Ini_Weight's PR is slightly worse than the random method, it is significantly superior to the random method in AC and RE. In addition, the clustering results obtained by random methods are different each time, and our method can get stable clustering results. Although Cao's method and Wu's method are also superior to random methods, Ini_Weight is superior to both methods. On Soybean, Breast, and Mashroom, Ini_Weight yields AC, PR, and RE higher than or equal to these two methods. On Zoo, although Ini_Weight's PR is better than Cao, Wu is low, but it is higher than those 2 methods on AC and RE. In addition, Ini_Weight performs better than Khan's because on any data set, Ini_Weight yields higher AC, PR, and RE than Khan's.
We can also see that Ini_Weight is always higher than other methods on RE.
is shows that our method can strictly control the objects in a cluster without being wrongly assigned to other clusters. In addition, our method produces poorer PR on Zoo than Cao, Wu, and random methods, but better on AC. is is because PR and AC are two different performance metrics. In the clustering results obtained by Ini_Weight, one of the clusters has zero accuracy (i.e., no objects in Zoo are allocated to this cluster correctly), so that the PR obtained by Ini_Weight is lower than that by Cao, Wu, and random methods. However, although the accuracy of the cluster is zero, the number of objects belonging to this cluster in Zoo is very small, so Ini_Weight is still higher on AC than Cao, Wu, and random methods. We also compared the accuracy of several algorithms, as shown in Figure 4.

Application of K-Modes Clustering Algorithm Based on Weighted Overlap Distance in Intrusion Detection
We used the network test environment shown in Figure 5 for testing. Under normal circumstances, Host 5 uses Tcpdump to collect 10 minutes of network packets every 1 h. e data collection was performed six times in succession. e first five sets of data were merged to form the standard normal behavior set dataNormal, and the sixth set of datasets was the normal behavior set dataNatural. en the distributed denial of service attack tool trino is installed on host Host 1, and trino uses master to control host Host 2 and host Host 3 to launch an attack on host Host 5. e network packet of 10 min is also collected to form the abnormal behavior set data Abnoraml (Algorithm 2). e clustering process is shown in Figure 6. During the experiment, we used four attack types. For each of the four types of attacks, random and nonreturning are performed according to different proportions. Each type of attack extracts a certain amount accordingly, and the corresponding category attribute is also removed. e attack datasets of D, P, R, and U are, respectively, obtained, and the attack records of DOS, Probe, R2L, and U2R types are, respectively, stored in D, P, R, and U. e proportion of the four attack types corresponding to the extraction and the corresponding number of records are shown in Table 3.
In the experiment, we set the threshold μ to the proportion of abnormal behavior in the entire training set. In order to verify the performance of the UIDM_WODKM (2) Calculate WDens (x) for any x ∈ U; (3) Select the object y with the largest weighted average density from U as the first initial center, and C � C{y}; (4) If |C| < k, go to step (5), otherwise go to step (10); (5) Assume that C � {c1, c2, ..., cq}, repeated for any x ∈ U -C (5.1) Calculate the weighted overlapping distance wd (x, ci) of x and ci, where ci ∈ C, 1 ≤ i ≤ q; (5.2) Calculate Pos_ Center (x); (6) Select the object y that is the most likely to be the initial center from U-C as the new initial center. And let C � C{y}; (7) If |C| < k, go to step (5), otherwise go to step (8); (8) Return k initial centers in C.
ALGORITHM 1: e initial center selects the Ini_Weight algorithm.    Scientific Programming model in intrusion detection, we compared it with the FCM clustering algorithm and the clustering algorithm proposed by Wu. e FCM algorithm is a fuzzy clustering algorithm based on the objective function, which uses Euclidean distance to measure the similarity of objects. e specific experimental results are shown in Figure 7. In Figure 5, DR (detection rate) represents the detection rate, and FPR (false positive rate) represents the false alarm rate. By comparing the detection results of the three intrusion detection algorithms listed in the table, the following conclusions can be drawn. e UIDM_WODKM model is better for Probe, U2R, and R2L attacks than the FCM and Wu algorithms. e detection effect of the DOS attack is worse than the FCM algorithm, but it is better than the Wu algorithm. In addition, the overall performance of the UIDM_WODKM model is better than the traditional FCM algorithm and Wu algorithm, especially the UIDM_WODKM model has a much lower false alarm rate than the FCM algorithm. Since Wu's algorithm does not analyze the false alarm rate, we cannot compare it. Output: A collection of k clusters.
Step 1: according to the previously proposed Ini_Weight algorithm, select k objects from the universe U as the initial center point: z1, ..., zk.
Step 2: calculate the weighted overlap distance between each object x in U and each center point zi, and divide the object x into the cluster represented by the nearest center point.
Step 3: for each current cluster c, recalculate the center point of c based on the frequency of the object's value on each attribute in c.
Step 4: repeat steps 2 and 3 until the value of the objective function does not change. After each iteration, let t ++.
Step 5: return the clustering results. ALGORITHM 2: e clustering algorithm used in the experiment.

Conclusion
In this paper, a new distance metric and density metric are proposed based on the concepts of attribute importance and rough entropy in rough set theory, the weighted overlap distance and the weighted average density are used, and an initial class center selection algorithm Ini_Weight based on the weighted density and the weighted overlap distance is proposed [21]. e Ini_Weight algorithm takes full account of the different importance of each attribute so that important attributes are given a larger weight, and unimportant attributes are given a smaller weight. e Ini_Weight algorithm distinguishes the influence of different attributes on the clustering result by the weight of the weight, thus solving the problem of the traditional K-modes algorithm in the initial center point selection. We have carried out experiments on multiple subtype UCI datasets.
e experimental results show that the performance of the Ini_Weight algorithm is better than the existing initial center point selection method, which effectively improves the accuracy of the initial center point selection. [24] Data Availability All data are available from the corresponding author upon request.   Scientific Programming