Network Intrusion Detection with Threat Agent Profiling

With the increase in usage of computer systems and computer networks, the problem of intrusion detection in network security has become an important issue. In this paper, we discuss approaches that simplify network administrator’s work. We applied clustering methods for security incident profiling. We considerK-means, PAM, and CLARA clustering algorithms. For this purpose, we used data collected inWarden system from various security tools.We do not aim to differentiate between normal and abnormal network traffic, but we focus on grouping similar threat agents based on attributes of security events.We suggest a case of a fine classification and a case of a coarse classification and discuss advantages of both cases.


Introduction
In the information age, network services and users face cyber threats due to rapid development; networks, network services, and network users are facing cyber threats like malware, data breach, phishing, social engineering, and so forth.These threats must be identified before organizations or users lose any data or reputation.Nowadays, attackers use advance methods, tools, and approaches to avoid detection, like IP address spoofing, encrypted payload, human failure, and so forth.
The aim of any administrator of network services is to monitor, collect, and analyse network traffic, users' activities, and system logs.They have become fundamental to guard against cyber threats to ensure cybersecurity.They are part of measures to ensure integrity, availability, or confidentiality of networks, network services, and network users.
Conventional approaches are cyber defence systems, which can be defined as security mechanisms that monitor, track, and block malicious network activities or cyberattacks [1].Examples of these defence systems are firewalls, authentication tools, and detection systems.
Detection tools of cyber defence systems capture security events from the logs of information sources.Security events can be defined as "a low level entity (e.g., TCP packet, system call, and syslog entry) from which an analysis is performed by a security tool" [2].Depending on their origin, there are host-based security events (e.g., user's computer) or networkbased security events (e.g., network devices and NetFlow probes).
One of the most widespread used cyber defence systems is intrusion detection system (IDS).IDS can be defined as "a defense system, which detects hostile activities or exploits in a network" [3].There are three main types of IDS based on the used method of detection [4,5]: Signature-based (misuse-based) IDS uses signatures of known attacks (a priori knowledge on attacks).They are effective for detecting known types of attacks without generating an overwhelming number of false alarms [3].The second type of IDS is anomaly-based IDS.This type of IDS monitors network's and system's normal behaviour and identifies any differences from it [6].The last type of IDS is hybrid IDS.It combines misuse and anomaly detection.The standard architecture of hybrid IDS consists of "an anomaly detection module, a misuse detection module, and a decision module combining the results of the two detection modules" [3].
Intrusion Detection Working Group defined a general IDS architecture based on the consideration of four types of functional modules.These modules are shown in Figure 1 and are as follows [6]: (i) Event modules are made of sensor elements monitoring the target system and acquiring information events.
(ii) Database modules store information from event modules.
(iii) Analysis modules analyse events and detect potential hostile behaviour, generating an alarm if necessary.
(iv) Response modules execute response to prevent any detected intrusion if it occurs.
Event boxes receive overwhelming size of data from the monitored environments.The aim of analysis boxes is to process the data in a way that simplifies work of network administrators.It can be achieved by automating activities in the response boxes or allowing administrators to focus only on relevant events.
One solution is to profile network traffic and incidents recorded in event modules.Profiling module as a part of analysis modules can be defined as module that groups similar network connections and searches for dominant behaviour using various types of algorithms [1].Profiling is usually used to distinguish between normal and abnormal network traffic [7].Profiling modules perform various types of algorithms or methods to group similar network connections, events, or activities and search for dominant behaviour.Workflow of the profiling box is shown in Figure 2. It consists of four steps [1]: Researchers outlined two of the largest problems in security profiling [1]: (i) The huge amount of data and the difficulty in detecting patterns in the data and in the learned patterns (ii) Visualization ability which can strengthen the role of security profiling by security administration In this paper, we focus on the behaviour of threat agents.Threat agent can be defined as "a system entity that performs a threat action or an event that results in a threat action" [8].The main aim of this paper is analysing the profiling of security events based on data collected by security sensors.This profiling is closely associated with prediction of threat agent behaviour and the attacks themselves.The prediction also helps with protection of organizations, since the administrators are better informed and they can be better prepared for security incidents in their organization.We only focus on the clustering methods.To formalize the scope of our work, we state the following research questions: (i) Analysis of security events' attributes for threat agent profiling (ii) Analysis of profiling of threat agents based on clustering of security events' attributes This paper is organized into five sections.Section 2 focuses on the review of published research related to clustering methods in cybersecurity and profiling in cybersecurity.Section 3 outlines the methodology of data collection, preprocessing data, and clustering methods.Section 4 presents results of analysis and discusses the important points.The last section contains conclusions and our suggestions for the future research.

Related Works
This section presents the related works carried out by various researchers or research groups.As the paper addresses profiling in cybersecurity area and implements clustering methods to profile, we divide related works into 2 categories: (i) Clustering methods in cybersecurity (ii) Profiling in cybersecurity Clustering is often used in intrusion detection systems to decide if the traffic is normal or anomalous.One of the most used algorithms is -means.Münz et al. [9] applied -means clustering algorithm to feature datasets extracted from flow records.Training data is divided into clusters of time intervals of normal and anomalous traffic.Li and Wang [10] improved clustering algorithm through studying the traditional means clustering algorithm.The experiments proved that the new algorithm could improve accuracy of data classification and detection efficiency significantly.
Ranjan and Sahoo [11] described a new way of intrusion detection using K-medoids clustering algorithm and certain modifications of it.The algorithm specified a new way of selection of initial medoids and proved to be better than -means for anomaly intrusion detection.The proposed approach has many advantages over the existing algorithm, which mainly overcomes the disadvantages of dependency on initial centroids, dependency on the number of clusters, and irrelevant clusters.Eslamnezhad and Varjani [12] proposed a new detection method based on a MinMax -means clustering algorithm which overcomes the shortage of sensitivity to initial centers in -means algorithm and increases the quality of clustering.
To overcome disadvantages of misuse detection and anomaly detection, hybrid methods are used.There are several papers applying hybrid methods, combining -means and some other techniques.Hybrid classifiers can provide improved accuracy but have a complex structure and high computational cost.Varuna and Natesan [13] introduced a new hybrid learning method, which integrates -means clustering and Naive Bayes classification.Muda et al. [14] proposed a hybrid learning approach by combining -means clustering and Naive Bayes classifiers.Their approach was evaluated using the commonly used KDD Cup'99 benchmark dataset.The fundamental solution is to separate instances between the potential attacks and the normal instances during a preliminary stage into different clusters.Subsequently, the clusters are further classified into more specific categories, namely, Probe, R2L, U2R, DoS, and Normal.Elbasiony et al. [15] introduced the data-mining-based network intrusion detection systems.Two data-mining techniques are used in misuse, anomaly, and hybrid detection.First, the random forests algorithm is used as a data mining classification algorithm into a misuse detection.Second, the -means algorithm is used as a data-mining clustering algorithm into a proposed unsupervised anomaly detection method.Third, the random forests algorithm is used with the weighted means algorithm to build a hybrid framework to overcome the drawbacks of both misuse detection and anomaly detection.
Important research in the clustering methods applications is the outlier problem.Several authors [16][17][18] tried to answer the question of which outlier is an anomaly.Liao and Vemuri [17] use the Euclidean distance to define the membership of data points to a given cluster.Breunig et al. [18] state that some detection proposals associate a certain degree of being an outlier for each point.
Using clustering methods is important also for profiling in cybersecurity based on behaviour of IP hosts and anomaly detection.Jakalan et al. [19] focused on the behaviour of IP hosts from the prospective of their communication behaviour patterns.They created hosts' behaviour profiles of the observed IP nodes by clustering hosts into groups of similar communication behaviour.DBSCAN clustering algorithm is used and it found 14 most important features important to represent host behaviour communication patterns (e.g., number of peers, duration of flow, and number of sent SYN-ACK packets).Erman et al. [20] evaluated two different clustering algorithms, -means and DBSCAN, for the network traffic classification problem.Their analysis was based on each algorithm's ability to produce clusters that have a high predictive power of a single traffic class and each algorithm's ability to generate a minimal number of clusters that contain most of the connections.They compared these clustering algorithms to the AutoClass algorithm.The results showed that the DBSCAN algorithm produces the best overall accuracy.Marchette [7] focused on clustering of computers into groups that consist of computers, which tend to have similar activity profiles.In the paper, the authors used two clustering methods: -means and method of Cowen and Priebe.Xu et al. [21,22] focused on clustering of hosts in the same IP prefixes.They used bipartite graphs to represent hosts' communications in network traffic and described a spectral clustering algorithm for automatic discovery of behaviour clusters in network prefixes based on hosts' communications.

Methodology
This part of the paper describes the input data and the way of their analysis.We took into account the workflow in profiling module, according to which we also divided this chapter.

Data Collection.
For the purposes of this research, data were collected during 2 weeks (from 2017-03-16 to 2017-03-31) by Warden system [26].Warden is a part of CESNET Large Infrastructure project and it enables security teams to efficiently exchange information on detected events (threats) from honeypots, intrusion detection systems, network threat probes, and even external sources, designed as multiclient queue.Scheme of Warden system is shown in Figure 3.
Collected data contain approximately 72 million records from various data sources.Table 1 shows significant sources of collected data and amount of data collected by the source.
Warden in version 3 uses a flexible and descriptive event format, based on JSON-Intrusion Detection Extensible Alert (IDEA) format [27].IDEA is a descriptive data model using key:value format and JSON structure.The IDEA format   is defined as maximum 2-level tree of key:value pairs.It allows for just one basic level of indirection when represented in relational models (save for arrays) and avoids lack of predictability and discoverability in multiple-level or recursive schemes.The keys "Format," "ID," "DetectTime," and "Category" are mandatory.The rest of the keys are optional [28].The keys, which are significant for our research, are stated in Table 2.

3.2.
Preprocessing.An analysis of data collected from Warden system is difficult without their transformation.For this reason, they had to be preprocessed.Each record from Warden stands for a security event.Since we consider the IP address as a threat agent, in the context of this paper, threat agent is a specific system entity with a public IP address or several system entities of the same private network subnet using that public IP address to communicate with other devices on the Internet (e.g., using NAT) and perform a threat action.
For easier processing, data was stored in PostgreSQL database [29].The reason for selecting this database storage is the fact that PostgreSQL can very effectively work with JSON format.It directly gets individual values without having to additionally parse strings.Data were stored in the table, which contains 2 columns: ID and IDEA data, where the IDEA data column values are in the IDEA format.
From those data, a table with 12 columns was made by transforming data.Each column has its own data type.Therefore, it is easier to perform specific operations, for example, numerical operations which were not possible to do directly from the JSON format.Columns in this table represent properties: ID, source IP address, target IP address, category, category count, protocol, protocol count, port, duration, start timestamp, end timestamp, and ISP.However, this table contains attacks, not threat agents; therefore another transformation was needed.This transformation consists in merging the same source IP addresses, thus creating one entry per one threat agent.
In the final input for clustering, every threat agent is represented by a 41-element vector.This vector consists of 22 elements related to a type of attacks this threat agent performed.For every type, there is a number stating how many times this threat agent performed a certain type of attack.Out of the next values, the first 12 values are related to a protocol used by the threat agent in the same manner as described for the type of the attack.13th value expresses how many times the threat agent attacked from a port in range of 0-1023.14th value expresses the number of times they attacked from port in range of 1024-65535 to attack.The rest of attributes are the following: overall duration of the threat agent activity, maximal idleness between two subsequent attacks of the threat agent, minimal idleness between two subsequent attacks of the threat agent, and number of different networks aimed at by threat agent (this was determined from the ISP of target IP address), and the last element of the vector representing the threat agent is a number of different targets.
For a statistical analysis, we can exploit information in attributes that attain more than only zero values (attribution reduction).In our case, types of categories except Recon.Scanning and Availability.DDoS have zero values.The same is for all protocols except TCP and UDP.Also both groups of ports have exclusively zero values.
After data transformation and attributes reduction, for each threat agent (IP address), we consider four categories of attributes: (i) Type of security event is based on a value of key "Category" in the IDEA format.In the collected data, we consider only two categories: Recon.Scanning and Availability.DDoS.
(ii) Communication-related data is based on values of keys "Source:Port," "Source:Proto," "Target:Port," and "Target:Proto" in the IDEA format.In the collected data, these data are identical to previous category.For this reason, they are omitted in the analysis.
(iii) Temporal-related data is based on values of keys "EventTime"and "CeaseTime" in the IDEA format.
(iv) Spatial-related data is based on values of key "Target:IP4" in the IDEA format.In the collected data, we consider a number of different targets and a number of Internet service providers.
Vectors representing threat agent consist of the following attributes: Regarding IP address of threat agent, it corresponds to key "Source:IP4" in IDEA format.From the perspective of privacy issues, we omitted IP address from vector of threat agents.
Recon.Scanning category of security event corresponds to key "Recon.Scann-ing" in IDEA format.Availability.DDoS is category of security event that corresponds to key "Availability.DDoS" in IDEA format.
Timeline of all events for threat agent can be seen in Figure 4. On one hand,    (EventTime) is start of security event  associated with threat agent .On the other hand,    (CeaseTime) is end of security event  associated with threat agent .

Time of event
Duration is sum of all time of events for the threat agent.
is maximum of all time periods between security events (time of inactivity) for threat agent .
is minimum of all time periods between security events (time of inactivity) for threat agent .
ISP count is a number of unique networks recorded for the threat agent (IP address) according to Internet service providers (ISP).This was collected using IP-API service [30].This service provides spatial data about an IP address and its ISP.Unique targets is a number of unique targets (hosts with IPv4 address) according to threat agent.Relationship between ISP count and unique targets can be expressed as   ≤  .[32], grid-based clustering methods [33], model-based clustering methods [34], categorical or mixed data clustering methods [35,36], fuzzy clustering methods [37], and others.Some clustering approaches can be sensitive to outliers so their robust modifications [38] have been developed.For a partitioning method, it is typical that the general process of partition-based clustering [39] is iterative.The first step defines or chooses a predefined number of representatives of the cluster and the second step updates the representatives after each iteration if the measure for the clustering quality (objective function) has improved.In our research, we decided to partition methods because of many advantages [40] they have.
First, most of the partitioning methods (moving centres, -means, K-modes, and K-prototypes) have low computational complexity [40].Therefore, they can be implemented for large volumes of data.Furthermore, the number of iterations needed to minimize the within-cluster sum of squares is generally small, making these methods even more suitable for such applications.
The second advantage [40] is that, unlike hierarchical methods, in which the clusters are not altered once they have been constructed, the reassignment algorithms constantly improve quality of clusters.Thus, the quality of clusters can quickly reach a high level when the form of the (spherical) data is suitable.
Third, there is a benefit of an easy and intuitive interpretation, in particular in our application.Partitioning methods we employ have uniquely defined representatives.And this property is desirable when we want to characterize specific groups of threat agents.
Partitioning methods are not ideal in all aspects and it is good to be aware of some drawbacks at the implementation.First, the final partition depends greatly on the more or less arbitrary initial choice of the centres.Consequently, we do not have a global optimum but simply the best possible partition based on the starting partition.The solution could be to run the clustering algorithm several times with different initial cluster centers.The run with the best value of clustering quality measure (objective function) is selected as the final clustering solution and guarantees that we are not stuck within a local optimum only.
Another challenge [41] is to specify the optimal number of clusters.The solution could be to run clustering algorithm for a range of  values.Then, choose the best  by comparing the clustering results obtained for the different  values.We employ some popular criteria to help us choose the optimal number of clusters.They are mentioned in the text below.
We choose three widespread partitioning clustering methods [31,39,42] for our purpose: -means, PAM (Partitioning Around Medoids), and CLARA (Clustering LARge Application).In the following paragraphs, we introduce the main ideas behind these well-known methods.
The -means algorithm [39,41,43], one of the mostly used clustering algorithms, searches for a partition of a given set of numeric objects X into  (given) clusters, which minimizes the within-groups sum of squared errors.This process is often formulated [44] as the following mathematical program problem : where  is an  ×  partition matrix, Q = { 1 ,  2 , . . .,   } is a set of objects in the same object domain, and (⋅, ⋅) is the squared Euclidean distance between two objects.
This optimization problem is solved iteratively [41].The algorithm starts by randomly selecting  objects from the dataset to serve as the initial centers for the clusters.The selected objects are also known as cluster means or centroids.Next, each of the remaining objects is assigned to its closest centroid, where closeness is based on the Euclidean distance between the object and the cluster mean.After that, the algorithm computes the new mean value of each cluster.When the centers have been recalculated, each observation is checked again to see if it might be closer to a different cluster.All objects are reassigned again using the updated cluster means.These steps repeat until the clusters formed in the current iteration are the same as those obtained in the previous iteration.
The second algorithm we consider is PAM [39][40][41]45].The goal of this clustering method [40] is to find  representative objects (medoids among the observations of the dataset) of clusters which minimize the sum of the dissimilarities of the observations to their closest representative object.A medoid is a representative of a cluster, chosen as its most central object.The centrality is tested by a systematic permutation of one representative and another object of the population chosen at random to see if the quality of the clustering increases.In other words, if the sum of the distances of all the objects from their representatives decreases, the algorithm stops when no further permutation improves the quality.
The PAM algorithm is known to be more robust to outliers than -means algorithm.It is due to the principle of the given algorithm.The complexity could be considered as its main disadvantage.
To reduce the computing time and RAM storage problem, one can use the modification of the PAM algorithm, namely, the CLARA algorithm [39][40][41]45].The main idea behind this method [39] is that, instead of taking the whole set of data into consideration, the CLARA algorithm randomly chooses a small portion of the actual data as a representative of the data.Medoids are then chosen from this sample using PAM.If the sample is selected in a fairly random manner, it should closely represent the original dataset.CLARA draws multiple samples of the dataset, applies PAM to each sample, finds the medoids, and then returns its best clustering as the output.
Choosing the best clustering method for given data can be a challenging task for an analyst [41,46].Therefore, one has to employ measures to compare simultaneously multiple clustering algorithms.In combination with external facts, they help to choose the best performing clustering method with the optimal number of clusters.We follow this approach in our analysis.
More precisely, we compute internal measures [41,47,48] and stability measures [41,47].Internal measures use intrinsic information in the data to assess the quality of the clustering.As the goal of clustering is to aggregate similar objects within the same cluster and distinct objects in different clusters, internal measures are mostly based not only on compactness and separation of the groups but also on connectivity (see [41,47,48] for more details).To internally validate our choice of the clustering algorithm, we calculate the connectivity, the silhouette coefficient, and the Dunn index in the analysis.Higher values of mentioned measures are desirable with exception of connectivity; a value of this measure should be minimized.
Stability measures, a special version of internal measures, evaluate consistency of a clustering result by comparing it with the clusters obtained in cases if each variable is removed, one at a time.In our analysis, we included the following stability measures: the average proportion of nonoverlap (APN), the average distance (AD), and the average distance between means (ADM) (see [41,47] for more details).The values of APN and ADM lie in [0, 1], whereby smaller values Notes.Percentage is a concordance rate of particular method presented with respect to the percentile method we use.
represent highly consistent clustering results.The value of AD lies in [0, ∞) and smaller values are also preferred.These introduced measures for comparing clustering algorithms are cleverly implemented in clValid package [47] that was very helpful in our clustering analysis.
We also used popular approaches such as elbow method and silhouette method [45,49] to help us determine the optimal number of clusters.
Moreover, in the final stage of our analysis, we implement clustering for a dataset without outliers and check the influence of such objects on our clustering approaches.Although there are various sophisticated techniques to cope with outliers [50] (e.g., clustering algorithms themselves can identify outliers in data (-means [51], trimmed -means [52], and DBSCAN [41])), we use a simple and intuitive approach based on percentiles.We identify an observation to be an outlier if at least one of the characteristics has the value above 99th percentile.We do not consider a lower cutoff point as there is a natural zero bound for each variable.
For a comparison with the percentile method described above, we investigate other common methods to identify outliers: (1) Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [53] (2) The Invariant Coordinate Selection (ICS) [54] (3) Local Outlier Factor (LOF) [18] In Table 3, we report the concordance rate for outliers identified by other above-mentioned methods with respect to the percentile method.The rate can be interpreted as the fraction of outliers identical to those classified by the percentile method.
There is a quite good agreement in identifying outliers by introduced methods.Consequently, the clustering procedures deliver very similar results after removing outliers from data.According to this finding, we implement percentile method in following computations because it is easy to use and not very time-consuming in contrast to other methods.

Results and Discussion
First, variables in the dataset must be scaled to obtain comparable weights of individual variables in the clustering algorithm.We employed one of the most widespread scaling approaches, scaling by the range.Let X  be the -th variable (column) and let    be its -th element in our dataset.Let  be the number of objects (rows) and let  be the number of variables (columns) in our dataset.Finally, let us denote by    the transformed (scaled) data point.Then, for all  ∈ {1, 2, . . ., } and  ∈ {1, 2, . . ., }, we proceed with the following scaling: Before applying a clustering method on any dataset, it is important to assess its clustering tendency.In other words, one needs to detect if the dataset contains meaningful clusters (i.e., it is a nonrandom structure) or not.If a nonrandom structure is explored, the next task is to determine a number of clusters.
Going to a specific dataset, the best way is to start with data visualization.In our case, we have multidimensional data and they cannot be displayed exactly in their full range.We need to reduce their dimension, for example, by using principal components.Then we can obtain approximated data visualization.For such visualization, we used factoextra package [55].
In Figure 5, we can observe that data are by large explained by the first two components.A two-dimensional projection explained more than 90% of the entire variation in data.In what follows, we aim to better understand the data structure.
For the purpose of assessing clustering tendency of our data, we calculated Hopkins statistic [56], which is very well implemented in clustertend package [57].It is used to assess the clustering tendency of a dataset by measuring the probability that a given dataset is generated by a uniform data distribution.Simply said, it tests the spatial randomness of the data.In our case, the value of Hopkins statistic is equal to 0.0031 and it means [41] that our dataset is highly clusterable.
As the initial results indicate existence of clusters in our data, we proceed with searching for the best method and the optimal number of clusters.We consider the three clustering methods, -means [58], PAM [59], and CLARA [59], discussed in the previous section and employ the internal and stability measures to assess how appropriate their use is.
Figure 6 shows values of internal measures for different clustering methods and different number of clusters.A range for a number of clusters is considered to be from 2 to 7, as 7 is taken as the maximum reasonable number of clusters we aim to have in our classification based on seven variables in the dataset.Figure 7 reports corresponding results for stability measures.Furthermore, we consider the elbow method and we plot the total within sum of squares in Figure 8.Based on the three figures, we can make several observations.First, all internal measures prefer -means with two clusters (searching for the minimal value in connectivity measures and maximal value in two others).Second, the elbow method suggests using two clusters indicated by a strong decline at this value.Third, the stability measures do not provide a uniform answer to the questions of what is the optimal method and what is the optimal number of clusters.However, there is a strong pattern across all of them; that is, the stability measures prefer more clusters.Moreover, PAM seems to be least sensitive to different stability measures.Therefore, in addition to -means with two clusters for a coarse classification, we also implement PAM with 7 clusters for a finer classification.
The internal and stability measures provide guidance on which method (from a set of -means, PAM, and CLARA) and which number of clusters (from one to seven clusters) deliver the best properties.For example, the results from the initial diagnostics indicate that if we construct 7 clusters using -means instead of PAM, the clustering will be unstable and uneven.In other words, the decomposition would be not representative.Moreover, CLARA seems to be less appropriate for both coarse and fine classifications.Therefore, we do not implement it at any further stage of the analysis.
Overall, the initial diagnostics of the clustering methods and the optimal numbers of parameters support our view of different refinement of our classification strategy.Based on this, we decided to focus on three different approaches to profiling module: (1) One-stage profiling without analysis of outliers (2) One-stage profiling with analysis of outliers (3) Two-stage profiling analysis

One-Stage Profiling without Analysis of Outliers.
In the first approach, we use one-stage profiling with 2 clustering algorithms (-means and PAM), which are used independently of each other.The first approach can be seen in Figure 9.Here we do not separate any threat agents as outliers.We discuss the outliers in the second approach in the next section.
First, we construct two clusters based on -means to classify our threat agents in a coarse classification.Second, we implement PAM with seven clusters to provide a finer classification and capture a higher variety of non-automatized threat agents.
Table 4 gives an overview of the structure with two clusters.The first cluster is big and contains almost 89% of all threat agents.A representative of this cluster (last 7 columns) is characterized by attacking several targets of one ISP.At Notes.The second and third columns report the number and percentage of threat agents in a specific cluster, respectively.The last seven columns correspond to the following characteristics: Recon.Scanning, Availability.DDoS, duration, max.idleness, min.idleness, a number of ISP, and a number of unique targets.the same time, their behaviour is characterized by rather short breaks between single security events lasting about 5 hours.Interesting factor is the maximum idle time between the security incidents (about 140 minutes), what suggests that threat agent is not coming back to a particular network after a longer time period.The second cluster is smaller (about 11%) but seems to group more interesting types of threat agents.Threat agents in this cluster are characterized by bigger number of targeted devices in various ISPs.Availability.DDoS attribution is discussed in the next subsection in more detail.Duration time of security events is prolonged and there is a significant rise of other values as well, which might suggest a longer period of activity for the threat agents.This suggests that we are not able to create an appropriate security rules.For this reason, further analysis is needed (clustering with PAM algorithm).
For a better grasp of the clustering output, we also provide visualization of the two clusters in two dimensions in Figure 10.Now we proceed with an analysis of 7 clusters.Size of individual clusters and characteristics of the representatives are reported in Table 5.Based on them, we can give an interpretation to the members of each cluster.
The first cluster of threat agents (about 82%) is characterized by attacking one device at one ISP.These are short automated actions, suggested by short values of MaxIdleness and MinIdleness.The average time of security events is 733 seconds (12 minutes).In our opinion, this cluster could represent threat agents, hosts infected with malware, which are controlled by command and control servers.
The second cluster of threat agents shows very short attack duration time.Minimum difference between values MaxIdleness and MinIdleness suggests that it is a short, Notes.The second and third columns report the number and percentage of threat agents in a specific cluster, respectively.The last seven columns correspond to the following characteristics: Recon.Scanning, Availability.DDoS, duration, max.idleness, min.idleness, a number of ISP, and a number of unique targets.automated attack.Unlike the previous cluster, these are security events at multiple devices in multiple ISPs.In this case, we suggest paying further attention to such security events as they do not play any role in aiding the defence of the network.The third cluster of threat agents is characterized by security events targeted at multiple devices at multiple ISPs.It is interesting that this threat agent attacked each device only once (same values of Recon.Scanning and Targets) and at the same time has the highest value of MinIdleness.Given other values (duration and MaxIdleness), it can be concluded that this was a manual attack.These threat agents need to be further dealt with (not only by adding a firewall rule).
The fourth and the seventh clusters of threat agents are automated attacks due to value of MinIdleness, which target multiple devices at multiple ISPs.The difference between these groups is the values of Duration and MaxIdleness.Threat agents in the fourth cluster repeated network scan due to the value of Recon.Scanning but with short attack duration time.The high value of MaxIdleness might suggest the existence of a bot and its participation in several campaigns.
The threat agents in the fifth cluster scanned the target device only once (values of Recon.Scanning and Target).Time values (Duration, MaxIdleness, and MinIdleness) suggest that it was a scan during one campaign or it could be

Report and analysis
Figure 9: Scheme of profiling module with one-stage profiling without analysis of outliers.
scanning of IPv4 address space of countries (in our case Czech Republic).We suggest treating these threat agents by adding a firewall rule.The threat of the sixth cluster is similar in its behaviour to threat agents of the fifth cluster.There is only difference in value MinIdleness.Threat agents in this cluster are characterized by the largest number of targeted networks at the largest number of ISPs.In our opinion, it could be scanning of whole IPv4 address space (e.g., by shadowserver and censys.io).These are periodical automated scans to monitor the available devices on the Internet for discovering new threats and assessing their impact.It is beneficial to share security events of these threat agents with other organizations; figure out if it is a scanning service targeting the whole address space; if not, add a firewall rule.
For a better grasp of the clustering output, we also provide visualization of the seven clusters in two dimensions in Figure 11.

One-Stage Profiling with Analysis of Outliers.
In the second approach, we extend our analysis from previous approach by one more layer.This approach can be seen in Figure 12.We treat very specific threat agents separately and suggest that an expert devotes additional time to analyse such threat agents.We identify those threat agents as outliers.In statistics, outliers are specific objects that differ from the core of the dataset in some way.For our purpose, we consider an observation (a threat agent) to be an outlier if at least one of the characteristics has the value above 99th percentile.Altogether, we found 173 outliers.Notes.The second and third columns report the number and percentage of threat agents in a specific cluster, respectively.The last six columns correspond to the following characteristics: Recon.Scanning, duration, max.idleness, min.idleness, a number of ISP, and a number of unique targets.Table 6 gives an overview of the structure with two clusters.When compared to Table 4, it can be seen that expelling the outliers had a bigger impact on the number of individual Recon.Scanning and a number of different targets, whose value went down in both clusters.The number of different ISP did not change.Next change is in the value of Duration, which is significantly lower in clusters in Table 6.Interestingly, the ratio of the value between the two clusters stays the same.
The first cluster contains almost 90,6% of all threat agents.A representative of this cluster (last 6 columns) is characterized by low values of MaxIdleness and MinIdleness.In this cluster of threat agents, security events were recorded in one ISP to two different targets.Because value of Recon.Scanning is higher than the value of unique targets, the threat agents attacked each device multiple times.The average time of these events is 700 seconds (11 minutes).
Like in previous approach, the second cluster is smaller (about 9,5%) but seems to group more interesting types of threat agents.Threat agents in this cluster are characterized by bigger number of targeted devices in various ISP.Duration time of security events is prolonged and there is a significant rise of other values as well, which might suggest a longer period of activity for the threat agents.In this case, too, we must conclude that we are not able to create appropriate security rules.For this reason, further analysis is needed (clustering with PAM algorithm).For a better grasp of the clustering output, we also provide visualization of the two clusters without outliers in two dimensions in Figure 13.
Further, we proceed with an analysis of 7 clusters.Size of individual clusters and characteristics of the representatives are reported in Table 7.Based on them, we can give an interpretation about the members of each cluster.
Compared to the first approach, the attributes of following clusters did not change: clusters 1, 2, 4, and 5.All clusters, with the exception of clusters 1 and 7, have a lower number of threat agents in them.Small change can be seen in clusters 3, 6, and 7.In cluster 7, the value of MinIdleness is negative, meaning that before one security event generated by these threat agents finished, another was recorded.This might suggest that the threat agent's IP address is public and behind it there are several different hosts participating in these security events.
For a better grasp of the clustering output, we also provide visualization of the seven clusters without outliers in two dimensions in Figure 14.
Overall, we conclude that analysis with outliers not only changed individual clusters but also showed group of threat Notes.The second and third columns report the number and percentage of threat agents in a specific cluster, respectively.The last six columns correspond to the following characteristics: Recon.Scanning, duration, max.idleness, min.idleness, a number of ISP, and a number of unique targets.

Report and analysis
Outliers removing and analysis agents that need to be analysed individually.Such division does not impact rules for individual clusters.With -means algorithm, the percentage of same-clustered threat agents is the same whether clustering is done with or without outliers: 99.68%.With PAM algorithm, the matching score is slightly lower but still delivers a value sufficiently close to 100%.Because of this, we advise to use profiling according to analysis with outliers.

Two-Stage Profiling
Approaches.We use two-stage profiling with 2 clustering algorithms (-means and PAM).Notes.The second and third columns report the number and percentage of threat agents in a specific cluster, respectively.The last seven columns correspond to the following characteristics: Recon.Scanning, Availability.DDoS, duration, max.idleness, min.idleness, a number of ISP, and a number of unique targets.Notes.The second and third columns report the number and percentage of threat agents in a specific cluster, respectively.The last six columns correspond to the following characteristics: Recon.Scanning, duration, max.idleness, min.idleness, a number of ISP, and a number of unique targets.

Data preprocessing
K-means PAM

Report and analysis
Figure 15: Scheme of profiling module with two-stage profiling without analysis of outliers.
-means algorithm is used to split threat agents into two clusters.Then first cluster remains unchanged and the second cluster is divided into 6 clusters using PAM algorithm.Like one-stage approaches, we focus on 2 approaches.The first approach is two-stage profiling without outliers' analysis (Figure 15).In this approach, we do not separate any threat agents as outliers.The second approach is two-stage profiling with outliers' analysis (Figure 16).We treat very specific threat agents separately and suggest that an expert devotes additional time to analyse such threat agents.Unlike one-stage approaches, we do not analyse threat agents per two different outcomes; analysis of one summary division is sufficient.Table 8 gives an overview of the structure with seven clusters of threat agents in the analysis of two-stage profiling without outliers.The attributes of clusters of threat agents in the analysis of two-stage profiling with outliers are listed in Table 9.
For a better grasp of the clustering output, we also provide visualization of the seven clusters with outliers (Figure 17) and without outliers (Figure 18) in two dimensions.
We compare results of one-stage and two-stage profiling.The outcome of the comparison is that the percentage of the same-clustered threat agents in the one-stage analysis and in the two-stage analysis is 71,64%.The second comparison is a percentual ratio of the same-clustered threat agents (without outliers) with the one-staged profiling and with the twostaged profiling, a much better 75.91%.A higher impact of outliers can be seen in the two-stage profiling.The evidence is the percentual ratio of same-clustered threat agents in the two-staged profiling with and without outliers.The outcome is 90.38%.Compared to the one-stage profiling (99.68% and, resp., 98,6%), it is a relatively low number.4.4.Attribute Availability.DDoS.DDoS attribute in security events was recorded only in 1019 cases, which is a very small number compared to the number of all the recorded events.At the same time, these values appeared only for three threat agents.These threat agents were matched to the same cluster or were outliers, which can be seen in Table 10.In all approaches with an analysis of outliers, these threat agents belong to the outlier group.This shows that an analysis with outliers should be favoured.
While analysing threat agents with DDoS attribute, elementary properties of -means and PAM algorithms can be observed.In particular, -means might choose an imaginary element for the centroid.For this reason, DDoS attribution is listed in Table 3.On the other hand, the PAM algorithm chooses a real element as a medoid.It is now clear that threat

Data preprocessing
K-means PAM

Report and analysis
Outliers removing and analysis agents with DDoS attributes are not such elements (see Tables 5, 7, and 8).

Conclusion
In this paper, we discussed an application of clustering algorithms for security event profiling.We used data collected during two weeks in Warden system, which include security data from various sensors, tools and honeypots deployed to CESNET, and their partner networks.We applied -means and PAM clustering methods to group threat agents based on attributes of security events.In this paper, we discuss the various approaches (one-staged and two-staged profiling with and without analysis of outliers) of using clustering algorithms (-means and PAM) in profiling modules.Onestage profiling with analysis of outliers comes out as the best approach for profiling module.Future research can point to determining size of private network subnet using that public IP address to perform a threat action according to the parameters shown in this paper.The privacy in prepossessing appears as a very interesting research issue.

Figure 2 :
Figure 2: Workflow of profiling in profiling module.

Figure 4 :
Figure 4: Timeline of events for threat agent.

Figure 5 :
Figure 5: Scaled data visualization using first two principal components.

Figure 6 :
Figure 6: Internal measures for all three clustering methods.

Figure 7 :
Figure 7: Stability measures for all three clustering methods.

Figure 8 :
Figure 8: Elbow method for all three clustering methods.

Figure 10 :
Figure 10: Decomposition of threat agents into two clusters.Visualization using the first two principal components.

Figure 11 :
Figure 11: Decomposition of threat agents into seven clusters.Visualization using the first two principal components.

Figure 12 :Figure 13 :
Figure 12: Scheme of profiling module with one-stage profiling with analysis of outliers.

Figure 14 :
Figure 14: Decomposition of threat agents into seven clusters without outliers.Visualization using the first two principal components.

Figure 16 :Figure 17 :
Figure 16: Scheme of profiling module with two-stage profiling with analysis of outliers.

Figure 18 :
Figure 18: Decomposition of threat agents into seven clusters without outliers by two-step clustering.Visualization using the first two principal components.

Table 1 :
Sources of data.

Table 2 :
Significant key in IDEA format.

Table 4 :
Representatives of individual clusters, -means with 2 clusters.

Table 5 :
Representatives of individual clusters, PAM with 7 clusters.

Table 6 :
Representatives of individual clusters, -means with 2 clusters without outliers.

Table 7 :
Representatives of individual clusters without outliers, PAM with 7 clusters.

Table 8 :
Representatives of individual clusters, -means and PAM with 7 clusters.

Table 9 :
Representatives of individual clusters without outliers, -means and PAM with 7 clusters.

Table 10 :
Clusters and attributes of threat agents with DDoS attribute.Notes.The first column represents attributes.The other six columns correspond to the following profiling approaches: one-stage profiling without analysis of outliers (-means algorithm), one-stage profiling without analysis of outliers (PAM algorithm), one-stage profiling with analysis of outliers (-means algorithm), one-stage profiling with analysis of outliers (PAM algorithm), two-stage profiling without analysis of outliers (-means and PAM algorithms), and two-stage profiling with analysis of outliers (-means and PAM algorithms).The rows correspond to the following attributes: number of clusters, count of threat agents, percentage of threat agents in cluster to all threat agents, Recon.Scanning, availability, duration, max.Iileness, min.idleness, a number of ISP, and a number of unique targets."Out" means outliers.