ICSTrace: A Malicious IP Traceback Model for Attacking Data of Industrial Control System

Considering the attacks against industrial control system are mostly organized and premeditated actions, IP traceback is significant for the security of industrial control system. Based on the infrastructure of the Internet, we have developed a novel malicious IP traceback model-ICSTrace, without deploying any new services. The model extracts the function codes and their parameters from the attack data according to the format of industrial control protocol, and employs a short sequence probability method to transform the function codes and their parameter into a vector, which characterizes the attack pattern of malicious IP addresses. Furthermore, a Partial Seeded K-Means algorithm is proposed for the pattern's clustering, which helps in tracing the attacks back to an organization. ICSTrace is evaluated basing on the attack data captured by the large-scale deployed honeypots for industrial control system, and the results demonstrate that ICSTrace is effective on malicious IP traceback in industrial control system.


Introduction
With the rapid development of the Internet of Things (IoT), more and more Industrial Control Systems (ICS) are connected into the Internet.As the key bond between the virtual signal and the real equipment, an Internetconnected ICS makes the production process to be more accurate and agile.But it also narrows the distance between the cyber attacks and the industrial infrastructure.As we know, Stuxnet worm was disclosed to be the first worm attacking the energy infrastructure [11,23] in 2010.In 2014 the hackers attacked a steel plant in Germany, so that the blast furnace can not be closed properly [41].On December 23, 2015, the Ukrainian power network suffered a hacker attack, which was the first successful attack to the power grid, resulting in hundreds of thousands of users suffering power black-out for hours [42].In 2017, the security vendor ESET disclosed an industrial control network attack weapons named as win32/Industroyer, which implemented malicious attacks on power substation system [3].
ICSs are highly interconnected and interdependent with the critical national infrastructure [34], and thus the attackers have noticed the high returns to attack ICS in recent years.The attackers are diverse in identity.They may be hackers, members of organized criminal groups, or even a hostile country.The worse situation is that ICS has become the new target of terrorists to gain the influence by destroying the real physical world.As traditional ICS is physically isolated from the Internet, most researches just focus on the functional safety of the system rather than the security consideration of the network.There are not any special protective measures, not to mention the attribution mechanism for tracing the attack back [22].Security researchers are now committed to the intrusion detection technology for ICS.They want to identify, intercept and alert the threats, before a severe attack occurs.These intrusion detection technologies can be divided into several categories as follows: state-based [21], behavior-based [24], rule-based [39], characteristic-based [25], model-based [26], and ML-based (machine learning) [31,43].
Because ICS plays an important role in the critical national infrastructure, the cyber attacks against ICS are mostly organized and premeditated actions.It is significant not only to determine whether there is a threat in ICS, but also to trace the attack back.Furthermore, locating the initiators and their motivations before or during an attack is crucial for deterring and cracking down the premeditated and organized attackers.
Attribution is one of the most intractable problems of an emerging field, created by the underlying technical architecture and geography of the Internet [28].The current dominant IP traceback technologies include packet marking mechanism [30], packet logging mechanism [32] and their hybrid [15,16].Packet marking mechanism needs the routers to write a tag (for example, IP address) into some fields of every packet.The target retrieves all the tags from the received packets and finds out the routing path.Packet marking mechanism includes two categories: Probabilistic Packet Marking (PPM) [30] and Deterministic Packet Marking (DPM) [8].Packet logging mechanism needs the routers to record all the forwarded packets so as to reveal the routing path.Apparently, this mechanism consumes a lot of storage space.All these IP traceback technologies above need to re-design the Internet or to deploy new services.There is still no applicable IP traceback system to deploy over the network.
The ultimate goal of attribution is identifying an organization or a government, not individuals [28].Our study identifies an organization by zooming down to a single IP level and then zooming back out to an organization or a unit level without changing the Internet architecture or deploying new services.Instead of tracing back to the source of a packet directly, we just recognize the malicious IP addresses which belong to the same organization.
In this study, we present a malicious IP traceback model (ICSTrace) for industrial control system, and this model makes the following contributions: 1. Based on the deep analysis of ICS protocol S7comm, the function codes and their parameters are extracted from the attack data.
2. A feature vector of the function codes and their parameters are designed to represent the attack patterns.
3. The slide window method is adopted to reduce the dimension of those multidimensional samples.
4. A Partial Seeded K-Means clustering algorithm is proposed based on K-Means algorithm.
5. ICSTrace is proven to be effective basing on the real attack data captured by the large-scale deployed honeypots for ICS.
Section 2 introduces the research background and our previous work on the attack data collection.Section 3 gives the details of S7comm protocol.Section 4 describes the architecture of our IP traceback model.Section 5 and Section 6 introduce the attack pattern extraction method and Partial Seeded K-Means algorithm for clustering respectively.In Section 7, we evaluate our IP traceback model basing on the real attack data.Section 8 is our related works and Section 9 is the conclusion.

Background
ICS is a business process management and control system which is composed of various automatic control and process control components.It collects and monitors realtime signals to ensure the function of the automatic operation or the process control.Its application fields include program automation, industrial control, intelligent building, power transmission and distribution, smart meter, car communication and so on.ICS protocol refers to the communication protocol used in ICS.The most wellknown ICS protocol includes S7, Modbus, BACnet, and DNP3.
At present, there is not any ICS attacking data set for security research.Therefore, we developed a high interactive ICS honeypot named as S7commTrace in previous work [37], based on Siemens' S7comm protocol.Honeypot is a kind of security resource that is used to attract the attacker for illegal application without any business utility [20].Honeypot technology is a method to set some hosts, network services or information as a bait, to induce attackers, so that the behavior of the attacks can be captured and analyzed [33].Honeypot can be used to better understand the landscape of where these attacks are originating [44].
S7commTrace masquerades as a real PLC device by simulating the S7 protocol to capture the probing and attacking data.It can be divided into four modules, including TCP Communication module, S7comm Protocol Simulation module, Data Storage module and User Template, as shown in Figure 1.
The main function of TCP Communication module is to listen on TCP port 102, submit the received data to the Protocol Simulation module, and reply to the remote peer.S7comm Protocol Simulation module parses the received data according to the protocol format and obtains the valid contents at first.And then S7comm Protocol Simulation module generates the reply data referring to User Template.At last, the reply data are sent back to TCP Communication module to be packaged.User Template records all the user-defined information such as PLC serial number, manufacturer, and so on.The Data Storage module handles the request and the response of data storage.2. This means 573 valid IP addresses belong to four organizations at least.Shodan.io[6] is the domain suffix of Shodan which is a search engine in cyberspace.In addition to retrieving traditional web services, Shodan has used the ICS protocol directly to crawl the ICS devices on the Internet, and visualizes their location and other information.Eecs.umich.edu is the domain suffix of the Department of Electrical and Computer Science (EECS) Department of University of Michigan, which is one of the agencies developing Censys [1,13].Censys scans the devices in  the Internet and stores the results in its database.It provides not only web and API query interfaces but also raw data to download.Neu.edu.cn is the domain suffix of Northeastern University of China which develops a search engine name as Ditecting [2].Ditecting is capable of providing accurate information of ICS devices and their locations.Plcscan.org is the domain suffix of Beacon Lab [4] which is committed to the research and the practice related to ICS security.These four organizations are the well-known security research institutes.They are scanning the devices in the Internet all the time, including the ICS devices.As shown in Table 2, except for the 66 IP addresses belonging to four well-known organizations, there are still 507 IP addresses which are resolved to be dynamic domain name or none domain name.

S7 Protocol
S7 protocol is a Siemens proprietary protocol [5] running on programmable logic controllers (PLCs) of Siemens S7-200, 300, and 400series.It is suitable for either Ethernet, PROFIBUS or MPI networks .Because the objects of this study are those industrial control systems which are accessed to the Internet, we only discuss the TCPbased S7 protocol in Ethernet networks.As shown in Figure 2, S7 protocol packets are packed by COTP protocol, and then packed by TPKT protocol package for TCP connection.As shown in Figure 3, the communication procedure of S7 protocol is divided into three stages.The first stage is to establish COTP connection, the second stage is to setup S7 communication, and the third stage is to exchange the request and the response for function code.
The Magic flag of the S7 protocol is fixed to 0x32, and  4.
In parameters field, the first byte stands for the function code of S7.Table 3 shows the optional function codes of S7.Communication Setup code is used to build a S7 connection; Read code helps the host computer to read data from PLC; Write code helps the host computer to write data to PLC.As for the codes of Request Download, Download Block, Download End, Download Start, Upload and Upload End, they are designed for downloading or uploading operations of blocks.PLC Control code covers the operations of Hot Run and Cool Run, while PLC Stop is used to turn off the device.
When the function code is 0x00, it stands for system function which is used to check system settings or status.And the details are described by the 4 bits function group code and 1 byte subfunciton code in the parameters field, as shown in Figure 5.
System Functions further divided into 7 groups, as shown in Table 4. Block function is used to read the block, and Time Function is used to check or set the device clock.

Structure of ICSTrace Model
When an attacker launches the attacks, he usually hides the IP address of his own resorting to the springboard host, VPN and other measures.As shown in Figure 6, after an ICS suffered an attack from the Internet, the security personnel can only see the last IP address connected to ICS instead of the real IP address of the attacker, not to mention the organization which belongs to.
ICSTrace transforms the features of data from each IP address into a one-dimensional eigenvector.This eigenvector stands for the unique pattern of an attack.There-fore, the problem of attribution turns into a problem of clustering the patterns.
As shown in Figure 7, the input of ICSTrace is a malicious IP and its packets.The output is a cluster containing multiple IP addresses, which indicates an organization.ICSTrace model consists of three stages, including Protocol Resolution, Attack Pattern Extraction and Partial Seeded K-Means clustering.The main function of Protocol Resolution is to parse the packets and extract the function codes and their parameters.Attack Pattern Extraction transforms the function codes and their parameters into one-dimensional vector as the attack pattern of a certain IP address.Partial Seeded Means is used to cluster the attack patterns so that those IP addresses with the same patterns are aggregated into one cluster.And then, the cluster is labeled as a certain organization according to some auxiliary information (e.g.domain name or geographical location)of the IP addresses in it.

Attack Pattern Extraction
After an attacker has constructed the connection with ICS, he will carry out a series of delicate operations on purpose, which are expressed by the function codes and their parameters in table 3 and table 4. Therefore, the attacking features, which are extracted from the function codes and their parameters of S7comm protocol data, can reveal the intention of the attacker effectively.
As shown in Figure 8, one attacker may have several IP addresses to launch attacks.We have defined an uninterrupted TCP communication as a session, and one IP address may attack one or more ICSs for more than one times.And thus a single source IP may build several sessions.We call a packet sent by the attacker as a request and there are several packet interactions, so a session usually contains many requests.
The function codes and their parameters of S7comm protocol are included in these requests, so we extract these from the communication data package, which is sent by the attacker to the receiver, as the feature of the attacker to construct IP traceback model.

Mean Count of Function Codes and Parameters
Mean count of function codes (MCFC) refers to the average amount of the function codes of each session from the same IP address.Different attackers have different motivations, objectives and methods while conducting a cyber attack.As a result, quantities of requests and function codes are very different in different sessions.while lunching an attack, but the chronological order is different.As shown in figure 9, the Function code C 1 , C 2 , ..., C i can be arrayed to form a Markov chain in chronological order.
Array the function codes in the session to form a function code sequence according to the chronological order.
For some sessions may belong to the same source IP address, we combine the function codes serials and parameter serials of all sessions from the same IP address into a set of function code sequence.
Different amount of sessions originate from each source IP and various methods are adopted by the attackers for each time, which results in the different function code sequences in each session.Therefore, F n of different source IP addresses are two-dimensional matrix vectors with unequal rows and columns.
These FCSs with uncertain amount and unequal length cannot be handled directly, for clustering algorithms like K-Means needs samples with same dimensions.In this study, we propose a method to convert these sequences with uncertain amount and unequal length into the vectors with same length, the detailed process is as follows: Step 1 Add the start and the end status to the sequence.For a sample set of sequence F n , there are n sequences with unequal length and the length of which are a 1 , a 2 , ..., a n , a i ≥ 1, i ∈ [1, n] respectively.Add the start and the end status to each sequence in F n , then we get F n .Now the length of each sequence is no less than 3.
Step 2 Get the unrepeatable set of short sequences.Setting the window length equals 3 and the stride equals 1, we use the slide window method to process each sequence in F n .Then we get a 1 , a 2 , ..., a n short sequences with the same length of 3, Then remove the duplicate sequences and add the short sequences into set S = (s 1 , s 2 , ..., s m ), m ≤ ∑ n i=1 a i .Step3 Get the short sequences set of all sample sets.Process all of the sequence sample sets according to step1 and step2, and get a short sequence set S = (s 1 , s 2 , ..., s k ) without duplication.
Step4 Express the probability vector of the sequences with uncertain amount and unequal length.
All IPs

Single IP
According to the probability of short sequence For a sequence set P n corresponding to a certain IP, there are l function code sequences with unequal length and the lengths of them are b By adding the start and the end status to each sequences, we get P n .And then we process all the function codes sequences with the slide window method to construct a feature vector X ip according to the frequency of these short sequences.
The method for FCS feature vector processing is shown in figure 10.We make an improvement on the short sequence processing method in literature [13].The improved method has the following advantages: Firstly, we transform the FCS with uncertain amount and unequal length from the same IP into feature vectors with the same length, and we retain the information of the function codes and their parameters resorting to the frequency characteristics of the short sequence.Secondly, when the length of the short sequence is set to 3, we can process the sequences with unequal length including the length of 1 or 2, by adding the start and the end status.
Parameters sequence (PS) indicates the change rule of the parameters in all the function codes used by the sessions from the same IP and it is arrayed by chronological order.Similar to FCS, we use the same method to process PS. indicates the rule of how the parameters vary in all the function codes used by the sessions from the same IP and it is also arrayed by chronological order.Similar to FCS, we use the same method to process PS.

Partial Seeded K-Means Algorithm
We have tried machine learning methods for malicious IP traceback.Commonly used machine learning methods include decision tree, SVM and neural network, but all these methods need supervised training samples.But in the homology test of attacking data, the attack source is unknown and therefore the sample data has no labels.Unsupervised learning can reveal the inherent nature and law of data by learning the unlabeled training samples.Clustering is the most widely used method in unsupervised learning.Clustering is to divide the data samples into multiple classes or clusters, so that the samples in the same cluster have a higher degree of similarity and the samples in different clusters are more different from one another.[19] algorithm is one of the most classical clustering methods based on partition.The basic idea is to cluster around K points as centers in space, by classifying other samples which are the closest to them.The values of each cluster center are updated iteratively until the best clustering results are obtained.In application, the clustering effect of K-Means algorithm is greatly influenced by the initial center selection method.

K-Means
Considering the clustering performance can be improved by using labeled samples to assist the initial center selection, Wagstaff et al. [36] proposed the COP K-Means algorithm.By constructing the two constraint sets of Must-list and Cannot-link, the samples were constrained when they were added to clusters, but the selection of the initial center point was not constrained.Basu et al. [7] proposed Seeded/Constrained K-Means algorithm.It constrained the choices of initial center through seed, and the constraint was also valid when a sample was added into a cluster.However, in this method, each cluster needs a pre-existing seed.
In the IP traceback process, it is possible to know that some IP addresses belong to a certain organization.However, it is very hard to know all the organizations in advance.That means some cluster do not have preexisting seed.Therefore, we designed a Partial Seeded K-Means algorithm to solve this problem.
Partial Seeded K-Means algorithm utilizes some sample subsets with known cluster partition (which is partial seed) as seed, to determine the initial center point.Considering there may be a variety of attack modes in an organization, constraints on seed is not applied while adding a sample into the clusters.That means the samples with known cluster partition may be classified into the original cluster or a new cluster during the process of clustering.

Algorithm 1: Partial Seeded K-Means
Input: Given a sample set D = {x 1 , x 2 , ..., x m }, the clustering number k, the known clustering number l, k ≤ l, the sample subset of known cluster partitionD = {x 1 , x 2 , ..., x n }, and the sample subset of unknown cluster partition D − D .
1. Calculate the mean of the samples in each known cluster 2. Calculate the distance from each sample , and choose the largest value which equals mean distance added minimum distance as the new initial mean µ l+1 and let µ l+1 as known mean.

Calculate the distance
5. Choose the cluster label for the sample x j according to nearest initial vector λ j = arg min i∈1,2,...,k−l d ji (1 ≤ j ≤ m − n), and add x j into corresponding cluster C λ i = C λ i ∪ x j .

IP Recall Rate of the Known Organizations
We use the IP addresses of the four known organizations to check how many IP addresses of the same organizations are recalled in the same cluster.The four curves from Figure 11 to Figure 14 show how the recall rate varies with different K values.Apparently, the IP addresses of Shodan, Censys and Beacon Labs are all grouped into the same cluster, when the cluster number K is set between 20 and 25.However, the highest recall rate of Ditecting's IP addresses is about 40%.That means Ditecting's IP addresses are divided into different clusters and there may be multiple attack modes in the samples of Ditecting.

Similarity Between the Predicted Value and the True Value
Given the knowledge of the ground truth class assignments labels true and our clustering algorithm assignments of the same samples labels pred, Adjusted Rand Index (ARI) [18] is a function that measures the similarity of the two assignments, ignoring permutations and with chance normalization.Mutual Information is a function that measures the agreement of the two assignments, ignoring permutations.Adjusted Mutual Information (AMI) is normalized against chance [35].We use the 66 IP addresses of the known organizations out of 573 valid IP addresses to compare the similarity between the predicted value and the true value.Figure 15 shows how ARI and AMI scores between the predicted and the true values of the 66 IP addresses vary with different K values.Apparently, the clustering works best when the number of clusters K is set between 20 and 29.

Clustering Performance
In the previous sections, we have evaluated the clustering effect using the samples with known labels.If the ground truth labels are unknown, evaluation must be performed using the model itself.The Silhouette Coefficient [29] is an example of such an evaluation, where a higher    Silhouette Coefficient score relates to a model with better defined clusters.Calinski-Harabaz index [10] can be used to evaluate the model too, where a higher Calinski-Harabaz score relates to a model with better defined clusters.
Figure 16 and Figure 17 respectively show the curves of Silhouette Coefficient score and Calinski-Harabaz score, when the number of clusters K is set differently.Apparently, the clustering works best when K is set to 20.

Attack Pattern Recognition
Figure 18 shows the total number of clusters, in which those IP addresses of the four known organizations are grouped.No matter what value K is set, the maximum number of clusters is always 6.It indicates that there are only 6 attack patterns at the most in the samples with known organization labels.
The attack pattern of Shodan, Censys, and Beacon Lab is unique, when the cluster number K is set between 20 and 25.But Detecting's attack mode is not unique.All the IP addresses of Detecting belong to three different clusters, except that four IP addresses are labeled as Shodan and two IP addresses are labeled as Censys.The specific distribution of these IP addresses is shown in Figure 19.

Organization Identificaion
We set the cluster number K to be 20 for clustering and get 20 clusters at last.That means we find 20 kinds of attack patterns.However, these 20 attack patterns do not indicate there are 20 organizations.Because an organization may have multiple attack patterns, and some different organizations may also share a common attack pattern.The DNS query results and the geographical locations of IP Addresses are helpful to identify the organizations.If the IP addresses in a cluster point to the same static domain name or they are very close geographically, we can name this cluster with these labels.
As shown in Table 5, there are 20 clusters with no less than 9 IP addresses in each of them.According to the DNS query results, Some IP addresses in cluster 1, 2, 3 and 4 point to a static domain name, and some IP addresses in the clusters 11, 14, and 17 point to a dynamic domain name.There is no domain name for reference in clusters 15, 18, 19 and 20.However, they are located in a particular country or a region, so we can name these clusters with the geographical labels.Furthermore, the cluster 3 and 13 are labeled as Ditecting, which confirms   8 Related Work

ICS Intrusion Detection
Khalili and Sami [21] have proposed the SysDetect, which is a Systematic approach to Critical State Determination, to solve the problem of determining the critical states in the state-based intrusion detection.This system built a well-established and iterative data mining algorithm, ie Apriori.Kwon et al. [24] have proposed a novel behavior-based IDS for IEC 61850 protocol using both statistical analysis of traditional network features and specification-based metrics.Yang et al. [39] have presented a rule-based IDS for IEC 60870-5-104 driven SCADA networks using an in-depth protocol analysis and a Deep Packet Inspection (DPI) method.McParland et al. [25] have proposed the characteristicbased intrusion detection, which is an extension of the specification-based method, by defining a set of good  properties and looking for behavior outside those properties.A specification-based intrusion detection model is designed to enhance the protection from both outside attacks and inside mistakes through combining the command sequence with the physical device sensor data.Mo et al. [26] have developed the model-based techniques which is capable of detecting integrity attacks on the sensors of a control system.It is assumed that the attacker wishes to disrupt the operation of a control system in steady state, to which end the attacker hijacks the sensors, observes, and records their readings for a certain amount of time, and repeats them afterward to camouflage his attack.The model-based techniques can effectively prevent such attacks.Shang et al. [31] have presented PSO-SVM algorithm which optimizes parameters by advanced Particle Swarm Optimization (PSO) algorithm.The method identifies anomalies of Modbus TCP traffic according to appear frequencies of the mode short sequence of Modbus function code sequence.Zhou et al. [43] have designed a novel multimodel-based anomaly intrusion detection system with embedded intelligence and resilient coordination for the field control system in industrial process automation.In this system, a multimodel anomaly detection method is proposed, and a corresponding intelligent detection algorithm is designed.In addition, in order to overcome the shortcomings of anomaly detection, a classifier based on intelligent hidden Markov model is designed to distinguish the actual attacks and failures.

IP Traceback
Savage et al. [30] have described a general purpose traceback mechanism based on probabilistic packet marking.Routers probabilistically mark packets with partial path information when they arrive.By combining a modest number of such packets, a victim can reconstruct the entire path.Snoeren et al. [32] have presented a hash-based technique for IP traceback that generates audit trails for traffic within the network, and can trace the origin of a single IP packet delivered by the network in the recent past.Belenky et al. [8] have proposed a Deterministic Packet Marking algorithm, which only requires the border router to mark the 16-bits Packet ID field and the reserved 1-bit Flag in the IP header.Therefore, the victim can obtain the corresponding entry address and the subnet where the attack source is located.This method is simple and efficient compared to Probabilistic Packet Marking algorithm.Bellovin et al. [9] have proposed an ICMP Traceback Message.When forwarding packets, routers can, with a low probability, generate a traceback message that is sent along to the destination or back to the source.With enough traceback messages from enough routers along the path, the traffic source and path of forged packets can be determined.Goodrich et al. [17] have presented a new approach to IP traceback based on the probabilistic packet marking paradigm.This approach, which is called randomize-and-link, uses large checksum cords to link message fragments in a way that is highly scalable, for the cords serve both as associative addresses and data integrity verifiers.The main advantage of this approach is that attacker cannot fabricate a message and it has good scalability.Gong et al. [15,16] have presented a novel hybrid IP traceback approach based on both packet logging and packet marking.They maintain the single packet traceback ability of the hash-based approach and, at the same time, alleviate the storage overhead and access time requirement for recording packet digests at routers.Their work improves the practicability of single-packet IP traceback by decreasing its overhead.Yang et al. [38] have proposed a traceback scheme that marks routers interface numbers and integrates packet logging with a hash table (RIHT) to deal with the logging and marking issues in IP traceback.RIHT has the properties of low storage, high efficiency, zero false positive and zero false negative rates in attackpath reconstruction.Yu et al. [40] have proposed a marking on demand (MOD) scheme based on the DPM mechanism to dynamically assign marking IDs to DDoS attack related routers to perform the traceback task.They set up a global mark distribution server (MOD server) and some local DDoS attack detector.When there appears suspicious network flows, the detector requests unique IDs from the MOD server, and embeds the assigned unique IDs to mark the suspicious flows.At the same time, the MOD server deposits the IP address of the request router and the assigned marks, which are used to identify the IP addresses of the attack sources respectivelyinto its MOD database.Fadel et al. [14] have presented a new hybrid IP traceback framework.This framework is based on both marking and logging techniques.In the marking algorithm, every router is assigned a 12-bits-length ID number; it helps in deploying pushback method to permit legitimate traffic flow smoothly.In the packet logging technique, a logging ratio is managed by changing a value k specified in the traceback system.This framework can save more than 50% of the storage space of routers.Cheng et al. [12] argue that cloud services offer better options for the practical deployment of an IP traceback system.They have presented a novel cloud-based traceback architecture, which possesses several favorable properties encouraging ISPs to deploy traceback services on their networks.This architecture includes a temporal token-based authentication framework, called FACT, for authenticating traceback service queries.Nur et al. [27] exploit the record route feature of the IP protocol, and propose a novel probabilistic packet marking scheme to infer forward paths from attacker sites to a victim site and enable the victim to delegate the defense to the upstream Internet Service Providers (ISPs).Compared to the other techniques, this approach requires less many packets to construct the paths from attacker sites toward a victim site.

Conclusions
IP traceback for cyber attacks usually needs redesigning the Internet deploying new service.In this study, we have proposed a malicious IP traceback model, i.e.ICSTrace, for Industrial Control System without changing the Internet infrastructure or deploying any new services.By analyzing the characteristics of the attack data, we extract the numeric features and the sequence transformation features from the function codes and their parameters.Those features are expressed by a one-dimensional vector, which stands for the unique pattern of an attack.As a result, the problem of IP traceback turns into a problem of clustering those patterns.We also propose a Partial Seeded K-Means algorithm to cluster the IP addresses with the same pattern into a malicious organization.The effectiveness of ICSTrace is proved by experiments on real attack data.Although ICSTrace can not recover the whole path of the attack, it is significant in the following aspects: 1. Find out the malicious IP addresses which belong to the same organization.
2. Reveal the unexposed active IP addresses belonging to the known organizations.
3. Collect the springboards used by the same organization for launching attacks.
4. Provide learning samples for subsequent malicious behavior identification by expressing the attack pattern in the form of feature vector.
10 Future work In the future, we will improve ICSTrace and apply it to other kinds of ICS protocols, even the traditional Internet protocols.At the same time, we will use the attack patterns as the learning samples to design and validate the intrusion detection system based on machine learning, to solve the difficult problem of unknown threat detection.
session i ∈ IP Mean count of the parameters (MCP) refers to the average amount of the parameters used in the function codes of each session from the same IP address.Some function codes do not need parameters, some function codes need one or more parameters, so different attackers use different amount of parameters.MCP = 1 n n ∑ i=1 (Count o f parameters) session i ,

Figure 10 :
Figure 10: Method for FCS feature vector processing.

Figure 11 :
Figure 11: The recall rate of Shodan's IP addresses.
Rate of Censys's IP addresses

Figure 12 :
Figure 12: The recall rate of Censys' IP addresses.

Figure 13 :
Figure 13: The recall rate of Ditecting's IP addresses.
Rate of Beacon Lab's IP addresses

Figure 14 :
Figure 14: The recall rate of Beacon Lab's IP addresses.

Adjusted
Rand index and Adjusted Mutual Info index Adjusted Rand index Adjusted Mutual Info index

Figure 15 :
Figure 15: ARI and AMI scores between the predicted and the true values of the 66 IP addresses vary with different K values.

Figure 16 :
Figure 16: Silhouette Coefficient score vary with different K values.

Figure 18 :
Figure 18: The total number of clusters, in which those IP addresses of the four known organizations are grouped.

Table 2 :
IP statics by DNS reverse lookup.

Table 3 :
S7 protocol function code and the corresponding function.
the following fields are S7 type, data unit ref, parameters length, data length, result info, parameters and data, as shown in Figure

Table 4 :
When the function code is 0x00, it is system function and further divided into 7 groups.

Table 5 :
Clusters and their labels of organization.