Malware Detection in Self-Driving Vehicles Using Machine Learning Algorithms

. The recent trend for vehicles to be connected to unspecified devices, vehicles, and infrastructure increases the potential for external threats to vehicle cybersecurity. Thus, intrusion detection is a key network security function in vehicles with open connectivity, such as self-driving and connected cars. Specifically, when a vehicle is connected to an external device through a smartphone inside the vehicle or when a vehicle communicates with external infrastructure, security technology is required to protect the software network inside the vehicle. Existing technology with this function includes vehicle gateways and intrusion detection systems. However, it is difficult to block malicious code based on application behaviors. In this study, we propose a machine learning-based data analysis method to accurately detect abnormal behaviors due to malware in large-scale network traffic in real time. First, we define a detection architecture, which is required by the intrusion detection module to detect and block malware attempting to affect the vehicle via a smartphone. Then, we propose an efficient algorithm for detecting malicious behaviors in a network environment and conduct experiments to verify algorithm accuracy and cost through comparisons with other algorithms.


Introduction
As automobiles become more intelligent, so do transportation systems [1].New business requirements in the automotive market and advances in automotive communication technology are increasing the connectivity of automobiles.is greater connectivity portends the increased likelihood of future automobile cyberattacks [2].erefore, it is necessary to prepare countermeasures for various attack vectors to combat threats to vehicle cybersecurity.
For example, in 2015, Miller and Valasek [3] remotely hacked a traveling Jeep Cherokee to control the audio, windshield wipers, steering and braking, revealing that an unprepared cybersecurity system can threaten driver safety.Furthermore, in 2016 and 2017, Keen Security Lab [4] hacked a Tesla vehicle to demonstrate security threats and potential attacks related to connected vehicles.Typically, connected vehicles are a closed environment that only accepts remote control commands in an authorized communication path, such as a server built by the manufacturer or dedicated applications published by the manufacturer.In a closed environment, unauthorized commands are blocked.However, recent self-driving vehicles share their control signals and internal data with not only the controllers inside the vehicle, but also various unspeci ed vehicles, infrastructures, and smart devices outside the vehicle in real time.us, vehicle network protection should be prioritized in open environments.
e security of a self-driving vehicle is directly related to passenger safety; therefore, it is necessary to comprehensively consider the various attack vectors against vehicles based on the integrity, availability, and con dentiality of their cybersecurity [5].When a connected vehicle's so ware is updated, it is essential to verify the integrity of the so ware.Attackers may use malicious applications to illegally steal privileges or gain access, repackage the so ware installed in the vehicle by injecting malicious code, and induce the installation of maliciously modi ed applications.is malicious so ware looks the same as the authorized so ware, but malicious code contained in the modi ed applications can collect the user's input to steal account information, activate abnormal service ports, or retain authorization for the attacker to access later.Such malicious so ware can be even used as a medium for additional remote attacks through communication with the command and control server.us, it is important to protect vehicle so ware when either the vehicle is connected to an external device such as a smartphone via an interface inside the vehicle, or a communication channel is opened between the vehicle and surrounding infrastructure.Previous research has installed vehicle gateways which allow only authorized communication to the vehicles and introduced vehicle Intrusion Detection Systems (IDSs) to detect abnormal behaviors in the Controller Area Network (CAN) [6].However, it is difficult for a gateway or IDS to block these actions in advance, as most malware and adware are behavior-based.In order to detect unknown threats, it is vital to introduce a technology that can detect abnormal behaviors and analyze anomalous indicators using data analysis technology.
In this study, we review the various security threats to self-driving vehicles imposed by malware in Android operating system (OS) and discuss a method for detecting such malware.In an embedded environment such as a vehicle, both response time and detection accuracy are key factors because resources are limited, and real-time responses are required.
erefore, we propose a machine learning-based detection model that can reduce analysis time and improve detection accuracy.e specific contributions of this research are as follows: (i) We present a method for detecting adware and malware in a self-driving vehicle environment.(ii) We define the intrusion detection module architecture required to detect malware and prevent it from affecting the vehicle through a smartphone.(iii) We experimentally compare the detection accuracy and cost of different algorithms and present the most efficient algorithm.
First, we describe the security technology protecting the internal and external communication networks of self-driving vehicles.We then propose an architecture for an intrusion detection module that detects malicious behavior in the vehicle network based on machine learning.en, we present an effective intrusion detection method and compare it with existing algorithms in experiments.Finally, we present the conclusions and future work.

Preliminaries
2.1.Vehicle-to-Device Communication.In the paradigm of vehicle-to-everything communication, communicating with a specific device is termed vehicle-to-device (V2D) communication [7].Android-based smartphones are typical devices that communicate with a vehicle.Services that identify vehicle operational information or diagnose vehicle abnormalities via a smartphone are classified as performing V2D communication.Initially, to carry out these functions, vehicles were directly connected to an external device outside the vehicle through a universal serial bus connector or Bluetooth, and the data on the device were used.Because a direct wired connection from the vehicle to the device occurred only if the target vehicle was physically occupied, a hacker could not directly control multiple vehicles remotely, even if the vehicles were successfully stolen.Since then, vehicle manufacturers have installed telematics control units (TCUs) or connectivity control units (CCUs) in vehicles and implemented interfaces for remote control of vehicles that include communication functions.In addition, this service is not limited to the original equipment manufacturer.Global telecommunication companies or Internet of ings device manufacturers can also install Long-Term Evolution communication modules on the on-board diagnostics II terminal to collect and manage various data inside the vehicle.When the vehicle is connected to a server or smartphone through such a communication module, information from the vehicle can be transmitted externally.Similarly, it is also possible to control the vehicle by injecting commands to the vehicle from the outside.A connection to a smartphone or external communication device is used not only for convenience services such as music playback and navigation, but also for important functions for updating the vehicle so ware.If a connection is unauthorized or infected by malicious codes, it can be a serious security threat to the vehicle network.erefore, security technology to protect the vehicle so ware and network is essential in V2D communication.

Android-Based Hacking Attacks.
Malicious code is a widely used attack method at the application level that comes in various forms [8].Various security threats such as leakage of private information, elevation of application privileges, and a denial-of-service (DoS) attack have been reported.e most common attack in the Android OS is the use of an application containing malicious code imported when a specific web page or email is loaded.Most malicious code is injected into the device without the user's awareness during the attack.When an application containing malicious code is executed on an Android OS, the code collects device and user information and sends it to a remote server.It also configures a backdoor by activating the service port to allow the attacker to reenter the device and elevate the privileges of available accounts.Subsequently, the malicious code can gain entire access to the infected device by rooting it.In particular, when an infected Android OS is connected to the inside of a self-driving vehicle, malicious code can be infiltrated directly into the vehicle to take control of the embedded OS or application so ware environment.For this reason, we need to detect malicious code from a self-driving vehicle.

Dataset.
Recently, machine learning algorithms have been used to detect malicious code. is study proposes a machine learning-based intrusion detection module using the Android Adware and General Malware (AW&GM) dataset [9], which was developed by the Canadian Institute for Cybersecurity (CIC) in 2017. is publicly available dataset comprises Android sandboxes, Android adware, malware, and normal application traffic.It consists of traffic from 1,900 applications downloaded from Google Play (Android official application market) and is used to classify normal and malicious code based on network traffic. is dataset is categorized with the following three classes (see Table 1).

Related Work for Protecting Vehicle Communication
Networks.Kwon et al. [10] proposed a method for reconfiguring the electronic control units (ECUs) in a vehicle and deactivating attack packets to defend against network intrusion.In the proposed architecture, an IDS is introduced to detect cyberattacks in the network inside the vehicle, and a control module, called a mitigation manager, is applied to mitigate the damage from detected attacks.
ey then proposed an architecture to deliver commands to reconfigure ECUs, deactivate packets, reconfigure head units, delete packets in gateways at each domain, or switch domains into a secure mode.However, the framework and algorithms for the methodology were only proposed and not developed, and performance evaluations of the specific shape or architecture were insufficient.erefore, a testbed and simulation environment should be prepared in order to verify the architecture appropriateness based on practical data such as detection accuracy, detection time, and resource utilization.
Han et al. [11] suggested an anomaly intrusion detection method for vehicular networks based on survival analysis.e method is based on an anomaly detection algorithm that detects a suspicious pattern within the usual pattern information.e method aims to detect three typical attack scenariosflooding attacks, fuzzy attacks, and malfunction attacks-that attempt to manipulate and control using malicious packets.
e authors noted that the proposed method can detect unknown attacks; however, they did not describe how to detect scenarios other than the three mentioned.
Zhang et al. [12] presented a cloud-assisted vehicle malware defense framework to defend vehicles against malware attacks.Such a service can help defend resource-constrained vehicle systems against malware by detecting new malware and updating onboard malware defense capabilities.Although the method is a cloud-based malware detection service, in-vehicle devices are also required to perform onboard threat defense functions.e premise of this service is that a single gateway should be able to control all external communication interfaces in the vehicle.If the vehicle cannot access the security cloud, it must find another way to inspect malware, however, no alternatives were explicitly suggested by the authors.

Machine Learning-Based Intrusion Detection Module
3.1.Malware Detection in Vehicle Networks.Study Group SG17 of the Telecommunication Standardization Sector, one of the International Telecommunication Unions that develops telecommunications standards, established the Intelligent Transport System (ITS) security investigation branch in order to standardize the ITS [13].Specifically, X.itssec-4, which covers methodologies for IDSs for in-vehicle systems, defines the system structure and methods.Existing mechanisms for detecting unauthorized access into a CAN, injection of a malicious control message, and DoS attack include vehicle gateways and vehicle IDSs [14].Attacks using adware and malware have various user interaction scenarios that can intrude into a vehicle through a smartphone (see Table 2).
Connected or self-driving vehicles are connected to external or public networks outside the vehicle via various interfaces.TCUs or CCUs are equipped with a modem and external communication interfaces to enable receipt of Global Positioning System signals and access to mobile networks.In-vehicle infotainment systems, which provide entertainment and information content, enable various applications by applying an embedded OS, such as QNX OS or Android OS.If security design is not considered in wired or wireless networks, these interfaces can be abused as a path for malware or malicious commands to enter the vehicle network (see Figure 1).In particular, the embedded OS environment can be controlled from the malware or malicious commands when these malicious processes bypass OS-level security logic or acquire root authority from self-privilege elevation.erefore, in order to prevent malicious commands from gaining control of the embedded OS, this paper proposes a CAN gateway architecture that includes an intrusion detection module and detects malicious behaviors when Android OS-based devices are connected to the vehicle.
In this study, a machine learning-based intrusion detection module is installed in the vehicle IDS, which can detect intrusion into the CAN or any abnormalities, so that a head unit or ECU can be protected from malicious code.Such detection methods are implemented in the form of soware-based computing modules to monitor malware injection or malicious code behaviors in the vehicle.e so ware can be installed as a component of the vehicle intrusion detection module or as an anti-virus agent in a head unit. is applied by employing correlation-based feature selection (CFS) and an entropy-based information gain (IG) method.Constructing a validated dataset for an e cient experimental environment is important in machine learning.In this paper, we propose the improved feature selection (IFS) method, which combines the higher values derived from correlation and IG methods.e proposed learning algorithm uses the selected network tra c features to detect malware.Unlike existing feature selection methods, IFS nds both greedy features and the highest correlation.ere are two broad categories that can be used to measure the correlation between two random variables, one based on classical linear correlation (i.e., CFS) and the other based on information theory (i.e., the IG method).First, a pair of variables is de ned for the CFS method and the linear correlation coe cient is derived [16].In addition, the IG method decides how important a given attribute of the feature vectors is [17].ese two vectors are combined in order to determine the nal features from the dataset that are highly e proposed detection so ware consists of input, analysis, evaluation, and noti cation modules.e tra c injected through the CAN is processed through the input module and entered into the analysis module, which is equipped with a machine learning algorithm (see Figure 2).e analysis module evaluates intrusion or abnormal behaviors based on a learned model and provides intrusion behavior information to a user or control center in real time.is machine learning-based intrusion detection module can improve the model's accuracy by repeatedly learning, verifying, and evaluating message patterns.Furthermore, detection rules for malicious behaviors can be updated to the vehicle gateway and each controller to accurately detect malicious code.

Data Preprocessing for Malicious Code Analysis.
e characteristics of 79 features included in the CIC AW&GM dataset are analyzed using the Waikato Environment for Knowledge Analysis [15].Feature selection is needed to reduce the dimensionality of the data.First, ten-fold cross-validation

Intrusion detection module
Malware attacks through external ports can infect controllers based on android OS such as infotainment systems.
When the gateway has the proposed IDS module, it can predict and detect malware attacks from external ports.have reported that malware can be detected in the network tra c of devices [18,19]. is paper selected nine features using the IFS method and shows that malware can be detected from the network tra c using a machine learning-based IDS module.

Domain controller units
As the original data has unique characteristics and distributions, learning from these data may be slow or result in modeling errors.In the case of network tra c, it is essential to perform scaling because each feature has a uniquely de ned data range and unit.Scaling is a data preprocessing task that helps prevent under ow and over ow when learning from experimental data.It is performed based on the nine selected features.e F1 score results a er applying the MinMaxScaler and StandardScaler are described in Table 4. e MinMaxScaler scales all features to be exactly between zero and one.e StandardScaler, in contrast, does not limit the minimum and maximum values, but ensures that all features have an average of zero and a variance of one.us, all features have the same size.A comparison of their F1 score results of the two scaling methods indicates that MinMaxScaler is more advantageous due to the nature of the network tra c, which comprises a wide range of data.erefore, in this study, the MinMaxScaler technique is applied to each algorithm.
Next, we analyzed algorithms that detect adware and malware typical in Android OS.In this study, these attack detection techniques are compared by applying six machine learning algorithms to the dataset.Furthermore, we analyzed the results of using a general machine learning algorithm, assuming that the computing power employed in the vehicle-embedded soware can analyze tra c data using a general speci cation rather than a high-performance system.e dataset used in this study consists of three classes: benign, adware, and general malware.
ere is a strong imbalance between these classes (see Table 5).When the data modeling results are evaluated with general accuracy, the evaluation result may suggest that its  e feature of min_ owpktl means minimum length of a ow; max_ owpktl means maximum length of a ow; max_idle means maximum time a ow was idle before becoming active; bVarianceDataBytes means variance of total bytes used in backward direction; avgPacketSize means average size of packet; max_fpktl means maximum size of packet in forward direction; max_ owiat means maximum inter-arrival time of packet; fPktsPerSecond means number of forward packets per second; Init_Win_bytes_forward means the total number of bytes sent in initial window in the forward direction, respectively.Especially the last item is included in both CFS and IG results.

Category
Selected Features 1  Input: is a universal set with all features.
Output: Ω * is a subset with selected feature by IFS method.
2: Get all , by linear correlation coe cient.

4:
Choose sets for top with high value of | |, for relevant variable and 1 ≤ ≤ .5: Get combination , ∈ , where ⊂ and = .
6: Determine * , where the maximum of F1 score with .7: Get all ( ) by information gain.8: Get is related to highly ranking variable.9: Choose sets for top with high value of ( ), relevant variable and (1 ≤ ≤ ).10: Get elements ∈ , where ⊂ and = .
A 1: Improved feature selection correlated and have a strong impact between classes (see Algorithm 1).
In the CFS stage, we derive the linear correlation coecient, , .In this paper, we determine the veri ed features x i with high value of | |. e number of elements in the set with elements is .ese elements are selected with a relevant variable from CFS. e nal C j * consists of a set with elements calculating the highest F1 score (see Table 3 for CFS features).In the IG stage, we derive the IG ranking ( ). e IG method nds the top 20% of features, according to the 2 statistical distribution, from the 79 features.It nds that the statistic result is saturated at around . e nal * consists of a set with elements calculating the highest F1 score (see Table 3 for IG features).e nal feature selection is made by nding the union of the CFS and IG method feature sets.In this paper, each method selected ve features; in total, nine features are used as input features (one feature was included in both feature sets).A total of 631,955 elements with these features were used for our model.In-vehicle applications can be infected by Android malware via wireless or wired communication channels, as illustrated in Figure 1.Several studies suggesting IDS architectures gradient boosting classi er (GB), extra tree classi er (ET), and bagging classi er (BC) algorithms-are used to analyze the data and their results are compared to those of the proposed algorithm.In addition, we also present hyperparameters for each algorithm for comparative veri cation of the machine learning used to implement the malware detection module in a self-driving vehicle gateway.It is important to tune hyperparameters for result, performance, and cost optimization when analyzing data using a machine learning algorithm.Indeed, signi cant di erences in the performance and accuracy of analysis results can occur depending on the con guration of the hyperparameters.We present the hyperparameters used in each experiment with the F1 score and elapsed time for each algorithm.ese hyperparameters were derived by changing various experimental conditions repeatedly for each algorithm.
e nine input features are de ned through the feature selection process and the output is de ned using two classication scenarios to analyze the experimental results.In the rst scenario (see Figure 4(a)), benign code, adware, and general malware are accurately detected, whereas the second scenario (see Figure 4(b)) is a binary classi cation scenario where only benign code and adware are detected because general malware accounts for only 0.8% of the dataset.It is meaningful to compare the results of the binary classi cation because its impact can be predicted through the rst scenario.
In this paper, malware detection using machine learning is included to develop the IDS module included in self-driving vehicles.e F1 score used in machine learning calculates the accuracy, recall, and precision values for all cases to evaluate the model's performance.is general method, which took about 3.570 s to verify the dataset on average, is not suitable for real-time detection.We applied a faster F1 score evaluation performance is good even when it is not.For example, the overall accuracy can be high if the benign category, which has high importance in the dataset, is accurately predicted, even if general malware, which has low importance, is not accurately predicted.erefore, the F1 score, which uses the harmonic mean based on recall and precision, is used to evaluate prediction accuracy.
In summary, the proposed machine learning-based intrusion detection module detects Android malware for a self-driving vehicle and labels its type (i.e., adware or general malware).
e procedure, which is based on the detection of the network tra c deviation on Android OS, is divided into three phases, as shown in Figure 3. e rst phase focuses on data preprocessing.Feature selection is performed to select the most relevant features from all measuring features in the dataset.e second phase consists of modeling.Using ten-fold cross-validation, this phase trains the machine learning model using 75% of the dataset and suggests the most suitable hyperparameters for the retraining model.In addition, this phase uses 25% of the dataset for testing and evaluating the proposed intrusion detection module.erefore, a machine learning model tuned by hyperparameters is created using the training dataset, and a testing dataset is applied to evaluate the model.In the third phase, the intrusion detection module can detect malicious behaviors in real time when real data ows into the self-driving vehicle.Speci cally, the proposed intrusion detection module should be included in the vehicle gateway shown in Figure 2.

Simulation Results
Six machine learning algorithms-the random forest (RF), decision tree (DT), k-nearest neighbors classi er (KC), In summary, the algorithm's overall prediction accuracy was 90% or greater with binary classi cation for all algorithms except GB. erefore, in this case, an algorithm with a short learning time can be selected.In order to detect malware or adware in an embedded so ware environment such as a vehicle, high accuracy and a fast response time are very important.
erefore, the ET algorithm, with its learning time of 95 ms and prediction accuracy of 90.6% in binary classi cation scenarios, would be suitable.However, considering that the attack detection method in the Android OS is classi cation scenario 1, the RF algorithm, which has the highest prediction accuracy and a learning time of 19,401 ms, would be the most suitable.
We use the receiver operating characteristic (ROC) curve to evaluate the experimental results of each algorithm.e ROC curve, a widely used tool for binary classi cation, plots the method's sensitivity against its speci city.e area under method because malware should be detected in real time on autonomous vehicles.In order to generate a class that calculates and returns a confusion matrix quickly, we proposed a new score function.rough this function, we obtained the F1 score directly when the model was training.e function stored the value of the computed confusion matrix and was reusable when the F1 score was called for performance evaluation.In this case, the elapsed average time was 0.049 s, which is acceptable for real-time detection.Abnormal behavior prediction can therefore determine within 0.049 s when new trafc occurred in the self-driving vehicle (see Table 6).
For the RF algorithm, the highest F1 score is obtained when the random state is 42, the number of estimators is 85, the maximum depth is 24, and the maximum number of features is 4.Although the RF's prediction accuracy is typically higher when using binary classi cation, in this case, it is higher under scenario 1.Overall, the RF algorithm had the highest prediction accuracy of the machine learning algorithms tested.For the DT algorithm, the highest F1 score was observed when the random state is 42 and the minimum leaf sample is 2. For KC, the highest F1 score was observed under the following conditions: uniform weights and 7 estimators.For both DT and KC, accuracy may decrease in datasets with large data imbalances.Moreover, although the KC algorithm exhibits higher prediction accuracy in binary classi cation scenarios, its learning time is more than twice that in scenario 1.For GB, the micro-average and macro-average detection results for GB are 0.97 and 0.84, respectively.However, the detection result of class 2 is low (0.58) due to class 2's scarcity in the dataset.at is, GB can perform well for binary classi cation, but is not suitable for multiclass classi cation.In conclusion, the ROC curve and AUC of each classi er show that the RF algorithm better detects malware than the other algorithms.

Conclusion
e increasing connectivity of vehicles has also increased security threats.Malicious code can ow into a vehicle's internal network when a device infected with malicious code is connected to the vehicle through an external communication channel.High accuracy and speed are key for detecting the ROC curve (AUC), which represents the surface integral under the curve, is an indicator of the detection performance of each classi er.When the curve approaches the graph of y = x, the classi er is purely random and the AUC is near 0.5; likewise, detection performance is better when the curve at the top le area is far from the random classi er line.e AUC of the perfect classi er is 1.
We compared the performance of four algorithms: RF, DT, KC, and GB (see Figure 5).In the RF algorithm shown in Figure 5(a), the AUC of the macro-average obtained by calculating the measurement of each class is 0.97 and the micro-average for integrated classes is 0.99.For the imbalanced (class 2) malware, the AUC was 0.93, which is relatively good compared to other classi ers.In Figures 5(b) and 5(c), the DC and KC show similar detection results.However, when class 2 malware is detected, the DC is slightly better than the KC.Moreover, in Figure 5 malicious behaviors in the embedded environment of vehicles, where responses must be processed in real time.is study, therefore, analyzed security threats from adware and malware in the Android OS within a self-driving vehicle.Network traffic was analyzed to detect malicious behaviors at the network in the module.In addition, a machine learning-based intrusion detection module for malware detection was proposed.Finally, we proposed a machine learning algorithm that can detect Android malware for vehicles with high accuracy and in a short time.We compared the algorithm's detection accuracy and speed with proposed optimal hyperparameters to six machine learning algorithms.In addition, we also found that we can significantly reduce the elapsed time by using the novel score-function model for real-time detection.Our simulation we demonstrated that our algorithm is highly accurate (92.9%) and fast (0.049 s), making it suitable for real-time malware detection in a self-driving vehicle environment.

F 2 :
CAN topology with the vehicle gateway and intrusion detection module.

T 3 :
Feature selection results.
Intrusion detection module in a vehicle network.Journal of Advanced Transportation the highest F1 score is observed when the random state is 42, the number of estimators is 50, the maximum depth is 15, and the maximum number of features is 5. Its prediction accuracy is generally high, but its learning time is the longest of all algorithms, at 2,556,517 ms; for comparison, the learning time of the second longest algorithm, KC, is 60,967 ms and that of the shortest, ET, is merely 976 ms.erefore, although GB is suitable for binary classi cation, the learning time costs are too large for general classi cation.For ET, the highest F1 score was observed when the random state is 42, the splitter is random, and the number of estimators is 100. is algorithm shows the shortest learning time under both scenarios.