ACAMA: Deep Learning-Based Detection and Classification of Android Malware Using API-Based Features

As a great number of IoTandmobile devices are used in our daily lives, the security of mobile devices is being important than ever. If mobile devices which play a key role in connecting devices are exploited by malware to perform malicious behaviors, this can cause serious damage to other devices as well. Hence, a huge research effort has been put forward to prevent such situation. Among them, many studies attempted to detect malware based on APIs used in malware. In general, they showed the high accuracy in detectingmalware, but they could not classify malware into detailed categories because their detectionmechanisms do not consider the characteristics of each malware category. In this paper, we propose a malware detection and classification approach, named ACAMA, that can detect malware and categorize them with high accuracy. To show the effectiveness of ACAMA, we implement and evaluate it with previously proposed approaches. Our evaluation results demonstrate that ACAMA detects malware with 26% higher accuracy than a previous work. In addition, we show that ACAMA can successfully classify applications that another previous work, AVClass, cannot classify.


Introduction
By 2025, it is expected that there will be 55.9 billion connected devices worldwide and 79.4 ZB of data generated by IoT devices [1], and 9 billion smartphones will be connected by 2024 [2]. Accordingly, the use of mobile devices is increasing rapidly. In addition, the mobile application market is also growing. Unfortunately, the attackers exploit the growing ecosystem, and we have observed that the number of mobile malware also increases rapidly [3,4]. Among the mobile malware, Android malware occupies the largest proportion of mobile malware as Android has the largest share in the mobile application market. An attacker abuses Android's open market policy to inflict damage such as personal information leaks or financial loss to users. erefore, it is critical to protect users from malware by accurately and quickly detecting Android malicious applications. In addition, to quickly analyze and respond to malicious applications, it is very important to identify their behaviors and classify them.
By identifying or categorizing malicious behaviors, we can help the analysts further analyze characteristics of malicious applications. On the other hand, identifying malware based on its behavior can let the security analysts to put their efforts on more malware. However, to our knowledge, antivirus products cannot detect unknown malicious applications, and, thus, many studies were conducted to detect unknown malware.
Recently, many studies used Application Programming Interfaces (APIs) as features to detect malware [5][6][7][8][9]. Malicious behaviors must be implemented with a series of specific APIs, and, thus, previously proposed systems which analyze APIs could detect malware with the high accuracy. However, previous APIs-based malware detection systems cannot classify malware into detailed categories. erefore, for developing malware classification techniques, the research community had to conduct other studies. For example, AVClass [10] uses antivirus vendors' reports obtained from VirusTotal. In general, antivirus vendors do not have a common analysis result of malware. Consequently, the results of AVClass are not reliable.
In this work, we propose ACAMA that can identify malware and can classify malware into specific categories by utilizing APIs used to implement malicious functions. ACAMA generates deep learning models based on APIs of Android malware with the CNN algorithm. To evaluate the performance of ACAMA, we compare it with the performance of a previous approach proposed by Kim et al. [11]. We, also, evaluate the effectiveness of ACAMA by using the classification results of AVClass. Overall, the evaluation results show that we used the same feature that Kim et al. used, but ACAMA detects malware with the higher degree of precision. ACAMA detects malware with 95% accuracy, 26% higher than Kim et al. In addition, ACAMA could successfully classify 72.456% of malware that AVClass could not classify.
In summary, this paper makes the following contributions:

Background and Related Work
In this section, we introduce categories of mobile malware and those behaviors. In addition, we discuss previous approaches for detecting malware by using APIs as a feature and other related work.
MalDozer [17] tried to detect malware by using the CNN based on API calls, extracted from DEX assembly. Nix et al. [9] also detected malware from API calls in applications by using the CNN. However, most of the previously proposed approaches focused on the binary classification problem (i.e., identifying malware from benign applications). In this section, we discuss two systems closely related to our work. AVIS [5] ensembles 10 types of machine learning algorithms, such as Support Vector Machine [19], Naive Bayes [20], and k-NN, and directly score the API to create the API ranking. In addition to detecting malicious applications, AVIS evaluates an application quantitatively through the average value to provide a risk indicator. However, in machine learning, you need to use appropriate algorithms according to the data to obtain accurate results. However, not all of the algorithms used can be considered suitable for API data. Also, the method of bagging ensemble other algorithms is not suitable for classifying malware into specific categories [5,11]. Kim et al.'s [11] methods, like AVIS, directly scored APIs to create API rankings and quantitatively evaluated applications. However, unlike AVIS, the application was quantitatively evaluated using a weighted average value. In addition, since the bagging ensemble was performed using only XGBoost [21], the accuracy was improved compared to the previous study. However, the XGBoost algorithm itself uses the boosting technique, which is already an ensemble technique. Furthermore, since the bagging technique is also used, the cost of classifying applications is high. In addition, there may be a problem of objectivity, such as overfitting, because a person used the feature after specifying the range through an experiment when selecting the training data.

Categories of Mobile Malware.
Analyzing malware is an error-prone task. erefore, if a category of malware can be determined automatically, it can provide boundaries of analyses to the analysts to help them with conducting effective and efficient analysis. However, it is challenging to automatically classify malware. Moreover, antivirus systems use different malware categories, and even the same applications can be categorized differently by each antivirus system. Among the previous studies, Samra et al. [22] extracted permissions from the manifest files and classified only two categories: business and tools with the k-means algorithm. On the other hand, DroidMiner [23] proposes a two-level behavioral graph model and extracts sensitive execution paths from Android program logic. ey classified malware into 12 families by using the Random forest.
In this work, we classify malware based on its behavior inferred by APIs used to implement the malware, which would help security analysts by providing an instinctive comprehension of malware behavior. Since there are many categories of malware, we use categories of malware proposed by Wei et al. [24]. Wei et al. used 24,560 malicious applications to classify them into 71 categories. Table 1 shows the categories proposed by Wei et al.

Labeling Android Malware.
A typical technique to analyze malicious applications and categorize them is to use VirusTotal [25]. From VirusTotal, users can obtain antivirus scanning reports for an application and can categorize the application using words contained in the report.
EUPHONY [26] analyses all labels provided by different vendors for labeling malware families. en, it builds a graph representing the association links between family names based on labels that they assigned to the malware samples. Finally, EUPHONY used Prim's algorithm to transform the graph into a Minimum Spanning Tree. erefore, EUPHONY unifies malware labels into common family groups, while Li et al. [27] remove legitimate library code from applications for labeling them. en, it used the malicious payload mining method with 68 malware labels to cluster malware.
AVClass [10] is a malware labeling system based on VirusTotal reports. AVClass does not simply count specific words of the AV scanning reports to determine a category, it creates aliases so that similar categories can be grouped into the same category through word learning in advance. en, it compares AV reports and alias of an unknown application, arranging words of the reports. Finally, words that appear twice or more in the same category are set as an application category. AVClass is generally more accurate than specifying categories using simple word counting, but the accuracy is still not high enough because the limited number of words used to classify malware. Also, AVClass strongly relies on AV reports, and, thus, if antivirus systems cannot generate enough information regarding a malicious application, AVClass cannot categorize them.

Deep Learning Visualization for Interpretation of Classified Result.
Deep learning has the advantage that automatically uses feature engineering. It is, thus, easy to learn a predictive model using deep learning algorithms. However, it is difficult to know the basis of which features are used to learn the predictive model. erefore, to justify prediction results, we need to interpret them while the predictive model is processing data.
In this work, we provide the confidence level in the classification results using LIME [28]. LIME is model-agnostic and, thus, can be used with CNN, LSTM, decision tree, and the other machine learning algorithms. Even if we replace the underlying machine learning algorithm, we can still use the same visualization model for the interpretation. Furthermore, LIME is an algorithm that can explain the predictions of any classifier in a precise way, by approximating it locally with an interpretable model. LIME inserts slightly changed input values into the predictive model and considers the input value with a significant change in the predicted value as an important value. We use this value as the interpretation of the prediction results.

Design
In this section, we describe our goal and details of the proposed approach.

Goal.
e main goal of this work is not only detecting malware but also classifying them into specific categories using deep learning. Classifying malware can quickly detect the attack of malware due to selecting and analyzing representative applications for each family. erefore, analysts can protect users from attacks of malware. Also, based on characteristics of malware that can be deterministically obtained, we aim to avoid misclassifying malware rather than relying on reports generated by antivirus vendors.

Overview of ACAMA.
In order to achieve the goal, we design and implement a deep learning-based approach, named ACAMA, that use APIs of an application as a feature. ACAMA mainly consists of three stages: Preprocessor, Deep Learner, and Categorizer. In the first stage, the Preprocessor extracts APIs from labelled benign and malicious applications using AndroGuard [29] and generates training datasets. In the second Deep Learner stage, the API is vectorized, and then CNN learns the vectorized dataset. After that, CNN creates a classifier model. In the last Categorizer stage, unknown applications are entered into the classifier model created in the previous step for classifying malicious applications. At that time, when the classifier detects a malicious application, it is transmitted to the category classifier. After the category classifier categorizes the malware, ACAMA using LIME provides a report that shows important APIs used to classify the e overall structure of the proposed method is shown in Figure 1.

Preprocessor for Training Dataset. ACAMA extracts
APIs to generate the training dataset using AndroGuard. API extraction process is performed by parsing the classes.dex file containing the actual code of an application. Method Table and Class Def Table among Table 2 shows the characteristic information of an API that can be obtained by extracting it in this way. Since the dimension can be too large due to the number of words, the training set is constructed using only the method name that can express the API as much as possible, excluding the description. We generate two types of training sets: the first dataset, which collects 10,000 benign applications and 10,000 malicious applications, and the second dataset, which collects only malicious applications for which categories are determined. we use the Android Malware Dataset (AMD) [24] for collecting malware.

Learning the Training Datasets Using CNN.
When using CNN algorithm, the feature map is extracted through convolution operations by applying a filter to the data value. erefore, we need to vectorize (word embedding) the API features that we extracted. Methods for vectorizing natural language include One-Hot, Word2-Vec, Glove, BOW, TF-IDF, and Tokenize. Among them, we use the tokenize method that assigns numbers to words by mapping integers by making a dictionary of words existing in the API feature. In our dataset, a dictionary consisting of a total of 1,273,251 words is used by adding two additional methods including padding to match the size of the vector and cover unknown APIs that are not used in the training phase. If we simply map a number, the dimension is too large and the relationship between the APIs cannot be considered, so an embedding layer is used to adjust the vector value. To input to the embedding layer, the size of each application's API vector is adjusted by padding it. After that, ACAMA learns using the embedding layer to transform the vector into a dense vector that can contain a lot of information compared to a small dimension. In this paper, the size of the dense vector is set to 64. Figure 2 shows an example of the vectorization process.

CNN Parameter Locations for Optimization.
After we determined the data format, we optimized the parameters of the CNN algorithm in order to learn datasets effectively. If appropriate parameters are not used based on the characteristics of datasets, the performance of a classifier will be low. erefore, the parameters should be optimized through repeated experiments. Specifically, in this work, we optimized the embedding dimension, the number and size of filters, the type and size of pooling, and the number of convolutional layers. Figure 3 shows the location of each parameter.

Learning Datasets with CNN.
Since the ACAMA uses API features, the convolution operation is performed using the Conv1D layer, which is widely used in Text-CNN. e dimension used in the operation was set to 8 (embedding dimension is 8). erefore, the size of the filter was (64, 8), and the number of filters was set to 32. Also, we used padding and stride. As default values, we set padding to be 0 and stride to be 1. Hence, the size of the feature map becomes (7765, 32) because the filter length is set to 8 and the number of filters is set to 32. Since the proposed method uses two datasets, learning phase is also performed twice with each dataset. Figure 4 shows the data and filters during the learning process of the category classifier.
After that we apply an activation function to the feature map created in the previous step. In ACAMA, we used the ReLU function [30] as an activation function.
e ReLU function is one of the most widely used activation functions because it can learn relatively quickly and the computation cost is not high. After that, using the max-pooling on the generated activation map, the largest vector is selected from the feature vectors. is process allows us to pick the most prominent feature used in an application. We set the pooling size to be 1.
Lastly, through the process of making the result value from the pooling layer into a one-dimensional vector, we used the softmax function that normalizes all outputs to values between 0 and 1. Based on this output, the risk classifier detects malicious applications and the category classifier classify them.

Malware Detection and Categorization.
e classifier created using the training dataset containing both benign and malicious application is called a risk classifier, which classifies whether an application is a benign application or a malicious application. And if it is a malicious application, the application is transferred to a category classifier created using the dataset of labelled malicious applications. e category classifier outputs a probability vector for each category of the received unknown application.

Identifying Categories of Malware.
Since there are only 71 categories, the probability results can be ambiguous for malicious applications that require more detailed categories. erefore, in this work, a malicious application that does not have a classification result higher than 0.5 for all the categories is called "Unlabelled applications." For such applications, ACAMA provides a list of categories where the probability is close to 0.5 with main features (APIs) of the application to users so that, at least, users can understand possible behaviors of malware.
To this end, using the LIME algorithm, ACAMA provides reasons of classification results of the classifier. First, LIME converts the input features to find important APIs and creates several similar input values. en, important features are determined based on the classification result that comes out by putting the converted similar input value into the category classifier. In this work, 10 important APIs are used to provide a confidence indicator for the results, and the goal is to be able to know what kind of actions are possible by a malicious application, even if there is no label (the application cannot be categorized with the high confidence). Figure 5 shows the process of LIME that extracts important APIs from an application.

Evaluation
In this section, we evaluate the proof-of-concept implementation of ACAMA. In addition, we used 10,000 benign applications randomly selected from the Google Play Store [31] for detecting malware, and 10,000 malicious applications from the AMD [24] were used as the training dataset. Also, we used 10,000 malicious applications collected from VirusShare [32] for evaluating the performance of ACAMA.
If the data is too biased, it can interfere with learning; 20,000 out of 24,090 of the AMD were used, and 71 categories specified in the dataset were used as well. Moreover, as the category classification dataset, 10,133 applications that AVClass cannot classify were used to evaluate ACAMA's effectiveness of classifying malicious applications.

Parameter Setup.
In this paper, we set the parameters that maximize the performance of the CNN classifier based on the loss rate and accuracy by using the training datasets as validation data. By changing from the most basic structure to the most commonly used parameter values, we found well-optimized parameters including embedding dimension, number of filters, filter size, type and size of pooling, and number of convolution layers. e CNN parameters used in the proposed method are as shown in Table 3.

Embedding Dimension.
Commonly used embedding dimensions are 50, 64, 100, 150, and 200. e results of the experiment are shown in Figure 6. Since the accuracy of the risk classifier gradually decreases from 100 dimensions and the loss rate increases, we decided that there was no need to experiment further by increasing the dimension; we tested up to 200 dimensions. e category classifier showed the highest accuracy and lowest loss rate when we use 64 dimensions. As in the two graphs of Figure 6, both the risk    classifier and the category classifier showed the lowest loss rate and the highest accuracy when the embedding dimension was 64, and, thus, we determined the API embedding dimension to be 64.

e Number of Filters.
In text-CNN, the number of filters is usually specified as a related number such as a factor or multiple of the embedding dimension. Since the embedding dimension is 64, we experimented with 16 and 32, and 100, 150, 200, 250, and 300 to find the approximate range. e experimental results are shown in Figure 7. As the graphs show, we obtained the best results when we used 32 filters. As the number of filters increases, the number of parameters increases, and the efficiency decreases. erefore, the number of filters was determined to be 32 for both classifiers based on the above results.

Size of Filter.
Once the number of filters was determined, we have to determine the size of the filters. Usually, using a smaller filter than a larger filter reduces the number of parameters and has better performance. As shown in Figure 8, the experimental results demonstrate that, in both classifiers, we obtained the best results when the filter size is 8. Hence, we determined the size of filters to be 8 according to the evaluation results.

Pooling and the Number of Convolution Layers.
It is a well-known fact that the performance of text-CNN is better when the max-pooling is used than the average pooling is used [33]. In addition, in the case of text-CNN, the max-pooling size 1 showed the best performance [34]. erefore, ACAMA also used the max pooling and set the pooling size to be 1.
In addition, the number of layers of the convolution layer is also an important parameter, but since the data and the proposed method in this paper have already obtained high verification accuracy and low verification loss rate with one convolution operation, it is necessary to increase the     layer to make the calculation more complicated, which was judged not to be necessary. Finally, the number of convolutional layers is set to be 1.

Evaluation Results.
e results of detection of malicious applications and classification results of malicious application categories are described. Figure 9 shows the accuracy comparison results between ACAMA and a malicious application detection system proposed by Kim et al. [11], closely related with ACAMA, that use APIs as a feature. e verification accuracy is a result of malicious detection by reapplying the list to the training dataset after removing 10% of it to verify the effectiveness of the classifier. On the other hand, the test data accuracy refers to the accuracy when we used a new dataset that is not contained in the training dataset.

Malware Detection Result.
In the training verification data, we can observe that the accuracy of ACAMA is similar to the one of Kim et al.'s approach. However, ACAMA outperforms the previous approach when we used a new dataset. Table 4 shows the classification results of AVClass using malicious applications in VirusShare. e undetected column of Table 4 indicates the number of applications that VirusTotal did not detect as malicious applications. e unlabelled column indicates the number of applications detected as malware by VirusTotal but could not be classified by AVClass.

Category Classification Result.
However, ACAMA classified a total of 10,133 malicious applications that AVClass could not classify, and the results are      shown in Table 5. Consequently, we can classify 7,342 applications out of 10,133 ones that AVClass could not classify. ACAMA shows the detailed classification results in Table 6.
In addition, we checked the classification results by using LIME. To this end, we extracted main features of each malware and manually verified 7,342 classification results to check whether the classification results are correct or not. As a result, by verifying evaluation results with LIME, we found that our category classifier can classify malicious applications without a misclassified result. Table 7 shows important APIs that LIME found, which is used to implement malicious functions of malware of the Gumen category which behaves similar to the Trojan-SMS malware family.

Conclusions
In this paper, we proposed ACAMA that identifies malware and classifies malware into specific categories based on behavioral characteristics of malware. We evaluated ACAMA by comparing its performance with a previous approach proposed by Kim et al. [11]. We also evaluated the effectiveness of ACAMA with AVClass. In summary, the evaluation results show that ACAMA outperforms the previous approach proposed by Kim et al. [11]. Also, we observed that ACAMA can classify 72.456% of malware that AVClass cannot classify.
However, ACAMA needs well-labelled dataset to categorize malware due to the use of supervised learning. Also, since ACAMA only uses the Android framework API, we cannot avoid the out-of-vocabulary problem (i.e., if malware is obfuscated or malware uses APIs that ACAMA did not catch in the learning phase, it cannot classify the malware). We leave these limitations as future work.

Data Availability
e data used to support the findings of this study were supplied by Eunbyeol Ko under license and so cannot be made freely available. Requests for access to these data should be made to Eunbyeol Ko (kongstar159@ soongsil.ac.kr).