ImageDroid: Using Deep Learning to Efficiently Detect Android Malware and Automatically Mark Malicious Features

. Te popularity of the Android platform has led to an explosion in malware. Te current research on Android malware mainly focuses on malware detection or malware family classifcation. Tese studies need to extract a large number of features, which consumes a lot of manpower and material resources. Moreover, some malware use obfuscation to evade decompiler tools extracting features. To address these problems, we propose ImageDroid, a method based on the image format of Android applications that can not only detect and classify malware without prior knowledge but also detect the obfuscated malware. Furthermore, we utilize the Grad-CAM interpretable mechanism of the deep learning model to automatically label the image that play a key role in determining maliciousness in a visual way. We evaluate ImageDroid over 10,000 Android applications. Experimental results show that the accuracy of malicious detection and multifamily classifcation achieve 97.2% and 95.1%, respectively, and the detection accuracy of obfuscated malware achieves 94.6%.


Introduction
With the rapid development and popularization of 5G networks, smartphones are recognized as an integral part of our lives, such as chatting, taking photos, electronic payment, and so on.
According to the Counter Point statistical report [1], the number of smartphones sold in 2021 has reached 1.35 billion.Te Android system is the most widely used operating system within smartphones.Te Android system is considered to be the biggest target of malware attacks, making the Android system very vulnerable to network attacks due to its openness.
When a malware variant owns evasion detection technology, even if the malware function of the new variant does not change, the malware cannot be detected.Terefore, to adapt to the variety of malicious applications, many researches on malware detection methods rely on feature extraction [2][3][4][5].Feature extraction usually requires a lot of manual work and material resources, which relies heavily on prior knowledge.At present, there are two methods for malware detection: static method and dynamic method [6,7].In the static method, if the code is encrypted, the detection efciency will be reduced or even the detection model will be invalid.Although the dynamic method can solve this problem, it must be confgured with a specifc running environment, which means higher requirements for hardware and detection time.Moreover, the dynamic method has the weakness of incomplete trigger path coverage [8].If the malicious execution action is not triggered, the detection efciency will be reduced.Given the rapid development of current deep learning models in imagebased recognition [9][10][11], some deep learning models have achieved good results in windows malware detection [12].
In this paper, we propose ImageDroid, an Android malware classifcation method based on image, which directly classifes the maliciousness of Dex fle without decompilation.Diferent from other methods, on the basis of analyzing the structure of the Dex fle, we only retain the Data Area that plays an important role in the semantic logic of the code, then convert it into image, and then apply the deep learning model (Inception-ResNet-v2) for classifcation.Te experimental results verify the efectiveness of the method.After extracting the Data Area of the Dex, the classifcation performance is improved regardless of whether the target Dex fle is obfuscated or not.On this basis, we use the interpretable mechanism of the deep learning model to mark the image part that plays a key role in determining the maliciousness.
Te remainder of the paper is structured as follows.Section 2 gives the detailed design and implementation of ImageDroid.Section 3 evaluates the efectiveness of ImageDroid.We conclude the paper in Section 4. Section 5 reviews the related work.

ImageDroid Design and Implementation
As obfuscation and other mechanisms are increasingly used for code protection, it makes code reverse and static analysis more difcult, and many malicious Android applications also use this mechanism to evade malicious detection.Te purpose of ImageDroid is to determine code maliciousness without decompilation.Te main idea is to directly extract the part of the Dex fle that represents the semantics of the code and then convert it into an image.Finally, the image is fed into a deep learning model for classifcation.
Te specifc implementation of ImageDroid is shown in Figure 1, which mainly includes the following two stages.(1) Malicious Classifcation.We convert the Data Area into image and put it into the Inception-ResNet-v2 for malicious classifcation.(2) Marking Key Parts of the Image.Te malicious weights extracted from the model are saved based on the Grad-CAM mechanism, and fnally the important part of the image is calculated by the weight and the image blocks obtained from the deep learning model.In the rest of this section, we will describe each stage in detail.Te Header of Dex describes the information of the Dex fle and the index (ofset address) of each area.For example, it describes the length feld of the Dex fle, the version number, the ofset address of the area corresponding to the string, and the statistics of the number of strings is included.Te size of the whole fle header is fxed at 112 bytes.Te header length of the Dex fle of diferent Android applications does not change, but the value of the corresponding feld changes.However, these changes in numbers have little to do with the maliciousness of Android applications.Terefore, we choose to remove the Dex Header.

Observation and Analysis of the
Te Index Area describes the ofset address of each area's specifc content in the Dex fle.For example, in Figure 3, String_Ids record the ofset addresses of all strings, but it is not real data, just an index, and the data are indexed by this value.Tese index values do not represent real data, so we remove the Index Area.Te Data Area retains not only the structure of the entire APK fle but also the real data, that is to say, all the real data used in the entire Android application are in this area, so we keep Data Area.
For the obfuscation of Android application code, there are usually the following three situations: substitution obfuscation, hidden obfuscation, and repeated function defnitions.Tese obfuscations are implemented through the HackPoint confguration fle to realize the obfuscation of the Dex fle.Te modifed HackPoint is saved to the end of the Dex fle, and there is no change in the Data Area area during the Dex obfuscation process.Terefore, using the Data Area of the Dex fle can avoid the obfuscation technology to detect Android applications.
Trough the above analysis, we fnd that diferent parts of Dex fle are not efective for Android detection and classifcation.Finally, we choose Data Area extracted from Dex fle as the research object.Because the Data Area not only contains all the structure information and real data of APK, it can also ensure that the data will not be afected during the confusion of Android applications.

Malicious Classifcation.
Diferent from the existing studies, we do not need prior knowledge and feature extraction and only extracts the Data Area in the Dex fle of the APK to realize the classifcation.
Te implementation of classifcation consists of the following three steps, as depicted in Figure 3: (1) Extracting the Data Area of APK (corresponding to ① and ② in Figure 3); (2) converting Data Area to the RGB image (corresponding to ③ and ④ in Figure 3); (3) implementing classifcation of Android applications (corresponding to ⑤ and ⑥ in Figure 3).Te specifc implementation is shown in Figure 3.
(1) Extrating the Data Area of APK.We use the unzip tool to directly obtain the Dex fle of the APK, as shown in ① in Figure 3. Te Dex fle is an executable fle of the Android virtual machine, which is composed of three parts, as shown in Figure 2. Inspired by the observation of the Dex fle structure in Section 2.1, we extract the most efective Data Area part for classifcation from the Dex fle as the next operation object, as shown in ② in Figure 3.
(2) Converting Data Area to the RGB Image.Te Data Area is in the form of bytecode, and it is much longer than the image format.Terefore, we need to process the Data Area to make it more consistent with the input of the deep learning model.We convert the bytecode into a multidimensional array by replacing the bytecode with a decimal number.We choose a 2 Security and Communication Networks three-dimensional array of 900 × 900 × 3. Tis makes it possible to accommodate the length of most Android applications.Tis array can be converted into the RGB image.Each pixel in the image is the three consecutive bytecodes in the original bytecode.If the single channel is used, the length of gray image is too large.In order to reduce the size of the image, we use three channel color image.Due to the diferent lengths of bytecode, we discard the code segments larger than 900 × 900 × 3 and add 0 after the code segments smaller than 900 × 900 × 3. Our approach retains most of the bytecode sequence, but the original spatial structure may be changed during the period of transformation.Tis is also the shortcoming of our method.Te bytecode of Data Area is shown in Figure 4.A square (two hexadecimal values) represents one pixel, for example, 64 represents a pixel.
To convert the bytecode to the RGB image, the RGB image corresponds to a three-dimensional array.Figure 5 shows that the bytecode corresponding to Figure 4 is converted into a three-dimensional array.For example, the frst three-dimensional array shown in Figure 5 is [64,65,78], where 64, 65, and 78 are set to R: 64, G: 65, and B: 78, respectively.
Because the malicious application of the same family has the same malicious behavior, and the code similarity is extremely high, some malicious applications of the same family have great similarities in images.As shown in Figure 6, we show four malicious application images of the AnserverBot family.
(3) Implement Classifcation of Android applications.In order to realize the classifcation of Android applications, in this section, we complete the selection of the deep learning model and the implement detection or classifcation.Te specifc implementation details are given in the following sections.In order to get the best structure of deep learning, we selectd the model based on the above three requirements.We tried several classic image classifcation models, such as VGGNet [13], GoogleNet [14], and ResNet [15].After several years of development, these models have been proven to have a good generalization ability.Tese models are described in detail below.
VGGNet uses a smaller convolution kernel as a whole.Te frst several layers of the model are a stack of convolution layers, and the last several layers are full connection layer (FCL) and softmax layer.Te activation function of all hidden layers uses the ReLU function.It uses several smaller convolution kernels instead of large convolution kernels to reduce the parameters and introduce more nonlinear factors to increase the ftting expression capability of the network.
GoogLeNet is derived from LeNet.At present, there are mainly four versions of Inception-(v1-v4).Each version is a little bit better than before and gets a better image classifcation efect.In this series of network structures, convolution kernels of diferent sizes are used to obtain receptive felds of diferent sizes.Finally, these features of diferent sizes are fused to extract better features.In addition, Inception-v [16] proposes batch normalization to reduce the variation of internal neuron data distribution.Tis setting normalizes the output of each layer to N(0,1) distribution, thus increasing the robustness of the model.It can also use larger learning rate training, faster convergence, and less infuence of weight initialization.In addition, the model uses two 3 × 3 convolution kernels instead of one 3 × 3 convolution kernel to make the network deeper.After that, the Inception-v4 and Inception-ResNet use residual network to improve the previous network structure.
In consideration of the experimental comparative analysis of the above models (as shown in 3.2), and combined them with our needs, we fnally chose the Inception-ResNet-v2.

Te Implement of Classifcation or Detection.
To implement detection or classifcation of Android apps using images, we frst convert each APK into a 900 × 900 × 3 RGB image.For example, if there is a dataset of N samples, an image input of N×900 × 900 × 3 is generated.We directly input these data into the Inception-ResNet-v2 model for training, as shown in Figure 3 ⑤.Te trained model is to realize the detection and classifcation of Android applications.

Marking of Key Parts of the Image.
To explain the neural network features, we used the Grad-CAM method [17], which achieves good results in the interpretation of image classifcation.Te specifc Grad-CAM method implementation framework is shown in Figure 7.
Grad-CAM is an extended version of CAM and is commonly used in image classifcation.Te goal of Grad-CAM is to obtain the heatmap for the images.
In particular, the heatmap is the contribution score for every single pixel of the image.Grad-CAM believes that the last feature maps generated by the convolutional layer have the valuable information of the input data, and the fnal decision of the model is performed on it.Yet the infuence of each feature map on the decision of the model is diferent.To refect the diference, Grad-CAM computes the important scores of the feature maps by multiplying each feature map with its corresponding importance weight.Ten, Grad-CAM takes the sum of the importance scores to summarize the scores of the feature maps contributing to the classifcation results.More specifcally, we denote the one of the feature map from the last convolutional layer as A m and the classifcation results as L C .We can calculate the importance weight of A m as where α c m is a constant that represents the importance score, and Z is the number of elements in A m .Assuming that we have M feature maps, the contribution scores of the input data can be calculated using a weighted combination of each feature map: where the ReLU is applied to preserve the features that only have a positive infuence on the classifcation result of C. Note that the size of the C score should be smaller than the input data.(2) Te Implementation of Marking Malicious Image.
To implement the Grad-CAM, we use the AGP (average global pooling) technique to calculate the weighted class activation map.
As shown in Figure 7, since we only do maliciousness detection in the model, we only need two forms of labeling: normal and malicious.From the output of the model, we get the weights of all features judged as malicious (the red square in Figure 7 is maliciousness).Ten the feature and its corresponding weight are multiplied to form a heatmap.Te Security and Communication Networks brighter the heatmap, the more important that part is judged malicious.Te malicious part of the image is explained by the heatmap.

Evaluation
In this section, we frst introduce the dataset and experimental environment we use in the verifcation process.Ten the feasibility of ImageDroid and its efectiveness in classifying malicious applications are verifed.Te specifc detailed description is as follows.

Experimental Datasets and Environment.
During the experiment, we evaluate ImageDroid using six datasets.Te samples in all these datasets are not only labeled normal or malicious but also contain family labels.Te details of these datasets are shown in Table 1.Our experiment is based on the Tensorfow framework, and our model is trained on 4 Nividia Titan XP.In this experiment, we use four indicators to evaluate the performance of the model, and they are accuracy, precision, recall, F1-score.

Verify the Performance of Deep Learning Model.
In order to realize the marking of important parts of images in malicious detection, we must have a good deep neural network model for image classifcation.Terefore, we verify the efect of diferent models in using images for malicious detection based on dataset 1.As shown in Figure 8, we use four typical neural network structures for validation.As can be seen from the evaluation metrics, Inception-ResNet-v2 performs the best.Terefore, the deep learning model selected by image is Inception-ResNet-v2.
To verify whether the Inception-ResNet-v2 model converges or not during the training process, we present the ROC curve of the Inception-ResNet-v2 model after multiple iterations, as shown in Figure 9.As the number of iterations increases, the closer it is to the upper left, and the AUC (area under curve) value also increases.Tese data indicate that the model converges with the iterative growth.2.

Verify the Validity of the
Trough the classifcation evaluation indicators in Table 2, we fnd that the extraction of Dex fles can have a certain efect on improving the classifcation efect.

Verifcation of the Efect of Malware Family Classifcation.
Because of the great similarity after converting to images by analyzing the Android malicious applications in the same family, we classify Android malicious app families based on the Data Area which is converted to images.Next, we perform malware family classifcation validation on the dataset 1.
During the process of malware family classifcation, we fnd that the diference in the number of samples in diferent families is too large to bias the detection results.For example, the number of samples in some malicious application families is only 2. To address this issue, we select the top 20 families from the Drebin dataset for validation.Each family name and the corresponding sample number are shown in Table 3. Te classifcation of ImageDroid for each family is shown in Table 4.
As can be seen from Table 4, the ImageDroid can classify the families in the dataset well, with an average recall rate of 96.7%.Among them, ten families are classifed perfectly, and the recall rate achieve 99%.Because the training process of the deep learning model is related to the number of samples, the above experimental data show that our method is also efective in malware family classifcation.Security and Communication Networks

Validation of Detection Validity for Obfuscated
Applications.In this section of experimental validation, we evaluate the robustness of ImageDroid in detecting obfuscated malicious applications.Te current obfuscation technology on the Android platform is very mature, and there are many obfuscation frameworks available.We choose the dataset 3 of Android PRAGuard [19] for validation.Five obfuscation techniques are used in the dataset 3(the details are shown in Table 5).
We use ImageDroid for detection on the obfuscated dataset.We use all Dex fles and ImageDroid method (using partial Dex fles) for comparison and verifcation, respectively.
From the detection results in Table 5, it can be seen that the ImageDroid method still has a good detection rate on diferent obfuscated malware.Tis is exactly the beneft of the ImageDroid method using partial Dex fles instead of decompilation to obtain features.In this way, we solve the problem of low detection of obfuscated malicious applications by static features.Tis makes it possible to detect obfuscated malicious applications without the need to dynamically run malicious applications.

Related Work
Malicious detection of Android has always been the focus of Android research.In view of the diferent current research methods, we divide the research methods into two categories: decompilation and unable for decompilation.
Detection that can be decompilation.Tis type detection method needs to decompile the Android application and then extracts diferent features from the decompiled fles for malicious detection.Te disadvantage of this type is that Android applications must be decompiled.Liu et al. [20] proposed a malicious application detection method for Android based on the multilevel signature matching algorithm.Trough this method, API, method, class, and APK of each APK are signed separately.Finally, the same signature is founded by the matching algorithm to detect malicious application.Arp et al. [4] proposed Drebin, which performs extensive static analysis and collects as many application features as possible, such as permissions, API calls, and strings in the Dalvik code.Ten these features are embedded into a joint vector space for Android malware analysis.Zhang et al. [21] proposed DroidSIFT, which constructs a weighted context API dependency graph database and generates graph-based feature vectors through graph similarity query.Fan et al. [22] proposed the faldroid method, which constructs frequent subgraph database through the call relationship of function call graph and classifes malicious applications by frequent subgraph to characterize the maliciousness of malicious applications.Liu et al. [8] proposed to use neighbor signature to classify Android malicious families.Based on neighborhood signature to acquire Detection that cannot be decompilation.With the development of encryption technology, some Android applications cannot be decompiled.In this case, there are some methods that can detect malice without decompilation technology.Tese types of methods are to use DEX fles directly for malicious detection.Ni et al. [23] transformed the operation code in the disassembled malware code into gray image, and then recognized the classifcation of malicious multifamily in the Windows system through the convolutional neural network.Han et al. [24] proposed to convert DEX into an image and then extract the entropy graph from the image as a feature for malicious detection.Bakour et al. [9] extracted local and global features from DEX converted images.Ten, multiple local feature descriptions are extracted from each image to form a feature vector, which is used for malicious detection.Mercaldo et al. [25] used the GIST method to generate a set of features from the image corresponding to each application to detect malicious applications and classify malware families.Security and Communication Networks

Conclusion
In this paper, we propose a method for malicious detection and multifamily classifcation without decompiling applications and prior knowledge.Te results show that our method can not only efectively detect malicious of android Application but also classify multiple families.Based on the method, we annotate the classifcation results with the interpretable mechanism of deep learning model.Tis not only provides a good solution for malicious detection of Android applications that cannot be decompiled but also enables further fne-grained analysis by locating the part of the image that is important for determining maliciousness.
Dex File.Since the extraction of Data Areas in Dex fles is the key to realizing ImageDroid, in this section, we mainly analyze the Dex structure and describe the reasons why Data Areas are chosen to represent Dex fles.After unzipping the APK fle, we can directly obtain the Dex fle.Te Dex fle format is a compressed format designed for Dalvik that stores data in bytecode.Te structure of Dex fle is shown in Figure 2. It is composed of three parts: Dex Header, Index Area, and Data Area.Finally, we extracted only the Data Area from the Dex fle as a representation of the APK.Specifc observations and analyses are shown below.

Figure 6 :
Figure 6: Four malicious applications of the AnserverBot family.
Extracted Data Area.In order to verify the efectiveness of the ImageDroid for extracting Dex fles, we use Dataset 1 and Dataset 4 based on the Inception-ResNet-v2 model to verify the efect on classifcation.We name the method as AllImageDroid that uses the entire Dex fle to convert to images for classifcation.Te efectiveness of Dex fle extraction is verifed by comparing the classifcation results of AllImageDroid and ImageDroid.After we input the data extracted by these two methods into the deep learning model for training, the diferent evaluation indicators obtained are shown in Table

3. 6 .
An Example of Image-Based Feature Marking.We use Grad-CAM technology to visualize the parts of the image that are important for maliciousness determination.In this section, we demonstrate two APKs from the Geinimi family of GenoneProject.Te two MD5 values are f3736147f7d46c5d96f8ae9f89bfd1f694b-b871a and bc3790cdc8ae0ee7da7d6e3fd397d2a720e00e67.Tese two APKs are used to generate a malicious heatmap based on Grad-CAM technology.As shown in the heatmap in Figure10, diferent colors indicate diferent importance in determining maliciousness.Te orange color indicates the most malicious part and the blue color indicates the least malicious part.Te two images of the malware are marked in orange around the positions of 400, which corresponds to an important part of the image to judge maliciousness.

4
(3)urity and Communication Networks2.2.1.Selecting the Deep Learning Model.When designing the deep learning model, we need to consider the adaptive of diferent models.According to the needs of our method, we put forward the following three requirements for the deep learning model.(1)Temodel should be able to handle a large length of input, corresponding to our high pixel image of malicious applications.(2)Sincethe local and global features are considered in the deep learning model, the model can still achieve high detection accuracy although the malicious code is separated or discontinuous in underlying bytecode.(3)Wealso want to obtain the important features learned from the network structure through the model, which can help the analysis of Android malicious applications in the next step.

Table 1 :
Datasets used in our experiment.

Table 2 :
Comparison of classifcation efects using diferent regions of Dex.

Table 3 :
Top malware family used in Debin.

Table 5 :
Classifcation result of per malware family.