Learning-Based Detection for Malicious Android Application Using Code Vectorization

,


Introduction
Among all smartphone operating systems, Android has occupied over 85% of market share. Besides, Android powered devices such as cars, fridges, televisions, point of sale (POS) terminals, and ATM booths are expected to flood the user markets within a few years. Due to the popularity of the Android ecosystem, the malware writers are targeting the Android devices exclusively, and the number of Android malicious apps surged exponentially. Android implements a number of security mechanisms like the permission mechanism to ensure the safety of device resources. e permission mechanism of Android is coarse-grained, and users are usually ignorant of the sought permissions. Hackers also proposed the attacks that can bypass the permission mechanism [1][2][3]. As a result, the effective detection of malware is very important to mitigate security threats in the Android ecosystem. e new malicious Android applications are also emerging. e innovation of Android source code security detection technology needs to be greatly valued [4,5]. e static analysis and dynamic system-level behavior analysis are common methods used to detect the malicious apps.
e static analysis utilizes the reverse-engineering techniques to analyze the source code of the Android application, which relies on the semantic signatures and focuses on analyzing code snippets without executing them [6,7]. It extracts the static features from the malicious apps, including all string constants [8] and URL addresses in the source code [9,10], function names of all components in the source code [11], and any other static information to determine whether an app exhibits malicious behavior. ese methods are more resilient to changes in malware codes. e dynamic analysis uses the operating behavior of the software during operation [12,13]. e dynamic analysis is more effective because it can extract functions that represent unique execution modes, such as system function call sequence [14], function call frequency [15], and function call combination [16,17]. Interestingly, according to this study, over 98% of new malware samples are in fact variants of an existing malware family. Google uses a dynamic analysis system called Google Bouncer that analyzes APKs submitted to Google Store. Unfortunately, the dynamic analysis techniques that execute the Android apps inside an emulator also suffer from the fact that the malware writers can detect emulators and thus evade detection. e APIs and permissions are usually the focus of APK analysis, because they include a lot of information related to user security, such as user passwords, geographic location, and browsing information. Many applications will apply for more permissions than they really need, which makes users face more permission warnings and increases the risk of permission being exploited [18][19][20]. In most existing studies, specific API calls [21][22][23] and specific API call sequences [24] are commonly used for the dynamic characteristics. In order to balance the shortcomings of static and dynamic analysis, a hybrid analysis method of static and dynamic methods has emerged [25][26][27]. In recent years, as a highly complex network with superior performance and easy implementation, deep neural networks have emerged in various fields, including the field of virus software detection. However, according to the above related research, it can be found that the current network structure in the field of Android malicious application detection has not been studied in depth. Most scholars simply apply the existing network structure, and such a simple application does not give full play to the selected network structure. erefore, this paper proposes an improved algorithm based on category discrimination to make up for the above problems and improve the accuracy and recall rate of malicious APK classification in response to the shortcomings of the existing virus APK detection algorithms.
In summary, this paper makes the following contributions: (1) is article designs a feature data set based on the characteristics of the decompiled code of Android software and also designs a neural network to extract code features. (2) is paper proposes a deep convolution model for Android malicious app detection using multiple receptive fields-DTCNN (Deep Text Convolutional Neural Networks). is model uses a variety of sizes of convolution kernels to take into account local features and short-distance cooccurrence features. e depth and scope of information acquisition are greatly improved, and the different granularities of information can be used to make decisions.
(3) It develops the DTCNN-LSTM hybrid model, an approach that extracts the deep features of the program through the convolution kernels of multiple sizes, and emphasizes the security semantics and the logical order of the code. is method greatly reduces the huge workload that is brought by the feature engineering required in the traditional method.
e rest of this paper is organized as follows: In Section 2, some existing studies using deep learning models and analyzing their shortcomings are briefly reviewed. In Section 3, the static and timing features in the core code of the Android application are extracted and vectorized by decompiling APK files. In Section 4, an improved detection model is proposed. In Section 5, the proposed model is evaluated through experiments and result analysis. Finally, this paper concludes the detection approach in Section 6.

Related Work
In recent years, deep learning and machine learning have become a popular branch of data science. Deep learning is currently recognized in the field of machine learning as a very effective algorithm for classification and detection problems. It is widely used in the detection, recognition, and classification tasks of text, images, and other objects. It can also be applied to the direction of malicious Android application detection.
Zhou et al. [28] used convolutional neural networks (CNN) to extract feature information in a sequence of word vectors with a convolution window of a specific size for the word vectors in the text. Convolutional neural network can play a good effect in extracting local feature information, but it always ignores text context information. Pascanu et al. [29] studied the use of RNN (recurrent neural networks) to process the input API sequence and added the maximum pooling layer to achieve the purpose of extracting fixedlength feature sequences from the variable-length input sequence. e structure of RNN makes it commonly used to process short text objects with a logical order, while the malicious Android applications have a long API sequence. Kolosnjaji et al. [30] also used convolutional networks and long-and short-term memory networks to analyze the calling sequence. However, they only model and classify malicious call sequences, and the feature information is not comprehensive enough. Caviglione and others [31] successfully detected information data communication of suspicious software through deep learning algorithms, but the actual detection only through communication information will have a higher probability of misjudgment. e above research on malicious Android applications through various deep learning and machine learning methods shows that the detection model is developing in the direction of multitechnology integration. Neural networks such as CNN, RNN, and DBN not only play an important role in natural language processing [32], audio analysis [33], and image detection [34], but also have great potential in Android application detection. e program itself is a very long sequence with only one-dimensional semantic dependence. e extracted operation instruction sequence of Android can be regarded as a text with a fixed dictionary language set, and this text has a very strong local relevance, which is formed by the stack of operation instructions. is paper optimizes the model construction, design of the hidden layer, input data structure, and other aspects to significantly improve the overall performance of the model.
According to the shortcomings of the above research, this paper proposes a novel convolutional network structure, DTCNN (Deep Text Convolutional Neural Network), for code vector classification and establishes a hybrid model combining DTCNN and LSTM (Long Short-Term Memory). It uses longitudinal experiments to verify the superiority of the DTCNN model in detecting the malicious Android applications.

Code Vectorization for the Deep
Learning Model e first step is to extract the vectorization description of the massive sample data as the input of the model. In this paper, the significant permissions from the apps are extracted, and the extracted information is applied to detect malware, using deep learning algorithms effectively. e design objective of selecting significant permissions is to detect the malware efficiently and accurately. As stated earlier, the number of the newly introduced malware samples is growing at an alarming rate. us, being able to detect the malware efficiently would allow the analysts to be much productive in identifying and analyzing them. e second step is to design the network model. rough studying the Android malware detection based on the deep neural networks, this paper designs an end-to-end information processing model. e disadvantage of the shallow network model is that it is difficult for it to learn the logic information of the Android application code language completely, so the DTCNN with excellent feature learning ability is used to extract the high-level abstract features of the code vector features. Although the CNN [28] used by Zhou et al. has a powerful local feature mining function, Android malicious applications solve the complex and special structure. e feature information contained in the core code is jumpy and contains a lot of redundant information. It is difficult for CNN to extract the timing information in the code. erefore, in order to improve the expression of the features on the semantic information of the code, increase the model's ability to learn, and understand the semantic sequence of the security semantic information and the code, the deep learning network structure in this paper is added to LSTM. LSTM is an improved recurrent neural network, which can overcome the problem that the recurrent neural network cannot remember the long-term information under long texts. erefore, in order to obtain the global and temporal features of the samples, two feature sets will be constructed as the input of the two structures in DTCNN-LSTM model, and the feature dimension will be further strengthened.
In the third step, the designed network is compared with the shallow network to verify the superiority of the proposed method. e detection process is shown in Figure 1.

Data Set.
When training the model, the positive and negative samples should be collected as the learning data to be provided to the DTCNN-LSTM model to prevent the overfitting problems caused by the small data set. e model proposed in this paper uses the static code of the Android application as the source of sample features. e static code is a richer and more comprehensive source of secure semantic information for deep learning than the dynamic files generated by the application running in the sandbox.
e Android application is developed in the Java language, and its code is executed under the interpretation of the Dalvik virtual machine. e Smali is the core code executed inside the Dalvik virtual machine. It has its own set of syntax. Sensitive information is gotten such as the logical structure, the function usage, and the permission calls of the application after analyzing the program's operating principles and process. erefore, using the source code of the Smali file as the representation form of the semantic features of the Android application, the code logic in the APK is further interpretable and can provide the required features for deep learning. Malicious applications of the same type usually have similar malicious behaviors. For example, a malicious application of privacy stealing type will steal user privacy data beyond the normal requirements, such as retrieval history, GPS location, mail, photo album, account password, and send it to the malicious terminal through WiFi or SMS without the user's authorization. When this malicious behavior occurs, it will trigger the sensitive API function calls, underlying virtual machine instruction calls, etc. erefore, it is necessary to use deep learning to mine the associations between the features to detect the unknown malicious applications.
is experiment uses the APK decompiling tool, APKTool, to process the sample Android application to obtain the Smali file. In the experiment, the Python script was used to execute specific Shell commands to decompile the APK file to obtain the Smali file automatically. e malicious Android applications may obtain the permissions by calling the sensitive API functions to implement the remote code execution and steal the user's privacy data. For example, the function "getCellLocation( )" can obtain the user's geographic location, and the function "getRunnin-gAppProcesses( )" can obtain the running software information in the user's device. erefore, the calling sequence of the sensitive API functions has practical significance as a model input feature. e exhaustive search program source code filters out the call flow of some sensitive API functions, focusing on finding the code statements that use invokedirect, invoke-virtual, invoke-static, invoke-super, and invoke-interface. After filtering the redundant code, the length of the code sequence is integrated to 500 as the first part of the static feature. ere are a lot of disassembled codes in the Smali file. In addition to the above API call instructions, the representative Dalvik ones are collected such as jump ones, Security and Communication Networks data operation ones, return ones, and other ten functions of instructions. ey are taken as the static features of the first two parts. In order to reduce information redundancy, the data is filtered while retaining features, and the Dalvik instruction is described using a custom mapping method. e partial Dalvik directives simplified comparative table is shown in Table 1.
In the experiment, the length of the input sequence of the feature extraction network is 10000. In order to ensure that the input matrix has the same format, the samples are subjected to zero padding. In summary, the input sequence of this experiment is divided into two parts, Group A is a sensitive API function call sequence of length 500, and Group B is a Dalvik instruction sequence of length 10000. Group A and Group B are used as the input of the LSTM module and the DTCNN module, respectively.

Code Vectorization.
In the deep neural network, a large amount of calculation processing and data mapping operations are performed, and only the digital vector matrix can be used as a reasonable network input. e frequency of API calls in the Smali file and the number of other Dalvik instructions are relatively large. If all code information is encoded directly and converted into the vectors, it will cause a "dimensional disaster." It is difficult to express the safety semantic information of the sample and the logical order of the code. In order to solve the problem of data dimension and enable the LSTM network to make better use of the security semantics and the logical order of the code, this paper converts the source text to readable low-dimensional vectors. e word2vec is a word vector training tool released by the Mikolov and others in 2013. Once released, it becomes an important text vectorization tool in the field of natural language processing. Compared with the traditional text representation, the word2vec can improve the approximation of synonyms in a high-dimensional space. e word2vec has two calculation modes, CBOW and skip-gram, which describe the relationship between the words from different angles. is paper chooses to use the skip-gram method with short training time and high accuracy. e model structure is shown in Figure 2. is method infers the context of the target words within a sliding window. e corpus of all data sets will be used in the experiment to train the word2vec model.
Given a set of word sequence w 1 , w 2 , . . ., w T , the skipgram is known to maximize the following formula: where c is the number of words in the context. As the value of c increases, the semantic relationship of the code in the high-dimensional space and the training time will increase. p(w t+i |w t ) is defined using the Softmax function: where v w ′ and v w represent the output and input vector description of w, respectively, W is the dictionary size, the Gensim library training word vector is used in the specific experiment, and the output code vector dimension is 100.

Deep Learning Model
Based on the structural characteristics of the code vector, this paper proposes a novel deep convolutional network model called DTCNN. It fuses the improved recurrent neural network to obtain the DTCNN-LSTM model, which completes the extraction of all the deep information of the input code vector. is paper uses the cleaned Android application decompiled code as the training data, which is highly similar to the text data in the field of natural language processing.
Both have the logical and local characteristics. e research results in the field of natural language processing indicate that the fusion model of the convolutional neural network and the recurrent neural network has obtained good experimental results. At the same time, the code data of the different modules in this experimental data set has great differences. If the traditional single model is adopted, the characteristics of different dimensions will be ignored. For this kind of data, a fusion model that takes into account the timing and capturing local features will obtain better inference results. e distance of some related codes in the code sequence data set may be far, and it is necessary to use the characteristics of the convolution network to extract it. However, the current convolutional neural networks used for text classification are mostly the shallow networks. In order to overcome the weakness of the shallow network to learn the security semantic information of Android malware, the DTCNN deep convolutional network is used to extract the sample information. e extracted features have a more essential characterization of the local correlation of the sample data.
Although the LSTM model can overcome the long-term memory problem of RNN and can process the sequence data with a long timeline, the processing effect of the LSTM will decrease when the text length exceeds a certain threshold. Because the API call flow and other Dalvik instruction data in the Smali file used in this study are large, it is difficult to extract the long-distance associations only by using the LSTM. erefore, this study extracts the deep local features through the deep convolutional networks. At the same time, in order to make up for the shortcomings of the convolutional networks in understanding the logical relationship of the code security semantics and extract the logical information and the time-order relationship of the application, a long-term or short-term memory layer is added to the model. e two-part feature vectors are fused to classify them.

DTCNN.
Inspired by the TextCNN model for the text classification, a deep convolutional network-DTCNNthat extracts the local code information is proposed. Figure 3 is a schematic diagram of the DTCNN structure. e B-Group feature vector set obtained by data processing is used as the input of the DTCNN structure to perform the local depth feature extraction.
Output Layer Projection Layer Input Layer  In a convolutional network, let x i ∈R k be a k-dimensional character vector, and in a code sentence of length n, the dimension of the word vector of the i-th word is k. e convolution operation applies a convolution kernel to a window of h characters to generate a new feature. For example, a new feature is generated by b ∈ R is the bias term, and f is a nonlinear function. e convolution kernel is applied to every possible substring in the string to form a feature set: Here, c∈R n− h+1 . e network structure of the DTCNN includes an input layer, the multiple parallel convolutional layers, and the pooling layers. e convolutional layer and the next adjacent pooling layer are called a convolution module. e DTCNN consists of the multiple convolutional modules. is study uses a DTCNN composed of 9 convolution modules to extract the features. e convolutional layer of the network uses four different heights of the convolution kernels, and their heights, h, are 2, 3, 4, and 5, respectively. In order to reduce the training time and the memory requirements, the number of the above convolution kernels is 50, 100, 150, and 200 and the corresponding convolutional layers are 9, 7, 6, and 5, respectively.
Based on the experimental classification effect, the pooling layer uses the maximum pooling method. e nonlinear activation function uses the ReLU. In order to extract much more local features, the convolution step is always set to 1, the height of the convolution window used in the first convolution layer is h, the width is determined by the word vector dimension, and the width of the remaining convolution kernel is 1. e size and step size of the pooling window are set to h. e final pooling step size of the pooling layer is determined by the input dimensions of the adjacent convolutional layer. e output of a convolution kernel is a 1 × 1 feature vector. When the height, h, of the convolution kernel is 2, the size of the convolution kernel is 2 × k, the output after the first layer of convolution is one-dimensional data, the size of the filter in the connected pooling layer is 2 × 1, and the step size is 2. e size and step size of the subsequent convolution kernel remain unchanged. In the last pooling layer, the pooling window dimension is equal to the previous convolutional layer data input dimension, so a 1 × 1 feature vector can be obtained.
e DTCNN module of this experiment finally outputs 500 1 × 1 feature vectors.

LSTM Network.
e RNN (recurrent neural networks) is a neural network that can process and predict the time series data. e expanded RNN is equivalent to a multilayer feed-forward neural network, which can transfer the information layers. However, in the case of relatively complex training text, it is difficult for the RNN to solve the problem of long-term dependence. e LSTM can perform better in longer sequences than the ordinary RNN. e LSTM uses the memory unit to replace the traditional hidden neuron node and transfers the memory information from the initial position of the sequence to the end of the sequence avoiding the problem of long-term dependence. is memory method is generally realized through three gating mechanisms, namely, input gate, forget gate, and output gate. e LSTM cell structure is shown in Figure 4. e input gate is used to control the input of the current node unit state. e output gate is used to control how much the current unit state is filtered out. e forget gate controls the degree to which the previous unit state is forgotten. e neuron's state update calculation method can be expressed as Here, σ represents the Sigmoid function; tanh represents the hyperbolic tangent function; W i , W o , W f represent the weight matrix of the input gate, output gate, and forget gate, respectively; x, h are the input and output of the memory unit; f t , o t , and i t are the forget gate, input gate, and output gate; c t ′ , c t are the candidate value and the new memory cell state; h t is the final output; and b i , b i , and b f , respectively, represent the offset vector corresponding to each gate.
Equations (5)-(7) represent the calculation process of the input gate, output gate, and forget gate, respectively. Equations (8)-(9) are used to update the state of the memory cell. Equation (10) first applies the tanh to get the current state of the memory cell and then determines the final state through the output gate. Finally, the deep extraction and output of code vector features are completed. e structural characteristics of the LSTM neural network make it have the advantage of processing the data-based language sequences. It has a good classification effect in some semantic processing tasks, which can make up for the shortcomings of the convolutional networks in terms of long-term memory, so it is selected as a deep feature extract being a part of the module.

Classification with the Fusion Model of DTCNN-LSTM.
After extracting the deep information from the DTCNN and LSTM neural networks, the feature vectors are obtained and sent to the fully connected layer. is module needs to complete classification of the feature vectors. Out of consideration of the overall consistency of the model, this study will use a fully connected network and Softmax classifier as the output module of the entire model. Considering that there are many layers of the network structure model, the model sets the activation function ReLU in the DTCNN and the fully connected layer, so as to accelerate the convergence and reduce the learning cycle. In order to prevent all the feature selectors from working together in each iteration and always highlight or weaken some specific features, a dropout mechanism is used between the two fully connected layers. Each training randomly selects 50% of the hidden layer nodes to carry out to make the weight update not dependent on some inherent features and avoid the problem of overfitting and weak generalization ability of the model. e fusion model is finally shown in Figure 5.
According to the previous sections, the malicious Android detection system algorithm using the DTCNN-LSTM fusion model is as follows: Input: a two-part code vector set Output: 0 (malicious) 1 (benign) Step 1: screen from the massive sample data, extract the required API call instructions and the Dalvik Opcodes, perform the code vectorization, and express in the form of feature vector.
Step 2: use the static feature block as the input of the DTCNN-LSTM fusion model. e vector set extracts the local abstract features through DTCNN and captures the high-level features in the pooling layer. e LSTM neural network further extracts the timing information for the logical sequence of the code language context.
Step 3: use the TensorFlow framework programming to fuse the features of the DTCNN and the LSTM output.
Step 4: connect the two fully connected layers and the dropout layer, use the Softmax classifier to classify the fusion feature set in Step 3, and finally complete the classification detection of the malicious Android application. needs to ensure a reasonable distribution of positive and negative samples. We select the number of benign apps to be three times the malicious apps to maintain balance during training, because the imbalanced data set can result in skewed models. In order to increase the validity of the model results, this experiment collected enough Android application installation packages as the training samples and the test samples from various sources at home and abroad.

Experiment Results and Analysis
is paper regards the benign Android applications mainly from the domestic Android application market and the Google Play Store. Although there is no guarantee that there will be absolutely no malware in the application market, this paper captures the Android applications that have the most downloads and the highest praise rates in the Android application market in China and abroad. e malicious Android application samples used in this study come from the foreign virus database VirusShare. e total sample set contains 2584 positive samples and 7584 negative samples.

Experimental Environment and Parameters.
e experimental environment is shown in Table 2.
In the experiment, in order to balance the training speed and the invalid convergence, the batch size was set to 35. During code vectorization, the word embedding vector dimension of the training output is set to 100, the number of hidden layer units is also set to 100, and the training window size is set to 5. e gradient descent optimization algorithm used to optimize the model parameters is Adam. e loss function is the cross-entropy error. e learning rate is set to 0.1. e filling method is the VALID. e unit number of the hidden layer of the LSTM module is set to 100, and the forget bias is set to 2.0.

Evaluation Index.
In terms of evaluating the effectiveness of the proposed model, the experiment considers the combination of the actual category of the experimental sample and the model prediction category and divides it into true positives, false positives, true negatives, and false negatives. e "confusion matrix" of the test results is shown in Figure 6. e accuracy rate and the recall rate are used to evaluate the model test results. For the test sample data set, the accuracy rate refers to the ratio of the number of samples correctly identified by the model to the total number of samples, and the recall rate refers to the ratio of the number of correct positive samples to the total number of positive samples in the test data set. e definition formulas of the accuracy rate and the recall rate are as follows:

Results and Analysis.
In order to solve the problem of model overfitting, this paper uses K-fold cross-validation for model tuning in the model realization stage. In K-fold cross-validation, the initial sample is divided into K parts: one part is retained as the data for the verification model, and the other K-1 parts are used as the training set. e cross-validation is repeated K times, and K detection models are obtained. e average result of the K models represents the detection performance of this model under the K-fold cross-validation method. In this article, the K value is set to 4. Figure 7 shows the results obtained for each cross-validation fold. e last set of data in the figure is the average value obtained by the fourfold cross-validation. e accuracy rate is 95.0%, and the recall rate is 93.9%. e DTCNN-LSTM model achieves a good prediction effect.
In order to verify that the evolution model proposed in this article has stronger detection capabilities than the other deep learning models that have been used for malware detection, this experiment uses the same batch of samples to compare our model with multiple deep learning models. In the comparative experiment, CNN-LSTM is a hybrid model based on convolutional network and long short-term memory proposed by Wang [35]. Unlike DTCNN-LSTM, the convolutional neural network in CNN-LSTM uses a shallow convolutional network. e DCNN used in the comparison experiment is a two-layer convolutional network for malicious applications proposed by Mclaughlin [36]. e convolution kernel in DCNN has a single size. Except for separate LSTM model and LSTM structure in CNN-LSTM, the models use Group B characteristic vector as input.
e test results are shown in Table 3. From the experimental data in Table 3, it can be seen that the other models have a shallower layer than the DTCNN-LSTM model. e detection effect is weakened to various degrees. ese results indicate that the deep convolutional architecture of the DTCNN can characterize the malicious applications more efficiently than the shallow architectures. In the task of detecting the Android malicious applications, single local information may not affect the classification. e maliciousness is determined by the interaction of multiple local information, so the malicious code's ability to express logical features under a single convolutional network is poor. e prediction accuracy of the DTCNN-LSTM model has improved. e LSTM network captures the logical relationship of the input features effectively and makes up for the deficiencies of a single DTCNN. Although the LSTM network has a long-term memory function, its ability to extract the high-level local information is weak. Only the safe semantic expression of the code context is learned, so the accuracy rate is low. Combining it with other models greatly improves the overall detection effect. is indicates that, in the Android malicious application detection tasks, abstracting the features to a higher level while paying attention to the learning of the dependence of the input feature vector can get a good classification effect.

Conclusion
is paper establishes a malicious Android application detection model based on the deep convolutional network DTCNN and LSTM network and implements the Android malware detection algorithm. e model is horizontally compared with a single convolutional network model, a long and short memory network model, and a CNN-LSTM fusion model to verify the effectiveness of the model. e positive and negative sample sets used in this paper are from the domestic Android application market and VirusShare. In the experiment, the source code acquisition and information filtering of the application were first carried out. e random parts were selected from the positive and negative sample sets as the training data set and the test data set. e code vectorization operation was performed, and the DTCNN-LSTM model parameters were trained using the feature vectors. Finally, the Android malicious applications in the test set are classified and identified. It is shown that the fusion model performs well in understanding the security semantics of the malicious Android applications and extracting the local information. is result shows that the organic combination of the modeling methods using different functions and different optimization principles is a way to improve the effectiveness of the model based on the deep network structure. An end-to-end detection model which can automatically acquire the feature expression capabilities is still the direction of future development.
ere are still some deficiencies in this study and areas that can be improved. Firstly, the number of samples in deep learning has a greater impact on the model training results. e data set used in the experiment is relatively small. Doing more experiments with a larger data set is required to improve the accuracy of the model. Secondly, the feature extraction using the static source code of the Android application will generate an input vector with an excessively large number of dimensions, which cannot guarantee the complete validity of the information therein. e selection and filtering of static features need to be improved.

Data Availability
e APK samples data used to support the findings of this study are available from the corresponding author upon request, and the website is https://virusshare.com/.