Power Consumption Based Android Malware Detection

,


Introduction
Smart phones and other mobile terminals have become the most important part of modern life, and Android operating system in the smart phone has more than 50% share.However, not only do open source characteristics of the Android platform provide convenient conditions for software developers, but also security problem has become a hot spot of concern.According to the latest statistics of F-Secure [1], in 2014 the Android platform has 277 species threats, such as unauthorized download applications and GPS location tracking.So distinguishing the mobile normal software and malicious software is a hot problem to be solved.
At present, there are two kinds of methods for malware detection: one is static analysis method, and the other is dynamic analysis method.The static analysis is mainly based on application decompiled and gets source code of malicious behavior.
However, due to the limitation of software code analysis method and decompiled technology, dynamic analysis has gradually become the trend of the mobile malware detection technology development.
In dynamic analysis field, Zheng et al. [2] propose a dynamic analysis system running in the Dalvik virtual machine, but because of the limitation of Dalvik layer in Android, it cannot monitor the detailed information of underlying code.Enck et al. [3] set up a monitoring program running in simulation environment; this method can effectively achieve real-time behavior for a specified application, but it cannot run on Android devices without ROOT access.
For the Android platform application software, there are two ways to get power consumption information.We can use native Android API to calculate power consumption.This method has been used by some application software, such as PowerTutor [4].
Another method is proposed by Curti et al. [5] who rewrite the underlying driving mode for power information.Compared with the first method, this method is more accurate, but it has high intervention to the system.
In the study of power consumption analysis method to detect malicious software, Jacoby et al. [6] propose a model which is based on power consumption to detect intrusion.
Based on this, Buennemeyer et al. [7] and Kim et al. [8] use the Android mobile phone battery configuration files to detect malicious software.But all of the above methods only consider the amount of power consumption, without taking the timing characteristics of power consumption into account.
According to the above problem, this paper puts forward a kind of mobile malware detection model based on mobile phone power consumption.The model structure is shown as in Figure 1, and it mainly includes four function parts:

Battery Power Consumption Monitoring
The method for monitor battery power consumption is the key to this detection model; if battery power consumption data is not accurate, it will directly affect the accuracy of the results.In this paper, we use PowerTutor [4] to monitor battery power consumption information of the specified application.
PowerTutor can get the battery power consumption timeseries data of each mobile application by monitoring their components including CPU, audio, screen, Wi-Fi, 3G, and GPS.
During the power consumption monitoring, we found that power consumption has the same state transition characteristics as sound signal did.But the state of power consumption is much more stable, and 5-minute sampling time can cover all status changes.So, in the rest of this paper, for each application we take 5 minutes to monitor battery consumption.
Figure 2 shows the CPU power consumption of two Android applications ("UC browser" and "iReader") within 5 minutes.It can be found from Figure 2 that the consumption curves of two applications with different functions are significantly different.It also shows that we can effectively distinguish different applications through the analysis of the power consumption.

Waveform Feature Extraction and Analysis
We know that applications with different functions have different power consumption from Figure 2.But when using these consumption data to classify software, we also need to consider the influence of different factors on the power consumption, such as uncertainty user input, different screen resolution, and the influence of different hardware acceleration technology on power consumption.In order to reduce the above factors on the result of application software to identify or test, it is important to choose an effective method to extract the feature information of power consumption waveform.
In the field of waveform characteristics analysis, the most widely used and the most mature way is Mel frequency cepstrum coefficient (MFCC).It has been widely used in speech recognition and has high identification accuracy [9,10].
This paper designs a MFCC calculation process for waveform feature extraction which was mentioned before (see Figure 3).
MFCC computation process is as follows.
(1) Frame Pretreatment.Frame pretreatment in this paper only needs to do framing and window processing for the original input data of ().The purpose is to extract short signal features (similar to audio processing).In order to facilitate the calculation of subsequent steps, set the frame length 5 s; frame offset (the overlap between two frames) is 2.5 s.During the processing, we use Hamming window; the calculation formula is Among them, the 0.54 and 0.46 are the constant which the Hamming window defined,  is sampling points per frame, and () is the short signal feature which is the result of the Hamming window formula.
(2) DFT (Discrete Fourier Transform) Linear Spectrum Calculation.We use Fourier transform to calculate the short-time energy and frequency domain signals.First of all, add some 0 after each frame in the time domain signal () to make sure the frame length is  ( = 256); then, we can get the linear spectrum () through the discrete Fourier transform.The calculation method is (3) Mel Frequency Filter Calculation.The input frequency signal through a number of triangle pass filters and Mel frequency spectrum signal are obtained by filtering calculation.Among them, each of the triangular filter center frequency calculation steps is as follows.
Set the number of filter , signal sampling frequency is   , and the max Mel frequency is Center frequency spacing between each filter is  of triangular filter for ; logarithmic spectrum () of linear spectrum () is calculated as follows: (5) Cepstrum Calculation.Through the discrete cosine transform (DCT), we can transfer the logarithmic spectrum () to cepstrum; the formula is

Waveform Feature Matching
Feature matching includes two processes.Firstly, create the battery power consumption characteristics database.And then match software battery power consumption characteristics of input signal with the feature database.In the feature matching calculation, we choose Gaussian mixture model (GMM) [11].The advantage of GMM is that it can approximate to arbitrary shape distribution probably and the algorithm complexity is low.Battery power consumption feature matching process design is as follows.
(1) Build Characteristic Database.The key to build battery power consumption characteristics database is essentially to calculate Gaussian distribution parameters of the given MFCC feature sequence.Therefore, we can use the maximum likelihood estimation to obtain probability distribution parameters; calculation method is as follows.
Set the MFCC sequence  = {  ,  = 1, 2, . . ., }.Each component of  is independent of each other; the likelihood for the GMM model parameters is The eligible parameters are to find a group λ which makes ( | ) largest: or While using maximum likelihood estimation, we choose to use the expectation maximization (EM) algorithm [12] to get the estimated parameters.The calculation steps are as follows.
For each training sample ,  is the category which this sample may belong to, and   is a certain probability distribution of ; then   (  ) =  (  |   ; ) .(10) Estimated parameters are calculated as follows: The EM algorithm can converge because of Jensen's inequality [13].
According to formula (10) and Jensen's inequality, when calculating expectation we assume that the value of  is unchanged; then we can get the equation ( () ) log  ( () ,  () ;  () ) But in each iteration, the value of  is changed; if we treat  as variable and calculate its derivative, we can get ( () ) log  ( () ,  () ;  () ) Through formula (12) and formula (13) we proved  is monotonically increasing.So one way to convergence is that  no longer changes, or  change very little.
(2) Features Value Matching.After building the database, the given  is a known characteristic.For the existing categories , calculate the probability value ( |   ).Characteristic value matching is finding   , which can make max  ( |   ); then   is the output of feature matching.

Experiments and Result Analysis
In the experiment, first we use the method proposed in Section 3 to build GMM model for game, browser, music player, and malicious software.And for the malicious software, we simulate a floating point number attack for the Android platform.In the construction of the GMM model, we repeatedly monitor the software's power consumption in 5 minutes and calculate the MFCC features for the reset of 3 types of applications and then use the EM algorithm to calculate the GMM model parameters.
Experiments are designed on 6 cell phones, including Samsung GALAXY S5 and LG G2 mobile phones.Battery monitoring tool is PowerTutor [4].In the experiments, we only monitor and collect the CPU power consumption data when the application software is running.The cycle time of monitoring is 5 minutes and sampling frequency is 300 ms.4; it is the test result of the application that is named iReader.The GMM parameters model is shown in Figure 5.
We can see from Figure 4 although power consumption MFCC feature values for the same application in different mobile phones are not completely the same, the distribution is roughly the same.It also proves that the MFCC feature extracting can largely reduce the equipment and other impacts caused by factors.
And, in Figure 5, the GMM model also shows that the estimated parameters can be calculated using the extracted MFCC parameters and the result is convergence.
In application software testing, firstly we build GMM model for four kinds of typical applications, including games, browser, music player, and malicious software.In this test, this paper only simulates one kind of malicious software which is floating point attack.
During the test step, this paper selects 100 applications in each group.Under the normal condition, we monitor the CPU battery consumption during 5 minutes and then classify it through the proposed power analysis model.Four kinds The experiment proves that CPU power consumption characteristics of simulated malicious software have a significant difference from others, and through the analysis of the power consumption we can detect and classify malicious software.
Since our method is based on the Gaussian mixture model, we know that the model order will affect the test results.So during the experiments, we change the model order from 2 to 12 with the same sampling time and the same sample data of music applications and compare the results.And the experimental results are shown in Table 3.
The data in Table 3 can prove that, with the increase of GMM's order number, detection accuracy has a significant rising.But when the GMM's order is greater than 4, detection precision begins to decline.This is because as the GMM's order is larger the model becomes more complex; the matching process and parameters calculating process will take a longer time.Since our experiment takes 5 minutes' sampling time, when GMM's order becomes larger the sample data is too little to calculate accurately the higher order of GMM's parameters.

Conclusions
Based on battery power consumption sequence characteristics of the application software which is collected from mobile terminals such as mobile phone, this paper introduces MFCC feature extraction algorithm and GMM model classification algorithm and then proposes an Android malware detection method.What is more, the experiment proves that the method has effective identification and detection of malicious software.
In the future, we will try to make MFCC parameters extraction and classification of GMM model integrate into the mobile terminal equipment and consider the battery consumption of other hardware when application software is running.

( 4 )
Logarithmic Energy Calculation.In order to have better robustness of noise, it is indispensable to calculate the logarithmic energy of Mel frequency spectrum.Set the number

Table 1 :
Typical applications results.

Table 2 :
Detection rate of statistical.

Table 3 :
Different model orders and results.

Table 2 .
The malware detection rate in Table 2 (negative rate, NR) shows that the proportion of application was classified as malware, benign detection rate (positive rate, PR) shows the proportion of application software was classified as benign, and accuracy rate (AR) shows the proportion of application by accurate classification.Table2data show that the proposed power analysis model to differentiate benign software and malicious software is highly accurate.