Present research proposes the application of unsupervised and supervised machine-learning techniques to characterize Android malware families. More precisely, a novel unsupervised neural-projection method for dimensionality-reduction, namely, Beta Hebbian Learning (BHL), is applied to visually analyze such malware. Additionally, well-known supervised Decision Trees (DTs) are also applied for the first time in order to improve characterization of such families and compare the original features that are identified as the most important ones. The proposed techniques are validated when facing real-life Android malware data by means of the well-known and publicly available Malgenome dataset. Obtained results support the proposed approach, confirming the validity of BHL and DTs to gain deep knowledge on Android malware.
Undoubtedly, smartphones are one of the emerging technologies that have revolutionized the use of computing systems. From the very beginning (late 1990s), more and more smartphones are sold every year and it is expected that the number of smartphone users passes the 2.7 billion mark by 2019 [
From the security standpoint, one of the main problems of smartphone apps is malware that is included in software in general and in these apps in particular. Furthermore, “users of mobile devices are increasingly subject to malicious activity pushing malware apps” [
As it can be seen, privacy and security of smartphones still are open challenges [
This pioneering work on collecting Android malware found some interesting statistics [
To improve present knowledge of Android malware families, a novel neural-projection technique from the family of Exploratory Projection Pursuit (EPP) techniques, named Beta Hebbian Learning (BHL) [
Each app (data sample) that was collected for the Malgenome dataset is defined as a set of certain features using a binary representation. Apps were grouped according to the family they belong to, and features were recalculated for the whole family, taking into account which features were present in the given apps. The generated high-dimensional space is then analysed by means of BHL in order to reveal the inner structure of the dataset. Obtained projections are consequently scrutinized to get further knowledge about the app features that define the organization of the data in different groups and subgroups. For comparison purposes, DTs have been additionally generated on the same features set, in order to know the features that better discriminate between the different malware families.
A variety of problems have been addressed by artificial neural networks in recent decades [
Visualization techniques have been previously applied to this problem of analyzing malware [
The rest of this paper is organized as follows; initially BHL and DTs are presented and the analyzed dataset is described in the following section. Then, the proposed experiments are introduced and the obtained results are analyzed in Section
In present research, the EPP BHL algorithm [
The Beta Hebbian Learning technique (BHL) [
Thus, if the PDF of the residuals is known, the optimal cost function can be determined. By using
where
Then, by using the following, gradient descent is performed to maximize the likelihood of the weights:
In the case of BHL, the learning rule allows for fitting the PDF of the residual, by maximizing the likelihood of such residual with the current distribution.
Therefore, the neural architecture for BHL is defined as follows:
Decision Trees (DTs) [
The main objective of a classification DT is to divide a dataset into groups of samples as similar as possible in relation to one of the features. They are made of three main elements: root node (contains all samples of the dataset), decision nodes (represent a decision or rule), and leaf nodes (final label). A dataset is then classified based on subdivisions of the DT nodes to reach one of the final (leaf) nodes whose label corresponds to a class (Figure
Structure of decision trees.
Several algorithms have been proposed so far to build DTs and their efficiency has been proved. The most notable ones [
The Classification and Regression Tree (CART) [ Build the decision tree splitting nodes according to a given function. Finish tree construction once the learning fits the stop criteria. Pruning the tree to avoid overfitting. Select the best tree after pruning process.
Originally, the splitting function used by CART is the Gini Index
where
For comparison purposes, two other splitting functions have been applied in present paper: Deviance (
Twoing is a splitting function different from Gini and Deviance. Being
On the other hand, in standard CART algorithm, the split feature that is selected for a decision node is the one that maximizes the split-criterion gain. Once again, for a more comprehensive comparison, two other criteria have been applied for selecting split features: curvature [ Curvature: it is based on the null hypothesis of unassociated two features. With these criteria, the best split predictor feature is the one that minimizes the significant Interaction: it is based on the null hypothesis of no interaction between the label and the predictor features. Therefore, for deep decision trees, standard CART tends to miss important interactions between pairs of features when there are also many other less important features. By means of this criterion, the detection of such important interactions is improved.
The dataset used in this research has been obtained from the Android Malware Genome Project [
This dataset contains malware apps installed in user phones and based on 3 main attack strategies: repackaging, update attack, and drive-by download. Samples of this dataset were manually classified based on different aspects such as installation and activation mechanisms and malicious payloads nature. Collected malware was split in families that were obtained “by carefully examining the related security announcements, threat reports, and blog contents from existing mobile antivirus companies and active researchers as exhaustively as possible and diligently requesting malware samples from them or actively crawling from existing official and alternative Android Markets” [
The different families present in the dataset are ADRD, AnserverBot, Asroot, BaseBridge, BeanBot, BgServ, CoinPirate, Crusewin, DogWars, DroidCoupon, DroidDeluxe, DroidDream, DroidDreamLight, DroidKungFu1, DroidKungFu2, DroidKungFu3, DroidKungFu4, DroidKungFuSapp, DoidKungFuUpdate, Endofday, FakeNetflix, FakePlayer, GamblerSMS, Geinimi, GGTracker, GingerMaster, GoldDream, Gone60, GPSSMSSpy, HippoSMS, Jifake, jSMSHider, Kmin, Lovetrap, NickyBot, Nickyspy, Pjapps, Plankton, RogueLemon, RogueSPPush, SMSReplicator, SndApps, Spitmo, TapSnake, Walkinwat, YZHC, zHash, Zitmo, and Zsone.
Therefore, the final dataset is made of a total of 49 samples, one for each family of malware, defined by a total of 26 binary features divided in 6 categories (Table
Features in the Malgenome Dataset.
Category 1: Installation | 1.Repackaging, 2.Update, 3.Drive-by download, 4.Standalone |
Category 2: Activation | 5.Boot, 6.SMS, 7.Net, 8.Call, 9.USB, 10.PKG, 11.Batt, 12.SYS, 13.Main |
Category 3: Privilege escalation | 14.exploid, 15.RATC/zimperlich, 16.ginger break, 17.asroot, 18.encrypted |
Category 4: Remote control | 19.NET, 20.SMS |
Category 5: Financial charges | 21.phone call, 22.SMS, 23.block SMS |
Category 6: Personal information stealing | 24.SMS, 25.phone number, 26.user account |
This section presents the experiments performed and the results obtained in the validation process of the proposed solution.
Both BHL (Section
In Figure
BHL: Projection of malware families.
Based on such projection, samples are grouped in 2 main clusters: G1 and G2 (Figure
BHL: Labelling of clusters.
Figure
Schematic clustering and relevant features from BHL projection.
Families allocation in Group 1 and relevant features identified in BHL projection.
Families allocation in Group 2 and relevant features identified in BHL projection.
Based on the analysis of BHL results, the most relevant features, in decreasing order of importance, are “Repackaging” and “Standalone,” “Boot" and “Activation: SMS,” and “Financial Charges: SMS.”
BHL clearly outperforms other algorithms used in previous works [
In addition to the BHL experiments, experiments with DTs were additionally conducted in order to compare and validate the obtained results. As it has been previously mentioned, 3 different splitting functions have been applied in present paper: Gini, Deviance, and Twoing. In addition, 3 different criteria for selecting split features have been applied: Standard, Curvature, and Interaction.
As an example, one of the obtained DTs is shown in Figure
DT obtained with standard CART split criteria and Deviance function.
To show the most interesting results from the different alternatives to build DT, information has been summarized in Table
Summary table of DT results: minimum depth of decision nodes for each one of the original features.
Deviance | Gini | Twoing | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
ID | Feature | Standard | Curvature | Interaction curvature | Standard | Curvature | Interaction curvature | Standard | Curvature | Interaction curvature | Average |
1 | Repackaging | 1 | 2 | 4 | 1 | 2 | 6 | 1 | 2 | 4 | 2.56 |
5 | BOOT | 2 | 3 | 3 | 4 | 3 | 2 | 2 | 3 | 3 | 2.78 |
18 | Encrypted | 4 | 2 | 4 | 4 | 2 | 3.20 | ||||
9 | USB | 6 | 3 | 5 | 3 | 4 | 6 | 3 | 4.29 | ||
3 | Drive-by Download | 5 | 5 | 4 | 2 | 5 | 6 | 5 | 5 | 4 | 4.56 |
24 | SMS | 4 | 3 | 5 | 10 | 3 | 6 | 3 | 3 | 5 | 4.67 |
26 | User Account | 6 | 1 | 8 | 3 | 1 | 8 | 6 | 1 | 8 | 4.67 |
2 | Update | 5 | 7 | 4 | 2 | 6 | 3 | 5 | 7 | 4 | 4.78 |
19 | NET | 6 | 2 | 8 | 9 | 2 | 3 | 4 | 2 | 8 | 4.89 |
6 | SMS | 3 | 6 | 5 | 8 | 7 | 3 | 3 | 6 | 4 | 5.00 |
10 | PKG | 6 | 5 | 4 | 5 | 4 | 6 | 5 | 5.00 | ||
22 | SMS | 3 | 6 | 4 | 10 | 6 | 4 | 3 | 6 | 4 | 5.11 |
4 | Standalone | 5 | 10 | 3 | 3 | 9 | 3 | 5 | 10 | 3 | 5.67 |
8 | CALL | 4 | 7 | 4 | 8 | 4 | 7 | 5.67 | |||
11 | BATT | 4 | 9 | 5 | 4 | 5 | 4 | 9 | 5.71 | ||
16 | Ginger Break | 6 | 6.00 | ||||||||
15 | RATC/Zimperlich | 6 | 8 | 1 | 9 | 9 | 8 | 7 | 8 | 1 | 6.33 |
7 | NET | 5 | 8 | 6 | 10 | 9 | 2 | 5 | 8 | 6 | 6.56 |
14 | Exploid | 5 | 8 | 6 | 9 | 6 | 6 | 5 | 8 | 6 | 6.56 |
17 | Asroot | 7 | 4 | 7 | 11 | 4 | 9 | 8 | 4 | 7 | 6.78 |
23 | Block SMS | 3 | 8 | 5 | 9 | 9 | 9 | 4 | 8 | 6 | 6.78 |
25 | Phone Number | 2 | 11 | 2 | 12 | 12 | 11 | 2 | 11 | 2 | 7.22 |
12 | SYS | 5 | 10 | 6 | 10 | 11 | 7 | 5 | 10 | 6 | 7.78 |
21 | Phone Call | 4 | 9 | 12 | 13 | 7 | 4 | 9 | 5 | 7.88 | |
13 | MAIN | 6 | 11 | 7 | 10 | 12 | 1 | 6 | 11 | 7 | 7.89 |
20 | SMS | 10 | 8 | 9.00 |
In this table it can be seen that results (slightly or significantly) vary when comparing the obtained results (by different splitting function and selecting criteria) for a certain feature. As general conclusions cannot be derived and to sum up all figures, the average depth value is calculated for each feature, that is, further analyzed.
When analyzing Figure
Additionally, from the DTs results (Table
Results from present paper are consistent with those obtained in previous work [
In this paper, some machine learning techniques have been applied to Android malware data in order to analyse the features of such apps and subsequently identify the ones that better define the organization of malware families. As a result, detection and categorization of malware could be improved and sped up at the same time. Furthermore, by knowing about these features, malware apps could be identified more quickly and precisely and then removed from the official Android market.
From the obtained results some conclusions can be derived; first of all, the proposed machine-learning techniques probed to successfully address the given challenge. BHL has outperformed previous neural projection techniques that have been applied to the same data in clearly revealing the structure of the Malgenome dataset. Additionally, features identified as the most important ones by such EPP technique are also highlighted by DTs as being relevant to better differentiate between malware families.
Obtained results are consistent with those obtained by FS and hence validate present proposal. Future work will focus on the development of a Hybrid Intelligent System to integrate results from the previously validated machine-learning techniques. In addition, it will be applied to up-to-date malware datasets in order to check its performance when facing 0-day malware.
Dataset used in this research is available in [
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work is partially supported by Instituto Nacional de Ciberseguridad (INCIBE) and developed by Research Institute of Applied Sciences in Cybersecurity (RIASC).