GFD: A Weighted Heterogeneous Graph Embedding Based Approach for Fraud Detection in Mobile Advertising

Online mobile advertising plays a vital role in the mobile app ecosystem.'emobile advertising frauds caused by fraudulent clicks or other actions on advertisements are considered one of the most critical issues in mobile advertising systems. To combat the evolving mobile advertising frauds, machine learning methods have been successfully applied to identify advertising frauds in tabular data, distinguishing suspicious advertising fraud operation from normal one. However, such approaches may suffer from labor-intensive feature engineering and robustness of the detection algorithms, since the online advertising big data and complex fraudulent advertising actions generated by malicious codes, botnets, and click-firms are constantly changing. In this paper, we propose a novel weighted heterogeneous graph embedding and deep learning-based fraud detection approach, namely, GFD, to identify fraudulent apps for mobile advertising. In the proposed GFD approach, (i) we construct a weighted heterogeneous graph to represent behavior patterns between users, mobile apps, and mobile ads and design a weighted metapath to vector algorithm to learn node representations (graph-based features) from the graph; (ii) we use a time window based statistical analysis method to extract intrinsic features (attribute-based features) from the tabular sample data; (iii) we propose a hybrid neural network to fuse graph-based features and attribute-based features for classifying the fraudulent apps from normal apps. 'e GFD approach was applied on a large real-world mobile advertising dataset, and experiment results demonstrate that the approach significantly outperforms well-known learning methods.


Introduction
Online mobile advertising plays a vital role in the mobile app ecosystem. One of the popular models in mobile app advertising is known as cost per action (CAP), where payment is based on user action, such as downloading and installing an app on the user's mobile device. is CAP model may incentivize malicious mobile content publishers (typically app owners) to generate fraudulent actions on advertisements to get more financial returns [1][2][3]. Some traditional methods and techniques have been used for detecting and stopping click fraud, such as threshold-based method [4], CAPTCHA [5], splay tree [6], TrustZone [7], power spectral density analysis [8], and social network analysis [9].
To automatically detect mobile advertising fraud behaviors, machine learning methods have been successfully applied to find fraud patterns in data, distinguishing suspicious advertising fraud operation from normal one [10][11][12][13][14]. As for learning model with attribute features, researchers usually use several attributes from each sample to train a learning model to identify the fraud behaviors. Unfortunately, such approaches may suffer from labor-intensive feature engineering and robustness of the detection algorithms, since the online advertising big data and complex fraudulent advertising actions generated by malicious codes, botnets, and click-firms are constantly changing. What is more, fraudsters could easily adjust their fraud patterns based on existing fraud detection attributes and rules to avoid being detected. Recently, some researchers try to use the relationship between information entities to construct a graph model and then use the graph mining or learning methods to identify the changing fraud behaviors [15][16][17]. All these methods obtain useful insights into the learning mechanism to classify fraud behaviors from normal activities. Intuitively, if we could combine the complementary information from attributes of sample data and relationship between entities (e.g., users, apps, and ads), we will be able to improve the accuracy and robustness of fraud detection.
However, to unleash the power of attribute-based information and graph-based information, we have to address a series of challenges. First, to take advantage of the characteristic of graph, we should construct a suitable graph, which could potentially represent the interaction behaviors between information entities such as users, apps, and ads. Second, an efficient graph learning method should be developed to learn the useful structural and semantic representation information from constructed graph [18,19], particularly learning from heterogeneous graph [20]. ird, fusing different kinds of information from sample attributes and node representation is difficult for their inherent heterogeneity and high-order characteristics.
To address the above challenges, in this paper, we propose a weighted heterogeneous graph embedding and deep learning-based fraud detection approach, namely, GFD, to identify fraudulent apps for mobile advertising. In the proposed GFD approach, (i) considering behavior patterns between users, mobile apps, and mobile ads, we construct a weighted heterogeneous graph to represent mobile app advertising behavior and propose a new weighted metapath to vector algorithm, namely, WMP2vec, to learn low-dimensional latent representation (graph-based features) for apps' nodes in the weighted heterogeneous graph; (ii) we use a time window based statistical analysis method to extract intrinsic features (attribute-based features) from the tabular sample data; (iii) we present a hybrid convolutional neural network model to fuse graph-based features and attribute-based features for classifying the fraudulent apps from normal apps.
We evaluate GFD approach and WMP2vec algorithm on a real-world dataset from one of the mobile advertising platforms in China. Results show that WMP2vec reaches higher performance than three well-known graph embedding algorithms in the constructed weighted heterogeneous graph, and GFD approach achieves highest classification performance compared with Support Vector Machine (SVM), Random Forest (RF), and Fully Connected Neural Networks (FCNN). e rest of the paper is organized as follows. We introduce GFD approach to detect fraudulent apps with deep neural networks and heterogeneous graph embedding algorithm WMP2vec in Section 2. We present the experimental results and discussion in Section 3. In Section 4, we introduce the related work. We conclude this paper in Section 5.

Proposed Approach
e flow chart of the proposed GFD approach is shown in Figure 1. First, we propose a weighted heterogeneous graph embedding method to learn the node representation, including constructing the weighted heterogeneous graph and the WMP2vec algorithm. Second, we use statistical analysis method to extract attribute-based features from the tabular sample data. ird, we introduce the deep neural networks to fuse the attribute-based features and graphbased features for identifying fraudulent apps from normal ones.

Data Description.
We collect advertising log data of mobile apps from a mobile advertising platform. Our mobile advertisement dataset contains the following attributes: user ID, a code to identify a unique mobile user; app ID, a code to identify a unique mobile app; ad ID, a code to identify a unique mobile advertisement; geographical attributes, a series of user geographical attributes used to detect anomalies, including encrypted IP and city; action type, user behavior related to the ads, such as viewing, clicking, app downloading start, app downloading completion, and app installation completion; action time, the time-stamp when the action happened; and device attribute, user device related attributes, such as device ID, device system models, and screen size.
A seven-day mobile advertising log dataset in June 2015 was studied in this paper, and some examples of our raw data are shown in Table 1.

Weighted Heterogeneous Graph Embedding.
In this section, we firstly propose the problem definition and construct the weighted heterogeneous graph, and then we present WMP2vec algorithm to learn latent representation of nodes in weighted heterogeneous graph.

Problem Definition
(1) Given. An undirected weight heterogeneous graph G � 〈V, E, W〉 is given, where V is a set of app nodes, ad nodes, and user nodes; E is a set of undirected weight edges between any two types of nodes: app nodes and user nodes, user nodes and ad nodes, and ad nodes and app nodes; W is the set of weight of edges.
e task is to learn the d-dimensional latent representations X e ∈ R |V|×d (where d ≪ |V|) for nodes, which could capture the structural and semantic relations among nodes in the graph G, and the representations could be used for classifying fraudulent apps.

Weighted Heterogeneous Graph Construction.
Let U be the set of user nodes, let A be the set of app nodes, and let P be the set of advertisement nodes. If there exists an action from user u ∈ U to advertisement a ∈ A through app p ∈ P, we form edges from u to p, from u to a, and from p to a, respectively, such that E 1 � U × P, E 2 � U × A, and E 3 � P × A are the edges set of heterogeneous graph G. e set of weight is W � w up , w ua , w pa , where the weights w up , w ua , and w pa are defined proportional to the behavioral centrality of u to p, u to a, and p to a, respectively. e calculation formula of w up is shown in equation (1) and so on for w ua and w pa .
where C up is the times of user u operating on advertisement p and Ω(u) is the set of operations of user u on all the advertisements.

Graph Embedding Algorithm.
In this section, based on the sequence generation method from metapath based random walk in heterogeneous graph [20], we propose WMP2vec algorithm to generate random walk sequence in weighted heterogeneous graph and embed sequence to representation vector with Skip-Gram [21] for nodes.
(1) Weighted Metapath Based Random Walk. We predefined number of walks per node n, the number of walk sequences l, and a metapath M. e metapath is defined as a path in the heterogeneous graph G with its metatemplate T G � (Z, R), where Z � Z u , Z p , Z a and R � R u , R p , R a . Each node v and each edge e are associated with mapping functions φ(v): V ⟶ Z and ϕ(e): E ⟶ R, respectively. Supposing that current node is v i , the relationship be- For walk sequences generation, we go through the metapath scheme l times, and each time generates one corresponding walk sequence. In the first time, we use two different selecting methods (first phase and second phase), because there are no limits to edge weight in the beginning. After first time, we use the method in the second phase to select next node.
For the first phase, when the length of walk sequence is less than 2, the next node in the sequence is randomly selected from the neighbors set N t+1 (v i ) of current nodes, which meet the requirements of metapath M [20]. e transition probability from v i to v i+1 is defined as follows: For the second phase, when the length of walk sequence is between 2 and l * |R|, the transition probability is restricted by a weight bias β. Supposing that the latest weight of edge of relationship R i is w i , the weight should be in the range of [w i − β, w i + β]. e transition probability from v i to v i+1 is defined as follows: where C(v i ) is the set of neighbors meeting the requirement.
(2) Embedding Sequence to Vector with Skip-Gram. Based on the weighted metapath random walk sequences, we use Skip-Gram model [21] and negative sampling [22] to learn lowdimensional representation of nodes. A description of our proposed WMP2vec algorithm method is shown in Algorithm 1.

Attribute-Based Feature
Extracting. From the raw log data (tabular data) of mobile advertising, we defined a time window (t hours) and divide original data into 24/t data block for one day (24 hours). en, a plain statistical analysis is performed on each field in each data block. e ratio of the unique value of the field to the total number of records in the specified time window is computed. e attribute-based feature corresponding to one mobile app could be represented as a feature matrix with 24/t rows.

Hybrid Neural Network for Classification.
To take advantage of the graph-based features and attribute-based features, we propose a hybrid convolutional neural networks (HNN) model to fuse and learn both information in GFD approach. e overview of the hybrid neural networks is shown in Figure 2.
In HNN model, the first layer (input layer) contains attribute-based feature matrix X s ∈ R N×t×m and graphbased featureX e ∈ R N×d , where N is the number of samples, t is the number of time windows by one day (24 hours), m is the dimension of attribute-based feature in a time window, and d is the dimension of node embedding.
A convolutional part includes two convolutional layers, and the output of the first convolutional layer is where W C 1 ∈ R w 1 ×w 1 and b c 1 ∈ R are the convolution kernel and bias, respectively, w 1 is the size of the kernel, * indicates the convolution operation, and the active function is relu(x) � max(0, x). e second convolutional layer is constructed as follows: where W C 2 ∈ R w 2 ×w 2 and b c 2 ∈ R are the convolution kernel and bias, respectively. w 2 is the size of the kernel.
We concatenate X c and X r into a single metric X ∈ R N×(d 0 +d) to be the input of the first fully connected layer l 1 . l 1 is constructed as follows: where W 1 ∈ R (d 0 +d)×d 1 and b 1 ∈ R d 1 are weight and bias, respectively, and d 1 is the number of neurons in the first fully connected layer. e second fully connected layer l 2 is constructed as follows: where W 2 ∈ R d 1 ×d 2 and b 2 ∈ R d 2 are weight and bias, respectively, and d 2 is the number of neurons in the second fully connected layer.
In the output of HNN, y ∈ (0, 1) is the probability of an application to be a fraudulent application.
where W o ∈ R d 2 ×1 and b o ∈ R d 2 are weight and bias, respectively, and σ(·) � sigmoid(·) is the sigmoid function. e cross-entropy function with l2-regularization is used to calculate the loss of the hybrid convolutional neural network model.

Data Description and Preprocessing.
A real-world dataset was collected from a mobile advertising platform in China. e dataset consists of seven days with around 2 M users, 3.5 K apps, and 1 K advertisements per day. We partition our log data into seven subsets with one-day period and conduct experiments on each subset to evaluate our model. e proportion of fraudulent apps is about 2-4 percent in the total 3,500 apps each day. More details of the dataset are described in Section 2.1.

Evaluation Metric.
In this paper, we define the fraudulent apps by positive samples and the other apps by negative samples.
e Average Precision (AP) and the Area Under ROC Curve (AUC) are used to evaluate proposed algorithm and approach. e AP criterion summarizes the Precision-Recall performances at different threshold levels and corresponds to area under the Precision-Recall curve. e ROC curve is (1) Input: e weighted heterogeneous information graph G � < V, E, W > , a meta-path scheme M, walks per node n, longest walk length per walk l, embedding dimension d, neighborhood size k (2) Output: e latent node embedding X ∈ R |V|×d (26) draw u and w according to equation (3)

Evaluation of WMP2vec
Algorithm. In this section, we use WMP2vec algorithm to learn the embedding vector of the nodes (apps) from the constructed weighted heterogeneous graph and then take their embedding vectors as the input of Random Forest (RF) model to classify fraudulent apps. Based on Section 2.2.2, we construct a weighted heterogeneous graph and define a metapath: app-user-ad-userapp (PUAUP); that is, which represents the heterogeneous semantic of fraud publishers (apps) that mimic legitimate users to act on the ads from the apps.

Comparison Models and Parameters.
We compare the AP and AUC of the WMP2vec model with three well-known graph embedding models: DeepWalk [23], Node2vec [24], and Metapath2vec [20]. e compared algorithms and their parameters are as follows: (1) DeepWalk: DeepWalk [23] is the first graph embedding model based on Word2vec. We use Skip-Gram model [21] and hierarchical softmax [25] with gradient descent to learn the node representation. Negative sampling technique [22] is used to accelerate the Skip-Gram model. e count of random walk is 30, and the walk length is 40. (2) Node2vec: Node2vec [24] extends DeepWalk algorithm through introducing backward probability p and forward probability q. e same random walk parameters (count � 30 and length � 40) are used with DeepWalk, and the negative sampling technique is also used. In addition, we use p � 0.5 and q � 0.2 for backward probability and forward probability, respectively. (3) Metapath2vec: Metapath2vec [20] uses the metapath based random walk to construct node sequences and then leverages Skip-Gram to perform node embedding. e metapath in this study is PUAUP. e count of random walk is 30, and the walk length is 10. (4) WMP2vec: We use the same parameters (count � 30, length � 10, and metapath � PUAUP) with Meta-path2vec, and the weighted bias β is 0.1 additionally.
In all the compared models, we train Skip-Gram model with window size of 5, and the negative samples is 5 in negative-sampling. e graph-based feature of each node is a 32-dimensional vector. e parameters of the RF model are as follows: the number of weak learners is 150, max. deep is 5, and min. sample leaf is 5. Tables 2 and 3 show the experimental results by comparing the AP and AUC over 10-fold cross-validation for seven days. e WMP2vec model reached highest AP value in six days and highest AUC value in three days over all seven days. e Metapath2vec model reached highest AP value in one day and highest AUC value in two days over seven days. us, WMP2vec outperforms all other models, such that WMP2vec > Metapath 2vec > Node2vec > DeepWalk.

Impacts of Parameters.
In this subsection, we evaluate the impacts of parameters over the classification task: (i) count of random walk, walk length, and window size of Skip-Gram in WMP2vec and Metapath2vec model; (ii) weighted bias β of WMP2vec. We compare the AP and AUC values in the dataset from one day.
(1) Count of Random Walk. Figure 3 shows the experimental results by comparing the AP and AUC with different count of random walk, with fixed walk length of 5. When the count of random walk is larger than 30, WMP2vec and Meta-path2vec models have better performance than count � 10, respectively. In addition, the values of AP and AUC have  (2) Walk Length. Figure 4 shows the experimental results by comparing the AP and AUC with different walk length (length � 5, 10, 20, 50, and 80), with fixed count of random walk of 10. WMP2vec and Metapath2vec models reach better performance when the walk length ≥10. In addition, when the length changes from 10, 20, and 50 to 80, the AP values change very little and the AUC values have some fluctuations.
(3) Window Size of Skip-Gram. Figure 5 shows the experimental results by comparing the AP and AUC with different window size (size � 3, 4, 5, 6, and 7) over the classification task. e best performance of models is reached when the window size is 5.
(4) Weighted Bias of WMP2vec. Figure 6 shows the experimental results by comparing the AP and AUC with different weighted bias β of WMP2vec (β � 0.1, 0.3, 0.5, 0.7, and 1.0) over the classification task. As the weighted bias β increases, the performance of WMP2vec gets closer to the performance when β � 1.0. e values of AP and AUC change very little when β ≥ 0.5.

Evaluation of Hybrid Neural Network.
In this section, we evaluate the classification performance of HNN model for fusing graph-based features and attributebased features in GFD approach. As the flow of GFD approach in Figure 1, we extract the attribute-based features and the graph-based features and then use HNN model to fuse two kinds of features to identify fraudulent apps.

Features Extraction.
Based on Section 2.3, we divide the log data for each app into 24 parts per day; that is, the time window is one hour. We calculate the ratio of records whose attributes take a certain value to all records in each time window, and we calculate them for each of 22 attributes in total, such as anonymized user id, advertisement id, country id, and device operating system. In addition, we calculate the ratio for browsing behavior and other actions on ads of users, respectively. Finally, we get 24 features for a time window (one hour), and the dimension of attributebased features of each app is 24 × 24 for one day. Based on Section 2.2.2 and Section 3.3, for the graphbased feature extraction, we construct the weighted heterogeneous graph of user-app-ad and then extract the graphbased feature through training by using WMP2vec. e dimension of graph-based features for each app is 32.

Security and Communication Networks 7
HNN : HNN is the fusing model proposed in this study. e number of convolutional layers is 2, and the kernel size is 3 × 3. e number of fully connected layers is 2 with 100 neurons, using activation function "ReLU," and the keep probability of dropout is 0.9. e learning rate is 0.0001, the weight decay factor of learning rate is 0.98, and the batch size is 100.
In order to make sure that all models could learn the same knowledge from the dataset, when training the comparison models, we flatten the attribute-based features into a 576-dimensional vector. Furthermore, the vector is concatenated with graph-based features, and the dimension of total input vector is 576 + 32 � 608.
We randomly divide the negative samples and positive samples of the dataset into three subsets 8 : 1 : 1, respectively, and combine the corresponding positive and negative example subsets into training (80%), validation (10%), and test (10%) sets. In order to handle the imbalanced category problem between fraudulent and nonfraudulent apps, we adopt upsampling technique during training.   Security and Communication Networks

Experimental Results.
e experimental results are shown in Tables 4 and 5. e HNN model proposed in this study reaches the highest AP value in six days and the highest AUC value in four days over all seven days. e FCNN, RF, and SVM models have similar performance to AUC measure, and Table  4 shows that HNN > FCNN > RF > SVM with AP measure. us, HNN outperforms all other models in terms of AP and AUC measures.

Comparative Experiments without Graph-Based
Features. To show the contribution of graph-based feature extraction in proposed GFD approach, we remove the graph-based features in our dataset. When the proposed HNN model has only attribute-based features as input and no graph-based features as input, the HNN model leaves only the fully connected part to work, since the convolution part of HNN model has no input. is also means that the working HNN model would change to a fully connected neural network, that is, FCNN model, in this setting. So we use the SVM, RF, and FCNN models in this comparative experiment.
e results are shown in Tables 6 and 7. Comparing the performances of models with/without graph-based features in Tables 4 and 5 and Tables 6 and 7, we could find that the FCNN model with graph-based features reaches better performance than the model without the graph-based features in both AP and AUC measures, while the performance improvement of SVM and RF models is not obvious with graph-based features.

Impacts of Parameters. (1). Time Windows t.
Time window in attribute-based feature extraction of GFD approach decides the dimension of attribute-based features.
We designed experiments to show the impact of time window, and the result is shown in Table 8. e size of time window is set to be 1, 3, and 6 hours. e continuous increase in size of time window makes HNN perform worse AP values. e other models seem to be not sensitive to the size of time window.

(2). Number of Convolutional Layers in HNN Model.
We compare the effect of the number of convolutional layers of 1, 2, and 3 in HNN model and show the results in Table 9.
e AUC and AP values achieve a high level when the number of convolutional layers is 2.

(3). Number of Fully Connected Layers in HNN Model.
We set the number of fully connected layers to be from 1 to 4, and the experiment result is shown in Table 10. When the number of fully connected layers is 2, the HNN model reaches the highest performance.
(4). Activation Functions in HNN Model. We compare three well-known activation functions, ReLU, tanh, and Sigmoid, in HNN model, and the experiment results are shown in Table 11.
e AUC values of the models with different activation functions are similar, and ReLU is slightly better than others. In terms of AP, ReLU is obviously better than the other two activation functions.

Related Work
Our work is related to existing studies on attribute-based fraud detection and graph-based fraud detection with machine learning. e challenges of fraud detection problem in mobile advertising system are summarized as accuracy requirement, throughput requirement, and the ability to combat the latest fraud methods [1].
Attribute-based fraud detection approaches have been used in fraud detection domain. Crussell et al. [26] built decision trees based on the features extracted from their dataset for classification. Liu et al. [27] proposed a binary SVM classifier to determine whether two UIs are likely to lead to equivalent states. is classification is used to simulate user interaction in the context of ad clicking. In order to classify malicious publishers, Mouawi et al. [11] evaluated KNN, SVM, and ANN based on features extracted from dataset, and the experimental results show that all three classifiers give very promising result. Haider et al. [2] proposed an ensemble-based method to classify each individual ad display as fraudulent or nonfraudulent. Gabriel et al. [28] evaluated the performance of logistic regression, gradient trees, and deep learning method in credit card fraud detection and proved that deep learning method outperforms the other compared methods.
Graph-based fraud detection approaches have been studied recently. Hu et al. [15] proposed a weighted graph propagation algorithm to identify the fraudulent apps in the user-app bipartite graphs. Vasumati et al. [29] applied decision trees to classify spam publishers based on constructed       feature vector and computed spam score for each of the spam publishers by constructing a bipartite graph between users and publishers to find fraud publishers. What is more, the natural language processing (NLP) models known as Word2vec [23] have been applied to graph embedding, such as DeepWalk [10], Node2vec [21], and Metapath2vec [22]. Zheng et al. [30] proposed an unsupervised method to detect abnormal users and items through deep joint network embedding. Yu et al. [16] proposed a deep embedding approach for anomaly detection in dynamic networks by learning network representations which can be updated dynamically as the network evolves. Mobile advertising fraud detection is still challenging; however, ensemble learning methods were usually the winner algorithms in fraud detection competition [10], and deep learning and graph learning are recently the most promising methods in this area.
ere are two key differences between our proposed approach and existing works. First, we used app id, ad id, and user id from the real-world dataset to construct a weighted heterogeneous graph with these three types of nodes and proposed the graph embedding algorithm for mobile advertising fraud detection. e popular existing datasets, such as TalkingData dataset [31], usually have one or two types of entities (e.g., app id), so there are not enough entities to construct a heterogeneous graph as we did in this paper. Second, we proposed a fusing model to combine attribute-based and graph-based information for mobile advertising fraud detection by graph embedding and deep learning methods.

Conclusion
In this paper, we focus on the fraud detection problem in mobile advertising to detect fraudulent publishers. We propose a novel weighted heterogeneous graph and deep learning-based fraud detection approach, namely, GFD, to identify fraudulent apps for mobile advertising. Based on the relationship of users, publishers, and advertisement in mobile ad system, we construct a weighted heterogeneous graph and proposed a weighted metapath based graph embedding approach, named WMP2vec, to learn structural features of publishers in the graph. Furthermore, we construct a hybrid convolutional neural network to learn highorder features from attribute-based features and graphbased features. e experimental results in a real-world dataset show that our method is effective in classifying fraudulent apps for mobile advertising system. ere are two limitations in the work presented here. First, the dataset is limited to one mobile advertising dataset. In order to be more generalizable, it would be important to see whether the proposed GFD approach excels in more fraud detection datasets. Second, the dataset is limited to seven days. In the complex and dynamic online advertising environment, more time is still needed to evaluate the proposed approach.
Despite being focused on mobile advertising fraud detection in this presentation, the proposed GFD approach could be generalized to benefit many other online applications (e.g., e-commerce) that involve relationship between several types of entities. Future work should focus on the robustness and accuracy of our proposed model for other large-scale online datasets.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest.