Multiview Ensemble Method for Detecting Shilling Attacks in Collaborative Recommender Systems

1 School of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei Province, China 2TheKey Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Qinhuangdao, Hebei Province, China 3Department of Computer Science and Technology, Xinzhou Teachers University, Xinzhou, Shanxi Province, China 4School of Software Engineering, Beijing Jiaotong University, Beijing, China


Introduction
Collaborative recommender systems are widely used in ecommerce websites to handle the information overload problem by providing personalized recommendations for their users.However, due to the openness of such systems, the attackers are likely to inject a large number of fake profiles in order to increase/decrease the recommendation frequency of particular items (e.g., movies and products).This behaviour is often referred to as shilling attacks or profile injection attacks.According to the purpose of attacks, shilling attacks can be categorized as either push attacks or nuke attacks, which promote or demote a particular item to be recommended.The fake profiles are called attack profiles or shilling profiles, which have a negative impact on the prediction quality of collaborative recommender system and make the users lose trust in the system.According to the experimental observations in practical systems, an attack with 3% shilling profiles would result in a prediction shift of around 1.5 points on a five-point scale [1].Therefore, in the face of shilling attacks, how to ensure the recommendation quality of system has become a problem that cannot be ignored in the research of recommender system.
By integrating a set of classifiers, some ensemble detection methods have further improved detection precision.However, the base classifiers are trained in the same feature space and the detection precision cannot meet the actual needs, especially for detecting attacks with low filler sizes and attack sizes.Obviously, not all the features are relevant and effective when detecting different types of attacks.This means that the correlated errors between base classifiers cannot be adequately reduced by traditional ensemble methods in the same feature space [31,32].
To tackle the aforementioned challenges, we extract the user features by considering the temporal effects of item popularity and rating values.These features can offer us a more comprehensive perspective for the detection of shilling profiles.Moreover, we use an optimal feature set partition algorithm to divide user features into several subsets in order to construct multiple optimal classification views.Since the ensemble method has the potential to improve the detection performance, we design a multiview ensemble detection framework, called MV-EDM, which integrates multiple classification views and base classifiers into a classification model to detect various shilling profiles.
The main contributions of the paper are summarized as follows: (1) By analysing the item popularity from timestamp, we define the item temporal popularity and deal with it using the wavelet transform method.Based on it, we construct the temporal popularity vector of user ratings and extract 5 user features.
(2) From the observation of item rating mean changing with time, 2 user features are extracted based on dynamic mean of item ratings.Moreover, we also extract 10 user features by analysing item temporal popularity and dynamic mean of item ratings in different popular item sets.
(3) With the repartition strategy, base classifiers can be trained not only from different classification views but also from different partitioning results.Based on the qualitatively different classifiers, we propose a multiview ensemble detection framework.Experiments on the Netflix and Amazon review datasets show that the proposed method can effectively detect various synthetic attacks and real-world attacks.
The remaining parts of this paper are organized as follows.In Section 2, we introduce the background and related work on shilling attack detection.Section 3 presents the proposed method in detail.In Section 4, we present and discuss the experimental results.Finally, we conclude the paper.

Background and Related Work
2.1.Shilling Attack Models.Typically, an attack is realized by inserting several shilling profiles into a recommender system database to cause bias on selected target items [30].A shilling profile can be defined as four sets of items [3][4][5][6].Initially, a set of items,   , is the set of selected items to form the characteristics of the attack.Also, another set of items,   , is the set of filler items usually chosen randomly to obstruct detection [30].A unique item   is the target item, and   is the set of unrated items.Table 1 illustrates the common attack models in the literature.
Random attack and average attack are basic attack models, which generate shilling profiles with ratings to randomly chosen empty cells around system overall mean or around each item's mean, respectively [3][4][5][6].
AoP (average over popular items) attack is an obfuscated form of average attack, which chooses filler items with equal probability from the top x% of most popular items, rather than from the entire catalogue of items [15].
User shifting and target shifting strategies are used to obfuscate the basic attacks to evade detection [5].User shifting is designed to reduce the similarity between shilling profiles.The goal of target shifting is to reduce the extreme ratings of shilling profiles [30].User shifting attacks include user random shifting attack and user average shifting attack.Target shifting attacks include target random shifting attack and target average shifting attack.To facilitate the representation, we define the shifting attack as a collection with the same number of user random shifting attack, user average shifting attack, target random shifting attack, and target average shifting attack.
Power items and power users are able to influence the largest group of items and users, respectively [33,34].In power item attack and power user attack, the strategies for selecting power items and power users include the top-N highest aggregation similarity scores, in-degree centrality, and the highest number of ratings [33,34].
The above attacks can be used as either push attacks or nuke attacks.However, love/hate attack and reverse bandwagon attack are specially used for nuke attacks.Especially, love/hate attack is an extremely effective nuke attack in which randomly chosen filler items are rated with the highest possible rating while the target item is given the lowest one in shilling profiles [30].
When the shilling profiles are generated by the attack models and injected into the system, in the reported experimental results, the filler size (fs) refers to the ratio between the number of items rated by a user and the number of entire items.The attack size (as) refers to the ratio between the number of attackers and the number of genuine users [12].
Based on the rating values, machine learning methods are primarily used to detect shilling profiles, which include supervised and unsupervised methods [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18].In supervised methods, Williams et al. [6] proposed several features based on user ratings and trained three classifiers to detect shilling profiles.Wu et al. [7,8] proposed algorithms to select the effective features and utilized the supervised and semisupervised classifiers to detect attacks.Zhou [9] proposed a supervised approach for detecting AoP attacks, in which the term frequency-inverse document frequency was used to extract the features.Zhou et al. [10] proposed a two-phase shilling attack detection method, SVM-TIA.This method first uses SVM-based shilling detection technique to obtain a set of suspicious profiles and then applies target item analysis Yang et al. [16] utilized an adaptive structure learning method to select features of user and item.Based on the selected features, they proposed a two-stage detection method.In [17], a soft coclustering with propensity similarity model was presented to detect shilling attacks.Unfortunately, the above methods assume that the signature or distribution for rating values of shilling profiles will differ significantly from those of genuine profiles.However, this hypothesis does not always hold because the attackers are likely to look for ways to beat the detector by fabricating the rating values.Therefore, the detection methods based only on rating values suffer from poor precision in detecting attacks, especially for detecting attacks with low filler sizes and attack sizes.
Based on rating time information, some detectors are proposed.Tang and Tang [19] used the span, frequency, and mount factor based on user rating time intervals to prevent the attacked item from the top-N list.Zhang et al. [20] presented a detector by calculating the mean and entropy of the samples in time windows when rating time series was constructed.Xia et al. [21] dynamically divided time intervals and used hypothesis test detection-based framework to identify the anomaly items.Zhou et al. [22] reorganized each item ratings by time series to examine rating segments and then used statistical metrics and target item analysis to detect the anomaly items.However, these methods assume that the attacks are injected in short periods, which cannot effectively detect the long-term decentralized attacks.Neal et al. [23] monitored the ratings at any time and detected the attacks according to statistical attributes of all the ratings, user ratings, and item ratings.Nevertheless, this method assumes that the shilling profiles have constant attack parameters, which is difficult to meet in practice.
Based on item popularity information, Zhang et al. [24] constructed a rating series from item novelty and used Hilbert-Huang transform method to extract user features to detect the attacks.In [25], the user features were extracted by mutual information and statistical methods based on the item popularity series, then C4.5-based ensemble classification was used to detect the attacks.Karthikeyan et al. [26] utilized discrete wavelet transform to get the feature set based on the rating series of items' popularity and novelty.And then the user features were used by SVM for detection.Li et al. [27,28] extracted the user features according to the item popularity distributions.Unfortunately, in these methods, the item popularity is simply computed and observed as a static value, which can be easily influenced by noise and manipulated by attacks.
In practice, the shilling profiles are usually far fewer than genuine ones, and the detection method based on supervised classification can be regarded as the imbalanced classification problem [11].A large number of experiment results [2][3][4][5][6][7][8][9][10][26][27][28] showed that the single classifier became worse in low filler sizes and attack sizes, so ensemble methods were used to alleviate such problems.Zhang et al. [18,25] used ensemble methods to detect the attacks based on the bagging method.Yang et al. [11] used the improved rescale AdaBoost method to detect the attacks.However, in these ensemble methods, the base classifiers were trained in the same feature space without consideration of the feature's effectiveness.This means that the correlated errors between the base classifiers cannot be adequately reduced.
In [26], a discrete wavelet transform method was used to extract user features in offline training and online detection, which would consume a lot of running time.However, in this paper, we use discrete wavelet transform method only once for the item time series in the offline preprocessing.
In [7,8], the feature selection algorithms were proposed to estimate the quality of ratings-based feature to distinguish the user profiles that were near to each other.And then a single classifier was used to detect shilling profiles.In [16], the features of user and item were selected by using adaptive structure learning, and a two-stage detection method was proposed.These methods focused on the feature's ability to determinate neighborhood relationship.However, in this paper, we select the effective features by directly evaluating their detection performance and further construct multiple optimal classification views.
In [35], to construct the classification views, the feature set is divided only once.Obviously, the constructed views were affected by the initial order of features and the multiple view ensemble method was only integrated from the fixed views.Therefore, the ensemble result is likely to be unstable.Unlike the work in [35], to construct the views with great diversity and reduce the influence of feature order, we optimally repartition the proposed features at regular intervals based on the random order of features to obtain the stable ensemble results in MV-EDM.Therefore, we integrate base classifiers not only from different views, but also from different partitioning results in MV-EDM.

The Proposed Method
The framework of our method is depicted in Figure 1 To facilitate the discussions, in Table 2 we give the descriptions of notations used in this paper.

Feature Extraction.
In collaborative recommender systems, the genuine users rate items according to their preferences, which can be represented by the rating values and item temporal popularity at rating time.By contrast, the attackers rate items to cause bias on selected target items [30].They pay more attention to fabricating sophisticated rating values to manipulate the system's output.Thus, most existing detection methods are devoted to finding the rating differences between genuine and shilling profiles.However, the attackers are likely to imitate the ratings of genuine users to evade the detection.Therefore, the attackers should be tracked by fusing other information into ratings, such as rating time and item popularity.
With consideration of the temporal effects for item popularity, the related definitions are presented in Section 3.1.1.Then, the discrete wavelet transform method is introduced to deal with the item temporal popularity in Section 3.1.2.Based on the above preprocessing, 5 user features are extracted in Section 3.1.3.From the observation that the mean of item ratings changes with time, 2 user features are extracted in Section 3.1.4.By analysing item temporal popularity and dynamic mean of item ratings in different popular item sets, 10 user features are extracted in Section 3.1.5.

Definitions
Definition 1.Time Interval (TI).The time interval refers to the time bin [20], which is obtained by partitioning the timeline from the beginning time to the detection time with the same width.
When the timeline from the beginning time S to the detection time is partitioned with width T, the jth time interval TI j can be denoted by the following.
Definition 2. Popularity of Item (PI).The popularity of item refers to the popular degree of the item [24], which is  ).The novelty of item refers to the degree of difference between the item and other items [36], which is computed as follows: where NI u,i denotes the novelty of item i to user u, which is computed as follows [36]: where sim(i, j) denotes the similarity between item i and item j, which is computed as follows.
In general, the item with greater novelty is less popular.
Definition 5. User Rating Temporal Popularity Vector (URTPV).The user rating temporal popularity vector refers to the vector constituted by TPI based on the item novelty series in the user profile.The URTPV of user u can be written as follows: where each component of   , V , , can be computed as follows.
Figure 2 illustrates the TPI differences between genuine and shilling profiles in the Netflix dataset.The horizontal axis represents the item novelty series in ascending order, and the vertical axis represents the TPI of ratings.In Figure 2, 10 genuine user profiles are randomly selected from the Netflix dataset.10 shilling profiles are generated by random and average attack models, respectively.
As shown in Figure 2, most ratings of genuine profiles are concentrated on the left side of horizontal axis.This indicates that the genuine users are likely to rate the popular items, which is consistent with the observation in [28].By contrast, the ratings of shilling profiles are uniformly distributed along the horizontal axis.This is because the filler items are randomly selected in random and average attack models.Also, when these shilling profiles are injected at attack time, there are different distributions of TPI between genuine and shilling profiles.

Wavelet Decomposition of TPI Series.
The TPI offers us valuable insights into the user profiles.However, if we directly measure TPI by counting the number of item ratings on a discrete time series, TPI will be disturbed by many factors such as noises, burstiness of item [37], resulting in a nonstationary time series.
Figure 3(a) illustrates the original TPI signal of an item, which is randomly selected from the Netflix dataset.The horizontal axis represents the rating days from January 6, 2000, and the vertical axis represents the TPI.
As shown in Figure 3(a), there is high-frequency information in the original signal, i.e., the item popularity fluctuates violently with time.
According to the theory of wavelet analysis, the time series signal can be decomposed into different frequency channels, in which the signals with lower complexity than the original signal can be obtained and the noise-like highfrequency patterns can be filtered out.Therefore, the discrete wavelet transform is used to obtain the stable trend of TPI in time series.In general, the higher the complexity of the time series is, the more layers should be obtained from the decomposition [38].In this paper, the original TPI signal is decomposed into 6 layers based on the daily intervals in the Netflix dataset and Amazon review dataset.Because we aim to obtain the stable popularity trends of the item, we are mainly interested in the slowest dynamics of original signal and reconstruct the signal at the 6th layer after the original signal is decomposed in six levels.

Ratings of genuine user profiles Ratings of shilling profiles
The reconstruction signal of TPI is shown in Figure 3(b); compared with Figure 3(a), the reconstruction signal becomes smooth after the transformation of wavelet.Thus, we adopt the reconstruction signal to reflect the main and stable trends of the TPI changes.

Extracting Features from User Rating Temporal Popularity Vector.
In genuine profiles, each item is rated with the temporal tastes for this item according to the preferences of genuine users.By contrast, in shilling profiles, the items are selected and rated by attackers according to specific strategy.Therefore, the information transformed by URTPV is different between genuine and shilling profiles.Also, due to the different TPI distributions in URTPV of genuine and shilling profiles, we can use statistical methods to extract the detection features.
The empirical probability is used to evaluate ( 1 =  1 , . . .,   =   ); i.e., the components of URTPV are divided into Q bins between the minimum and maximum values, and then the probabilities are determined by the proportions of bin number in all data [39]. (

2) Corrected Conditional Entropy of User Rating Temporal Popularity Vector (CCE URTPV).
Based on the item novelty series, the sample space formed by the components of URTPV, {=V , }, can be regarded as a random process, in which the previous state affects the latter one.The conditional entropy is often used to measure the complexity of random process.Given the previous state, the conditional entropy can be calculated as follows: where E(Y 1 ,. ..,Y m ) denotes the entropy at a state, and it can be calculated by (9).

Security and Communication Networks
However, in recommender systems, most users are inclined to rate a small amount of items; that is, the random process {=V , } is the short series.Thus, we use CCE (the corrected conditional entropy) to measure the complexity of the short series, which can be calculated as follows [40]: where perc(Y m ) is the percentage of unique sequences of length m and EN(Y 1 ) is the entropy estimated value with m fixed at 1.The CCE minimum over different states is used to quantify the regularity level of random process, which can be defined as follows [40].
(3) Range of User Rating Temporal Popularity Vector (R URTPV).The R URTPV represents the difference between the maximum and minimum values of the components in URPTV, which can be calculated as follows.

) Mean of User Rating Temporal Popularity Vector (M URTPV).
The M URTPV is used to measure the overall statistical characteristics of URTPV, which can be calculated as follows.
(5) Variance of User Rating Temporal Popularity Vector (V URTPV).The V URTPV is used to measure the fluctuation degree of URTPV, which can be calculated as follows.
To intuitively demonstrate the effects of the above detection features, we take 500 genuine profiles and 480 shilling profiles as examples and illustrate their differences on detection features in Figure 4.The genuine profiles are randomly selected from the Netflix contest dataset.The shilling profiles are generated by random, average, AoP, shifting, power item, power user, love/hate, and reverse bandwagon attack models with {3%, 5%} filler sizes.We randomly select 60 shilling profiles from each type of attacks and finally obtain 480 shilling profiles.
As shown in Figure 4, the features of genuine profiles differ greatly from those of shilling profiles.Take Figure 4(b) for an example, most of the CCE URTPV values in genuine profiles are greater than 2. By contrast, most of the CCE URTPV values in shilling profiles are below 2 except for power item attack.This means that the rating behaviours of genuine profiles are more complicated than those of shilling profiles.Take Figure 4(d) as another example, there exists relatively high mean in URPTV of genuine profiles.This indicates that the ratings of genuine profiles have relatively high TPI from statistics view.Therefore, based on the definition of TPI, the extracted features can be used to separate the shilling profiles from genuine ones.

Extracting Features from Dynamic Mean of Item Ratings.
In the previous studies, some generic features are designed to find the differences between genuine and shilling profiles in rating deviations, such as RDMA (rating deviation from mean agreement) [4], WDMA (weighted deviation from mean agreement) [5], in which the mean of item ratings is a fixed value and is the base to measure the deviations.However, the mean of item ratings is affected by many factors and changes with time [41].Therefore, when the item is rated with item mean or system mean in shilling profiles according to the attack models in Table 1, the rating deviations in RDMA and WDMA can be further expanded by using the dynamic mean of item ratings.
(1) Rating Deviation from Dynamic Mean Agreement (RDDMA) where  , denotes the mean of ratings for item i in time interval TI .
(2) Weighted Deviation from Dynamic Mean Agreement (WDDMA) Figure 5 illustrates the comparisons of RDMA and RDDMA, WDMA and WDDMA for genuine and shilling profiles, where the genuine and shilling profiles are the same as those in Figure 4.
As shown in Figure 5, RDDMA values of shilling profiles in Figure 5(b) are greater than those of genuine profiles.Moreover, the difference is more obvious than that of RDMA in Figure 5(a).Similar results can be found from the comparison of WDDMA and WDMA.These results intuitively demonstrate that the distinguishing abilities of RDMA and WDMA have been, respectively, improved in RDDMA and WDDMA by using the dynamic mean of item ratings.Furthermore, we will also quantitatively evaluate the performance of these features in Section 4.3.4.

Extracting Features from Different Novelty Item Sets.
The genuine users have obvious preferences for different novelty items.However, the attackers hardly transfer their preferences by selecting different novelty items.For example, as listed in Table 1, the filler items are randomly chosen in average attack, random attack, user random shifting attack, user average shifting attack, target random shifting attack, target average shifting attack, love/hate attack, bandwagon attack, and reverse bandwagon attack.
To reflect the user's preferences to choose the different novelty items, based on the Zipf 's law, the items are divided into popular and novel item sets according to their novelty.Sorted by their novelty in ascending order, the top 20% items are taken as popular item set (PIS) and the remaining items  are used as novel item set (NIS).Based on the features in Sections 3.1.3and 3.1.4,the user features can be extracted from PIS and NIS, respectively, which are shown in Tables 3  and 4. Figures 6 and 7 illustrate the effects of these detection features, where the genuine and shilling profiles are the same as those in Figure 4.
As shown in Figures 6 and 7, these features have different discrimination abilities for different attacks.For example, IE URTPVP and M URN have the ability to detect power item attack and love/hate attack, respectively.However, they cannot effectively detect other attacks.For IE URTPVP, when the power item selection method is based on the number of ratings in our experiments, most of the rated items are in popular item set.Therefore, with the uncertainty distribution of TPI, the profiles of power item attacks have relatively high IE URTPVP in popular item set.For M URN, because love/hate attacks randomly select items and give them the maximum rating values, the profiles of love/hate have the highest mean of user ratings in NIS.
Let D denote the rating database, M denote the user feature matrix, US denote the rating timestamp matrix, UT denote the user rating temporal popularity matrix, and UR denote the user rating value matrix.The proposed feature extraction algorithm is described as follows.
Algorithm 1 contains two parts.The first part (lines 1-14) is to preprocess the rating data.Particularly, the UR and US matrices are obtained from the raw rating database D (lines 2-3).Then the item novelty series is constructed to divide PIS and NIS (lines 4-7), based on which, the wavelet decomposition is used to determine the item daily popularity (lines 8-14).The second part (lines [15][16][17][18][19] is to convert the raw rating database D to the user rating temporal popularity matrix UT (line 15) and extract the user features (lines [16][17][18].Finally, the user feature matrix is obtained (line 19).

Construction of Multiple Optimal Classification Views.
As discussed in Section 3.1, the extracted features exhibit different abilities in detecting different attacks.Therefore, we have to select and combine those effective features for shilling attack detection.In feature selection, one main challenge is the trade-off between directly removing the irrelevant detection features and keeping useful features.For this reason, we adopt wrapper methods instead of the filter methods used in [7,8,16].Moreover, the proposed features are extracted via constructing the corresponding relationship among item popularity, ratings, and timestamps, so they can be regarded as heterogeneous data with the same source.That is, with the same source of user behaviour data, there are multiple views (feature subsets) to be used for separating the shilling profiles.
Based on the above analysis, we propose an optimal feature set partitioning method to construct the views.To eliminate the impact of feature order as much as possible and construct the views with great diversities, the feature set is repartitioned with random order of features at regular intervals.
Suppose that the nonempty feature set  = { 1 ,  2 , . . .,   } can be partitioned into k feature subsets to construct the views.Let   (i=1,2,. ..k) denote the ith view Security and Communication Networks Let BsClassifier denote the base classifier, X be the feature set, k denote the number of views, Vdata denote the validation dataset, and X opt denote the feature set partitioning result.The proposed optimal feature set partitioning algorithm is described as Algorithm 2.
Algorithm 2 includes two parts.The first part (lines 1-9) is to initialize the views (line 5), accuracy of views (line 6), and accuracy difference of views (line 8).The second part (lines 10-24) is to generate the optimal feature set partition.Firstly, a distinct and unevaluated feature is randomly selected (line 11).Secondly, the selected item is temporarily added into each view (line 13), and the accuracy difference of each view is updated (lines 14-15).Thus, the feature is permanently added into the view with maximum accuracy difference (lines [18][19] and the accuracy of this view is updated (line 20).Finally, after each feature is temporarily added into each view to decide whether or not to be added permanently, the optimal feature set partitioning result is obtained and then returned (line 24).

Generation of Base Classifiers.
In supervised ensemble method, a predictive classifier is generated by integrating multiple classifiers, which are trained on diverse training sets.Figure 8 illustrates the framework of our method to generate the base training sets.As shown in Figure 8, from the horizontal direction, the various base training sets are    6).Then each classifier is trained on a base training set from different views, in which the accuracy of different views is calculated (lines 7-8).In lines 9-12, the feature set is repartitioned with q intervals according to Algorithm 2. Finally, the base training sets TR, feature set partitioning results X allopt , base classifiers' accuracy set Fpre, and different views' accuracy set Vpre are obtained.

Ensemble Detection.
In this paper, according to the accuracy, top 15% base classifiers are selected to yield ensemble result.The weight of each base classifier is determined by its accuracy on the validation dataset.Similarly, for every base classifier, the weight of each view is determined by the accuracy of base classifier from this view on the validation dataset.Therefore, the final decision can be obtained by weighting the base classifiers, which can be described as follows: where   is the unlabeled user in feature space, (  ) is the predictive result of   , bs top is the number of selected base classifiers, and   (  ) is the predictive result of base classifier p to user   .The decision-making process of each base classifier can be expressed as follows: where In Algorithm 4, the top 15% base classifiers with the largest accuracy values are selected to build a predictive classifier by the weight voting.Firstly, the base classifiers are selected (line 2) and the user features of profiles in the test set are extracted (line 3).Secondly, with the extracted user features, the predictive result of every classifier is obtained by the weight voting from different views (lines 7-16).Thirdly, the final predictive result is generated by voting the base classifiers (lines [18][19][20][21][22].When every user profile is detected in the outer loop, the set of final detection results is obtained.

Experimental Evaluation
In this section, we first introduce the experimental settings and evaluation metrics.To compare the detection performance under various attacks, we first conduct experiments on the Netflix dataset.Then we conduct the experiments on the Amazon review dataset to demonstrate the practical value of the proposed method.Finally, the comparison of running time is provided.

Experimental Data and Setting.
The Netflix dataset and Amazon review dataset are used for experimental evaluation.
(1) Netflix dataset (this dataset was constructed to support participants in the Netflix Prize (http://netflixprize.com)): A contest dataset is provided for Netflix Prize.We randomly select 542,182 ratings on 4000 movies by 5000 users between January 6th, 2000, and December 31st, 2005 as our experimental dataset.
(2) Amazon dataset: It contains 1,205,125 reviews on 136,785 products by 645,072 reviewers, which is crawled from Amazon.cn till August 20, 2012, by Xu etc. [44].Each review includes ReviewerID, ProductID, Product Brand, Rating, Date, and Review Text.In this dataset, 5055 reviewers are labelled.Moreover, the information of 8915 reviewer groups is provided [28,44].
Since Netflix dataset is provided to train the recommender algorithm for Netflix contest, we assume that the original users are genuine ones.We randomly divide 5000 genuine users into three groups.The first group including 3000 genuine users is used for the training dataset; the second and third group including 1000 genuine users are used for the test datasets and validation dataset, respectively.
Due to the lack of labelled users in Netflix dataset, the shilling profiles are generated according to the attack models in Table 1 by reverse engineer [6].The rating timestamps of shilling profiles are sampled from the genuine ones in order to ensure the rationality of the generated profiles.For example, for a shilling profile, the rating timestamps are generated as follows.Firstly, we obtain a genuine profile set in which all genuine profiles have more rated items than this shilling profile.Secondly, we randomly select one profile from the genuine profile set and then randomly select the timestamps from this genuine profile as the shilling profile's timestamps.
In the shilling profiles, the target item is randomly selected.Firstly, we detect 6 push attacks including random, average, 30% AoP, shifting, power item, and power user attack models shown in Table 1.These attack models can also be used for nuke attacks with the same detection methods.However, the attack models that are effective for pushing items are not necessarily as effective for nuke attacks.Therefore, secondly, we also detect 2 effective nuke attacks including love/hate and reverse bandwagon attack models.
As to filler size, more than 83% genuine profiles are below or equal to 5% in the Netflix dataset.To simulate most of the genuine users, the filler sizes of shilling profiles are set to {3%, 5%} in training and test sets.
As to attack size, the value should not be too large (usually below 20%); otherwise the shilling profiles would be easily detected and the attack cost would be raised [45].For the purpose of experiments, the attack sizes of shilling profiles are set to {3%, 5%, 10%, 12%} in training and test sets.
In the validation dataset, to balance the proportion between genuine and shilling profiles, the shilling profiles are generated by random, average, 30% AoP, shifting, power user, power item, love/hate, and reverse bandwagon models in Table 1 with 5% attack sizes and {3%, 5%} filler sizes.
In our experiments, the average values of 20 times experiments are used as the final evaluation values.All of our experiments are implemented using Matlab R2015b and Python 2.7 on a personal computer with Intel i7-5500U 2.40GHz CPU, 8GB memory.In Algorithm 2, the proposed features are partitioned into 3 views.In Algorithm 3, 100 kNN base classifiers are trained, whose neighbors are set to 9 according to cross validations.

Evaluation Metrics.
To evaluate the performance of the proposed method, we use precision and recall metrics.The recall and precision metrics are defined as follows: where TP denotes the number of shilling profiles correctly classified, FN denotes the number of shilling profiles misclassified as genuine ones, and FP denotes the number of genuine profiles misclassified as shilling ones.

Comparison of Recall and Precision.
To illustrate the effectiveness of MV-EDM, we compare it with the following six baseline methods.
(1) MEL-once: A multiview ensemble learning method [35].We use the 17 features proposed in this paper as feature set and partition them into 3 views only once. 100 kNN base classifiers are combined from 3 fixed views to generate the final detection results, whose neighbors are set to 9 according to cross validations.
(2) MV-SVM: A multiview detection method.The 17 features extracted from multiple views in this paper are used by SVM for classification.We call it MV-SVM, in which the Gaussian radial basis function is used as the kernel function, and rbf sigma is set to 4 according to 5-fold cross validation in the training set.
(3) SVM-TIA: A two-phase shilling attack detection method based on SVM and target item analysis [10].The 7 rating-based features including RDMA, DegSim, WDMA, WDA, LengthVar, MeanVar, and DegSim  are used by SVM for classification.In SVM algorithm, Gaussian radial basis function is used as the kernel function, and rbf sigma is set to 5 according to 5-fold cross validation in the training set.In phase 2 of SVM-TIA, the threshold , a number of attack profiles, is set to 20.
(4) RAdaBoost: An improved rescale AdaBoost method [11], which uses the 18 rating-based features.In RAdaBoost, 100 decision stumps are used as weak classifiers and the number of iteration times is set to 50.The shrinkage degree parameter is calculated by   = 2/( + ),  ∈ N [11], where u is set to 100-time attack size.
(5) RF-13: A random forest ensemble detection method, which uses the 13 rating-based features in [4].In the experiments, 100 decision trees are used and other parameters are set by default in Matlab.
(6) DWT-SVM: An item popularity-based method [28].The 17 user features are extracted from user's rating series based on item popularity, and SVM is used for classification.In SVM algorithm, Gaussian radial basis function is used as the kernel function, and rbf sigma is set to 4 according to 5fold cross validation in the training set.
Tables 5 and 6 list the recall and precision of seven methods under eight attacks with various filler sizes and attack sizes on the Netflix dataset.
As shown in Table 5, the recall of MV-EDM is higher than or equal to other methods when detecting various attacks.The rating-based methods (SVM-TIA, RAdaBoost, RF-13) and item popularity-based method (DWT-SVM) show relatively high recall under random, average, AoP, shifting, love/hate, and reverse bandwagon attacks.However, under power user attacks, only MEL-once, MV-SVM, and MV-EDM maintain recall at high levels.This is because SVM-TIA, RAdaBoost, RF-13, and DWT-SVM mainly extract detection features from single view and ignore other useful information.Also, this illustrates the effectiveness of the proposed features when detecting various attacks.In multiview methods (MEL-once, MV-SVM, MV-EDM), MEL-once and MV-EDM have higher recall than MV-SVM.In rating-based methods, RAdaBoost and RF-13 have higher recall than SVM-TIA.This is because the ensemble methods (MEL-once, MV-EDM, RAdaBoost, RF-13) can combine the outputs of multiple classifiers to reduce the generalization errors and benefit to detect as more shilling profiles as possible.The item popularity-based method (DWT-SVM) has higher recall than SVM-TIA when detecting random, average, AoP, shifting, love/hate, and reverse bandwagon attacks.The possible reason is that these attacks select filler items in a simple way, which makes the item popularitybased features more effective than the rating-based features.
It can be seen from Table 6 that the precision of MV-EDM is the highest when detecting various attacks.Compared with the single view methods (SVM-TIA, RAdaBoost, RF-13, DWT-SVM), the multiview methods (MEL-once, MV-SVM, MV-EDM) have higher precision than the single view methods.In the single view methods, RAdaBoost outperforms RF-13 in precision in most cases.This may be attributed to the more effective features and rescale AdaBoost method in RAdaBoost.For SVM-TIA, the suspicious user profiles are first detected by SVM based on some user rating-based features.However, shilling profiles may not be fully detected at the first stage.At the second stage, the suspicious profiles are further determined by analysing target items, which has no effect on the recall but increases the precision.In the multiview methods, although the precision of MEL-once at the same filler size increases generally as the attack size increases, it is always below the precision of MV-EDM.This is because MEL-once depends on the fixed views, and the partitioned views are easily affected by the initial order of features.Thus the precision of MEL-once cannot achieve the best performance.The MV-SVM also uses our features to train classifier.However, since MV-SVM does not consider the effectiveness of features and use ensemble method, it has relatively low precision in multiview methods.Therefore, given the same features, the proposed multiview ensemble method is superior to SVM and the traditional multiview ensemble learning method.
Obviously, MV-EDM outperforms other methods in terms of recall and precision metrics when detecting various attacks.This is because MV-EDM extracts user features from multiple perspectives, which enables it to detect more anomaly rating patterns than other methods.Moreover, MV-EDM learns multiview features and classifier simultaneously,    which leads to training the classifier with diversity.Thus, MV-EDM can achieve better performance than the benchmark methods.

Analysis of Information Gain.
Information gain is widely used to evaluate the importance of a feature for a classification system.In general, larger information gain indicates more importance of the feature.Table 7 lists the information gain of the proposed features when detecting different types of attacks.In our experiments, we calculate the mean of information gain for each type of attacks based on the experiments in Section 4.3.1.
As shown in Table 7, RDDMA is the most important feature for detecting eight attacks, whose information gain ranks the top 1.WDDMA is also important for random, average, AoP, shifting, power user, love/hate, and reverse bandwagon attacks.For power item attack, IE URTPVP is the second important feature.It is worth noting that M URP and V URP are important for detecting love/hate attack, but they are not important for identifying other attacks.

Generalization Ability of MV-EDM.
In [6-8, 10-13, 24-28], the detection model was usually trained for some attacks with fixed filler sizes and attack sizes and tested with the same parameters.Based on the experiments in Section 4.3.1,we conduct experiments to further evaluate the overall performance of the proposed method when the attack parameters are modified.
Since precision and recall are two equally important but mutually contradictory metrics, F1-measure metric is widely used to evaluate the overall performance of the detection method.The larger the F1-measure is, the better the overall performance is.
Figure 9 illustrates the F1-measure of MV-EDM under nine attacks with various attack sizes and filler sizes.The training set is the same as that in Section 4.1.However, in the test sets, since the filler sizes of more than 95% genuine profiles are below 10%, the filler sizes are set to {2%, 4%,      6%, 8%, 10%}.In fact, if the filler size of shilling profiles is more than that of genuine profiles, the attacks can be easily detected by the features based on filler size, such as length variance (LengthVar) [5,6] and filler mean variance (FMV) [46].Therefore, we set the maximum filler size in the test sets to 10%.Also, the attack sizes are set to {2%, 4%, 6%, 8%, 10%, 12%, 14%, 16%, 18%, 20%}.Moreover, we use the trained model to detect the bandwagon attack that does not appear in the training set.As shown in Figure 9, MV-EDM can effectively detect random, average, shifting, love/hate, reverse bandwagon, and bandwagon attacks with various attack sizes and filler sizes except for AoP, power item, and power user attacks.It is worth noting that bandwagon attack does not appear in the training set, but it can be effectively detected by MV-EDM.These experimental results indicate that MV-EDM can still work under some attacks (random, average, shifting, love/hate, reverse bandwagon, and bandwagon attacks) when the attack sizes and filler sizes are modified.This means the proposed method has relatively strong generalization ability under the attacks with randomly selected filler items.However, since a lot of AoP, power item, and power user shilling profiles have similar or even replicate filler items like the genuine profiles, MV-EDM hardly detects such attacks when filler sizes are significantly modified.As shown in Figures 9(c), 9(e), and 9(f), the F1-measure of MV-EDM stays relatively high at {2%, 4%, 6%} filler sizes and decreases at {8%, 10%} filler sizes under AoP, power item, and power user attacks.Therefore, compared with previous research results in which the training and test sets use the same attack parameters [6-8, 10-13, 24-28], the detection performance of MV-EDM is more optimistic even if the attack parameters are modified.
To further test the generalization ability of MV-EDM, we train MV-EDM by one type of attacks and conduct experiments to evaluate the performance of MV-EDM in detecting another type of attacks.MV-EDM is trained by random, average, AoP, shifting, power item, power user, love/hate, and reverse bandwagon attacks with {3%, 5%} filler sizes and 10% attack size, respectively.Then we take each type of attacks with 4% filler size and 6% attack sizes as examples to test the F1-measure of MV-EDM.Table 8 lists the experimental results.
As shown in Table 8, under random, average, shifting, love/hate, and reverse bandwagon attacks, once MV-EDM is appropriately trained by one type of attacks, it can effectively detect other attacks.This means MV-EDM has strong generalization ability under the attacks with randomly selected filler items, which is hardly affected by rating values.Also, when MV-EDM is trained by AoP attack or power item attack, besides random, average, shifting, love/hate, and reverse bandwagon, it can detect power item attack or AoP attack.This is because the way to select item to rate in AoP attack is similar to that in power item attack in our experiments.It can be seen from Table 8 that MV-EDM can detect power user attack only when it has been trained by the same attack.This is because the rated items and their ratings are copied from the genuine users in power user attack, whose rating pattern is unique and cannot be learned from other attacks.Therefore, MV-EDM has relatively high generalization ability.

Impact of Dynamic Mean of Item Ratings.
To evaluate the impact of dynamic mean of item ratings, we conduct experiments to compare the performance of MV-EDM with traditional RDMA, WDMA (for the sake of differentiation, we name it as MV-EDM-) and that of MV-EDM with the proposed RDDMA, WDDMA. Figure 10 illustrates the F1measure of two methods under eight attacks with attack sizes {3%, 5%, 10%,12%} at 5% filler size.MV-EDM- and MV-EDM are trained by the same training set in Section 4.1.
As shown in Figure 10, the F1-measure of MV-EDM is always higher than that of MV-EDM-.Taking AoP attack as an example, at 5% filler size across 3%, 5%, 10%, and 12% attack sizes, the F1-measure of MV-EDM is 0.82, 0.89, 0.91, and 0.95, which is higher by 39%, 45%, 20%, and 10% than that of MV-EDM-, respectively.These results indicate that the features RDDMA and WDDMA can help to improve the detection performance of MV-EDM.This means the dynamic mean of item ratings is useful to detect various attacks.

Experiment on the Amazon Review Dataset.
To examine the performance of the proposed method in practice, we conduct experiments on Amazon review dataset and list the experiment results.
To better simulate actual detection, we take two methods including random sampling and group sampling to generate the training, validation, and test datasets in Amazon review dataset.In random sampling method, we randomly select 500 reviewers as training set, 300 reviewers as validation set, and 4255 reviewers as test set, respectively.In group sampling method, firstly, we randomly select 1000 reviewer groups, in which all the 1535 distinct reviewers are selected as training set.Secondly, we randomly select 500 reviewer groups and find that 102 groups have no common members with the training set.Thus, we select 196 reviewers of the 102 groups as validation set.Finally, we find that 1295 groups in the remained groups have no common members with the training and validation set, so we select 1970 reviewers of the 1295 groups as test set.
With random and group sampling methods in the Amazon review dataset, Table 9 shows the recall, precision, and F1-measure of seven detectors described in Section 4.3.1 As shown in Table 9, MV-EDM has the highest recall, precision, and F1-measure with two sampling methods in practical detection.For overall performance, the F1-measure of MEL-once and MV-SVM is only next to MV-EDM with two sampling methods, which again illustrates the effectiveness of our proposed features in real-world dataset.With random sampling method, SVM-TIA has the highest F1-measure except MEL-once, MV-SVM, and MV-EDM.Understandably, as shown in Table 9, although SVM-TIA has the lowest recall, it remains the highest precision in single view methods, which results in the very high F1-measure.With group sampling method, in single view methods, RAdaBoost and DWT-SVM have relative high F1-measure, and SVM-TIA and RF-13 remain having the lowest F1-measure.This may be because the grouping sample method weakens the distinguishing ability of features DegSim (degree of similarity with top neighbors) and DegSim' (the improved DegSim) in SVM-TIA and RF-13.In fact, the features DegSim and DegSim' identify attackers by capturing the nearest neighbors.However, the grouping samples have no common users in training set and test set and make the neighbors information very different between training and test set.Therefore, the effectiveness of classifiers is reduced in SVM-TIA and RF-13 by grouping sample method.Also, we can observe that the seven detectors have better performance with random sampling method than with grouping sampling method.This may be because there are complex user preferences in different groups and the classifier features cannot be effectively learned only from some separated groups of users.By contrast, with random sampling method, the users are almost uniformly distributed in training and test dataset, and the classifiers can benefit from the uniform distribution samples.
No matter what the sampling methods are, MV-EDM takes into consideration not only multiple information, but also advantage of effectiveness from features and base classifiers to improve the performance significantly.Therefore, MV-EDM always outperforms other methods in Amazon review dataset, which shows the practical value of our proposed detection method.

Comparison of Running Time.
To examine time consumption of MV-EDM, we conduct experiments on two datasets and calculate the running time of seven methods, respectively.On the Netflix dataset, we take experiments under AoP attack with 3% filler size and 5% attack size as examples.On the Amazon dataset, the experiment setting is similar to Section 4.4.Then we calculate the offline training time and online detection time, respectively.Table 10 shows the comparison of running time for seven detection methods.
As shown in Table 10, the training time and detection time of MV-EDM are smaller than those of SVM-TIA, RF-13, and DWT-SVM, but greater than those of MEL-once, MV-SVM, and RAdaBoost.In multiview methods, compared to MELonce, MV-EDM needs extra time to repartition the feature set in regular intervals.Compared to MV-SVM, MV-EDM and MEL-once need additional time to integrate base classifiers.Therefore, MV-EDM has more running time than MEL-once and MV-SVM.Also, MV-EDM shows higher running time compared with RAdaBoost, which may be attributed to the view construction and integration in MV-EDM.For RF-13 and SVM-TIA, their running time ranks top 2.This is because the DegSim and DegSim' need a lot of time to calculate the similarity between users.Since DWT-SVM utilizes discrete

Figure 3 :
Figure 3: The original signal and reconstruction signal of TPI for an item.

Figure 4 :
Figure 4: The feature differences between the genuine and shilling profiles.

Algorithm 3 :
The algorithm for training base classifiers.in partition method, and d denote the randomly selected profiles in base training set.Let bs denote the number of the base training sets and q denote the interval of repartitioning the feature set.Let Vdata denote the validation dataset, BsClassifier denote the base classifier, and TR denote the set of base training sets.Let Fpre denote the set of different classifiers' accuracy on the validation dataset, X allopt denote the set of all the feature set partitioning results, and Vpre denote the set of different classifiers' accuracy from different views.The algorithm for training base classifiers is described as Algorithm 3. Algorithm 3 includes two parts.The first part (lines 1-4) is to generate the base training sets.In particular, a certain number user profiles are first selected (line 3) and then the detection features are exacted from these profiles to generate base training sets (line 4) according to Algorithm 1.The second part (lines 5-15) mainly vertically partitions the base training sets and trains the base classifiers from different views in different base training sets.Firstly, the classifier's accuracy on each base training set is calculated (lines 5-

Table 5 :
Recall of seven methods under eight attacks.

Figure 9 :
Figure 9: The F1-measure of MV-EDM under nine attacks with various attack sizes and filler sizes.

Figure 10 :
Figure 10: The F1-measure of two methods under eight attacks at 5% filler size across various attack sizes.
attack.Moreover, it needs to know the attack size in advance.
, which consists of four stages: feature extraction, view construction, base classifier generation, and detection.At the stage of feature extraction, the features of users in training and test sets are extracted by using the proposed feature extraction method.At the stage of view construction, the user features

Table 2 :
Notations and their descriptions.

Table 3 :
The features in popular item set.

Table 4 :
The features in novelty item set. ) (  ) denotes the projection of   from view   and ℎ , denotes the weight of view   in base classifier p, which can be calculated as follows.Let Ttest denote the test set, BsClassifier denote the base classifier, X allopt denote the set of all optimal feature set partitioning results, k denote the number of views in each partitioning result, Fpre denote the set of different classifiers' accuracy, Vpre denote the set of different classifiers' accuracy from different views, and PLabels denote the final detection result.The ensemble detection algorithm is described as Algorithm 4.

Table 6 :
Precision of seven methods under eight attacks.

Table 7 :
Information gain of the proposed features.

Table 8 :
The F1-measure of MV-EDM with untrained attacks.

Table 9 :
Recall, precision, and F1-measure of seven detectors with random and group sampling methods in the Amazon review dataset.