Deep Forest-Based E-Commerce Recommendation Attack Detection Model

,


Introduction
With the rapid growth of the Internet, the volume of network information and data has increased exponentially.Consequently, users are facing a signifcant challenge in obtaining the desired information, commonly referred to as "information overload."In order to address this issue and cater to the personalized needs of individual users, personalized recommendation systems have emerged.Tese systems analyze users' historical data to determine their preferences, guiding them from actively seeking interesting information to facilitating the discovery of relevant areas of interest.Moreover, personalized recommendation systems provide tailored recommendations for individual users, efectively meeting the diverse needs of a large user base.
Te introduction of recommendation systems has greatly facilitated users' daily information retrieval and mitigated the problem of information fatigue caused by information overload [1].Tese systems are not only extensively used in various e-commerce platforms but are also pervasive across the entire Internet industry.
In recommendation systems, there are common recall paths, namely, user-to-item (U2I) and item-to-item (I2I).In order to enhance the timeliness and accuracy of recommendations, platforms continuously update U2I and I2I in real time based on user behavior information gathered from the entire network.Recommendations are made by considering the relevance to users' recent behavioral data.However, this unique recommendation mechanism can be exploited by unscrupulous merchants who seek to establish an artifcial connection between their own products and popular items.Tese merchants may hire users to click on both popular items and their own products, in an attack commonly known as "Ride Item's Coattails" or "Coattail Attack."Tis deceptive behavior is characterized by its high level of concealment and its potential to cause signifcant damage to e-commerce recommendation systems, ultimately afecting the usability and shopping experience of users [2].
E-commerce recommendation systems play a critical role in the functionality of e-commerce platforms.Tey ofer personalized product recommendations based on users' browsing preferences, purchase history, and various other factors.By optimizing the shopping experience and customer satisfaction, these systems contribute to increased sales and profts for e-commerce platforms.However, as recommendation systems continue to advance, new forms of attacks have emerged.Tese malicious attacks not only negatively impact user experience but also result in substantial economic losses and reputational damage to ecommerce platforms.
Tus, it is crucial to develop an efective model for detecting e-commerce recommendation attacks.Tis is essential for ensuring platform security, enhancing trustworthiness, and improving user satisfaction.In response to the recent emergence of the "Ride Item's Coattails" or "Coattail Attack" behavior, this paper proposes a recommendation attack detection model based on the deep forest algorithm.Te model uses feature selection through random forests to reduce redundancy and select important features.To address the challenge of extreme class imbalance, a symmetric sampling method based on k-means centroids is introduced.Tis method overcomes the issues of incomplete noise fltering and information loss commonly associated with undersampling algorithms.Te proposed algorithm is trained using real recommendation system attack data obtained from Alibaba Cloud Tianchi, specifcally the "Coattail Attack" identifcation dataset.A comparative analysis is conducted between the proposed algorithm and commonly used detection models in the industry.Experimental results demonstrate the superiority of the proposed method in detecting attacks, surpassing the performance of commonly used attack detection algorithms.
In summary, the development of an efective ecommerce recommendation attack detection model holds signifcant importance in safeguarding platform security, enhancing platform credibility, and improving user satisfaction.Te contributions of this paper can be summarized as follows: (

Literature Review
Attack detection has always been a critical research area in the feld of network security, which aimed at identifying network attacks by monitoring network trafc and user behavior to ensure network security and stability.In recent years, the increasing stealthiness of attack behaviors and the emergence of artifcial counter-detection techniques have posed increasingly severe network security threats.In response to diferent types of attacks, the academic community has conducted extensive research, focusing on the following key aspects: Feature selection and extraction algorithms: Wang et al. [3], 4] have utilized methods such as Bayesian networks and voting mechanisms to select features, yielding favorable results in specifc attack scenarios.Hu et al. [5] have proposed a two-stage deep feature selection algorithm for selecting feature subsets, which efectively extracts optimal feature subsets and enhances classifcation accuracy.
Anomaly detection algorithms: Anton et al. [6] have applied classical machine learning methods to anomaly detection, but these approaches are not efective in detecting new attack types.Cheng et al. [7] have introduced a tuned random forest algorithm for anomaly detection, which exhibits high data requirements and is signifcantly afected by noise values.He et al. [8][9][10] have conducted dimensionality reduction and anomaly detection on data using feature clustering methods and support vector machines.Tese methods demonstrate good detection performance for small-sample datasets but may not be ideal for large sample datasets.
Deep learning algorithms: Lin et al. [11] proposed a deep learning-based intrusion detection model that incorporates both spatial and temporal features of network intrusions, achieving notable performance on the dataset.Kumar and Sniha [12,13] proposed a novel unifed intrusion detection algorithm that uses information gain for feature selection but is limited in identifying a specifc number of intrusion types.Wang et al. [14,15] applied convolutional neural networks to network intrusion detection, demonstrating higher accuracy compared to traditional machine learning techniques.Tran et al. [16,17] utilized DNN for fault detection in instant messaging and proposed an integrated IoT architecture based on deep neural networks, showcasing superior performance and ease of implementation compared to state-of-the-art algorithms.Additionally, statistical methods have been used to establish equivalence between Kalman flters and specifc least squares regression 2 Security and Communication Networks problems, enabling the construction of robust Kalman flters for solving state estimation problems caused by various network attacks such as pulse, ramp, and DoS attacks [18].Attack detection in e-commerce recommendation systems also falls under the domain of network attack detection.Its primary objective is to ensure fairness and stability within recommendation systems.With the widespread adoption of recommendation systems in e-commerce platforms, detecting attacks has become a prominent research issue for scholars [19].Notable attacks that signifcantly impact recommendation systems include shilling attacks, group attacks, and the recently emerged "Ride Item's Coattails" attack.
Shilling attacks refer to deceptive behavior within recommendation systems, where users create fake accounts or manipulate user preferences to obtain favorable recommendations and deceive the system.Tese fabricated users or preference data can result in inaccurate recommendations.In shilling attack detection, Zhou [20,21] used neural networks that consider centrality features, user rating history, and latent features, demonstrating efective detection capabilities.Jia et al. [22] constructed a classifcation model based on rating distribution and select diferent features as classifcation attributes based on information gain, achieving efectiveness in shilling attack detection, albeit with increased processing time.Wang et al. [23] proposed a shilling attack detection algorithm based on supervised prototype variational autoencoders (SP-VAE), which performs well even in the cold start scenario of shilling attacks.Zayed et al. [24] proposed an enhanced technique using supervised learning in collaborative recommendation systems to detect shilling attacks, requiring high-quality data but yielding average classifcation performance when labeled data is limited.
Group attacks involve malicious users or organizations collaborating to deceive recommendation systems for their own gain.Tese attacks use collective activities, shared interests, and other means to manipulate recommendation results, leading to incorrect recommendations or interference with normal users' recommendations.In group attack detection, Wu et al. [25] constructed features from multiple data sources for group attack detection, which performs well on synthetic datasets.Wu [26,27] detected group attacks by building a graph neural network attack detection model and a semisupervised model based on clustering and graph convolutional networks.Tese methods exhibit good timeliness in detecting mixed attacks.
"Ride Item's Coattails" refers to the deceptive correlation established between popular products and low-quality products through the creation of false click information.Its aim is to mislead e-commerce platform recommendation systems and promote the sale of low-quality products.To address large-scale fraudulent deception, Zhao et al. [28] proposed the fraud awareness impression regulation (FAIR) system, a data-driven approach that efectively regulates users' fraudulent behavior in real-time on large e-commerce platforms.Li et al. [2] analyzed and summarized the characteristics of false click information generated by attackers in "Ride Item's Coattails" attacks.Xu et al. [29] proposed a group shilling attack detection solution based on multidimensional user features and collusion behavior analysis.Tey design a set of indicators to measure abnormal user behavior and identify anomalous users using a feature similarity matrix.Tis method is suitable for detecting multistrategy group attacks.
Tis paper introduces an e-commerce recommendation attack detection model based on an enhanced deep forest algorithm.It validates and analyzes the model using a largescale dataset of real false click data, demonstrating the feasibility of the proposed approach.

Random Forest for Evaluating Feature Importance.
Random forest is an ensemble classifer based on the bagging technique, which combines multiple fully grown decision trees.In classifcation prediction, the class label of a sample is determined by taking a majority vote of the class labels produced by these decision trees [30].
Te feature importance evaluation in random forest can be achieved through various methods, one of which is based on the Gini coefcient.Te calculation method for the Gini coefcient is as follows: where k represents the k classes, p k represents the weight of the sample in classk.Te importance of a feature at node m, which is the change in the Gini coefcient before and after the branching of node m, can be calculated as follows: where GI l and GI r represent the Gini coefcients of the two new nodes after branching.
If a feature appears in the nodes of tree i in the set M, then its importance in the i-th tree is as follows: Assuming that there are n trees in the random forest, then Te obtained importance scores are normalized as follows: where the denominator is the sum of all feature gains and the numerator is the Gini coefcient of feature j.

Security and Communication Networks
Based on the feature importance ranking of random forest, further analysis and optimization of data features can be conducted.Tis includes selecting important features and eliminating redundant ones, which can enhance the training efectiveness and prediction accuracy of the model.
In the e-commerce environment, diferent online retail platforms place varying emphasis on specifc features.Consequently, during the actual feature selection process for a model, the platform's signifcant features are weighted before conducting feature importance analysis and selection.

Centroid Symmetric Sampling.
Te centroid is the mean of all data in a cluster and it represents the characteristic value of the data in the cluster.However, the centroid is not necessarily a sample in the cluster and cannot fully refect the overall distribution characteristics of the cluster.Terefore, we propose centroid symmetric sampling based on the centroid-centered distance to preprocess imbalanced datasets with positive and negative classes.
Centroid sampling frst extracts the positive and negative class data from the dataset.For the negative class data, kmeans algorithm is used to cluster and obtain the centroids of each cluster.Ten, sampling is performed based on the centroid-centered distance.Tis method can preserve the data distribution of each cluster to the maximum extent and ensure the efectiveness of sampling.Te functionality simulation of centroid symmetric sampling is shown in Figure 1 (Algorithm 1).
Te centroid symmetric undersampling algorithm effectively handles the problem of extreme class imbalance in the data.It selects samples based on the distribution of the majority class data, preserving the original data distribution while optimizing it by removing outliers.

SMOTE Resampling
. SMOTE resampling is a prevalent resampling algorithm.Its fundamental idea revolves around balancing the dataset by generating new synthetic samples through interpolating between the minority class samples, thereby enhancing the model's performance.
Te specifc formula for the SMOTE algorithm is as follows: (1) Suppose there is a minority class sample X i .
Te k nearest neighboring samples for this sample are denoted as X j1 , X j2 • • • X jk (2) For each sample X i , select a random number r, where 0 < r < 1 (3) For each feature j, Repeat step 2 until an adequate number of synthetic samples are generated.Considering the severe imbalances that might arise in real-world scenarios, this study uses a combined data processing approach of centroid undersampling and SMOTE resampling.Te balance ratio between them is dynamically determined based on the specifc dataset, ensuring that the model maintains commendable performance even under extremely imbalanced conditions.

Deep Forest Algorithm.
Te deep forest algorithm, proposed by Zhou et al. [31], is a deep learning model that distinguishes itself from traditional deep neural networks (DNNs).Structurally, deep forest can be described as a multigranularity cascading forest.It resembles convolutional neural networks (CNNs) which uses multiple scanning windows of various granularities to sample input data and extract features.Tis approach enhances its capability to learn data features efectively.Te multigranularity scanning algorithm further strengthens the feature extraction capability of deep forest for input data.Each layer in deep forest performs feature extraction and dimensionality reduction on the input data, transmitting the extracted features to the subsequent layer.By utilizing the multigranularity scanning algorithm, the input data can undergo feature extraction and dimensionality reduction at diferent scales, thereby enhancing the model's robustness and generalization ability.Te overall architecture of deep forest is depicted in Figure 2.
From Figure 2, it can be observed that the depth structure of the deep forest refers to a tree structure composed of multiple depth levels.Tis tree structure enhances the expressive power of the model, where each level can capture diferent features.Deeper levels enable the model to learn more complex features.In another sense, this hierarchical learning aids the model in identifying more signifcant features since each level extracts higher-level Te deep forest algorithm exhibits powerful computational capabilities, high accuracy, a reduced number of hyperparameters and excels in small-sample datasets.It has been extensively applied in various domains such as computer vision, natural language processing, and medical image processing.Moreover, it is well-suited for diverse attack detection environments.Hence, in this study, the deep forest algorithm is selected as the foundational classifcation model to enhance the performance of attack detection.Tis improvement is achieved through feature processing, undersampling, data augmentation, and other relevant methods.dist Where represents a sample point in the cluster, represents the cluster centroid.(8) Sort the samples within the cluster in descending order based on their distances.
for j � 1, 2, 3, . .., do If S i ≥ sigma sigma � k ×  n u�1 dist x i , μ /n  Sigma represents the mean of the sample distances, represents the total number of samples sampled within each cluster.(9) Identify and remove outliers.(10) Select the frst sample based on Algorithm 1.Here, r tp , r tn , r fp , and r fn are the reward values for respective cases.

Policy Design Method Based on Q-Learning.
Initialization: Initialize the action-value function Q s, a ( ) to zero.Choose an initial state s and an initial action a.
Loop: Execute action a and observe the reward r and new state s ′ .
Update Q value: Using the ε-greedy policy, select the next action a′.Update state and action: s ← s ′ and a ← a ′ Termination: If a certain condition is met (e.g., the model's performance reaches a predetermined threshold), then terminate the loop.
Using this method, the model can iteratively learn how to adjust its structure and parameters based on diferent states and received rewards, optimizing its attack detection performance.

Dataset.
In this study, we utilized the "embrace the leg" attack detection dataset from Alibaba Cloud Tianchi.Te dataset consists of 1 million ofine training data samples and 100,000 test data samples.It includes felds such as UUID, user's product access time, user ID, product ID, product and user attribute features, and labels.Te descriptions of each feld are presented in Table 1   In Figure 3, it is evident that the training set contains a notable proportion of duplicate samples, approximately 28.7% in total.Tis fnding highlights the signifcant presence of duplicated data within the dataset.

Feature Correlation Analysis.
Te given dataset consists of 152 anonymous features representing product and user attributes.Te frst 72 dimensions correspond to product features, while the remaining 80 dimensions represent user features.Tese features are named based on alphabetical prefxes and numerical indices.Visualizations of feature correlation matrices for product features and user features are presented in Figures 4 and 5, respectively.
From Figures 4 and 5, it can be observed that the feature correlation coefcient matrices exhibit several dark-colored blocks, indicating the presence of multicollinear features and a signifcant amount of feature redundancy.Te colors in the heatmap of product features (Figure 4) are generally darker compared to Figure 5, indicating a stronger correlation between product features and the target variable.In the heatmap of user features, three noticeable blank areas can be observed, indicating the presence of three inefective features: u148, u149, and u150.

Extracting Important Features with Random Forest.
Te random forest algorithm is utilized to evaluate the importance of features in the training set and rank them.Te top 10 important features selected are shown in Figure 6.
Te density distribution curves of these 10 important features are shown in Figures 7-16: From the density distribution plots of the ten important features selected by random forest on the training set and test set, it can be observed that there are signifcant feature diferences between the two sets.Te substantial feature diferences indicate a higher requirement for the model's generalization ability.

Data Augmentation.
Based on the dataset description provided, it is noted that the label is defned as malicious (label � 1) only when both the user and the product are malicious.In all other cases, it is considered a nonmalicious behavior.Te relationship between normal and malicious behaviors is illustrated in Figure 17.
As shown in the relationship graph between normal and malicious users, even if a user is malicious, clicking on a normal product is still considered a normal behavior and vice versa.Following this rule, relevant labels can be assigned to some unlabeled data (label � −1), thereby increasing the training data.Labeling the unlabeled data is primarily Te process is depicted in Figure 18, with the specifc steps outlined as follows: (1) For samples with label � 1, indicating positive samples where both the user and the product are malicious, mark the user ID and product ID as blacklist IDs.
(2) For samples with label � 0, in conjunction with the blacklist user and product IDs from the positive samples, two scenarios can be derived: the frst scenario consists of a blacklist user ID with an unknown product ID, and the second scenario consists of an unknown user ID with a blacklist product ID.
Based on the data interpretation of the labeled samples, one attribute is the blacklist ID, and the Te data changes after processing based on the black and white list rule are shown in Figure 19.Te data augmentation based on black and white list rules has shown excellent results.A total of 51,113 valid data samples were added to the train dataset through this method.Among them, the positive class data increased by 2,796 samples, accounting for a 28.9% increase, while the negative class data increased by 48,336 samples, which corresponds to a 119% increase.Te data

Evaluation Metrics.
For e-commerce attack detection models, the primary objective is to correctly distinguish between malicious and benign activities.Among the plethora of model evaluation metrics, two stand out as particularly relevant.Te frst is the true positive rate, also known as recall.Recall measures the model's capability to correctly identify attacks.In the context of thwarting ecommerce attacks, a high recall implies the model's profciency in capturing a vast majority of malicious activities, thus safeguarding the system from potential threats.Te second metric is precision.Precision gauges the fraction of instances labeled as attacks by the model that are genuinely malicious.A high precision denotes a lower false alarm rate, minimizing undue interruptions to legitimate users and operations.Te F1-score is the harmonic mean of precision and recall, providing a consolidated metric to assess the model's balance between these two parameters.Maintaining a harmonious balance between precision and recall is pivotal when combating e-commerce attacks.Hence, the F1-score serves as a robust representation of the model's actual detection efcacy.Additionally, the area under the receiver operating characteristic curve (AUC-ROC) is noteworthy.AUC-ROC quantifes the model's performance across all conceivable classifcation thresholds.It ofers a holistic perspective on the model's diferentiation capacity between positive and negative instances, which is paramount for e-commerce attack detection models.

Summation Average (F1)
. F1-score is the summed average of precision and recall and is an important measure for classifcation problems, which is calculated as follows:    training capabilities.Given that e-commerce attack detection data primarily consists of character-type data, and the multigranularity scanning algorithm exhibits superior performance with image processing, only the cascading forest component of the deep forest was used in the constructed model.

Algorithm Optimization Hyperparameter Tuning.
In the DF21 model, the primary hyperparameters that need adjustment include those of the cascading forest, with each cascading layer comprising multiple random forests and completely random forests.Te size and confguration parameters of each random forest and completely random forest infuence the model's performance and complexity.Te structure of the cascade is also a crucial hyperparameter.Increasing the number of layers may enhance the model's complexity and performance but might also lead to overftting.However, the number of layers in the cascading structure can be determined adaptively, ensuring that the model's complexity does not have to be a manually set hyperparameter.Instead, it becomes an automatically determined parameter based on the data.In our confguration, if there is no improvement in the F1 score within three layers, the building process stops, with a maximum cap set at 100 layers.
For hyperparameter tuning within the cascading forest's random forests and completely random forests, we used grid search to obtain the optimal parameters.
Te optimal hyperparameters are presented in Table 2.
(1) Optimal Treshold Selection.To enable the model to distinguish attack data more accurately, a threshold search method can be used, allowing the model to identify a threshold that yields a higher F1 score.Te specifc steps are as follows: First, train and save a deep forest model that outputs using a sigmoid function.
Load the saved model and conduct a threshold search on the validation set to identify the optimal threshold where the F1-score is maximized.
Append the optimal threshold to the fnal layer of the model, converting the sigmoid output into 0/1 labels.
Save and freeze the model.
(2) Distributed Computing.Te complex structure of the deep forest model necessitates signifcant computational resources, resulting in extended inference times.To reduce these inference times, the "ForestLayer" distributed by deep forest system is used, wherein each forest is divided into multiple subforests, with each subforest corresponding to a computational task."ForestLayer" is an efcient and scalable system specifcally designed for training deep forest models on distributed task-parallel platforms.It outperforms both gcForest and tfForest, achieving speed-ups of 7 to 20.9 times on various datasets.Additionally, it boasts near-linear scalability and excellent load balancing.By leveraging "ForestLayer," multiple trees can be trained and predicted simultaneously.Tis not only reduces the model's training time but, when applied in real-world scenarios, also allows for the distribution of the model's computational demands across idle computing resources.Tis distribution, in turn, minimizes the model's resource consumption.A schematic representation of distributed computing is shown in Figure 21.
Te fgure illustrates three cascading layers, where each cascading layer encompasses two random forests and two completely random forests.
Data partitioning: Within each learner box, the small rectangles represent data partitions processed by each learner across three distinct computing nodes.Model Training: Each node independently undertakes model training, facilitating parallel execution.Early Stopping Check: After each cascading layer, there's an early stopping check to determine the necessity of adding additional cascading layers.

Result Comparison.
Based on the prepared training and test datasets, deep forest models were constructed for detecting attack behavior.Te selected base classifers are extremely forest and random forest with default hyperparameters.In addition, the training process includes dropout techniques to mitigate the overftting problem and improve the performance of the proposed deep forest model.Te performance of the deep forest algorithm is evaluated by comparing it with other machine learning models (DLMP and DNN).All models were compared fairly using the same training and test datasets with default hyperparameters.Figure 22 shows the accuracy of each algorithm.

Security and Communication Networks
From Figure 22, it can be observed that the performance of the three algorithms is relatively close on the test set.GCF exhibits slightly better overall performance than DLMP and DNN models.GCF has the highest accuracy, and the F1scores of the three models are comparable.In terms of precision, GCF model slightly outperforms DLMP and DNN models, with DLMP model having the lowest score.
To verify whether the model's accuracy improves with an increase in the training dataset, we augmented the dataset with additional training data from an online source.Te results showed a slight improvement compared to the model before augmentation, indicating that the overall scores of the model tend to increase with an increase in the training dataset.

Conclusion
In this paper, a deep forest-based model for detecting ecommerce recommendation attacks has been proposed.Feature correlation analysis using random forests has been used to identify and remove irrelevant features.In the feature processing stage, data augmentation was performed on positive class data using the black-white list rule, greatly increasing the efective training data.A centroid undersampling method was proposed to handle severely imbalanced data, along with outlier removal.Furthermore, a comparative analysis with commonly used anomaly detection and classifcation algorithms in the industry was conducted, demonstrating that the proposed model can be used for detecting e-commerce recommendation attacks.Future work can involve applying this model to real-world recommendation systems for real-time detection of network attack behaviors.

Online Implementation Approach.
To genuinely deploy this model online, a series of engineering challenges must be addressed.As a result, the model has not yet been fully implemented online.Key issues to address for online implementation include as follows: Technical component aspect: AI fow can be utilized to defne the entire workfow, with Flink used as the real-time computing engine.Te core prediction process is accomplished using Cluster Serving.

Future Work.
Although the experiments described in this article have demonstrated that using the deep forest algorithm ofers commendable performance for ecommerce recommendation attack detection, it is vital to recognize that in a real e-commerce environment, the realtime detection efciency of the model is equally crucial.While distributed computing can enhance the prediction efciency of the model, the signifcant computational resource consumption remains a challenge to overcome.In this context, as part of future work, this study can be further expanded to reduce the model's resource consumption.
Lastly, the "No Free Lunch" theorem suggests that there is no universally superior learning algorithm.Every algorithm requires continuous learning and refnement.Future work aims to further enhance attack detection accuracy and efciency by combining larger models and applying them genuinely in real-world scenarios.

Figure 1 :
Figure 1: Function simulation diagram of centroid symmetric sampling.

( 1 )
Extract majority class data from the dataset.(2) Apply the k-means algorithm to cluster the majority class data.(3) Output: Set of cluster divisions.(4) Data within each cluster.(5) Number of clusters, k. (6) for i � 1, 2, 3, . .., k do (7) Compute the distance from each sample in the cluster to the centroid using Algorithm 1.

Figure 4 :
Figure 4: Heatmap of target correlation matrix for product features. :

4. 2 .
Data Analysis and Preprocessing 4.2.1.Removing Duplicate Values.Duplicate samples with the same user ID and product ID are removed.In the

Figure 5 :
Figure 5: Heatmap of target correlation matrix for user features.

Figure 6 :
Figure 6: Top 10 important features extracted by random forest.

Figure 18 :
Figure 18: Flowchart of the blacklist and whitelist rule.

Table 1 :
Training set data overview.
Tis model is a well-packaged version that has undergone extensive optimization.When trained on tabular datasets with tens of millions of records, it signifcantly reduces the memory footprint.Utilizing this encapsulated model addresses, to some extent, inherent issues of deep forests, such as memory consumption and CPU-only

Table 2 :
Figure 20: Deep forest structure with dropout.Te hyperparameter settings of deep forest.