Cache Pollution Detection Method Based on GBDT in Information-Centric Network

(ere is a new cache pollution attack in the information-centric network (ICN), which fills the router cache by sending a large number of requests for nonpopular content. (is attack will severely reduce the router cache hit rate. (erefore, the detection of cache pollution attacks is also an urgent problem in the current information center network. In the existing research on the problem of cache pollution detection, most of the methods of manually setting the threshold are used for cache pollution detection. (e accuracy of the detection result depends on the threshold setting, and the adaptability to different network environments is weak. In order to improve the accuracy of cache pollution detection and adaptability to different network environments, this paper proposes a detection algorithm based on gradient boost decision tree (GBDT), which can obtain cache pollution detection throughmodel learning.Method. In feature selection, the algorithm uses two features based on node status and path information as model input, which improves the accuracy of themethod.(is paper proves the improvement of the detection accuracy of this method through comparative experiments.


Introduction
With the popularity of the Internet and Internet of ings technologies and the transformation of IPV4 and IPV6 technology, more and more smart devices can access the Internet, and network traffic is beginning to grow rapidly. e current application range and scale of the Internet have far exceeded the original intention of the design. e information-centric network [1][2][3][4] was proposed, hereinafter referred to as ICN. e current information-centric network has protocols such as NDN [5], PSIRP [6], DONA [7], and NetInf [8]. Although these protocols have different forms, the key point is to use content names or IDs to obtain content, and all routers that pass through are supported for caching. Among the many information-centric network implementations, the most mainstream feasible solution is the named data network, hereinafter referred to as NDN. All the solutions discussed in this paper are based on NDN.
Since the original intention of the information-centric network design includes the use of caching to increase network utilization, the cache is an indispensable part of the information-centric network. If there is no cache, the efficiency of the network will be significantly reduced. In IPbased networks, there are various kinds of network attacks, and one of the well-known attacks is the DDoS attack [9,10]. Unlike the IP network, the principal part of the informationcentric network is content rather than IP. e attacker cannot specify a certain packet to send to the target host. erefore, the information-centric network has inherently capable resistance to such attacks. However, due to the massive use of cache in the information-centric network to improve network efficiency, it naturally brings the cache pollution attack. e attacker can send many nonpopular content requests through the controlled host so that the routers on the path cache the nonpopular content. When the normal user makes a request, the cache hits failed because the node cache cannot find the corresponding content, the router only forwards the request to the content producer for processing, which makes the original intention of the information-centric network design, using the cache to optimize the network message to the maximum extent becomes useless, so that the traffic of the backbone link of the network is greatly increased, resulting in network congestion and other phenomena.
Although ICN has rethought some of the design concepts of optimization and innovation, in many respects, some central issues have not been completely resolved in the initial ICN network framework. is paper mainly discusses the problem of cache pollution detection. Cache pollution attack is one of the most serious attacks in informationcentric networks. Most of the current detection algorithms need to manually set thresholds. ese methods have poor adaptabilities to different environments. erefore, in this paper, a GBDT-based cache pollution detection method is proposed, which does not require a manual setting of thresholds and has high accuracy.

Related
Work. NDN is an information-centric network architecture with high scientific research value and development potential [5,11], which has attracted great attention from the academic community in recent years. Although the native architecture of NDN tries to provide certain data security by encrypting and signing data packets by network content producers, in the face of various malicious attacks under complex environmental conditions in the actual network, its network nodes still have great potential risks, for example, Distributed Denial-of-Service (DDoS) attacks and cache pollution. Research by Virgilio et al. [12] shows that DDoS attacks can use a large number of forged interests to exhaust the memory of the pending interest table (PIT) in the NDN node. Tan Nguyen et al. calculated the difference in the satisfaction rate of interest packets caused by DDoS to detect and defend DDoS [13], but the effect of this solution is not ideal in the face of network cache pollution attacks.
Cache pollution attacks [14] have been widely studied in IP-based networks. More research studies focus on the cache of web traffic. e current research divides cache pollution attacks into two categories, destroying the content distribution (locality-disruption) characteristics attack and the false-locality feature attack [15]. In the destruction of the content distribution model, the attacker or the controlled host sends many interest packets of nonpopular content to the network. erefore, the router' CS table in the network caches is occupied by a large amount of nonpopular content, so that the requests of the normal popular contents cannot utilize the CS table of the router, increasing the network delay and achieving the purpose of the attack; in the forged content distribution attack, the attacker's attack model, and the broken content distribution differently, the attack does not destroy the overall distribution feature of the content in the network, but periodically sends a request for nonstreaming content, so that the cache is occupied by nonpopular contents for a long time, resulting in a decrease in network cache capability.
Although cache pollution attacks have been extensively studied in IP networks, most of the detection methods for cache pollution in IP networks cannot be applied to ICN networks.
is is because in ICN networks, the request packet does not contain any information of the requester, so it cannot be the source of the request packet which is traced, and the attack object of the cache pollution in the ICN is also different from that in the IP network. e cache pollution attack in the IP network is mostly directed to the cache server, and the target of the cache pollution attack in the ICN is the routing node in the network. is also shows that, in ICN cache pollution attacks, cache routes are difficult to perceive the existence of attacks [16,17]. As an informationcentric network architecture with network cache, NDN is vulnerable to cache poisoning and pollution attacks.
Mauri et al. consider a case in which an attacker uses NDN's routing and caching system to add a large amount of malicious content to the cache storage of network nodes [18]. Literature [19] analyzed the essential characteristics of content pollution attacks in NDN networks based on some cases. Li et al. proposed a lightweight integrity verification and access control mechanism for network cache pollution attacks [20]. Literature [21] studied local cache pollution attacks and proposed a cache shield, which enhances the robustness of the network by increasing the cost of cache pollution attacks.
Park et al. proposed a matrix random check method [22], which uses the content name. e method is mapped to the matrix. Each position of the matrix represents the number of corresponding requests. Whenever a request arrives, the rank of the matrix is checked. When the rank of the matrix is below a certain threshold, this node is considered to be attacked. Conti et al. proposed a lightweight mechanism [23] for detecting cache pollution attacks, which first defines a random sample set of content, monitors the distribution of current sample sets, and dynamically counts the arrival rates of these requests, once these cache pollutions have occurred when the arrival rate changes and exceeds a certain threshold.

Analytical Methods
Caching is one of the reasons why information-centric networks are efficient. is chapter will discuss the cache pollution detection problem in the information-centric network and study how to use the machine learning method to solve the problem that the traditional detection model needs to manually set the threshold. en, a GBDT-based cache pollution detection model is proposed, which is shown in Figure 1. First, the node status information and path information are collected as features, and then the gradient boosting tree algorithm is combined with the two features to build a cache pollution detection model. e training speed of this model is fast, and the accuracy is high. Finally, we evaluate the model through experiments.

GBDT Model.
e cache pollution detection model is essentially a classifier and is a two-class classifier. One is that the current node is being attacked, and the other is that the current node is not attacked. is chapter uses the GBDT model for cache pollution detection. GBDT is the abbreviation of gradient boost decision tree, which is the gradient lifting tree. is model is practically a gradient promotion model in the decision tree, that is, multiple decision trees are fused according to the gradient promotion method. is section will introduce the model-related content .
(1) e decision tree is divided into a classification tree and a regression tree. e classification tree refers to the decision tree used for classification problems. is chapter uses the GBDT model to classify the cache pollution problem. erefore, only the classification tree is introduced here. e classification tree is a kind of classification model using the tree structure. Each leaf node of the tree represents the classification result. Each branch of the tree is a decision mode. As shown in Figure 2, according to a set of new features A and B and the trunk of each tree according to the feature selection, the child node of the tree is entered. If the child node is a leaf node, the classification result is obtained. erefore, for any training set, if there is no data with the same characteristics but different results, the decision tree can obtain 100% accuracy on the training set. is is because the depth of the tree can grow all the time, and the decision algorithm will eventually get the result after judging all the attributes. Although the decision tree can get almost 100% accuracy on the training set, the test set may get very poor results. is is because the decision tree is just remembered all the training data, but did not learn how to judge this classification problem which the data that does not appear in the training set. us, it will produce more random results. is phenomenon is also called overfitting problem. For the decision tree, its training process aims to determine the characteristics of each branch, that is, finding the optimal feature in each node state. e usual method is to use information gain and information gain ratio and Gini index. e formula for information entropy is defined as follows: (1) To facilitate the operation, the log here takes the base 2 logarithm. From the formula, the value range of H(U) is [0, 1], and the amount of information entropy quantitatively identifies the uncertainty of the information. A larger value indicates a stronger uncertainty, that is, less information is included. For example, a box has red and white balls. If there are no restrictions, you can only assume that the probability distribution is 1/2 and 1/2. en, the entropy now is 1, which is the maximum value. is indicates that the information of the situation is the least, and if we know that there is only a white ball in this box, the probability distribution becomes 0 and 1. In this event, the information entropy is 0, which means that the amount of information is the largest currently, without any uncertainty. e ID3 algorithm uses the information gain method, that is, the change of the information entropy before and after the split is used to measure the optimal attribute, and the information gain obtained by dividing the D state by the feature A is defined as equation (2), that is, each iteration selects the largest information gain value to produce subsets of the data.
However, the ID3 algorithm has a significant limitation. e ID3 algorithm will choose the minimal Entropy(D, A), which will lead to the algorithm tending to select those features with more subclasses and purer features; therefore, the C4.5 algorithm is proposed to solve the ID3' s drawback. e C4.5 algorithm uses the information gain rate as the measure of the best feature. e optimal gain rate Gain Rate(D, A) is defined as equation (3), that is, the attribute with the maximum normalized information gain is chosen to make the decision.
e CART tree can be used for both classification and regression problems. e CART tree uses the Gini index or Gini impurity to determine the optimal division point. e Gini index is defined as equation (4), and the Gini index can also represent the uncertainty of the sample set S. Since the cart is a binary decision tree, each partition can only divide the set into two parts, so each partition needs to use the i th attribute value of attribute A, as shown in equation (5). Gain A,i (S) represents the uncertainty of set s after A i segmentation. e larger the Gini index, the greater the uncertainty of the sample set S after A i division, which is like entropy. Security and Communication Networks (2) e boosting algorithm [24] is an important part of ensemble learning. Boosting can be used to primarily reduce the bias of the model and enhance the weak models. e principle of the boosting algorithm is to use a weak model to fuse the strong model. First, weak classifier α is trained with the initial weights. e data weight of the training set is updated according to the prediction results of the weak classifier α so that the weight of the sample points whose prediction is error was made by the weak classifier α is increased. erefore, these samples with a high error rate can get higher accuracy in later learning with weak learner β.
en, we use the weighted training set to train weak classifier β, which is repeated until the number of weak classifiers reaches the number Τ appointed beforehand. Finally, these Τ weak classifiers are assembled through a set strategy to obtain the final strong classifier.
(3) GBDT is one of the ensembles learning boosting algorithm. GBDT is also an iterative model that uses a forward distribution algorithm, but the weak learner limits the use of the CART regression tree model.
For the cache pollution problem, binary classification is needed.
e log loss function can be used, and the loss function is shown as follows: . (6) e negative gradient of the loss function of the i th sample of the t th round is expressed as follows: Here, the loss function of the cache pollution problem is brought in, and the negative gradient error at this time is as follows: Using (x i , r ti ) (i � 1, 2, . . . , m), we can fit a CART regression tree and get the t th regression tree, and its corresponded leaf node region R tj , j � 1, 2, . . . , J, where J is the number of leaf nodes.
For each sample in the leaf node, the loss function is minimized, and the best output value c tj of the fitting leaf node is as follows.
For the problems provided in this chapter, the loss function of the cache pollution problem is brought in, and the optimal residual fit value of each leaf node is as shown in equation (10).
Since the above formula is more difficult to optimize, we use an approximation instead, as follows: us, the fitting function for each iteration is obtained as follows: Finally, the resulting strong learner expression is given as follows:

Node Status Information.
In the NDN-based network and the IP-based network, the cache pollution attack has similarities. In both networks, the attacker attempts to attack a terminal. In the IP-based network, the terminal refers to some servers. In the network of the NDN architecture, this terminal is a certain router, so this kind of attack is an attack that only the attacked node can judge.
In the NDN, the most intuitive reflection of the attack that occurred is the cache hit rate of the normal request, but the intermediate router responsible for forwarding and caching cannot distinguish the difference between the normal request interest packet and the attack interest packet, so the data cannot be directly or indirectly obtained through the router. It is only possible to estimate whether an attack has occurred by some available quantities. e variable quantities available in the NDN router are shown in Table 1.
Since the NDN is designed to follow the "thin waist" principle, it has less available information for routers. Firstly, the cache pollution attack is implemented by sending abundant nonpopular interest packets to the network. erefore, the correlation amount of the data packet does not have much significance and is not a feature. Secondly, for the attack detection model, some overall quantities have no meaning of the model detection, such as the total number of interest packets and the total cache hit rate, so these quantities are not suitable as model parameters. In addition, some ID-type quantities such as the name of the interest package and the cached interest package name are substantially independent of the cache attack. us, such IDtype variables should not be used as the features of the model. Existing research has shown that, for interest packet requests in routers, the Zip-f distribution is normally satisfied, that is, the most frequent requests are only a small part of all the data [25]. erefore, when retrieving features, we should consider extracting features that can reflect the distribution of content. Considering that the number of interest packets per unit time can reflect the distribution of content, the number of interest packets with the largest number of first K requests per unit time is used to form a K-dimension feature to enable the model to learn the current distribution characteristics and then select the K-cache hit rate of corresponding content as the feature.
For the above features, the number of interest packets may greatly depend on the usage of the network. For example, the total number of interest packets in the high-interval and lowpeak networks varies greatly, but the difference does not mean whether it is attacked. erefore, if you directly select the number of interest packages as a feature, the model may be too dependent on the number of packets in the network. erefore, it is necessary to normalize the number of interest packets, not using the quantity, using the proportion as a feature, the normalization formula is as shown in equation (14), and cnt k indicates the number of K interest packages with the largest number of units of interest per unit time, total represents the total number of interest packets per unit time, and the characteristics of the final selection node are shown in Table 2. In Table 2, VHit is cache replacement rate under cache replacement policy which can be obtained by the source code of ndnSIM because we use the built-in cache policy, that is, the LRU policy. And, VH is the K cache hit ratios corresponding to interest packages also can be obtained by the source of ndnSIM.

Path Information Feature.
In the NDN network, in addition to the feature based on the state of the node, pathbased information can be extracted as an aid. To save the path information, a path field needs to be added to the interest packet. is section proposes a hash-based path feature extraction algorithm. e algorithm only uses a few assembly instructions in the operation, almost no reduction in the speed of the original router processing packets. In the memory footprint, the algorithm only needs to add an integer variable in the interest package, and memory also hardly affects network bandwidth. e algorithm needs to select a random integer as the ID of the router when each NDN router starts, and the path field of the consumer sending interest packet is 0, that is, the content consumer does not participate in the maintenance process of the entire path, if the attacker attempts to change this field to forge the path information, the first hop router can also judge the attacker's attack based on the value being nonzero. e algorithm of the router is as follows: When the NDN router receives an interest packet, the value of PATH is updated by using equation (15), where PATH i+1 represents the PATH value in the interest packet forwarded by the (i + 1) th router, ID i+1 represents the ID of the (i + 1) th router, and xor is the exclusive-or operation. is kind of replacement or filling produces only one assembly code per forwarding time, so it hardly affects the delivery rate of interest packets. e above PATH value can be approximated to represent the path of the interest packet to a certain terminal, and the definition Unique(C) indicates the number of different PATH values in the interest packet requesting the content C in the current terminal.Cnt(c) is defined to represent the number of interest packets that request content C in the current terminal. Obviously, in the case of no cache pollution, there is a positive correlation between the number of interest packets Cnt(C) and Unique(C   Unique(C) cannot be directly used as a feature, and Unique(C) needs to be normalized to define the diversification ratio CP(C) of content C as in equation (16). e diversification ratio can reflect the richness of the source of certain content C to some extent, and the diversification ratio can be known according to the definition whose range is between 0 and 1. e smaller the value, the more singular the source of the interest packet is, the more likely it is the attack. e feature has a negative correlation with the cache attack; therefore, the feature can increase the accuracy of the model.
It can be known from equation (17) that, to calculate the diversification ratio, CP(C) of content C needs to calculate Cnt(C) and Unique(C), both of which are statistical values. Cnt(C) is the number of interest packets per unit time which requires a numeric variable, and Unique(C) is the number of different kinds of PATH. For an NDN network, considering the network traffic, the interest packets will not be stored, so the hash is required to statistic the values above. Hash the PATH, and, in addition, use the bitmap to reduce the memory usage. Using one bit to represent whether the current PATH has appeared or not; if so, set that bit to 1. Calculate the number of bits with the value of 1 in unit time, which can be approximately considered as the number of different kinds of PATH in network.

Experimental Environment.
e experimental environment of this paper is shown in Table 3:

ndnSIM Simulation. ndnSIM is an NDN simulation
platform developed based on the NS-3 network simulator, which can simulate diversified NDN scenarios [26]. is article changes the source code of the interest package structure in ndnSIM, adds the PATH variable, randomly assigns an ID to each NDN router, and adds related operations to the PATH variable in the routing and forwarding process according to the description above in this chapter.
is chapter conducts simulation experiments on known complex topologies. e experimental network topology is shown in Figure 3. In each experiment, the attacker randomly selected the host as the controlled host, and the controlled host sends a great quantity of nonpopular requests.
According to the current researches, most researchers believe that the request in the information-centric network should obey the Zip-f distribution. erefore, the request in the simulation experiment network should follow the Zip-f distribution, and the normal request distribution's parameter should set a � 1.2, and the request rate is 1000/s. In the experiment, the cache policy of the NDN router adopts the LRU policy. e experiment builds the environment through ndnSIM [24], and the statistics of the experimental related data including the number of arrivals of the interest packets are performed by modifying the source code. In order to obtain the training data of GBDT, respectively, simulate the network when there is no attack, and when there is an attack, the attacker sends a large number of nonpopular interest packets to simulate the occurrence of the attack, and statistics when there is an attack and the statistics when there is no attack are recorded and saved separately, and the data are divided into a training set and a test set for multiple experiments. e training set and test set data selection in each experiment are shown in Table 4.

Model Training.
is article uses Python's lightGBM library to build the GBDT model. lightGBM is Microsoft's boosting framework, which has faster training efficiency, lower memory usage, and higher accuracy than xgboost [27], and supports parallel learning. In this experiment, the GBDT model is used for training on 10,000 sets of data, and the test   is performed on 2,000 sets of data to analyze the accuracy of the model. When training the model, in order to prevent overfitting of the model, the maximum depth of the decision tree and the maximum number of leaf nodes in the GBDT model should be set, and the regularization parameters should be set. Besides, for the number of iterations, the fast stop strategy is selected, and the training data is divided into two parts: one as the training set and one as the evaluation set (to distinguish it from the test set, here called the evaluation set), used to make a quick stop. e training set and the evaluation set are disjoint sets. In the experiment, their ratio is 4 : 1. e loss function on the evaluation set is calculated in each iteration. When the performance on the evaluation set will not improve anymore (that is, the loss function does not change anymore), stop training and the model loss function uses the Log loss function. When training the GBDT model, some of the parameter settings for using lightGBM are shown in Table 5.

Evaluating Indicator.
e validity of the method is measured by the precision, recall rate, accuracy, and F-measure. It defines the following concepts: Cache pollution is classified as cache pollution: TP(true positive) Cache uncontaminated is classified as cache uncontaminated: TN(true negative) Cache uncontaminated is classified as cache pollution: FP(false positive) Cache pollution is classified as cache uncontaminated: FN (false negative).
Precision (Pre), which is the actual proportion of the sample classified as cache pollution, is calculated as equation (17).
Accuracy (abbreviation: Acc) is the ratio of the correctly classified samples to all samples and measures the overall correctness of the model. e formula is given by equation (18). e recall rate (recall) indicates how many positive samples are correctly classified, and the formula is calculated as in equation (19).
F-measure is the harmonic mean of the accuracy rate and the recall rate, and the formula is calculated as in equation (20).
In this paper, the training time of the model is used as the main indicator of model performance evaluation. e relationship between the number of GDBT iterations and time when the experiment counts 10,000 sets of data.

Analysis of Results
For the GBDT model proposed in this section, two types of features are used, node status and path information. Since normalization is used, all values are in the range [0, 1], and for the NDN cache pollution attack, the attacker's attack strength will be numerically strong and weak. e characteristics of different attack strengths also change within a certain range. erefore, the final decision model should be a range model. is property is like the decision tree. GBDT is currently a very good model for improving the decision tree, so the model is adopted. e experiment also proves that the model can achieve good detection results. Figure 4 shows the relationship between the loss function and the number of iterations when using the parameters described in Section 3 on 10,000 sets of data. It can be seen from the figure that, as the number of iterations increases, the performance of the training set becomes better. However, the performance of the evaluation set is not getting better anymore, and there is a trend of deterioration. If the number of iterations continues to increase, there will be overfitting. In the current model parameters, the loss function of the training set and the evaluation set is better at 736 iterations. At this time, the loss function on the evaluation set is 0.0386, the loss function on the training set is  As can be seen from Figure 5, using the lightGBM for the training of the GBDT model, the training is very fast in the case of 10000 sets of data. When the iteration is about 300 times, the time is still less than 1 second. In the simulation experiment, it is best to iterate. It took only about 2 seconds, which means that the GBDT model training for lightGBM is very fast. e attack strength θ is defined as the proportion of attack packets in the request packet. e stronger the attack strength is, the greater the impact on the state of the network node is. e accuracy of the model has a certain relationship with the attack strength.
erefore, the simulation is as follows. In the experiment, the relationship between attack intensity and detection accuracy was analyzed.
As shown in Figure 6, when the attack intensity of network cache contamination is lower than 15%, with the continuous increase of the attack intensity, the accuracy rate PRE, accuracy rate Acc, and recall rate of GBDT model detection are all continuously improved. When the attack intensity exceeds 20%, the model can distinguish the attack more clearly, and the above indicators tend to be stable, with the accuracy rate up to 93%, accuracy rate up to 95%, and recall rate up to 97%. is is because with the increase of attack intensity, the cache hit ratio of nodes in the network and the proportion distribution of interest packets will also be affected more and more, which makes it easier for the model to detect attacks. Figure 7 shows the comparison result of the detection accuracy with the light weight mechanism method proposed in [17]. e traditional LWM method needs to set a threshold, which affects the detection accuracy of the model. e GBDT square model uses the current mainstream machine learning method to learn the judgment standard, so there is no need to set the threshold. It can be seen from the numerical values that the detection accuracy of the GBDT model is higher than that of the LWM method under various attack intensities, and the detection accuracy of the GBDT model can reach more than 85% under the attack intensity of 2.5%, which is better than that of the LWM. e detection accuracy of the method is 5% higher, so it can indicate that the model has a fairly strong cache pollution perception ability.

Conclusion
In this paper, a new detection method based on machine learning is proposed for the current problem of cache pollution in the information-centric network. Firstly, the node state and the path information are used as the feature. e node information is obtained through statistics, and the path information is hashed to achieve the diversification rate. In the interest packet delivery process, only one field and one assembly instruction need to be added, so completely it does not affect the forwarding efficiency of normal routes.
en, the two features are combined to use the gradient lifting tree algorithm to build a cache pollution detection model. Compared with the current detection methods that mostly need to set thresholds, the method has higher accuracy and adaptability to different networks. Finally, the model is implemented by ndnSIM and lightGBM, and the advantages of the method in detection accuracy are demonstrated compared with other detection methods.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.