Anti-Attack Scheme for Edge Devices Based on Deep Reinforcement Learning

. Internet of Things realizes the leap from traditional industry to intelligent industry. However, it makes edge devices more vulnerable to attackers during processing perceptual data in real time. To solve the above problem, we use the zero-sum game to build the interactions between attackers and edge devices and propose an antiattack scheme based on deep reinforcement learning. Firstly, we make the k NN - DTW algorithm to ﬁ nd a sample that is similar to the current sample and use the weighted moving mean method to calculate the mean and the variance of the samples. Secondly, to solve the overestimation problem, we develop an optimal strategy algorithm to ﬁ nd the optimal strategy of the edge devices. Experimental results prove that the new scheme improves the payo ﬀ of attacked edge devices and decreases the payo ﬀ of attackers, thus forcing the attackers to give up the attack.


Introduction
Internet of Things (IoT) [1,2] integrates various sensors or controllers with sensing and monitoring capabilities as well as advanced technologies (e.g., mobile communication technology and intelligent analysis technology) into all aspects of industrial production, realizing the leap from traditional industry to intellect industry.It has been widely used in logistics [3,4], transportation [5], energy [6], and so on.During the application process of IoT, mass perception data is produced in end devices, which requires the edge devices to have higher real time, security [7,8], and privacy [9,10].However, edge devices are usually located in a nearby user or on a routing path to the cloud, making them more vulnerable to attackers.For example, machine learning models on edge devices during the training period are vulnerable to welldesigned adversarial examples [11,12].UPGUARD, an American cybersecurity firm, found that hundreds of millions of Facebook user records stored on Amazon's cloud computing servers could be easily accessed by anyone.Tens of thousands of private Zoom videos are uploaded to the public web page that anyone can watch online.
The above threats can cause network penetration, personal data theft, and the epidemic spread of intelligent computer viruses.Therefore, preventing attacks and ensuring data security are the key to improve the efficient application of this system.
Currently, resisting malicious attackers mostly adopts traditional skills in IoT, such as encryption method and identity management technology.The encryption method is the most common traditional skill [13,14].However, due to the limited resources of edge devices in IoT, making the lightweight encryption program becomes one of the biggest challenges.Identity management technologies are the first line of resisting malicious attackers.However, the existing identity management technology cannot achieve identity authentication between multi-layer architectures.In recent years, a few emerging safety precaution technologies are widely used in IoT [15][16][17], such as trusted execution environments and machine learning technologies.However, most machine learning technologies, which are based on the assumption that training data remains constant during training, are incompatible with the environment where the data changes dynamically in real-time in IoT [18,19].
Inspired by the above schemes, from the point of view of attacker payoffs, we build the interactions between edge devices and attackers as the zero-sum game and propose an antiattack scheme for edge devices based on deep reinforcement learning.The structure diagram of the proposed scheme is shown in Figure 1.The major contributions are as follows: (1) To find the optimal strategy of edge devices, we propose the k NN-DTW algorithm to find a similar sample to the current sample and then use the weighted moving mean method to calculate the mean and the variance of the samples; (2) To weaken the influence of time series' irregularity, we emphasize the influence of the latest data on forecast value and then set weight for samples by the law that the object is big when near and small when far; (3) To overcome the overestimation problem of the optimal strategy, we design an optimal strategy algorithm to find the edge device's optimal strategy by maximizing their accumulated payoff and then achieve the purpose of defending against attackers.
The structure of this paper is as follows: in Section 2, we define problems that we seek to solve in this article.In Section 3, we discuss the antiattack scheme for edge devices.In Section 4, we verify the effectiveness of the antiattack scheme for edge devices.Section 5 contains conclusions and future research.

Related Work
This section introduces the latest development and research of antiattack schemes from two aspects: traditional security protection schemes and emerging security protection schemes in IoT.

Traditional Safety Precaution
Technologies.Homomorphic encryption, differential privacy, and identity authentication are three traditional protection technologies.Homomorphic encryption can process sensitive data without decryption to protect data privacy.Lu et al. [20] encrypted structured data by using homomorphic Paillier crypto-graphic system technology.Tan et al. [21] used the technique of finite field theory and proposed a private comparison algorithm based on full homomorphic encryption for encrypted integers.Differential privacy technology is used to ensure the privacy of any single item in the data set under the statistical query.Wang [22] proposed a data-driven spectrum trading solution that could maximize the income of PUS and retain SU's privacy differences.However, the computing resources on the edge devices are quite limited and cannot support the huge computing power consumed by using encryption schemes.Identity management technologies can set access authority by identity management and access control to prevent illegal user intrusion.Alizadeh et al. [23] summarized authentication technology in mobile cloud computing.Malik et al. [24] proposed an identity authentication and expeditious revocation framework based on the blockchain, which can quickly update the status of revoked vehicles in the shared blockchain.Zhang et al. [25] proposed a smart contract framework comprised of several access control contracts, a judge contract, and a registered contract, which gave a trusted access control strategy.However, using an access control strategy for precautions makes it hard to clear different users' roles and their rights in IoT.

Emerging Safety Precaution
Technologies.The improvement of traditional security schemes can be used to enhance the security of edge devices in the IoT [26].With the rise of artificial intelligence, some emerging security prevention technologies, such as trusted execution environment and machine learning technology, are gradually used to improve the security of edge devices in the IoT.The trusted execution environment [27] can be used to ensure the security of the running environment of the software.Running an application in a trusted execution environment can guarantee the security of data even if edge devices are compromised.Trusted execution environment, such as trustzone, intel management engine, and ARM trustzone, are quite popular.Han et al. [28] built a complete framework that supports visibility into encrypted traffic and can be used in secure and functional networks.However, trusted execution environment usually has its loophole, such as Qualcomm loopholes, and trustonic loopholes.Ghaffarian and Shahriari [29]  2 Wireless Communications and Mobile Computing software vulnerability analysis.Scandariatio et al. [30] explored machine learning-based text mining to predict security loopholes in a software source code.However, most machine learning methods assume that statistical data remain unchanged during the training process, but the data in the IoT changes dynamically in real time.

Problem Definition
In this article, we build the interactions between edge devices and attackers as a zero-sum game [31].Namely, the payoff of attackers equals the loss of edge devices.In each round of interaction, the attacker attacks the sample from edge devices to gain illegal payoff, and the edge device plays his strategy to defend against attackers.Previous studies usually determine the optimal strategy of edge devices by calculating the Nash equilibrium in the handcrafted abstraction of the domain [32].Currently, some researchers introduce the recursion technique to the neural network to determine players' optimal strategy by predicting human action in a strategic environment.From [33], the payoff function of edge devices can be defined as where a is the strategy vector of edge devices, that is, a = fa 1 , a 2 , ⋯, a n g T ; μ, μ = fμ 1 , μ 2 , ⋯, μ n g T is sample mean payoff; and C is the covariance matrix of sample payoff.As can be seen from Equation ( 1), if we know the sample mean payoff and the covariance matrix, we can determine the optimal strategy of edge devices by maximizing the payoff function.
That is, However, the sample mean payoff and the covariance matrix are unknown.But, we can take advantage of the similarity between the historical sample and current sample to predict the sample mean payoff and the covariance matrix and then determine the optimal strategy of the edge device.
When we determine the optimal strategy of the edge device, we should try to resolve the following three problems: (1) during the process of measuring similarity between samples, we need to avoid using improper measuring methods which might cause the disappearance of optimal solution; (2) during the process of calculating the sample mean payoff and covariance matrix, we need to weaken the influence of time series' irregularity; (3) during the process of finding the optimal strategy of edge devices, we need to break the correlation between training samples and solve the problem of overestimation.

Antiattack Scheme for Edge Devices
This section introduces the following three problems that we seek to solve: how to find the sample that are similar to the current sample, how to calculate the mean value and the covariance matrix, and how to calculate the optimal strategy of edge devices to resist attackers.

Measuring Similarity of Sample.
To find a sample similar to the current sample, we propose the k NN-DTW algorithm to determine the category of the current sample and then find the similar sample with the current sample.The k NN-DTW algorithm is a combination of the k-nearest neighbor algorithm and the dynamic time warping method (DTW); the k NN algorithm classifies the current sample and the DTW method finds the similar sample in the same category samples with the current sample.
In the k NN algorithm, the choice of k has a significant impact on the classification results.We use the bootstrap method to find the optimal value of k.Assuming the value of k and the probability of time series being correctly classified ρ satisfy the following regression model,: where hð•Þ is the mapping from k to ρ, β is a coefficient vector, and fε i g is a numerical vector, i.e., FðxÞ.We use the least square method to estimate β, i.e., b β = gðρ 1 ,⋯,ρ n Þ, make the regression residual empirical distribution function to estimate FðxÞ, and apply the bootstrap method to estimate the covariance matrix Varð b βÞ of β.If the estimation error of each coefficient (the square root of the diagonal element in Varð b βÞ ) meets the threshold ε 0 , the value of k can be determined by maximizing ρ.
After the category of the current sample is determined by k NN, we use DTW to measure the distance between the current sample and the historical sample.DTW method locally scales two samples on the time axis to make the morphology of the two sets, so that the DTW method can measure the distance between time samples that have different lengths.Comparing with Euclidean distance, the DTW method is more elastic and supports local time shifts and in the length of time series, but the time and special complexity of this method is OðnmÞ, where n and m are the lengths of two time series, respectively.To decrease the space and time complexity of the DTW method, we apply early abandoning method to optimize the computations of the DTW method.The detailed process is as follows: Step 1.Given two time series X and Y, where n is the length of time series X, and m is the length of time series X Step 2. Define the warping path as P = p 1 , p 2 , ⋯, p K , where max ðn, mÞ < K < m + n + 1, p k = ði, jÞ is the kth element in warping path P, i is the ith cell of time series X, and j is the jth cell of time series Y, the i and j of p k = ði, jÞ are monotonically increasing, 3 Wireless Communications and Mobile Computing Specially, when calculating warping path, it must ensure that every coordinate in the time series X and Y is involved.That is, the calculation starts from p 1 = ð1, 1Þ and ends at p k = ðn, mÞ Step 3. Find the warping path between two time series with the shortest cumulative distance, To obtain the warping path with the shortest cumulative distance, Eq. ( 6) can be solved iteratively by using the dynamic programming method Step 4. Set the distance threshold ε, ε > 0, if the distance Dði, jÞ > ε in cell ði, jÞ, the calculation of the distance between two time series on the path will be terminated Step 5. Determine the distance between two time series D ðn, mÞ

Calculating the Mean Value and the Covariance Matrix.
To weaken the influence of time series' irregularity, we emphasize the influence of the latest data on forecast value and set weight for samples by the law that the object is big when near and small when far.Namely, the sample elements that are close to the prediction period will be given a relatively big weight.We use the weighted moving average method to calculate sample's mean value.That is, where w t refers to the weight of sample data y t ; it follows the rule that weight decreases as the distance increases, i.e., w t > w t−1 > ⋯ > w 1 .Accordingly, the covariance matrix C can be calculated as 4.3.Preventing Malicious Attacks.After finding the similar sample, we first take the mean payoff and covariance matrix of the similar sample as the mean payoff and covariance matrix of the current sample, respectively.And then, to weaken the influence of time series irregularity, we emphasize the influence of the latest data on forecast value and set weight for samples by the law that the object is big when near and small when far.Finally, we find the solution to the optimal strategy of edge devices by maximizing the payoff function.The detailed process is shown in Algorithm 1.However, the above method is prone to overestimation.To solve the above problem, we design Algorithm 2 to find the optimal strategy of the edge devices by maximizing their accumulated payoff.
Reinforcement learning is aimed at maximizing the reward for the long term to find the payoff maximum of the agent.Thus, players of the game are transformed into separate agents.We use Deep Q Network to find the agent's optimal strategy.In this algorithm, the state set S of agent is defined as S = fs 1 , s 2 g, where s 1 means that the current data is normal (not attacked) and s 2 means that the current data is abnormal (already attacked); the above states can be described by the Markov decision process.The action set A is defined as A = fa 1 , a 2 g, where a 1 means that the agent accepts the current data set, a 2 means that the agent rejects the current data set, and action reward R is defined as The agent interacts with its fellow agents and stores its experience of strategy transitions ðs j , a j , r j , s j+1 Þ in replay memory R. To break the correlation between training samples, during the process of training, we select samples randomly from replay memory R to train model for finding the optimal strategy of the agent.The detailed process is shown Algorithm 2, where Q is the payoff when the agent adopts the optimal strategy.

Stimulation Results
We use the Anaconda-integrated development tool to validate the proposal.First, we analyze the feasibility of weakening the influence of time series irregularity to prove the reasonableness of setting sample data's weight according to the rule of the object being big when near and small when far.Second, we compare the DTW method with seven classical distance methods like correlation distance, Jaccard distance, and cosine distance to verify the reasonableness of k NN-DTW.Finally, we apply optimal strategy to the rockpaper-scissors game [34] to verify the practicability of optimal strategy by comparing the winner (choose winner's strategy) and opponent (choose opponent's strategy) strategies.

Feasibility Analysis of Weakening the Influence of Time
Series Irregularity.Tables 1-3 analyze the influence of weighting weights on each parameter in the target payoff function according to the law that the object is big when near and small when far.To better describe the complete process, Input: Similarity sample set W; Output: Optimal strategy a max ; 1: for n = 2: N do 2: Calculate similarity with DTW; 3: if similarity < 1 then 4: Continue; 5: else: 6: U P = max a Uða, μ, CÞ; 7: end for 8: return a max Algorithm 1: Optimal strategy.4 Wireless Communications and Mobile Computing we set each sample dataset only has 20 data and make the weight for each sample, as shown in Table 1.Table 2 shows the prediction error between the weighted mean and mean under the same strategy profile ð1, 1, 1, 0Þ, where 1 means that the edge device accepts the data, and 0 means that the edge device rejects the data.
Table 3 shows the effects of weakening the influence of time series irregularity on the parameters of the objective function (e.g., C).According to the table, setting the weight of the data according to the rule of the object is big when near and small when far has a greater influence on C and a weaker influence on logða T μÞ.Therefore, the proposed method can weaken the irregularity of the time series.

Verification of the Reasonableness of k NN-DTW.
To verify the reasonableness of combining DTW method and k NN algorithm, Figure 2 shows the DTW method with seven classical distance methods like correlation distance, Jaccard distance, and cosine distance.From Figure 2, we can see that the cosine distance and Chebyshev distance are the worst.For example, when the ratio of the same elements in the range from 20% to 33%, the results of the Cosine distance are all 0.79 while the ratio of the same elements in the range from 6% to 67%; the results of the Cosine distance are all 0.66.Therefore, cosine distance and Chebyshev distance are not suitable for measuring the distance between the samples in this paper.Although other methods also produce the same results, the number of the same results is less than that of cosine distance and Chebyshev distance.For example, the results of the DTW method are the same if and only if the ratio of the same elements is 80% or 87%.And the results By comparing Figures 2(a) and 2(b), it can be seen that DTW method works best, followed by Jaccard distance.For example, when the number of the same elements is 80%, 87%, and 93%, the results of Jaccard distance are 0.2, 0.13, and 0.07; the results of Euclidean distance are 0.52, 0.41, and 0.01; the results of Manhattan distance are 0.74, 0.42, and 0.01; and the results of DTW method are 0.27, 0.17, and 0.0004, respectively.From the above results, we can draw a conclusion that Euclidean distance and Manhattan distance have a similar impact on measuring the distance between samples, while the results measured by these two methods varied greatly when the data in the two samples varied from 87% to 93%.While the Jaccard distance and DTW method have a similar impact measuring the distance between samples, the results measured by these two methods varied slightly when the data in the two samples varied from 87% to 93%.The DTW method is more suitable for measuring the distance between samples; this is because the DTW method can measure the distance between samples of different lengths.Therefore, we combine the DTW method with the k NN algorithm to measure the distance between samples.

Application of Antiattack
Scheme.First, we need to define the rock-paper-scissors game's payoff matrix, as shown in Table 4.The rock-paper-scissors game is a typical example of zero-sum game.In the game, two players have the same strategy set, which is (rock, paper, scissors).If two players play the same strategy, then both of them get 0 for a draw; otherwise, the winner gets 2 and the loser gets -2.
Figure 3 shows the changing trend of payoff that player 1 and player 2 play optimal strategy, winner strategy, and opponent strategy in the initial states S 0 = fscissors, rockg   and S 1 = frock, scissorsg.In the figure, winner vs. optimal means that player 1 plays winner strategy and player 2 players optimal strategy in the game.Similarly, we can know the meaning of opponent vs. optimal and winner vs. opponent.Figures 3(a)-3(d) show the payoff of player 1 and player 2 in S 0 and S 1 states, respectively.From Figure 3(a), we can see that due to player 1 adjusts scissors strategy to rock strategy when starting the second round of the game, the payoff of player 1 is -2 in the first round, and the payoff of player 1 is 2 in the second round.It is worth noting that the payoff trend of player 1 and player 2 is the same in winner vs. optimal and opponent vs. optimal because the strategies adjusted by winner strategy and opponent are the same in the initial state S 0 .According to Figures 3(a)-3(d), we can draw a conclusion that the optimal strategy is optimal in state S 0 , while in state S 1 , optimal strategy is superior to winner strategy and inferior to opponent.As can be seen Table 5, the overall payoff of players is same in opponent strategy and optimal strategy.To sum up, optimal strategy scheme can help to determine the player's strategy and maximize the player's payoff.

Conclusion
In IoT, defending against attacks by determining the optimal strategy of the edge device for ensuring data security is the key to improve its effectiveness.In this article, we propose an antiattack scheme for edge devices based on deep reinforcement learning to solve this issue.And the core of this scheme is he optimal strategy algorithm.Detailed simulation experiment verified the effectiveness of this new scheme.In future studies, we will focus on creating a new methodology to determine the similarity between data samples and use machine learning approaches to solve more data security problems.

Figure 1 :
Figure 1: The structure diagram of the proposed scheme.

Table 3 :
The influence of time series irregularity.