DQfD-AIPT: An Intelligent Penetration Testing Framework Incorporating Expert Demonstration Data

. Te application of reinforcement learning (RL) methods of artifcial intelligence for penetration testing (PT) provides a solution to the current problems of high labour costs and high reliance on expert knowledge for manual PT. In order to improve the efciency of RL algorithms for PT, existing research has considered bringing in the knowledge of PTexperts and combining it with the use of imitative learning methods to guide the agent in its decision-making. However, the disadvantage of using imitation learning is also obvious; that is, the performance of the strategies learned by the agent hardly exceeds the demonstrated behaviour of the expert and it can also cause expert knowledge overftting. At the same time, the expert knowledge in the currently proposed method is poorly interpretable and highly scenario-dependent. Te expert knowledge used in these methods is not universal. To address these issues, we propose an intelligent PTframework named DQfD-AIPT. Te framework encompasses the process of collecting and using expert knowledge and provides a rational defnition of the structure of expert knowledge. To solve the overftting problem, we perform PTpath planning based on the deep Q-learning from demonstrations (DQfD) algorithm. DQfD combines the benefts of RL and imitation learning to efectively improve the PT strategy and performance of agents while avoiding overftting. Finally, we conducted experiments in a simulated network scenario containing honeypots. Te experimental results proved the efectiveness of expert knowledge incorporation. In addition, the DQfD algorithm can improve the efciency of penetration testing more efectively than that by the classical deep reinforcement learning (DRL) method and can obtain a higher cumulative reward. Not only that, due to the incorporation of expert knowledge, in scenarios with honeypots, the DQfD method can efectively reduce the probability of interacting with honeypots compared to the classical DRL method.


Introduction
With the development of the Internet, the network environment is increasingly complex, and cyber security threats that we face are increasing day by day.Globally, protecting modern systems and infrastructure is becoming a challenge in the feld of computer security.Te traditional approach to system security assessment takes a defender's perspective by solidifying and enhancing system security against attackers [1].Penetration testing is used as a positive method to attack and test a target system against authorised networks from the attacker's point of view.We can conduct vulnerability detection and security assessment through potential threat paths [2].However, with the increase in network size and system complexity, the number of hosts in the network, and the complexity of confguration information, the efciency of performing penetration testing will be afected by the AoI (age of information) [3,4].Performing PTmanually involves a lot of repetitive actions and procedures [5].As a result, automated and intelligent penetration testing was born out of this need.
Early research included automated penetration testing tools and related theoretical studies.Automated penetration testing tools integrate modules for scanning, penetration attacks, and payload selection, such as Metasploit [6].However, human intervention is still required for the critical target identifcation and load selection in this tool.In terms of theoretical research, attack trees, attack graphs, and planning domain programming languages (PDDLs) are representative.Methods such as attack graphs plan attack paths through formal representation of the target network confguration information and state transfer analysis [6,7].However, all these methods require in-depth knowledge of the target network's information in advance and cannot model the uncertainty in the penetration testing process.Recent advances in artifcial intelligence have provided a new approach to the current research on automated and intelligent penetration testing.In particular, RL methods are proving to be a general and efective approach.RL can be used to solve the problem of optimal performance of an agent in a given environment.A model-free approach to RL, for example, allows an agent to learn a strategy by interacting with the environment, with little reliance on prior knowledge of the environment.Te method is analogous to a human player interacting with a game in order to complete the game objectives and learn relevant solution strategies [8].
Te Markov decision process (MDP) is a paradigm of RL.Schwartz and Hanna [9] and Zhou et al. [10] trained the agent for path planning by formalising PT as MDP and using the DRL algorithm in a constructed penetration test simulation environment.Te problem is that the training process takes a long time to converge and that the simulation environment is poorly simulated.Zennaro et al. [11] combine RL and imitation learning approaches and classify PT problems according to scenarios.Tey provide a priori knowledge to the agent for a specifc network scenario structure, which better guides the agent to explore their problem space and thus obtain a better solution.On the downside, its expert knowledge is limited by the constructed penetration test scenarios and is less interpretable and generalisable.Chen [12] proposed an intelligent PT framework named GAIL-PT.GAIL-PT collects expert knowledge via Metasploit and uses GAIL (generative adversarial imitation learning) in imitation learning for path planning.However, the complexity of the A3C-GAIL and DPPO-GAIL algorithms used in the experiments is still high.Te above research takes full account of the characteristics of the penetration testing problem and uses imitative learning methods for intelligent penetration testing while incorporating expert knowledge.Te use of expert knowledge with decision aids as inputs can go some way to improving the PT strategy of the agent, allowing it to be trained in a direction close to the behaviour of the expert [13].However, the aim of imitation learning methods for training models is to ft the trajectory distribution of modelgenerated strategies to the trajectory distribution of the input.Terefore, using imitation learning in combination with RL makes it difcult to make policy enhancements to that part of the environment that has not been explored.In conclusion, the following challenges need to be stressed: (i) Challenge 1: the penetration testing expert knowledge provided to the agent is poorly interpretable, usually dependent on specifc network scenarios, and not universally applicable.
(ii) Challenge 2: the use of imitation learning is tending to produce an overftting of expert knowledge, making it difcult to balance exploration and exploitation of the environment.Te higher complexity of the algorithms used for intelligent penetration testing leads to slower convergence and lower efciency.
To address these challenges, we frst propose an intelligent PT framework called DQfD-AIPT that incorporates expert knowledge.DQfD-AIPT contains mainly the collection and exploitation of expert knowledge and the training of agents.At the same time, we have rationalised expert knowledge.Second, we use the DQfD algorithm based on the reinforcement learning with expert demonstrations (RLED) framework, combining both supervised and unsupervised methods to construct the loss function.Te algorithm also uses prioritized experience replay (PER) for experience sampling to balance expert data and interaction data to prevent overftting.Ultimately, experiments were conducted in a simulated network scenario containing honeypots to verify the efectiveness of expert knowledge incorporation and to test the performance of the DQfD algorithm.Te experiments show that the DQfD algorithm has better penetration testing performance than the classical DRL method.For the experimental platform, we selected the CyberBattleSim (CBS) platform developed by Microsoft, which features high simulation and support for RL algorithms and is currently a more well accepted intelligent PT simulation platform.
Te main contributions of this paper are as follows: (i) To address the problems of poor interpretability of expert knowledge and dependence on specifc scenarios, we propose an intelligent PT framework named DQfD-AIPT that incorporates expert knowledge.Te construction of an expert knowledge base is carried out through two methods: the transformation of abstract expert knowledge and the collection of PT traces in diferent network scenarios.At the same time, we also defne the form and structure of expert knowledge.(ii) To address the problem of overftting of expert behaviour due to imitation learning, we use the DQfD algorithm incorporating expert demonstration data, together with a PER mechanism for sampling of expert data and interaction data.With the guidance of the expert demonstration data and interaction data, the efciency and overall performance of the training process of the agent are effectively improved.
Te organization of this paper is as follows: In Section 2, we provide an overview of research advances in PT, intelligent PT, and use of expert knowledge.In Section 3, we explain and outline the RL covered in this paper.In Section 4, we describe the method that we used and the specifc implementation details of the method.In Section 5, we describe the procedure and hyperparameter settings of the experiments, while the results are analysed and evaluated.

2
Security and Communication Networks Finally, in Section 6, we summarise our work and further articulate future research directions.

Related Work
In this section, we mainly introduce the basic concepts of PT, the research progress of intelligent PT, and the importance of expert knowledge in the PT process.In the end, we conclude with a summary of the problems in the current study.

PTand Its Automation.
PT is a method of simulating real attacks with the aim of assessing the security of computer systems and networks.PTassesses information security from the attacker's perspective.Trough PT of companies, organisations, or departments, we understand their information security policies and network vulnerabilities and give possible solutions and remedies to improve network security [14].As network equipment and defence detection systems continue to upgrade, the complexity of performing the PT process has increased dramatically.Te entire testing process involves skilled cyber security experts generating attack plans to discover and exploit vulnerabilities in networks and applications.A team of highly experienced testers is therefore essential, who must control all tasks manually.In addition to this, there are a large number of repetitive actions and deterministic steps in the PT process, which will lead to the problem of the high time cost of conducting PT manually.In summary, PT currently appears to be a costly means of assessing the vulnerability of network systems [15,16].
To address the high cost and reliance on manual PT, methods and tools to implement automated PT have been proposed.In terms of theoretical research, early studies such as attack trees, attack graphs, and PDDL are representative.Tese methods plan attack paths through formal representations of target network confguration information and state transfer analysis [3,4].On the one hand, these abovementioned methods require full knowledge of the network topology and the confguration of each machine, which is unrealistic from the attacker's point of view.On the other hand, these methods focus on a regular representation of known information and then fnd the path of attack by means of planning.Te uncertainty of a real PT process is not well modeled.Tat is, the uncertainty of system knowledge must be obtained using remote tools before a planned attack can be executed.
In terms of the development of automated PT tools, mature automated PT tools include APT2 PT suite, Autosploit PT tool, and Awesome-Hacking tools [2].Tese PT tools have signifcantly improved the efciency of PT and simplifed the process of conducting PT manually.However, there is still the problem of not being able to intelligently select the attack payload and of targeting only a single host.Tere is still a need to base the correct choice of attack methods and means on the decisions of PT experts.Te use of intelligent planning techniques to improve the automation of attack path discovery is still the key to achieving automated PT [17].

Intelligent PT Using RL Methods.
With the development of algorithms in the feld of artifcial intelligence, there have been new advances in the study of automated PT.Artifcial intelligence-driven PTmethods are able to intelligently select attack targets and attack payloads based on the current state of the target network.Te RL approach learns how to map the current state to an action and provides an idea how to do so.Te agent learns the PT strategy by interacting with the environment composed of the target network and based on the feedback from the interaction.Te process by which we build a simulated environment for PT to train an agent is similar to how a player interacts with a game to discover its solution [9].
A precondition for applying RL for intelligent PT is the need to formalise the PT process into the RL paradigm.Sarraute et al. [18], Schwartz et al. [19], Hu et al. [20], and Zennaro [11] et al. formalised the penetration testing process as a partially observable Markov decision process (POMDP).Tey incorporated the attacker's observations of the environment into the attack process.However, as the size of the network scenario expands, the computational complexity increases and is still not applicable to large-scale networks.Durkota and Lisý [21] proposed the model penetration test as an MDP, in which the action space consists of specifc vulnerabilities and the state space consists of the results of attack actions.Te goal of the whole model is to minimize the expected loss value.Hofmann [22], based on Durkota, ignores the structure of the target network and relies instead on expressing the uncertainty of PT in the form of possible action outcomes.Tis is essentially a model-free approach [9] that requires minimal prior knowledge of the environment.However, POMDP is more realistic in most cases.However, considering the computational complexity and the efciency of the reinforcement learning algorithm, the MDP model is still a better scheme to balance the computational efciency and modeling rationality.
In recent years, a variety of RL algorithms have been used extensively in addressing intelligent PT.Schwartz and Hanna [9] constructed the network attack simulator (Nasim) and used known network confgurations as states, available scans and exploits as actions, and used table-based Q-learning methods and neural network-based DQN methods to achieve intelligent discovery of attack paths.Zhou et al. [10] combined various improvements with the DQN algorithm and proposed the NDSPI-DQN algorithm to optimise PT path discovery.Te algorithm efectively reduces the action space of the agent and is experimentally validated based on Nasim.Zhang [2] introduced the multidomain action selection module on the basis of intelligent PT.Tis module can efectively identify the actions that can be used according to a specifc state, reducing unnecessary exploration by an agent.Finally, this method combined with the deep deterministic policy gradient (DDPG) algorithm is verifed in a simulated environment.

Use of Expert Knowledge.
Expert guidance often plays a key role in solving real-world problems.As a method highly dependent on expert knowledge, reference to Security and Communication Networks experienced expert knowledge is often helpful to exploit the vulnerability of the target system, so as to achieve the target at a lower cost.From the perspective of research status, the current research is mainly focused on solving the problems of state space explosion, action space explosion, and sparse reward caused by penetration testing using the reinforcement learning algorithm.Most of them focus on the algorithm itself, often ignoring the characteristics of expert decision-making in the penetration test process and the analysis of specifc network scene structures.
Zennaro et al. [11] simplifed the penetration testing problem with diferent structures in the form of a capture fag challenge and demonstrated how the performance of an agent can be improved by relying on diferent forms of prior knowledge provided to the agent.Te experiments show that by incorporating prior knowledge, the agent can better explore the space of their problems and thus efectively obtain solutions.However, the CTF scenarios constructed for the experiments were only simplifed versions and were not experimented on relatively complex scenarios.Chen [12] frst proposed a generic intelligent PT framework based on GAIL.GAIL-PT addresses the problem of high labour costs due to the intervention of security experts and high-dimensional discrete action spaces.Te study used a variety of algorithms for experiments, but the results showed that the complexity of the A3C-GAIL and DPPO-GAIL algorithms, which combine GAIL, is still relatively high.
Te main idea of imitation learning is to match the behavioural strategies of an agent with the behaviour of an expert by means of training.Imitation learning can be divided into behavioural cloning [23], inverse reinforcement learning [24], and generative adversarial imitation learning [25].However, imitation learning tends to focus on imitating the behaviour and trajectories of experts, making it difcult to enhance and contribute to the performance of the agent in the environment when combined with RL methods.Tis method is essentially an exploration and exploitation of the environment without enhancement for the strategy in the application of RL methods.Recently, demonstration data have been shown to help solve difcult exploration problems in RL.Subsequently, a framework known as reinforcement learning with expert demonstration (RLED) was proposed.Tis framework is suitable for scenarios where rewards are provided by the environment.Todd Hester et al.of the Google DeepMind team [26] propose the DQfD algorithm based on the RLED framework.Te method is pretrained on presentation data while combining the features of supervised and unsupervised learning to construct a loss function and using PER in order to achieve a balanced amount of demonstration data in the training data.By training deep neural networks in this way, the results show that DQfD outperforms imitation learning, which only imitates expert trajectories, as well as classical deep Q networks in terms of average overall performance.

Brief Summary.
We have summarised the current progress in intelligent penetration testing research in the previous section.Traditional penetration testing has high reliance on expert knowledge and high labour costs.In the process of changing from manual to intelligent PT, the cost of manual time is efectively reduced.Tis goes some way to solving one of the major dilemmas of current manual PT.However, as the complexity of the target system increases, the performance of penetration testing using classical RL methods also encounters bottlenecks.Te problem is that real-world PT relies on expert experience and the profciency of the penetration tester.Knowledge reasoning between successive states during real PT is missing in the training of the agent, but in the course of a real penetration test, this is quite important.Considering the integration of expert knowledge into intelligent PT can efectively solve these problems.
In the exploration of incorporating expert knowledge in PT, previous studies have mainly used imitative learning methods.Tese studies have been conducted by processing collected expert knowledge and combining it with imitative learning methods for PT path planning.On the one hand, the expert knowledge collected is usually highly relevant to the target network scenario and not universally applicable.Te interpretability of this expert knowledge is relatively poor.On the other hand, the use of a combination of imitative and RL was efective in guiding the agent to ft the expert knowledge trajectory to some extent.However, imitative learning is more concerned with imitating the behaviour of the expert than the strategies of the testing expert.Tis will lead to problems of expert knowledge overftting.As a result, agents trained using imitation learning often struggle to outperform experts.In conclusion, the current research in the feld of intelligent penetration testing incorporating expert knowledge can be further enhanced with regard to the interpretability of expert knowledge and the RL methods used.

Classical RL Method and Its Improvements
, which is the expected gain from taking an action a(a ∈ A) in a state s(s ∈ S) at a given time.Te main idea of the algorithm is to construct a Q table of states and actions to store Q values and then select the action that yields the greatest beneft based on the Q value.Q-learning uses temporal diference (TD) to update Q values, with the updated formula shown in the following equation: where α is the learning rate, c is the discount factor, a t and s t are the action and state at moment t, respectively, s t+1 is the next state after performing action a t , a t+1 is the possible action to be performed in the state s t+1 , and r t is the immediate reward obtained.

3.1.2.
Deep Q-Learning.Q-learning takes a tabular approach to storing Q values.Terefore, when facing the RL Security and Communication Networks assignment with a high-dimensional state space and action space, the limited space of the table cannot store all states and actions, which limits the performance of the algorithm [27].Algorithms that combine the advantages of deep learning give a better solution to this problem.Minh [28] proposed the deep Q-network (DQN), which is an extension of Q-Learning.Te algorithm replaces the Q value table in Q-learning with a neural network.Tis transforms the original problem of convergence of action value function solving into a function ftting problem for neural networks.
During the exploration and exploitation of the environment, the transition data are stored in a replay bufer.DQN uses randomly sampled data from the replay bufer to train the neural network to approximate the Q function.Tis method breaks the correlation between the training data and makes the training process stable.Te DQN algorithm uses the square of the error between the target Q value and the estimated Q value as the loss function when updating the parameters of the neural network, where the target Q value y i at the i iteration is calculated as follows: where θ ' i are the parameters of the target Q network.Te goal of strategy learning is to update the parameters of the strategy Q network by the mean square error between the target Q value and the current Q value, where the loss function of the algorithm at the i iteration is calculated as follows: where ρ(s i , a i ) is the probability distribution of s i and a i .θ i are the parameters of the strategy network.Te parameters of the strategy network are assigned to the target network in fxed steps of intervals.When optimizing the loss function, the parameters of the target network θ ' i are not updated.

Prioritized Experience Replay.
Prioritized experience replay is a technique for prioritizing experiences, whereby important state transition experiences are replayed more frequently.Tis method is efective because some transition data contain more information that is worth learning.Giving these transitions more opportunities to be played back helps accelerate the overall learning process.Te core idea of PER is to measure the importance of diferent transition data through the TD error δ.Te larger the error of the sample, the larger the value of the sample.Te sampling probability of the state transition i is calculated as follows: where p i refers to the priority of the state transition i, denoted as p i � |δ i | + ∈, and ∈ is a positive number used for numerical stability so that p i 〉0.z is an exponential hyperparameter representing the degree of the impact of the TD error on the sample.
It is worth noting that the purpose of using experience replay is to eliminate sample correlation, but the use of prioritized sampling certainly forgoes random sampling.Terefore, it is also necessary to reduce the training weights of the high-priority state transition data.Te PER method uses importance sampling weights to correct for deviations in the state transition i. Te weights are calculated as follows: where N refers to the capacity size of the replay bufer and β is the annealing hyperparameter of the training process.To implement the above method efciently, we store the priorities in an efcient query line tree data structure and sample the range of line segments during the training process.We call this efcient query data structure a sum tree.

RL Formalisation for PT.
An RL agent must be able to perceive the state of the environment, have one or more goals related to the state of the environment, and then take action and infuence the state of the environment.In order to implement intelligent PT in conjunction with an RL method, we begin by modeling the penetration testing process as an RL paradigm.MDP is a theoretical framework for achieving goals through interactive learning.It is the classic formal expression of sequential decision-making.Te actions of an agent in an MDP afect not only the current immediate reward but also the subsequent state and the future beneft.Tus, the MDP is a mathematically idealised form of the reinforcement learning problem.Te interaction process between the agent and the environment in the MDP is shown in Figure 1.
Te machines that learn and implement decisions in the MDP are called agents.Everything that interacts with it outside of the agent is referred to as the environment.Taking the penetration testing process as an example, if the target network is considered as a state variable environment, the feedback from the environment to the actions of the agent is considered as a reward.Te whole penetration testing process can then be represented in the form of a 4-tuple〈S, A, R, T〉, where S represents the state space, A represents the action space, R represents the reward function, and T represents the transfer function.Detailed defnitions for specifc penetration testing questions are given in Section 4.3.

Methods
In this section, we frst present an intelligent PT framework incorporating expert knowledge and explain the details and processes involved in this framework.We then detail the collection and use of expert knowledge and present the RL algorithm that we use that incorporates demonstration data.

DQfD-AIPT Framework.
Manual PT relies on the experience and knowledge of experts.Expert knowledge and the way of making decisions are also of great importance for intelligent PT.By analysing the PT process and summarising Security and Communication Networks the characteristics of expert knowledge and RL, we propose DQfD-AIPT for intelligent PT that incorporates expert knowledge as shown in Figure 2. DQfD-AIPT consists of three main phases: expert knowledge collection, input and use of demonstration data, and interaction and training of the agent.Our proposed framework is suitable for intelligent PT in simulated network scenarios and is characterised by simplicity of use and generality.In the following, we will explain the specifcs of each phase of the framework.

Stage 1:
Collection of Expert Knowledge.Te frst step in combining expert knowledge for intelligent PT is the collection and acquisition of expert knowledge.Te acquisition of expert knowledge is an abstract and difcult matter.Te difculty lies in the fact that the generation and design of expert knowledge are often based on the experience and rules of experts.As the process of PT is usually strongly correlated with the structure and vulnerability distribution of the target network scenario, the actions taken by penetration testers facing a specifc target network are somewhat unpredictable.Considering the interpretability and validity of expert knowledge, we propose two ways of collecting expert knowledge: (i) Method 1: As shown in Figure 3, frst, we transform the abstract experience of the penetration tester into an executable action that can interact with the simulated environment.Tis executable action is usually mapped to a specifc environmental state.For example, a penetration tester decides to take a certain penetration exploit action based on the current state of the target network.We can abstract this expert knowledge and represent it in the form of a state-action pair.Afterwards, the penetration test simulation environment executes the action and processes the result of the action and gives feedback.Finally, the reward value R resulting from the execution of action A is integrated into the state S together with the new state S′.We obtain a complete set of expert transition data that can be used by the training of the agent and stored in an expert knowledge base.(ii) Method 2: We collect valid paths and traces of agents completing PT objectives in multiple diferent simulated network scenarios (scenarios that are within a fxed order of magnitude due to the uniformity of expert knowledge).Tese trajectories consist of multiple transition data, and the transition data are obtained from tests of the agent against diferent network scenarios.However, due to the structural specifcity of the expert data that we designed, realworld common open ports and services, for example, have specifc bitmasks in the transition data.Tese predefned bitmasks often override the confguration of our simulated network environment.For example, the agent collects expert data in multiple network scenarios of size less than N. We can apply this expert knowledge to the training of network scenarios of a size less than N.In addition to this, we can take valid transition data (reward values > 0 for action execution) and store them in an expert knowledge database.

Stage 2: Input of Expert Knowledge.
Te transitional data collected in the expert knowledge base can be used as demonstration data to ensure that agents can learn the PT strategy of experts through pretraining.Te demonstration data input to the agent conform to the standard transition data form of reinforcement learning algorithm and are used for training and processing.Te detailed representation and structure of transition data are shown in Figure 3. Te transition data consist of a four-tuple: τ(S, A, R, S′), where S is a valid observation of the current target network state by the agent and also serves as the state input to the neural network in the RL algorithm, A is the vector of actions performed, consisting of the number of the agents' actions, R is the reward value for executing action A under the state S, and S′ is the new state to which the transition is made after the execution of action A under the state S. Each transition in the expert knowledge base has the same structure and can be applied to the training of the agent as demonstration data.Te observed state contains the agent's perception of the target network environment, which contains statistically tractable information that has an impact on the action decisions of the PT process.Tis information includes, among other things, the number of hosts the agent has discovered, the number of hosts it controls, the number of open services it has scanned, the number of connection credentials it has obtained, and the ports it has discovered.

Stage 3: Interaction and Training of the Agent.
As the agent interacts with the penetration test simulation environment, the agent's actions will change the state of the environment, while the environment will give feedback to the agent on rewards and penalties.Te agent adjusts the PT strategy and actions based on the rewards.At this point, the demonstration data extracted from the expert knowledge base serve to assist the agent in making decisions and to infuence the agent's tendency to perform actions.Te better transition data generated by the agent as it interacts with the environment are also recorded and stored in the expert knowledge base.In this way, the expert knowledge base is continuously expanded with valid transition data.More

DQfD
Training.Te previous section described the basic process of implementing our DQfD-AIPT framework incorporating expert knowledge.In particular, we highlighted the process of collecting expert knowledge data.Tis section will describe the DQfD RL algorithm and the specifc implementation details of how the algorithm combined with expert knowledge data can guide the agent in PT.
Based on the algorithm structure of DQN and combined with the framework of DQfD, we implement the DQfD algorithm, whose algorithm structure can be expressed as shown in Figure 4. First, based on the algorithmic structure of the DQN, we build a policy network and a target network with the same structure.Each network consists of a multilayer structured neural network: an input layer, three fully connected hidden layers, and an output layer.At the same time, we implement PER through the tree storage structure of the sum tree.PER draws those samples that are more valuable at higher frequency than random samples, and PER takes into account the importance of the diferent state transition data by means of the TD error.Tis approach is used to balance the proportion of demonstration data and interaction data contained in the small batch of transition data sampled.On this basis, we combine the demonstration data from the expert knowledge base to improve algorithm performance and learning efciency.
Te detailed process of the DQfD algorithm can be described as follows.

Pretraining Stage.
State transition data from the constructed expert knowledge base are prepositioned in the demonstration data area of the sum tree.In particular, it is important to emphasise that the size of the demonstration data area and the data flled by the sum tree are fxed.Troughout the training process, the data in the demonstration area are not overwritten as new state transition data are added to the sum tree.After the expert knowledge preset was completed, the policy network was pretrained by sampling batch-sized state transition data from the demonstration data areas several times.Te pretraining process updates the parameters of the Q network using a J(Q) loss combining the three losses, where the J(Q) error is calculated as follows: where J(Q) s is the joint loss containing the supervised loss, λ 1 , λ 2 , and λ 3 represent the constants, respectively, Security and Communication Networks a supervised loss, and J L2 (Q) is a regularisation term applied to the neural network to alleviate overftting to the presentation data and to prevent the strategy from overftting to a small fraction of the experience in the expert data.A detailed explanation of the J n (Q) and J E (Q) is given below.For the J n (Q) loss, the agent updates its Q-network with a mixture target of 1-step and n-step return.Te incorporation of the N-step loss can help propagate the value of the expert data to an earlier state, greatly enhancing learning from the limited demo dataset.It also ensures that the pretrained learned neural network value function estimates satisfy the Bellman equation.Te calculation of J n (Q) can be expressed as follows: J E (Q) is a supervised loss.Te incorporation of the supervised loss is the key to the pretraining process.It can be expressed as follows: where a E is the action corresponding to the expert demonstration data in the state S. l(a E , a) is a margin function, which is a measure of how well the currently executed action matches the action demonstrated by the expert and can be expressed as follows: In this way, the value of any action that difers from the expert action a E is less than the value of the action a E .With the supervised loss, the values of actions outside the range of the demonstration data also become reasonable values, resulting in a value-driven ε-greedy strategy that efectively imitates the expert's actions.Pretraining is a good starting point for learning the task.Once the agent begins to interact with the task, it continues to learn by sampling from its own generated and demonstrated data.

Training Stage.
After the pretraining phase, we get an agent with expert experience.However, the agent does not interact with the environment throughout the pretraining phase.During the formal training phase, the agent frst interacts with the environment to generate transition data, and each transition data is stored in the interaction data area of the sum tree structure.In addition, to avoid overftting of the expert transition data early in the training process, the interaction data area of the sum tree needs to be flled up by the interaction of the agent with the environment before the formal learning begins.After the maximum storage capacity is reached, the agent-generated data will continuously cover the interaction data area of the sum tree structure.Te fow of the PER algorithm is presented in Algorithm 1.
In the pretraining phase, only the expert transition data are extracted from the demonstration data areas for training.In the formal training phase, the transition data from both the demonstration data region and the interaction data area could be extracted from the sum tree according to the PER method.Te diference is that the supervised loss J E (Q) is removed from the calculation of the joint loss J(Q) s , which indicates that λ 2 � 0. In addition to this, the network update process is more computationally expensive due to the forward propagation process compared to the forward propagation process.Te purpose of this form is to ensure that the replay bufer is closer to the state distribution of the current policy and to prevent network overftting.Terefore, the update frequency of the target network in the pretraining phase is measured in steps, and the step setting interval should not be too small.In the formal training stage, the update frequency of the target network is in episodes.In summary, the DQfD algorithm combined with expert knowledge is presented in Algorithm 2.

RL Settings for PT.
In this paper, we model the PT process as an MDP.We use the RL agent as an attacker who penetrates the target system.Te target system constitutes the environment in which the agent interacts with each other.We take the formalisation of MDP as a basis and consider the characteristics of the intelligent PT simulation environment used for the experiments.We give the relevant settings for the necessary elements required for RL.Te relevant elements in the PT formalisation into an MDP can be represented separately as follows: (i) State space: A state space is a fnite set of states with a nonfxed structure.In PT problems, the state space covers the range of changeable states of the target network.In this paper, we take the awareness of the target network environment by an agent through observation as the state.Te representation of the specifc states is shown in Figure 5. (ii) Action space: Te action space contains all the executable actions of the agent and does not change with the current state of the environment.Tis means that the output dimension of the neural network is always fxed during the state-to-action mapping process.In this paper, there are three main types of actions for an agent: ① Local exploit: the local exploit action is the process of exploiting the local resources of a target host after taking control of that host.Te outcome of this action exploitation is privilege elevation, credential information leakage, suspicious link leakage, etc. ② Remote exploit: the remote exploit uses the current controlled host as a springboard to execute malicious commands by submitting them in the local browser.Te outcome of the exploit is to gain control of the target host, leaking a suspicious link, etc. ③ Connect: the connect action acts on the discovered hosts and connects to the target host by means of the host credential information acquired during the lateral movement.Te action outcome is to control the target host.
(iii) Reward: the reward function is feedback from the environment for the action taken by the agent, and the calculation of the reward during the penetration test can be expressed as follows: In the equation, Eval(Outcome S A ) is an evaluation of the outcome of the execution of an action by the agent.Cost(Action) is the cost of performing the action.Te classifcation of the exploit outcomes of the actions and the corresponding values are shown in Table 1.(iv) Transfer function: Te transfer function is a description of the probability of the environment to make a state transfer under certain conditions.For the model-free approach, learning is performed from the experience generated during the interaction, as the agent cannot directly rate the merit Input: k: batch size, N: capacity of the sum tree, and n: the amount of transition currently stored by the sum tree, initialised the sum tree structure Output: Updated the sum tree structure after sampling the transitions (1) if n < N then (2) push the transition τ(S, A, R, S′) into the sum tree with maximal priority (3) end if (4) Start sampling batch-size transitions from the sum tree (5) Calculate σ←SumTreeMaxPriority/k (6) for steps t ∈ 1, 2, . . .k { } do (7) Sample k transition data with priority from the sum tree (8) a←t * σ, b←(t + 1) * σ, (9) v← generate a random number between a and b (10) Transition τ t and corresponding priority are obtained according to a random number v (11) Compute importance sampling weight for each transition τ t (12) end for (13) Train this batch size of transition and compute the TD error according to the weight (14) Update transition priority according to the TD error ALGORITHM 1: Implements PER with the sum tree structure.
Security and Communication Networks 9 of the transformed state.Te transfer function is unknown when formalising the PT process as an MDP.

Experiment
First, we build PT simulation network scenarios on the CBS platform developed by Microsoft.Second, we build an expert knowledge base containing transition data for multiple network scenarios by using the expert knowledge collection method introduced in Section 3. Finally, to validate the efectiveness of our proposed method, we use the DQfD algorithm and the DQN algorithm to perform PT path planning under scenarios containing elements of network defence deception (equipped with honeypots), respectively, and the performance of the algorithms is evaluated by specifc metrics.Calculate loss J(Q) s using the target network (5) Perform a gradient descent step to update the weights for the policy network θ (6) iftmodf p � 0thenθ′←θend if (7) end for (8) for episode u ∈ 1, 2, . . .E { } do (9) for step v ∈ 1, 2, . . .S { } do (10) Sample action A from the behaviour policy (11) Te environment performs A and gives back R(reward), and the agent observes (S, A, R, S′) (12) Push the transition τ(S, A, R, S′) into D interact , overwriting oldest interaction transition if over capacity of D interact (13) Sample a batch size of k transitions from D replay with prioritization (14) Calculate loss J(Q) using the target network (15) Perform a gradient descent step to update the weights for the policy network θ (16) S′←S, the state transitions from S to S′ (17) end for (18) ifumodf f � 0thenθ′←θend if (19) 6.

Main
Te simulated enterprise network scenario consists of a DMZ zone, Trust-1 zone, and Trust-2 zone.Te Trust-1 and Trust-2 zones provide web services and database services in the enterprise network.We have deployed honeypots in two separate trust areas.Honeypots are replicas of sensitive servers and hosts and provide some common services, open more sensitive ports, and contain some invalid resources and information.Honeypots are set up to consume the attacker's resources and to mitigate the impact of the attacker's actions on the enterprise network.Tis is a high fdelity construction of a real-world network scenario.Te frewall between each zone controls the access policy between zones.Te host confguration information and frewall access policies and vulnerability information for the simulated enterprise network scenario are shown in Tables 2-4, respectively.

Penetration Testing Goals.
In the simulated enterprise network scenario, the attacker has initially gained control of SpringBoot in the DMZ zone.Te attacker uses this as a springboard machine for further lateral move.Te goal of the PT process is to gain access to the control commands of the database server in the Trust-2 area in order to get further access to sensitive data and critical information.At the same time, the agent needs to avoid getting caught in the honeypot in the trust area.Tat is, the agent expects to obtain as many cumulative rewards as possible at the least cost of consumption.

Expert Knowledge Collection.
For the construction of the expert knowledge base, we convert the artifcial experience of successfully conducting PT into demonstration transition data that can be understood and learnt by the agent.In our experiments, we collected 1000 expert demonstration data from each of 10 diferent structured network scenarios and pushed them into the expert knowledge base.Te scale of these network scenarios is all within a certain size range.We preplace these expert demonstration data in the demonstration data areas of the sum tree before pretraining.

Evaluation Metrics
(i) Average cumulative reward: In the process of applying the RL algorithm to train an agent for penetration testing, the cumulative reward earned in each round can directly indicate the training of the current round as the number of steps increases.Terefore, the average cumulative reward over multiple rounds can efectively show the average performance of the agent throughout the whole training process.(ii) Probability of attacking honeypots: Te honeypots in the simulated network scenario are hosts or servers with a cyber deception defence role.We calculate the number of times the honeypot is attacked in each round as a percentage of the number of actions performed by the agent.Tis metric assesses the efectiveness of expert demonstration data for policy training by the agent.

Experimental Results and Analysis.
We trained the agent to perform PT using two algorithms, DQfD and DQN, respectively.Te hyperparameter settings for the algorithms and the DQfD-specifc parameter settings are shown in Tables 5 and 6, respectively.Te average of the cumulative rewards obtained by the agent over the 200 round episodes of training was counted to compare the performance of DQfD and DQN.Second, to verify the efectiveness of the incorporated expert knowledge, we counted the probability of attacking the honeypot for Honeypot-1 and Honeypot-2, respectively.To ensure the credibility of the experimental results, we conducted 10 experiments in the same network scenario.We plotted the average results of the 10 experiments as graphs as shown in Figures 7-9.
As can be seen from the experimental results in Figure 7, the DQfD algorithm incorporating expert knowledge is able to achieve the PT goal in fewer steps compared to the DQN algorithm (DQfD within 500 steps and DQN within 3000 steps).Redundant repetitions, exploitation failures, and actions that fall into the honeypot during the penetration test often result in penalties.Terefore, the larger the reward value accumulated in each round, the more it refects the superiority of the agent's PT path and action selection strategy.Te DQfD algorithm accomplishes the goal faster while earning more cumulative rewards in each round.Te experimental results indicate that the DQfd algorithm improves the performance of penetration testing to a certain extent while demonstrating the superiority of fusing expert knowledge.
Te results in Figures 8 and 9 show that intelligence trained using the DQfD algorithm has a signifcantly lower probability of attacking the honeypot hosts in the Trust-1 region and Trust-2 region in each episode.Compared to DQN, DQfD maintains a lower probability of attack throughout the training convergence, always below 0.1.Te reason for fuctuations in DQfD in the early stages is due to the fact that in the early stages, the exploration rate ∈ is in the process of decaying.However, there is still a high probability of random exploration of actions.DQN has a high probability of attacking the honeypot in the early stages.As training progresses, trial-and-error experience is learned into the network model though.However, due to the lack of guidance from expert knowledge, its ability to avoid Security and Communication Networks deception defences was not signifcantly improved compared to DQfD.
Te experimental results efectively refect the role of incorporating expert knowledge in the identifcation and evasion of the deception defence components of the scenario during the PT performed by the agent.
Te less the intelligence interacts with the honeypot in each episode, the less the cost of completing PT will be consumed.Tis will greatly weaken the role of honeypot deployments from another perspective, where expert knowledge can guide and modify the agent's PT strategy.Security and Communication Networks 13 Te advantage of using the DQfD algorithm is that it combines the advantages of supervised and unsupervised learning.Te DQfD algorithm makes reasonable use of the transition data from experience replay and avoids the phenomenon of overftting of expert data.Te results of experiments conducted in a network scenario with honeypots also indirectly indicate the validity of the expert demonstration data.Experimental results also show that the DQfD algorithm not only achieves higher cumulative reward values in network scenarios with honeypots but also makes better use of the expert demonstration data to avoid getting trapped in honeypots compared to DQN.Te efciency and difculty of PT depends on the complexity of the target network structure.Most of the expert knowledge that we currently collect is gathered through training in network scenarios of a specifed size range.Te coverage of expert knowledge is therefore relatively small and limited by the representational form of the transition data.In future research, we consider improving the interpretability of expert knowledge by better converting human PT experience into knowledge that can be accepted and learned by an agent.[15,27,28].

Figure 1 :
Figure 1: Te agent interacts with the environment.

Figure 3 :Figure 2 :
Figure 3: Te process of converting abstract expert knowledge into transition data.

Figure 5 :
Figure 5: Structure of transition data.
replay : the experience replay area built by the sum tree, D demo : expert demonstration data area in D replay , D interact : interactive data area in D replay , θ: weights for the policy network (randomly generated), θ′: weights for the target network (randomly generated), f p : update target network frequency of pretraining, f f : update target network frequency of formal training, k: batch size, j: number of pretraining gradient updates, E: episode number of training, and S: max steps per episode Input:D

Table 1 :
Action execution outcome and evaluation.

Table 3 :
Firewall policies for the enterprise network.