With the rapid development of Internet technology,network security issues have become increasingly prominent.Penetration testing,as a method to actively discover network security vulnerabilities,is of great significance to ensuring network security.However,traditional penetration testing methods have problems such as cumbersome manual operations and low efficiency.Aiming at these problems,this paper conducts research on penetration testing attack path discovery based on deep reinforcement learning.The main research contents include:(1)Aiming at the problems caused by complex attack graphs,such as excessively large state space,complex attack paths,difficulty in subsequent decision-making of the best attack path,and overly long attack paths that easily expose attackers,this paper first proposes a All Shortest Paths-Bidirectional Breadth-First Search(ASP-BBFS)algorithm.Based on the principle of bidirectional search,the algorithm constructs paths by traversing all possible parent nodes of the encountered nodes,so as to find all the shortest paths from the start point to the end point,and complete the simplification of the attack graph.Then apply the Markov Decision Process(MDP)to the attack graph,and use the Common Vulnerability Scoring System(CVSS)to add the reward and state transition probability information required in the MDP to complete the penetration Build test models.Finally,through experiments on two types of attack graphs,the experimental results prove that the ASP-BBFS algorithm is faster in calculating all the shortest attack paths,consumes less memory and is more robust than other attack graph path search algorithms.(2)Aiming at the problems that Deep Q Network(DQN)and its variant Deep Reinforcement Learning(DRL)algorithm are not efficient and difficult to converge in the penetration test attack path discovery scenario,this paper proposes a Normalized Entropic Dynamic Reward Scaled Clipped-Proximal Policy Optimization Algorithm(NHSC-PPO)is used to solve the optimal attack path.This algorithm improves the Proximal Policy Optimization(PPO)algorithm,and introduces advantage function normalization,policy entropy,dynamic reward scaling,and gradient clipping techniques.The NHSC-PPO algorithm first converts the network topology and vulnerability information into a state representation,and then through the continuous exploration and update strategy of the environment by the agent,finally finds the attack path with the highest reward value in the data set.Experimental results in various scenarios show that,compared with PPO,Double Dueling DQN(D3QN)and DQN algorithms,the attack path discovery of NHSC-PPO algorithm can more efficiently and accurately converge to the path with the highest reward value,and then Find the best attack path. |