Tor is one of the most widely used anonymous communication tools,which can hide the identity information and communication relationship between the two sides of com-munication.While protecting users' communication privacy,Tor is also abused by illegal activities.For example,attackers use Tor to carry out network attacks or access illegal content.The anonymous mechanism of Tor makes the attack difficult to be tracked and held accountable.Therefore,how to analyze the correlation between tor-based illegal ac-tivities and attackers from a large amount of communication data has important theoreti-cal significance and application value for the protection of network information security.Obfs4 is the most widely used obfuscation plug-in for Tor in the world.To establish the communication relationship between attackers and the websites accessed through Tor is faced with the challenges of content encryption,multi-hop IP hiding,random packet padding and randomization of data packet interval and other high strength anti-traffic association analysis technologies.In order to meet the above challenges,this thesis pro-pose a traffic feature classification method based on the logical temporal relationship of web page elements,and establish the communication relationship between attackers and the websites they visit through Tor.The main work and contribution are as follows.(1)For the obfs4 node acquisition limitation,short effective time and website access control problem,the Tor bridge node acquisition method based on webpage publishing and mail publishing,and the obfs4 data traffic collection algorithm based on simulation browsing tool are proposed.A large number and fast acquisition of obfs4 data traffic is realized by obfs4 node selection and update,control of Tor link selection,and communi-cation exception handling strategy optimization.(2)Aiming at the boundary segmentation problem of users accessing different web pages sequentially in the collected obfs4 data traffic,a web traffic segmentation algorithm based on data traffic time series density clustering is proposed.The clustering algorithm is used to cluster the time boundary of the visited webpage according to the arrival time of the data packet,and divide the communication traffic of the user sequentially accessing the plurality of webpages into the traffic sequence corresponding to each webpage ac-cording to the time boundary.(3)Aiming at the problem of anti-flow analysis such as random packet filling and packet interval randomization,a multi-dimensional eigenvector analysis based on web page structure element resource request and obfs4 random padding length constraint is proposed.Establish three types of 124-dimensional feature vectors,including packet length class statistical features,packet data number statistical features,and packet accu-mulation features,to achieve feature representation of a single web page traffic.In order to verify the effectiveness of the proposed method,10,920 URLs were col-lected and 172 GB of website access data was obtained.The collected data are trained by using multiple classification models such as decision tree,gradient lifting decision tree and random forest.Experiments show that in the closed world,the multi-classification accuracy rate of the website is optimally 91.6%;in the open world,the accuracy of the website two classification is 89.6%.Experiments show that the obfs4 website fingerprint recognition algorithm proposed in this thesis has better recognition efficiency and practi-cability. |