| Tor is an anonymous communication tool for the dark net based on multihop routing.In order to achieve effective network access,Tor uses bridge nodes deployed based on the bridge protocol instead of Tor network entry nodes to build virtual circuits connected to the Tor network.Obfs4 bridge protocol is the most commonly used bridge node protocol in the Tor network,which uses improved Ellipse Encryption Algorithm technology for encryption,is strong against traffic analysis,does not contain plaintext information,and cannot simply follow specific rules to distinguish the type of traffic.In this paper,we conduct an in-depth study on Obfs4 traffic detection and identification technology,and design a high-precision traffic identification method that can effectively detect Obfs4 traffic in an open environment.The main work of this paper is as follows.(1)This paper generates a feature set for Obfs4 traffic identification that is most complete at current by analyzing the bridge protocol and traffic,and integrates the existing related research.The feature set is feature selected using various methods such as filter-based feature selection and wrapper-based feature selection.Using the idea of filter-based feature selection,the feature items with too low correlation with classification and too high correlation between features are removed according to Pearson correlation coefficient,and the feature selection results are calculated by wrapper-based feature selection method combined with random forest,support vector machine and other machine learning methods for comparison,to achieve the optimization and selection for the feature set,without significantly reduce the identification precision,to improve the identification time efficiency and data preprocessing efficiency of Obfs4 traffic identification method,which can be used in the open environment identification scenario of high traffic.(2)In order to avoid the influence of the basic rate fallacy problem on the traffic recognition effect,this paper adopts r-precision as the main evaluation index instead of the commonly used precision metric,where r denotes the ratio of the number of non-target samples to the number of target samples.We optimize the traffic identification with the goal of improving r-precision,and apply the confidence-based,distance-based and integrated learning-based precision optimization methods to Obfs4 traffic identification,respectively.Through experimental analysis,we confirm the machine learning methods applicable to the model identification module for different scenarios,and adapt the corresponding precision optimization module to each method.The output of the model identification module and the precision optimization module together determine the final results of traffic identification.The experimental results show that the Obfs4 high-precision traffic identification method designed in this paper achieves a maximum r-precision of 99.99% and a recall rate of 92.58% at r = 1(number of non-target samples / number of target samples = 1).(3)This paper implements a high-precision traffic identification algorithm for Obfs4 traffic,which includes a preprocessing module,a data filtering module,a feature selection module,a model identification module,and a precision optimization module.In this paper,the data sets collected in several real nodes are combined with public data sets to form a hybrid data set as the experimental data of this paper.The experimental results show that the identification precision of the design method in this paper is better than the previous Obfs4 traffic identification algorithm in different environments.When r=10,the r-precision of the designed method in this paper can still maintain about 90%,and the identification precision is much higher than the best r-precision of 63.96% of the previous methods in the same environment,so it is reasonable to infer that the designed method in this paper can still ensure the effective identification of Obfs4 bridge protocol traffic in the open environment. |