Font Size: a A A

Unknown Protocol Format Extraction And Message Classification

Posted on:2019-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:S Y DingFull Text:PDF
GTID:2428330566970988Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet,in addition to a large number of standard protocols,there are a large number of specification unknown protocols on the Internet,which are also known as private protocols,in considerations of security,economic interests,and privacy.These protocols do not have a publicly available protocol specification for identifying what they are.Therefore,its existence brings a high security risk and is hard to supervise.For example,a large number of malicious software use undisclosed protocols to communicate and control the computer,causing a large area of network paralysis and secret leakage.In these security incidents,lacking of traffic specification results failure of malicious traffic interception.Therefore,it is an indispensable for network supervision to extract unknown traffic characteristics and then classify and identify the unknown traffic.In an actual Internet environment,the collected data is a mixture of various kinds of messages.And with the restriction of receiving conditions,the collected messages are often disordered and discrete,among which there are no session constraint relationship.In addition,the size of collected messages is always too huge to handle.According to the characteristics of actual network messages and the speed required to identification,the paper extracts packet payload format features to complete message-based mixing unknown traffic classification.The main works of the paper are summarized as follows:1.For the problem of clustering feature vector construction and clustering problem,we proposed an unknown traffic separation method based on entropy estimation(EMEE).The algorithm uses the degree of byte entropy vector variation to measure the possibility of fixed keyword bytes at each offset position,solves the drawbacks of setting the load interception length based on artificial experience,and uses a two-stage clustering method to perform payloads clustering with a closer cluster number to the real message types.The performance of EMEE algorithm is tested on DARPA dataset and messages collected on the Internet.The experimental results show that EMEE algorithm can effectively reduce the impact of data fields on the separation of unknown traffic and make the number of clusters close to the number of real categories,and separates the mixed message sets composed of 18 categories.The false negative rate and false positive rate of all kinds of messages are less than 2%.Compared with Kmeans and DBSCAN algorithm,the clustering based on EMEE algorithm is obviously more effective.2.For the problem of Message keyword sequence extraction,we proposed an iterative algorithm for keyword sequence extraction based on byte link algorithm(BLIS).The algorithm uses the link as the basic unit to construct the message,which can effectively reduce the confusion degree of link frequency distribution of keyword fields and data fields at each offset position.When mining the fixed keywords,the space of clustering can be effectively estimated by the link entropy vector.And then to obtain the frequent links in the space.Lastly the fixed position keywords are selected from the fields constructed of frequent links by the strong constraints.After the message clusters are further purified according to the obtained fixed location keywords,subsequent variable location keywords are mined for each cluster message,and variable location keywords are extracted in an iterative manner.In each iteration,after finding the corresponding offset position of the keyword and aligning the variable keywords,the variable position keyword is obtained through the entropy vector model.The experimental results show that Compared with the features extracted by traditional methods,the features extracted by the method have advantages.3.For the problem of low time efficiency of frequent string mining,designed and implemented a parallel frequent pattern mining(Apriori and FP-Growth with position information)system.Because Hadoop has low cost and is suitable for off-line processing,the implementation is based on the Hadoop platform.After completing the setup and configuration of the experimental environment.The system is implemented as the following steps: 1)the Apriori and FP-Growth with position information Parallelization Algorithm Design;2)the Apriori parallelization development and the FP-Growth parallelization development based on Mahout;3)Performance testing.The test results on the distribution platform show that,compared with the serial computation,the data processing time of parallelized scheme grows slowly with the data size,which reflects that the parallel algorithm has higher computational efficiency.
Keywords/Search Tags:Message format characteristics, Unknown traffic separation, Entropy vector, Keyword sequence, Statistical characteristics, Parallel algorithm
PDF Full Text Request
Related items