With the development of information technology,the recognition of the network target has urgent practical needs whether it is in the signal detection of the electronic countermeasures or in the security maintenance of the communication networks.Especially with the rise of artificial intelligence and smart homes,unknown protocols are no longer used solely in the military field,and more commercial and civilian communications are also beginning to adopt it.Therefore,the identification of unknown protocols has very important research significance.Especially the link layer protocol which exists in the form of bit stream occupies an important position in the field of protocol recognition.Because it does not have any semantic characteristics,and the high-level protocol is always encapsulated in its protocol frame in the form of data.So in the this paper our object is an unknown protocol in the form of a bit stream at the link layer.The algorithms for link layer unknown protocol identification in the current literatures are mainly based on pattern matching and data mining.The basic processes are: pattern sequence statistics,frequent sequence mining,association rule mining,link frame segmentation and protocol format speculation.Subsequent results always depend on the effects of pattern sequences and frequent sequence statistics.However,in the traditional scheme,when the AC(Aho-Corasick)fast count algorithm is used for pattern sequence statistics,there is still a disadvantage that the number of database operations is too large.Frequent sequence mining algorithms based on enumeration tree pruning also have the drawbacks of difficult pruning cycle selection and complicated storage structure.Therefore,this paper improves the module of pattern sequence statistics,frequent sequence mining and association rule mining respectively.First,for the module of pattern sequence statistics and frequent sequence mining,this paper proposes an improved AC algorithm based on statistics.The algorithm traverses the target sequence only once to count the maximum length pattern sequence,and the remaining short pattern sequences are directly calculated by statistical formulas.As the number of algorithm database operations is reduced,subsequent frequent sequence mining is based directly on the threshold for screening.The algorithm does not need to maintain the "enumeration tree",which greatly simplifies the data storagestructure.The simulation results prove that the time-consuming ratio of the new algorithm to the traditional algorithm is about 18% when performing frequent sequence mining of 3-8 bits on the ARP(Address Resolution Protocol)protocol.The new algorithm is insensitive to the size of the data set,and the statistical time growth rate is only 0.07,which is much smaller than the traditional algorithm's time growth rate of0.4.Next,for the fixed-length frame protocol,this paper first uses the rotate left of the array to complete the frame length prediction based on the concept of self-confidence.After that,on the basis of successfully predicting the frame length,the concept of position entropy is used to complete the de-redundancy operation for frequent sequences.The simulation results prove that the frame length prediction algorithm can correctly predict the frame length of the fixed-length frame protocol,and the smaller the self-confidence threshold is set,the more accurate the result of the frame length prediction.When the self-confidence threshold of the frequent sequence de-redundancy algorithm is 0.04,the keyword candidate set screening rate of the ARP protocol can reach 28%.Finally,this paper directly counts all the dual association rules that satisfy the mutual confidence threshold through the upper difference matrix and the lower difference matrix.After that,based on the principle of merging association rules,the adjacency matrix is used to realize the merging of dual association rules.The simulation results show that the algorithm can effectively calculate some of the association rules of the fixed-length frame ARP protocol,and can also calculate the partial synchronization codes of the non-fixed-length frame 802.11 b protocol. |