Protocol identifications are the precondition of traffic monitoring, intrusion detection, and user behavior analysis. With the development of the Internet, more and more new protocols are emerging; many of the new protocol specifications are no longer open to the public and use random ports. In addition, considering of the problems of information security or privacy issues, much network traffic is encrypted, making problems that protocol identifications have to face and solve getting more complicated. So, looking for more efficient protocol identification methods has important research significance. Identification techniques based on characteristic strings can get a high accuracy, and identification techniques based on dynamic action characteristics mapping can be used to identify encrypted flow with a good throughput, both of them are most widely used.This paper is mainly to improve the efficiency of protocol identifications by improving performance of matching algorithms in Identification techniques based on characteristic strings and dynamic action characteristics mapping. And the main work is listed as follows:A protocol identification system based on pattern matching and machine learning is proposed. The system combines the advantages of protocol Identification technology based on pattern matching and machine learning: high accuracy can be achieved by pattern matching method; machine learning methods can also be used to identify the encrypted traffic; and feature base can also be updated continually.Common pattern matching algorithms are analyzed, and then an improved BM algorithm is provided in this paper. This improved pattern matching algorithm can reduce complexity of pretreatment process, increase the maximum jump distance by making full use of mismatching information, increase the probability of maximum jump distance by considering more information, thus the efficiency of protocol identifications is improved.Based on Comparison and analysis of feature selection methods, the genetic algorithm is used to further filter some common feature of traffic statistics. Using the feature set got from genetic algorithm in machine learning, a good traffic identification effect can be reached, and the efficiency of protocol identification can also be improved.Machine learning algorithms are studied first. As it is difficult to determine the value of K in the K-means clustering algorithm, an optimization is proposed in the paper. By combining the binary search method with K-means clustering algorithm, it is possible to more quickly determine the appropriate value of K. |