With the rapid development of computer network technology and the advent of the information era, the widespread use of network causes the explosive growth of Internet traffic, the appearance of new applications results in the more flexible and mixed use of network communication protocol, and network virus and the growing behavior such as hacking and malicious attacks make the society and government departments pay attention to the network security. These problems can be solved by network traffic identification. Therefore, it also gets more and more attention.As all kinds of traffic identification methods have been proposed, academic and application fields are focusing on the feasibility and effectiveness of traffic identification, which is how to process huge amounts of data quickly and how to identify the various network applications correctly. Facing the changing network environment, this paper puts forward the method, which is the real-time network traffic identification based on the machine learning. The method mainly adopts two supervised learning algorithms, which are the BP (Back Propagation) neural network and SVM (Support Vector Machine).The BP neural network adopts the distributed and parallel mesh structure to be trained to learn, which makes it have strong fault tolerance and high processing speed. What’s more, it has good nonlinear mapping ability and can be used to simulate the nonlinear relationship of input and output. Meanwhile, BP neural network is trained by means of global optimization, so it has good generalization ability. SVM is the machine learning algorithm for small samples, and can realize the nonlinear mapping from low-dimensional space to high-dimensional space by the inner product kernel function, and has the solid theoretical foundation. Moreover, it can easily solve the nonlinear multi-class classification problem in the method of transductive inference. The optimal separating hyperplane generated by SVM only contains a handful of support vectors along the border, which makes SVM is not only simple and effective, but also has good robustness. Above two kinds of machine learning algorithms are able to adapt to the big data and diversity under the network environment, and can quickly and effectively identify the application types of the Internet traffic.Traffic identification system of this paper is based on the home network. According to functions, the system can be divided into the home gateway and the backend server. Home gateway captures packets in real time, extracts features, and identifies the traffic by the machine learning method, and then transmits the results to the backend server. Backend server will store the identification results into database, and display the current application types of the Internet traffic, which is easy for administrator to monitor. The main contribution of the thesis is as follows:First of all, through the research and analysis of network traffic identification and machine learning, the BP neural network can adapt to the big data and diversity in the Internet, and put forward a method of traffic identification based on BP neural network. Choose three layers BP neural network as the implementation scheme, which meets the requirements of traffic identification, and is simple and easy to implement. The hidden layer of BP neural network has the sigmoid function as transfer function, realizing the nonlinear mapping of the input data, such as traffic features. Although the BP neural network is easy to fall into local minimum of error curved surface, by PSO algorithm finding the global optimal initialized weights it is trained into the global minimum. As the experimental results show, the BP neural network optimized by PSO can quickly find the global minimum of error curved surface, and accurately identify the network application types.Secondly, through carefully studying the principles of SVM to solve the linear and nonlinear classification problems, the traffic identification based on SVM is proposed. And SVM will be applied in the Internet traffic identification field. The radial basis function is chosen as the kernel function of SVM, realizing the nonlinear mapping from a low-dimensional space of network traffic features to a high-dimensional space. By One-Against-One method to construct the multi-class SVM classifier, SVM is able to identify a variety of network applications. SVM can classify several kinds of network applications by generating optimal hyperplane in high-dimensional space. This is a form of global optimization, so the SVM has good generalization ability. The experimental results show that SVM is very suitable for solving the nonlinear classification problem such as network traffic identification, and needs less training samples and has low computing complexity, so it can realize real-time traffic identification.Thirdly, traffic identification system is designed and implemented in the home network. According to the machine learning system model and the realization method of supervised learning, the overall architecture of the network traffic identification is design, which is divided into online real-time traffic identification and offline training. It contains capturing packets, extracting the traffic characteristics, selecting training sets and test sets, studying on the training samples, and identifying the traffic accurately. Then, realize the identification algorithms by programming, and transplant them into home gateway (constructed by a router). At the same time, set up the Web server and MySQL database on the Linux platform of the backend server. As a result, implement the interactive communication between home gateway and backend server, and information processing and storage. The administrator can observe the identification results of the current traffic by a Web browser connecting to the backend server. |