Font Size: a A A

Research And Implementation Of Semi-supervised Machine Learning Algorithms For Classifying The Imbalanced Protocol Flows

Posted on:2015-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:L L LiFull Text:PDF
GTID:2308330482979110Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Network traffic analysis and classification is a fundamental technology for the research such as network services, network control, network security as well as network operations, management and upgrade, so the network research has important application va lue. However, with the rapid development of network technology, the number of network users expands rapidly and new applications keep emerging, resulting in an increasingly complex network environment, makes accurate classification of network traffic becomes more and more difficult. Especially with wide range applied of the dynamic port number and encryption technology, the validity and reliability of the traffic identification method based on port number and load feature matching is declining, more and more researchers make the study focus of the network traffic classification to machine learning. Such methods are based on statistical characteristics of the stream to classify, get rid of the dependence on the port number and the data load, have broader development prospects. In this paper, the two key issues in the field of traffic classification based on machine learning is studied on, which are class imbalance and labeling bottleneck. Its main work and achievements are outlined as follows:1. In order to solve the problems of samples marked bottlenecks and the imbalanced protocol flows,a semi-supervised traffic identification method based on K- means and k-Nearest Neighbor(KMkNN) is presented. This method makes high-dimensional statistical feature vector to characterize the flow, structure a two level classifiers based on k- means and k nearest neighbor algorithm. Firstly, cluster the data which contains a small amount of labeled samples and a large unlabeled samples into several clusters; then, train the k-nearest neighbor classifier by the labeled samples in the cluster to classify the unknown samples in the cluster, and adjust the number of neighbors k according to the distribution of the labeled samples, thus overcome the problem that minority class samples often be classified into majority by mistake. The theoretical analysis and experimental results show that the algorithm can improve the recognition rate of minority flows in the case of the imbalanced protocol flows, moreover this method can discover the new application.2. Due to the feature is redundant, which can be divided into several independent feature subset, a network traffic classification method based on random subspace ensemble classifier(RSEC) is proposed. This method does feature selection by the forward selection wrapper mode to structure feature set for ensemble classification; then, generate feature subset by random selection, and then train different classifiers by the different feature subset; finally, integrate the classification results by the method which combine absolute majority and relative majority voting system. Experimental results show that this method comparing to KMkNN further enhance the recognition rate of the minority class and the majority class.3. Combined with the actual network environment, an offline traffic analysis system based on machine learning is designed, and using C# to program. The system use wireshark software to collect data online and then save to loca l; flow feature generation module restore the flow according to quintuple information; sample labeling module combines the port number matches, payload feature matching and manual annotation to label the train samples; classification module provides five method such as C4.5, NBK, semi-supervise k- means, KMkNN and RSEC to classify; finally, test the system by real data collected in the laboratory, and verify the effectiveness of the system.
Keywords/Search Tags:Traffic Classification, Machine Learning, Semi-supervised Learning, Imbalance Data, Ensemble Learning
PDF Full Text Request
Related items