Font Size: a A A

Research On Key Issues In Internet Traffic Classification

Posted on:2016-03-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:R Y WangFull Text:PDF
GTID:1108330479993403Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of network applications and network devices, Internet carries more and more network traffic everyday. The traffic plays an important role in social development, but it brings many challenges to network operation such as bandwidth resources management, network security and network acounting. To handle these challenges, network administrators require techiniques to identify the application types of network traffic. Therefore, Internet traffic classification becomes an important direction of network researches. It can provide decision support for network management, e.g. controlling P2 P traffic, ensuring Qo S(Quality of Service) of interactive application and blocking abnormal traffic etc.As the development of network techniques, Internet traffic classification methods keep moving forward. As the popular usages of dynamic port number, port number camouflage and payload encryption, the traditional traffic classification methods(port number mapping and payload signature matching) are gradually ineffective. Machine learning based Internet traffic classification technique is a promising alternative, and it attracts a lot of researches in recent years. However, this technique still faces multiple challenges, including class imbalance, concept drift etc., which degrades the classification performance of traffic classifier, especially for minority classes. This paper proposes a new Internet traffic classification framework, which includes the sub modules to handle the class imbalance and concept drift. And new traffic classification methods are devised in each module. The object is to improve the classification performance of minority classes on dynamic traffic datasets. The main innovations and contributions of this paper are as below.(1) Fine tunning resampling method for Internet traffic. In multi-class traffic datasets, some application classes(majority classes) generate a lot of traffic flows, but some others(minority classes) generate a small number of traffic flows. When using the machine learning based traffic classification technique, the classifier always biases toward effectively classifying majority classes’ flows and ignoring the classification performance of minority classes. However, minority classes are also very important, such as interactive application. In traffic classification field, many existing papers randomly select the same number of flows for each class when preparing training set for handling class imbalance. But this way significantly destroys the data distribution of traffic flows. This paper explores the class imbalance property of traffic datasets and proposes a new fine tuning resampling method for training set. 1) It samples an initial training set from origin training set according to origin data distribution. 2) It trains a classifier on the training set and tests the classification accuracy of each class on rest flow set. 3) It selects a certain number of flows for each difficult class from rest flow set and these selected flows are combined with initial training set as a new training set. 4) It repeatedly trains a classifier on the new training set, tests the classification accuracy of each class and selects a certain number of flows for each difficult class. A method based on PAC(Probably Approximately Correct) is proposed to assess the number of flows to be selected in each interation. It avoids the negative influnce of labeling noisy in traffic flow set. Experimental results show that the new resampling method can improve the classification accuracies for minority classes without significantly changing the origin traffic data distribution.(2) Internet traffic classification method with data cleaning. When the traffic classification technique is used for real network management activities such as blocking P2 P packets, byte accuracy is more important than flow accuracy. But most classifiers only pursue high flow accuracy. Moreover, the bytes are imbalance(elephant flows are much less than mice flows) leading to low byte accuracy. To handle this problem, this paper firstly analyses the correlated factors of low byte accuracy, and then proposes a data cleaning based Internet traffic classification method. When preprocessing training set, it removes the mice flows of majority classes in the decision boundary using heuristic rules. Its effect on alleviating the boundary complexity is proved by theory. Experimental results show that it can significantly improve byte accuracy without sacrificing flow accuracy.(3) Internet traffic classification framework based on PCDD(Per-class Concept Drift Detect). Most researches have shown that traffic classification methods can obtain high overall classification accuracy on the static context. But, as the update of network applications, the application types and statistical flow features are continuously and dynamically changing. The classifier can not effectively classify future traffic collected far from training set. It is the concept drift problem. This paper systematically explores the situation of concept drift on real-world traffic datasets. Exploration results show that concept drift of minority classes is more frequent than that of majority classes. The traditional concept drift detection methods are only based overall error rate, which difficutly detect the concept drift of minority classes. This paper proposes an Internet traffic classification framework for handling class imbalance and concept drift concurrently. A new method named PCDD is proposed to detect the concept drift for each class. And a new training scheme is proposed to avoid high frequently updating the classifier. The framework firstly trains an initial classifier on a static training set. And then, in the processing of classifying Internet traffic, it detects concept drift using PCDD. If concept drift is detected, it updates the classifier based the new training scheme. Our framework can update the traffic classifier in time when concept drift is detected in each class and alleviate the class imbalance during updating the classifier.(4) Abnormal traffic identification method based on information entropy. The previous sections in this paper mainly improve the classification accuracies of multiple minority classes. But, the experimental results on real traffic datasets show that the classification performance of attack class is difficultly improved, since the abnormal behavior is much complex and dynamic changing, and its flow samples are seriously insufficient. Therefore, this paper needs to further mine the unique and stable features for abnormal traffic. The explorations are carried out from two aspects. One is from the direction of traffic flows, which found that most abnormal traffic flows belong to one-way flow, also known as Internet background radiation traffic. These flows are always resulted by network scanning, malicious attack or misconfiguration. Another is from the host communication behavior, which found that the traffic source distribution of inactive hosts(only receive one-way flows) is much different from that of normal active hosts. Based on the exploration results, this paper proposes a new abnormal traffic identification method. Firstly, this paper proposes a metric to evaluate the randomness of traffic sources and an algorithm to check whether the traffic source is uniform distribution, both of which are used to detect the inactive IP and malicious source IP. And then, the new method identifies the abnormal traffic according to the communication behavior patterns. On the IPv4 benchmark traffic dataset, it obtains 99% precision on identifying abnormal traffic. The detected abnormal traffic can be used to enhance the training flow samples of abnormal class in the supervised machine learning based traffic classification technique, or to extract new statistical flow features for improving classifiers. These are advantageous to improve the classification accuracy of abnormal traffic.
Keywords/Search Tags:Internet traffic classification, multi-class imbalance, multi-class concept drift, abnormal traffic identification, machine learning
PDF Full Text Request
Related items