Research On Traffic Classification Algorithms Based On Machine Learning

Posted on:2014-12-06

Degree:Doctor

Type:Dissertation

Country:China

Candidate:G Lu

Full Text:PDF

GTID:1318330536980973

Subject:Information security

Abstract/Summary:

PDF Full Text Request

Network traffic classification can help Internet service provider to optimize network bandwidth,to improve network quality of service,to charge for a specific application and to detect macilous network traffic etc.At present,two main issues challenge network traffic classification techniques.First,more and more applications use dynamic ports and payload encryption techniques to evade traffic detection,which challenges the accuracy of traffic classification.Second,the rapid growth of throughput in the border of network presents more requirements in the real-time of traffic classifiers.This dissertation addresses the issues in traffic classification via machine learning techniques.It enhances the accuracy,stability and real-time of traffic classification from the two points of view: optimizing feature selection and improving classifiers.The dissertation firstly introduces the main techniques in traffic classification and summarizes the challenges and research progresses.Then,it orderly addresses three problems: the class imbalance problem,the bias of statistical features and automatic generation of payload-based signatures for LWDPI.To improve the classification accuracy,stability and real-time,some algroithms are proposed.The main contents of dissertation are as follows:(1)In the case of protocol imbalance,Machine Learning(ML)based classifiers achieve lower True Positive Rates(TPRs)for small classes;the dynamic changes of the flows generated by large classes make ML based classifiers unstable.To address the problem,an improved Bagging algorithm is proposed.Firstly,by means of single-factor experiment design,the dissertation determines the size of observation window for TCP flows.Then it compares three algorithms: C4.5 decision tree,Support Vector Machine(SVM)and Na?ve Bayesian Kernel(NBK).Lastly,the improved Bagging algorithm is applied in traffic classification.The experimental results show that,compared with C4.5 decision tree,SVM and NBK,the imporved Bagging algorithm can achieve accurate and stable classification results.Moreover,it achives satisfied TPRs for the small classes.The classification time and training time of the improved Bagging algorithm are short,which is suitable for online traffic classification.(2)Due to the imbalance in the number of traffic flows,ML based classifiers are prone to misclassify flows as the traffic type that occupies the majority of flows on the Internet.The high-dimensional feature space worsens the problem.To address the issue,a novel feature selection metric named Weighted Symmetrical Uncertainty(WSU)is proposed.We design a hybrid feature selection algorithm named WSU_AUC,which prefilters most of features with WSU metric and further uses a wrapper method to select features for a specific classifier with Area Under roc Curve(AUC)metric.Moreover,to overcome the impacts of dynamic traffic flows on feature selection,we propose an algorithm named SRSF that Selects the Robust and Stable Features from the results achieved by WSU_AUC.We evaluate our approaches using three classifiers on the traces captured from entirely different networks.Experimental results validate the efficiency and effectiveness of our algorithms.We detail the robust and accurate flow statisitics as well.(3)When ML based classifiers classify several applications,some statistical features may increase the identification accuracies of some applications but reduce those of the other applications at the same time,which is called the biases of statistical features in this dissertation.To address this issue,we propose an accurate Traffic Classification Framework based on Ensemble Clustering(TCFEC),which is composed of multiple classifiers and a decision part.Each individual of the multiple classifiers is built by clustering in different feature subspace with k-means algorithm.An normalized mutual information based algorithm is proposed for optimizing the parameter of k-means algorithm.The decision part is used for dealing with inconsistent classification results between base classifiers.It applies two decision approaches: Support Vector Machine(SVM)based approach,hash based approach,to select the most accurate base classifier.On the public dataset,our experimental results show that TCFEC can classify traffic flows accurately and stably.(4)Automatic signature generation approaches have been widely applied in recent traffic classification.However,they are not suitable for LW_DPI since their generated signatures are matched through a search of the whole application data.In this dissertation,based on LW_DPI schemes,we present two Hierarchical Clustering algorithms: HC_TCP and HC_UDP;they are able to generate byte signatures from TCP and UDP packet payloads respectively.In particular,HC_TCP and HC_UDP can extract the positions of byte signatures in packet payloads.In addition,to deal with the case that byte signatures cannot be derived,we develop an algorithm for generating bit signatures.Compared with the LASER algorithm and Suffix Tree based algorithm(ST),our algorithms are better in terms of both classification accuracy and speed.Moreover,the experimental results indicate that,as long as the application-protocol header exists,it is possible to automatically derive reliable and accurate signatures combined with their positions in packet payloads.

Keywords/Search Tags:

traffic classification, class imbalance, feature selection, machine learning, ensemble clustering, automatic signature generation

PDF Full Text Request

Related items

1	Studying Class Imbalance Characteristics And Classification Methods On Internet Traffic Flows
2	An Analysis Of Combining Ensemble Sampling And Fea-Ture Selection Methods For Imbalanced Multi-Class Internet Traffic Classification
3	Research On Imbalanced Network Traffic Classification Algorithm Based On Supervised Learning
4	Research On Class Imbalanced Network Encrypted Traffic Identification
5	Instant Messaging Traffic Classification Technology Based On Machine Learning
6	Research Of Ensemble Classification Methods For Class-imbalance And Cost-sensitive Datasets
7	Two-class Imbalanced Data Classification Based On Diverse Data Generation And Ensemble Learning
8	Study On Ensemble One-class Classification And Its Applications
9	Research On Classification Of P2P Traffic Based On Machine Learning
10	Research On The Technology For Network Traffic Identification Based On Machine Learning