Font Size: a A A

Studying Class Imbalance Characteristics And Classification Methods On Internet Traffic Flows

Posted on:2014-08-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z LiuFull Text:PDF
GTID:1268330425476710Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet traffic classification is an important foundation for performing networkmanagement, quality of service guarantee, network accounting and network security etc.Traditional traffic classification methods difficultly accommodate the rapid developing ofnetwork applications. Internet traffic classification using machine learning (ML) is apromising alternative. However, the traffic classifier is always optimized to obtain highoverall classification accuracy, which does not take into account the class imbalance propertyof Internet traffic datasets. The traffic classification performance always biases towards themajority class and ignore the minority class. On Internet traffic, some minority classes containsignaling flows or real-time communication flows, and their classification performanceinfluences communication quality and user experience etc. Some minority classes own a lot ofbytes, and their classification performance affects network planning or bandwidth resourcesallocation etc.At present, there is lacking of systematic research on the class imbalance characteristicsand classification methods in Internet traffic classification. This paper observes the classdistribution of Internet traffic datasets on selected feature space and analyzes the imbalancecharacteristics, and then carries out researches on Internet traffic classification methods fromdata resampling, feature selection and classification algorithm. The main contributions are asfollows.(1) Class imbalance characteristics of Internet traffic datasets. This paper studies theclass imbalance characteristics of Internet traffic datasets from external and internal aspects.By comparing the flow number and byte number of each traffic class, this paper found thattraffic datasets usually contain multiple majority classes and multiple minority classes, thereis a big distance between the flow number of the majority class and that of the minority class,the minority class may own a lot of bytes and there is obvious imbalance between large flowsand small flows in some classes. The distribution of flow samples in the feature space showsthat the flow samples from the same class usually have several sub concepts and some subconcepts only have a small number of flow samples, and the flow samples of a class overlapthose of other classes. The research of the influence of class imbalance characteristics onInternet traffic classification performance shows that multiple sub concepts is more closelycorrelated to the classification performance when compared to flow number imbalance andclass overlapping.(2) Cost-sensitive learning for the traffic datasets with multiple minority classes. When cost-sensitive learning algorithm is applied to classify traffic flows, the flow rate based costmatrix does not fit the difficulty classes with more flows but difficultly identified. This paperutilizes weights to improve the cost matrix. Through analyzing the relationship between theclass imbalance degree and the room of increasing misclassification cost, an evaluation metricfor class imbalance degree and the calculation method for weight are proposed. The methodaims to properly increase the weights of difficulty clases without decreasing the classificationperformance of the majority class significantly.(3) Data resampling method for Internet traffic datasets. A traffic dataset may existseveral imbalance related factors i.e. flow number imbalance, class overlapping, multiple subconcepts and small disjuncts. To handle these problems simultaneously, a hierarchical dataresampling method named PSC (partition, sampling and combining) is proposed. Firstly, anorigin traffic dataset is partitioned into multiple disjoint and dense subsets to reduce subconcepts. And over sampling is performed on each cluster, which handles small disjuncts inthe way of enhancing flow samples for minority classes. Then, a heuristic under samplingmethod is performed on each class, in which rules for removing majority class flow samplesare devised, so as to alleviate class overlapping. PSC can build sub training set with lowerwithin-class dispersion, class overlapping and class imbalance.(4) Selection algorithm for Internet traffic flow features. Considering the multiple subconcepts, class overlapping and multiple minority classes, a balanced feature selection (BFS)algrithm is proposed. In order to select the features that make flow samples with lowerdispersion, a local correlation metric is proposed to evaluate the certainty of a feature on theflow samples of a class. In order to select the features that make flow samples of differentclasses with lower overlapping, a global correlation metric is applied to evaluate the certaintyof class variable when a feature is given. Based on the evaluation results of local and globalcorrelation of each feature, a search algorithm is proposed, which selects a local correlationfeature for each class and the feature also has high global discrimination power. So that, theselected feature subset includes the features that are advantageous to discriminate minorityclasses.(5) Classification methods for large flows. The imbalance between large flows and smallflows exists in some classes, which may result into that the classifier ignores the learning oflarge flows. The flow number imbalance between the minority class and the majority classmay result into that the classifier ignores the classification performance of the minority classwith a lot of bytes. Both of the two cases may lead to difficultly classifying large flows andobtaining low byte accuracy. For handling the imbalance between small flows and large flows, a flow size modularization method based on information gain ratio (FSMGR) is proposed.Taking the object of minimizing the data complexity of large flows, it searches a partitionthreshold (correlated to bytes). The origin traffic training set is partitioned into large and smallflow sub sets according the partition threshold, each of which is individually used to train aspecific classifier. So that the large flows are emphasized and the classification problembecomes easier. For handling the imbalance between the minority class and the majority class,the PSC in (3) is improved (named BPSC) to alleviate the flow number imbalance whileretaining all large flows and the boosting ensemble learning algorithm is used to improve thestability of the classifier.
Keywords/Search Tags:Internet traffic classification, class imbalance characteristics, data resampling, feature selection, cost-sensitive learning, ensemble learning
PDF Full Text Request
Related items