Font Size: a A A

An Analysis Of Combining Ensemble Sampling And Fea-Ture Selection Methods For Imbalanced Multi-Class Internet Traffic Classification

Posted on:2020-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:P L O e u n g P h u y l a Full Text:PDF
GTID:2428330596968182Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Internet traffic classification is currently playing an important role in Inter-net management and security.In recent years,machine learning(ML)-based has been gaining increasing importance owing to the declining payload-based and port-based approaches.However,the imbalance of Internet traffic distribution severely degrades the classification performance of ML techniques,because it generally constructs a biased classifier that achieves higher performance in term of accuracy for the majority class(es),yet lower for the minority class(es).Several algorithms have been proposed to handle the two-class imbalance,yet the multi-class imbalance often arises,that they can be multi-minority classes and multi-majority classes,and the problem is more difficult to solve.In this work,we propose two new ensemble sampling approaches by employing the intelligent under-sampling method and over-sampling method with Decision Tree(DT)and Random Forest(RF)algorithms for dealing with the highly multi-class imbalance learning.The first proposed approach,namely ADCUT,is a combination of modified Adaptive Synthetic(MADASYN)over-sampling method and clusteringbased under-sampling(CUT)based on mini-batch K-means applying on the training dataset before training DT,whereas the second approach,namely ADTO,manages to improve the performance by combining MADASYN and Tomek Link(Tomek)under-sampling technique with RF algorithm.ADTO alters the data distribution of each bootstrap of RF to make the learning of the minority class(es)easier for each base learning decision tree.In both proposed methods,MADASYN is a modified ADASYN algorithm proposed in this paper,which adds KNN-based noise filter to exclude noisy samples from the minority classes before generating the synthetic samples.Besides,CUT and Tomek are presented as under-sampling methods.CUT is proposed to mitigate the between-class and within-class imbalance problems by considering each subspace of the majority class during under-sampling.And,Tomek is used to removing those majority samples that lie along the decision boundary,making the boundaries between classes more distinct.Tomek is used instead of CUT in the second proposed method to reduce the time consuming of CUT undersampling technique on each bootstrap data of RF.The two proposed sampling methods aim to handle the multi-class imbalance and concept-drift problems by introducing new knowledge(samples)through MADASYN and removing the uninformative majority samples using CUT and Tomek.ADCUT is suggested for the network environment with the limited computing resource or high real-time requirement,where ADTO is suggested for computing resource-rich network environment or high-performance requirement.And,the parameters in each proposed method are thoroughly investigated guiding the improvement of the subject.On top of that,the Ensemble Feature Selection(EFS)method is further proposed to enhance the classification performance by removing the redundant and irrelevant features.EFS is built on the existing four different types of feature selection(FS)methods and a wrapper method.The initial feature set is filtered through the four FS obtaining the suboptimal feature subset.EFS afterward applies the wrapper method with DT as guiding classifier on the subset to select the final optimal subset using Area Under Curve(AUC)as an evaluation metric.In this paper,the classifier is built from NetFlow records,which are extractable from NetFlow-enabled devices in the form of flow-level,reducing cost as well as extra works of the network personnel in compare with common packet-level features using in most studies,of which the features extraction requires an additional device for collecting and computing.The paper compares the performance of the proposed methods with the existing popular sampling and FS methods using four different evaluation metrics: overall accuracy(OA),geometric mean(G-mean),F-measure,and AUC.Experiments on four different sizes and imbalanced real-world networking traffic datasets show that the proposed methods achieve higher performance for the minority class(es)without affecting the overall classification performance in comparison with existing methods.
Keywords/Search Tags:Internet traffic classification, multi-class imbalance learning, ensemble sampling, feature selection, NetFlow feature
PDF Full Text Request
Related items