Font Size: a A A

Research On Improvement And Parallelizationof Classification Algorithms Inimbalanced Data Sets

Posted on:2019-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2348330563954504Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The classification of imbalanced data sets refers to the classification problem of different sizes of samples in data sets.Most of the traditional classification algorithms are based on the basis of the uniform sample distribution or the same cost of the sample error,so it is easier to appear the misclassification of a few classes of samples when dealing with imbalanced data.Along with the widespread application of internet,the amount of information stored in the networks has been massive.Moreover,these massive data contain imbalanced data sets,which brings great challenges to extract information.1.In SMOTE(Synthetic Minority Over-sampling Technique),only minority class samples nearest to neighbors are computed when samples are synthesized,causing the problem that the density of the minority class samples remains unchanged after oversampling.This paper proposed an improved NKSMOTE(New Kernel Synthetic Minority Over-Sampling Technique)algorithm to overcome the shortage of SMOTE.Firstly,a nonlinear mapping function is used to map samples to a high-dimensional kernel space,and then the K nearest neighbors of samples of minority class from whole samples are computed.In addition,different over-sampling rates are set on different minority samples to change the imbalanced multiplying power according to the influence brought by the distribution of minority class samples on the classification performance of algorithm.In the experiments,some classical oversampling methods were compared with the proposed oversampling method,and Decision Tree(DT),error BackPropagation(BP)and Random Forest(RF)were chosen as base classifier.Experimental results on UCI data sets show better classification performance of NKSMOTE algorithm.2.On the basis of the idea of RareBoost algorithm and GMBoost algorithm,an imbalanced data classification algorithm,namely NIBoost algorithm which combines the cost sensitive theory with the over sampling technique,is proposed.The over sampling algorithm is used to sample the data and also train the classifiers on the data sets.Then the the weight of samples according to the different class labels of the classification results is adjusted.With Decision Tree and Naive Bayes algorithm used as weak classifier algorithm,the experimental results on UCI data sets show that NIBoost algorithm has some advantages in dealing with the classification problem of imbalanced data.3.In real life,a certain number of imbalanced data sets exist in big data.Based on this,a parallel imbalanced data classification algorithm using the MapReduce parallel computing framework,namely PNIBoost,is proposed.The experimental results on UCI data sets showthat the algorithm has better processing capacity for imbalanced data and also has a good parallel performance.4.An imbalanced data set classification system based on B/S structure is constructed.All the over sampling algorithms and classification algorithms mentioned in this paper are integrated in this system.Meanwhile,the cluster management interface is provided for the purpose of user's cluster management.
Keywords/Search Tags:Imbalance Dataset, SMOTE Algorithm, Cost Sensitive, MapReduce, Classification System
PDF Full Text Request
Related items