Font Size: a A A

The Improvement And Application Of Smote Algorithm For Unbalanced Data Sampling

Posted on:2016-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:B ChenFull Text:PDF
GTID:2308330464968532Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The unbalanced data set refers to that within a data set the distance between the size of different kinds of samples is very large.Traditional data mining algorithm cannot deal with unbalanced data set, because of lower accuracy and poor classification.SMOTE algorithm is a method of preprocessing unbalanced data set. However it is with less effectiveness for sampling, vogue of negative-positive bound and also affects the distribution of original data.This dissertation improves SMOTE algorithm. the main contributions is presented as below:(1) We investigate the optimal strategies of SMOTE algorithm. a new SMOTE algorithm, i.e. KM-SMOTE, is proposed bases on K-means clustering, which preprocesses the minority data set, with baseline of clustering center and clustering data point.In order to cope with the problems of SMOTE algorithm, we use the improved KM-SMOTE sampling formula, substituting original SMOTE sampling formula, and confine the increased sampling data within minority area.Experimental results show that improved KM-SMOTE algorithm can improve classification preciseness for minority sampling.(2) In order to enhance the sampling ability of SMOTE algorithm, we propose the second improved SMOTE algorithm, i.e. RM-SMOTE, by improving the over-fitting problem.This algorithm is based on the clustering center, and interpolate stochastic value within an h-dimensional space, which further narrows the area of interpolation and enhance the rationality of algorithm.Experimental results show that RM-SMOTE algorithm has some advantages in classification of unbalanced data set, and exhibits stability in classification for different data sets.(3) We apply this algorithm in detecting of network intrusion. we checked the availability of algorithm by experiments on UCI data set and network intrusion data.
Keywords/Search Tags:Unblanced data set, SMOTE, H-dimensional spherical, K-means
PDF Full Text Request
Related items