Font Size: a A A

Research On Oversampling Algorithm Of Unbalanced Data Set

Posted on:2021-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:T Y ZhangFull Text:PDF
GTID:2438330620472583Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The characteristic of the imbalanced data set is that there are significant differences in the number of samples of different classes.The prediction results of basic classifiers learned from imbalanced data sets are usually biased towards the majority class.The reason for this deviation is that the traditional classification algorithm tends to optimize the overall accuracy of the training data set,as for unbalanced data set classification,the majority class dominate the overall accuracy.Therefore,the methods for solving the classification problem of imbalanced data sets are mainly divided into two categories.One is to change the imbalance at the data level through the resampling algorithm;the other is to make improvement at the algorithm level,modify the classification algorithm to make it better handle unbalanced data sets.Through the research and analysis of the classification methods of imbalanced data sets at this stage,this paper proposes some improved solutions of classic algorithms and new oversampling algorithms:(1)Improved SMOTE(Improved synthetic minority over-sampling technique).SMOTE generates one synthetic sample according to a line segment of two minority samples as endpoint.The new minority sample generated by the method are lack of diversity.Improved SMOTE chooses more than two real minority samples to generate new sample.Experiment results show that the proposed SMOTE achieves better area under curve and stability.(2)Oversampling strategy based on K-means and central limit theorem.Analysis of the existing oversampling algorithm shows that the oversampling strategy can be roughly divided into two steps:grouping and synthesizing.In this paper,K-means clustering algorithm is used as the grouping strategy,and the distribution of samples is estimated by the statistics of sample eigenvalues in the clustering cluster,and new sample points are generated from the distribution.Grouping and synthesis mode can reduce the probability of new sample points appearing in the range of other classes,at the same time,it can eliminate the imbalance of data set and make better classification performance(3)Solving the machine learning problem of unbalanced data sets by a combined strategy.By combining the method of data level and algorithm level,the classification of unbalanced data sets is further optimized.Resampling algorithm can increase the resolution ability of the classifier to a minority classes,and the boost algorithm can greatly improve the overall performance of the classifier.Combined with the two oversampling algorithms proposed in this paper,a boost algorithm based on Gini coefficient decision tree is proposed.Experiments show the excellent performance of the algorithm.
Keywords/Search Tags:Imbalanced dataset, Over-sampling, Sample synthesis, Classification
PDF Full Text Request
Related items