Font Size: a A A

Research On Imbalanced Data Classification Learning Algorithm Based On Mixed Sampling Technique And Adaboost Principle

Posted on:2019-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2428330566468205Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Imbalanced data means that data from a class(hereinafter referred to as the majority class)the number of samples relative to other classes(minority class)more.The number of sample classes in the traditional classification algorithm is basically balanced.However,the actual data concentration samples may not be balanced.In the classification problem of imbalanced data sets,the traditional classification algorithm pays more attention to the overall classification accuracy.In practical applications,it sometimes pays more attention to the classification accuracy of the minority classes.The study from the two aspects of the sample distribution and sample characteristics of imbalanced analysis data set,proposed clustering samples distribution characteristics under sampling method and sampling method based on the characteristics of the effects of sample categories based on the balance between the quantity of each kind of sample.At the same time,this research combined with Adaboost algorithm,and finally proposes an integrated learning classification algorithm for imbalanced data sets.First,by analyzing the characteristics of sample distribution,a sample undersampling method based on clustering is proposed.The method used for clustering in imbalanced data sets the majority class samples,determine the sample amount of information carried by clusters by clustering the cluster size,the amount of information carrying different clusters using different sampling strategies,get rid of the majority class in outliers and reduce the number of edge samples,in the class as samples,at the same time to reduce the unbalanced data set.Through experimental verification of UCI data set,Adaboost and SVM are selected as classifier.The method is compared with the random undersampling method.Secondly,this study proposes an oversampling method based on the effect of feature pairs on the characteristics of imbalanced datasets.In a data set,the impact of sample features on each category is different,so each feature of the sample has a different degree of importance to each category.Thus the sample features can be classified,and then a few classes are sampled to achieve the balance of the data according to the results of the feature classification.It is proved by experiments that the method of random over sampling and SMOTE can improve the recognition accuracy of a few samples.Finally,on the basis of the above data level mixed sampling method and Adaboost algorithm is proposed to solve the integrated learning algorithm of imbalanced data set,the method by modifying the data set and Adaboost samples to balance misclassified samples,to further improve the recognition accuracy of the minority class.
Keywords/Search Tags:classified learning of imbalanced data, classified learning, mixed sampling method, adaboost algorithm
PDF Full Text Request
Related items