Font Size: a A A

Research And Application Of Unbalanced Data Classification Algorithm Based On Resampling

Posted on:2022-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y B FengFull Text:PDF
GTID:2518306341964069Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In today’s era of big data,the scale of data increases rapidly at an exponential rate.As the core of the Internet,the application value of data in people’s life is becoming more and more critical.How to obtain valuable information from these data accurately and quickly through technical means has become an important research content.The emergence of machine learning provides a means,which can automatically mine the inherent laws of data,and use these laws to predict new data.After years of research,many mature models have been proposed,but these learning methods are proposed for evenly distributed samples.However,in daily life,many data sets are unevenly distributed,and the sample number between classes is very different,that is,unbalanced data sets.Such as credit card fraud detection,fault diagnosis,medical diagnosis,spam filtering,etc.The imbalance of data distribution poses a greater challenge to data mining.This is because the traditional classification algorithm is based on the evenly distributed data set in the initial design and experiment.In order to ensure the overall performance of the classifier,some minority samples will be wrongly classified,which will reduce the recognition rate of minority classification.However,we usually pay more attention to the classification accuracy of minority samples,because the information carried by minority samples has higher value and is the goal of data mining.Aiming at the problems of unbalanced data in the process of classification,the industry mainly studies from two aspects of data sampling and algorithm improvement.In the aspect of data sampling,we mainly use some sampling strategy to make the majority and minority samples in the data set roughly equal in the number of samples,so as to solve the classification problem of unbalanced data sets;in the aspect of algorithm,we optimize the algorithm by introducing a penalty mechanism(such as cost sensitive learning,Ensemble learning,fuzzy support vector machine and so on)to improve the classification and recognition rate of minority classes.Aiming at the problem that the classification results of imbalanced data sets are biased towards the majority class,this paper improves the classification accuracy of the minority class from two levels of data sampling and algorithm.The specific research content is as follows.(1)From the data level,this article first analyzes the characteristics of the unbalanced data set,fully considers the balance between the class and the class of the unbalanced data set,and combines the SVM algorithm to improve the SMOTE sampling algorithm,and proposes a non-balanced data set based on SVM.Balanced data oversampling method-SVMOM(oversampling method based on SVM),the algorithm first obtains the classification hyperplane through SVM,and then assigns the distance weight of the sample minority samples according to the distance of each minority sample to the classification hyperplane,and then according to the sample’s distance weight The distribution gives the sample density weight,and again gives the sample selection weight according to the distance weight and density weight of the minority samples,and finally uses SMOTE to synthesize a new sample according to the sample selection weight to achieve the purpose of balancing the data set.The effectiveness of the algorithm is verified through comparative experiments.(2)From the level of algorithm improvement,using the advantages of the integrated algorithm in processing unbalanced data sets,the SVMOM oversampling algorithm proposed in this paper and Adaboost are combined into a new integrated algorithm—the integrated classification algorithm for unbalanced data based on oversampling(SVMOMboost)This algorithm uses decision trees as the base classifier.At the beginning of each iteration,the SVMOM algorithm proposed in this paper is first applied to oversample the minority samples and update the sample weights at the same time,so that the data set reaches a certain degree of balance.Through experimental comparison,it is found that compared with other algorithms,the SVMOMboost algorithm has better performance.(3)Finally,extract sand and dust storm data in parts of Gansu from "China’s Strong Sandstorm Sequence and Its Supporting Dataset" and "China Surface Climate Data Daily Value Dataset",and preprocess the data(including data cleaning,data conversion,and feature Select),combined with the unbalanced data ensemble classification algorithm based on oversampling(SVMOMboost)proposed in this paper,constructs a model for the uneven data classification of sand and dust storms in some areas of Gansu.Through comparative experiments,the effectiveness of the algorithm SVMOMboost proposed in this paper is verified.
Keywords/Search Tags:Unbalanced Data, Oversampling, Integrated Algorithm, Sample Weight, SVM
PDF Full Text Request
Related items