Font Size: a A A

Research And Application Of Integrated Algorithms For Unbalanced Data Sets

Posted on:2020-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:D P WuFull Text:PDF
GTID:2428330578956093Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
At present,in the field of machine learning and data mining,the classification of data is a relatively core research work.Traditional classification algorithms are based on data classification studies based on the balance of category distribution.However,in many practical applications,such as credit card fraud detection,satellite image oil leak detection and network intrusion detection.The category distribution of data is often in an unbalanced state,that is,the number of samples in one category is significantly less than the number of samples in another category,and the category with a relatively large number of samples is called the majority class.A relatively small number of categories is called a minority class,and for this type of data,a few classes are more important than most classes in most cases.For example,in credit card fraud detection,it is far from the cost of detecting credit card fraud as fraud-free.Far more than the cost of detecting no credit card fraud.Therefore,for the classification problem of such unbalanced data sets,it is more important to improve the classification performance of a few types of samples.For the classification problem of unbalanced data,the research of classification algorithm is mainly divided into the following two aspects.One of the commonly used methods is to use sampling techniques(such as random oversampling,random undersampling and SMOTE oversampling).The problem of unbalanced distribution of data categories is that the class distribution of the data is balanced by sampling the sample distribution of the data set category,and then the classification study is performed on the balanced data set.Another common method is to preserve the class distribution of the original data set(training directly on the original training set),improve the classification algorithm by using a certain method,and then directly classify the unbalanced data set with an improved algorithm,such as cost sensitive.Techniques such as decision thresholds,probability estimates,and integrated learning.Among them,one of the more popular directions is based on the integrated learning algorithm,and the research work on the classification of unbalanced data sets is carried out,and good results have been achieved.In the classification problem of unbalanced data sets,the integrated learning algorithm has been widely used to solve this problem because it exhibits superior classification performance.The integrated learning algorithm mainly improves the classification performance of the classifier by improving the base classifier and using some methods to increase the difference between the base classifiers,and has strong generalization ability.Based on the above analysis,this paper mainly carried out the following aspects: First,starting from the data level,this paper combines SMOTE oversampling method and repeated undersampling method to deal with the advantages of unbalanced data sets.A rotating forest integrated classification method based on combined sampling method.The algorithm firstly samples the original training set using the SMOTE oversampling method,and then extracts multiple balanced training subsets by using the repeated undersampling method on the new training data set obtained after SMOTE sampling.Finally,the rotating forest integration algorithm is used.Learning on these multiple training subsets.Secondly,starting from the algorithm level,based on the Bagging integrated algorithm,the threshold moving method is introduced,and an unbalanced data classification method based on the probability threshold Bagging integration algorithm is proposed.The algorithm determines one for each category according to the maximum performance evaluation index.Decision thresholds allow the algorithm to adapt to unbalanced data sets.In the last work of this paper,the improved probability-based Bagging integration algorithm is successfully applied to the classification of dust storm unbalanced datasets in some areas of Gansu Province.The experimental results show that the proposed algorithm has good classification performance.
Keywords/Search Tags:Unbalanced Data, Ensemble Learning, Bagging Algorithm, SMOTE Sampling Method, Sandstorm Data Classification
PDF Full Text Request
Related items