Font Size: a A A

Classification Of Imbalanced Data Based On Margin Distribution Boosting Algorithm

Posted on:2022-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z L ZhangFull Text:PDF
GTID:2518306512961979Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In the field of machine learning,the algorithms to solve the classification problem are mostly aimed at the classification problem based on uniform distribution balanced data,while for the imbalanced data classification problem,the ideal classification effect cannot be achieved.In practical applications,the classification of minority instances is very important,such as medical diagnosis,risk management,etc.Therefore,it is of great significance to study the problem of imbalanced datasets classification.This paper studies two types of imbalanced datasets classification problems.It is mainly divided into the following three parts:Firstly,an Ada Boost_v algorithm based on cost sensitivity is proposed.Based on the existing Ada Boost_v algorithm,this algorithm is further improved to deal with the problem of imbalanced datasets classification.An adaptive cost sensitive function is introduced into the sample weights,which makes the classifier pay more attention to the minority instances.According to the sample weights formula and the upper bound of the classification error rate of Ada Boost_v algorithm on the optimal margin,a new weight strategy of the base classifier is derived,which fully considers the imbalanced datasets classification problem and the optimal margin.In order to further deal with the imbalanced datasets classification problem,the algorithm adopts the improved SVM model.The solution method is SVRG to improve the convergence speed of the algorithm.Secondly,an algorithm of Ada Boost_v based on under-sampling is proposed.On the basis of the existing Ada Boost_v algorithm,this algorithm adopts two undersampling methods based on neighborhood to deal with the class overlap problem of imbalanced datasets.The first undersampling method is the common nearest neighbor search undersampling method.This method is applicable to when the data density of the minority class samples is greater than or equal to the majority class samples,the majority class samples are negative class samples,and the minority class samples are positive class samples.The main idea is to find the common negative nearest neighbors of any two positive samples and delete them as overlapping negative samples.The second undersampling method is recursive search undersampling method.This method is based on the first undersampling method to further delete majority class samples,which is suitable for the data density of minority class samples is far less than the majority class samples.In order to further deal with the problem of imbalanced data classification,the algorithm adopts the improved SVM optimization model,and the solving method is SVRG.Lastly,a penalized Adaboost algorithm based on cost sensitivity is proposed.Based on the existing penalized Adaboost algorithm based on margin distribution,this algorithm is further improved to deal with the problem of imbalanced datasets classification.The algorithm introduces a new adaptive cost sensitive function in its sample weights,this function takes into account the influence of sample class,sample classification error rate and noise sample.Then,in order to further deal with the problem of imbalanced datasets classification,the algorithm still adopts the improved SVM optimization model,and the solving method is SVRG.
Keywords/Search Tags:Imbalanced Datasets, AdaBoost, Undersampling, Cost Sensitivity, Margin Distribution
PDF Full Text Request
Related items