Font Size: a A A

Research And Application Of Imbalanced Data Classification Algorithm Based On Ensemble Learning

Posted on:2022-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:2518306722498744Subject:Bionic Equipment and Control Engineering
Abstract/Summary:PDF Full Text Request
Classification problem is an important branch of machine learning.In reality,classification problems often have data imbalance,such as common medical diagnosis,credit card fraud detection,fault detection and so on.It is of great theoretical and practical value to study the classification problem of imbalanced data.Compared with the balanced data classification problem,the imbalanced data has the characteristics of imbalanced samples among classes and high cost of misclassification for a few classes.As a typical representative of machine learning,ensemble learning algorithm can improve the overall classification accuracy through group decision-making,and is widely used in the classification of imbalanced data.However,the classification effect of the ensemble learning algorithm for imbalanced data still needs to be improved.This is mainly because the ensemble learning algorithm aims at reducing the overall classification error rate,and does not further consider the difference of classification costs of different samples,so the recognition effect of the key minority samples is not good.In addition,imbalanced data is often accompanied by high dimension,low value density,overlapping within class data and serious missing values,which bring multiple challenges to the current ensemble learning algorithm.In view of this,this paper focuses on the difficulties of imbalanced data classification.Based on the existing imbalanced data classification methods,from the data and algorithm levels,this paper constructs an adaptive key feature mining algorithm,and introduces cost sensitive learning to improve the Ada Boost ensemble learning algorithm,so as to realize the model's classification of a small number of samples on the premise of ensuring the overall classification accuracy The effective identification of this system.The main research contents are as follows:(1)From the data level,this paper analyzes the key feature mining technology of unbalanced data.For the blindness of key feature mining algorithm of Pearson's Redundant Based Filter when removing features,it is easy to lead to the problem of under-fitting of the model.An Adaptive key feature mining algorithm AKKPRBF(Adaptive KNN and Kernel Density Pearson's Redundancy-based Filter)is proposed Based on Pearson Redundancy Filter.The algorithm in Pearson redundant filter algorithm on the basis of introducing the coefficient of kernel density estimation for linkage to identify key characteristics are more coefficient,through the KNN algorithm based on nearest neighbor distance characteristics of key adaptive filling missing values,in order to maintain the specificity of the characteristics,and by using polynomial dynamic combination to create a new feature,further improve the characteristics of the logo.Adaboost algorithm was used as a classifier to construct AKKPRBF-Ada Boost classification model.The experiments show that AKKPRBF adaptive key feature mining algorithm has a more significant improvement on the performance of ensemble learning classification model.(2)From the algorithm level,the Adaptive enhanced ensemble learning algorithm Ada Boost(Adaptive Boosting),with optimal overall accuracy as its goal,is difficult to deal with data misdivision cost imbalance,data class imbalance,data overlap and other problems.Cost Sensitive Learning is introduced to improve the weight updating method of Ada Boost algorithm.A Cost Sensitive improved Adaptive enhanced ensemble learning algorithm Cs Ada Boost(Cost Sensitive Adaptive Boosting)is proposed.The algorithm can improve the weight of a small number of samples with classification errors on the basis of weight update of original samples.At the same time,the weight of most category samples with misclassification should be appropriately increased to avoid excessive attention to a few category samples,which may lead to the increase of the overall classification cost,so as to achieve the goal of the lowest overall classification cost.(3)Research at the data and algorithm level is integrated,and AKKPRBF key feature mining algorithm is combined with Cs Ada Boost ensemble learning classification algorithm.An integrated learning algorithm Based on key feature mining and Cost sensitive improvement,AKKPRBF-Cs Ada Boost(Adaptive KNN and Kernel Density Pearson's Redundancy-based Filter-Cost),was constructed Sensitive Adaptive Boosting),which combines improvements in both data level and algorithm level.In order to verify the classification effect of the model,AKKPRBF-Cs Ada Boost model was applied to the classification and prediction of unbalanced data in different fields.Recall and G-mean were used as evaluation indexes through ten-fold cross-validation and hundreds of experiments.The validity of the proposed algorithm is evaluated and compared from three dimensions of model stability,accuracy and the recognition rate of a few types of samples.It is verified that the proposed AKKPRBFCs Ada Boost has a wide range of application fields and practical value.
Keywords/Search Tags:imbalanced data classification, ensemble learning, cost sensitive learning, key feature mining
PDF Full Text Request
Related items