Font Size: a A A

Research And Application Of Imbalanced Classification Technology Based On GcForest

Posted on:2020-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:P ZhaoFull Text:PDF
GTID:2428330590471776Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of global information technology,machine learning has become an important method for solving practical problems in all walks of life.Imbalanced classification is an important research content in the field of machine learning.In real life,imbalanced data is widespread,such as medical diagnosis,churn user identification,and spam identification have a large number of imbalanced data.How to effectively classify imbalanced data has important research value.Many experts and scholars have achieved certain results by using ensemble learning algorithms to deal with imbalanced classification problems.Researchers used traditional ensemble learning algorithms to study the problem of imbalanced classification.These algorithms have certain limitations and have been studied very well.Few studies use leading-edge ensemble learning algorithms to deal with imbalanced classification problems.The gcForest algorithm is a new ensemble learning algorithm proposed by Professor Zhou Zhihua.The algorithm has the characteristics of high classification accuracy,strong generalization ability and simple parameter adjustment.However,gcForest does not consider imbalanced data processing in the design of the algorithm.The classification performance is not superior when the algorithm faces imbalanced data.In this thesis,the problems existing in the imbalanced classification of the gcForest algorithm are optimized and improved from the cascaded forest part and the data level respectively.Firstly,at the cascaded forest part,the classification accuracy of each layer of the forests is calculated for the minority samples and the majority samples,and the classification accuracy is used as the weight of different class.Then,the voting result of each layer of forests is optimized according to the weight,and the algorithm is strengthened for identification of minority samples.At the same time,the XGBoost algorithm is used to replace the original base classifier of the cascaded forest,which further strengthens the classification ability of the entire forest for imbalanced data.Secondly,at the data level,after the multi-grained scanning process,two different strategy optimization algorithms are adopted according to the imbalanced ratio of the data.By introducing the oversampling algorithm and using the idea of the EasyEnsemble algorithm,the minority samples and the majority samples are combined into multiple balanced datasets.The balanced datasets is then passed to the cascade forest for learning,and the processing ability of the algorithm to classify the imbalanced data is improved from the data level.Finally,experiments are carried out through multiple sets of UCI,KEEL public datasets and a provincial communication operator user churn dataset,and compared with the mainstream ensemble learning algorithms.The experimental results show that the improvement can effectively improve the classification performance of gcForest on imbalanced data.
Keywords/Search Tags:ensemble learning, gcForest, weight, EasyEnsemble, oversampling
PDF Full Text Request
Related items