Font Size: a A A

Research On Ensemble Learning Algorithm For Imbalanced Data

Posted on:2020-08-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z M ZhangFull Text:PDF
GTID:1368330578471856Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Imbalanced data sets are widespread in human production activities and daily life due to the data per se or influence of human factors in the sampling process.In these imbalanced data sets,the minority samples are often more closely related to some abnormal and important situations.However,in many practical applications,it is quite difficult to effectively classify and identify these minority samples with traditional methods.As an important research branch in data mining,ensemble learning has received wide attention from researchers.By integrating multiple sub-learners to study problems of machine learning,ensemble learning can significantly improve the generalization ability of the learning system,which has greater advantages than traditional single data mining algorithm.The main research object of this dissertation is the classification and clustering of imbalanced data and ensemble learning method is used as a tool.Some algorithms are proposed to improve the performance of imbalanced data set classification and clustering.At the data level,people mainly focus on how to reasonably and effectively adjust the sample distribution.At the algorithm level,people mainly focus on how to optimize and improve the parameters of existing algorithms.The main research contents of this dissertation are as follows:(1)K-AdaBoost clustering ensemble algorithm based on under-sampling techniqueA K-AdaBoost algorithm is proposed by combining the AdaBoost algorithm with the K-means technology to deal with the imbalanced data sets.The improved algorithm first uses under-sampling technique based on K-means clustering technology to reduce the amount of the majority samples and to achieve the balance of the imbalanced data set without destroying its structure.Secondly,the K-means algorithm is applied again on the newly obtained training samples set to obtain multiple clusters.By calculating the distance between test samples and cluster centers,weights of the base learners for the test samples are obtained.Finally,according to the weights of the base learners to combine the strong learner and as a result,the test samples are predicted.(2)R-AdaBoost classification ensemble algorithm based on ADASYNAn integrated classification R-AdaBoost algorithm based on ADASYN is proposed for the imbalanced data sets.First,the algorithm generates the m synthesis samples based on ADASYN technology,which can balance the original data set.Secondly,base data learners are used to classify the obtained data sets and get the classification results of each base classifier.In updating the weight value of the sample,the idea of the Focal Loss function is introduced to increase the weight of the difficult classification samples.Eventually,test samples are classified by the AdaBoost algorithm to obtain the final classification result.(3)EOS-Bagging ensemble learning algorithm based on evolutionary over-samplingThe EOS-Bagging(Evolutionary Over-sampling)algorithm is proposed for the imbalanced data set based on the improved SMOTE sampling technique.First,over-sampling is randomly performed on the minority samples.Secondly,based on the SMOTE algorithm and the genetic algorithm,selection operation,cross operation and mutation operation are conducted on the minority samples of the new data sets.Finally,at the algorithm level,by combining with the Bagging ensemble learning framework,base learners are used to classify the synthetic samples to obtain prediction results of the test samples.The experiments testify that the algorithms proposed in the dissertation have achieved some improvements in the performance of imbalanced data set classification and clustering.
Keywords/Search Tags:Imbalanced data set, Ensemble learning, Classification, Clustering, Machine Learning
PDF Full Text Request
Related items