Font Size: a A A

Research On Ensemble Classifying Algorithm Of Imbalanced Date Set Based On Oversampling

Posted on:2021-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:X H ZhaoFull Text:PDF
GTID:2428330614458284Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
The problem of imbalanced data classification widely exists in different fields,and the inherent complexity of imbalanced data distribution will significantly reduce the classifier performance.Therefore,how to improve the classifier performance in imbalanced data is worthy of our continuous research.The traditional classification algorithm is difficult to obtain the ideal classifier performance when dealing with the imbalanced data.At the data processing level,the Synthetic Minority Oversampling Technique(SMOTE)is an excellent resampling method,but it is blind to synthesize new samples in some cases,and it cannot synthesize samples based on the distribution of the samples,which will seriously reduce the classifier performance.Therefore,this thesis modifies the SMOTE oversampling algorithm and proposes an oversampling algorithm based on clustering.At the classification algorithm level,the ensemble classification is an effective algorithm to improve the classifier performance.The diversity and ensemble strategy of the base classifier are the key factors affecting the ensemble method.This thesis combines the Adaptive Boosting(AdaBoost)algorithm with the Support Vector Machine(SVM)algorithm on the basis of the proposed oversampling algorithm,modifies the base classifier and ensemble strategy respectively,and proposes an asymmetric costsensitive ensemble classifying algorithm.1.Oversampling algorithm based on clustering.The algorithm clusters minority samples to obtain minority clusters of different sizes and densities.It synthesizes more samples in clusters with large sparseness,and relatively synthesizes in clusters with small sparseness.The algorithm fully considers the problems of inter-class,intra-class,noise,and class overlap in imbalanced data,and provides a new oversampling strategy for the problem of imbalanced data classification at the data processing level.The results indicate that the oversampling mechanism of this algorithm is more reasonable and can effectively improve the classifier performance.2.Asymmetric cost-sensitive ensemble classifying algorithm.The ensemble classifying algorithm uses the proposed oversampling algorithm to divide the training set into multiple training subsets.In the AdaBoost framework,each training subset is trained using the modified SVM and a series of powerful classifiers are obtained.Then calculate the weight of each powerful classifier based on the similarity(distance)of each test sample to the center of each training subset.Finally,multiple powerful classifiers constitute the final classification system by weighted voting.The results show that this ensemble classifying algorithm has better stability and performance than other similar algorithms.
Keywords/Search Tags:imbalanced data set, oversampling, SVM, ensemble classifying algorithm, AdaBoost
PDF Full Text Request
Related items