Font Size: a A A

Research On Classification Method For Imbalanced Datasets

Posted on:2022-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ZhengFull Text:PDF
GTID:2518306569459104Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Most of traditional classification algorithms are proposed based on the assumption that samples' size of every class is nearly balanced.However,it's common that the dataset is imbalanced in some realistic applications like medical diagnosis,fault prediction and fraud detection,in which the imbalance problem of datasets would decrease the generalization ability of classifier.So,how to improve the performance of classifier in imbalance problem is a hot topic in machine learning realm.There are several methods to solve imbalance problem,including over-sampling,under-sampling,cost-sensitive learning and ensemble learning algorithms.In this paper,three new algorithms of over-sampling,under-sampling and costsensitive are proposed.The specific research work is as followed:Through the analysis the shortcomings of SMOTE algorithm,a new fuzzy over-sampling algorithm based on sample density(FSMOTE)is proposed.In imbalanced datasets,the denser of the majority samples and the sparser of minority samples of the point where a minority sample locate at,the minority sample is easier to be misclassified.In FSMOTE algorithm,the sampling weights of easily misclassified samples are enlarged to generate more samples nearby so that the classifier would pay more attentions to these samples.Through FSMOTE algorithm,the performance of classifier is improved.An under-sampling algorithm(SVM-US)based on SVM algorithm is proposed.In SVMUS algorithm,the first step is to remove the majority samples which may decrease the classification accuracy of minority samples.Then,through K-means and SVM algorithm,some boundary samples of majority class are found.Being different with most undersampling algorithms which remove majority samples directly from datasets,SVM-US algorithm reconstructs majority samples through interpolation of these boundary samples and their nearest neighbors.In this way,the distribution information of majority samples has been preserved,which would contribute to improving the performance of classifier.After analysis of disadvantages of SVM for imbalanced datasets,a new cost-sensitive SVM algorithm(WSVMCIL)is proposed.In WSVMCIL algorithm,firstly,Kernel Density Estimation algorithm is used to evaluate the probability density of every sample,which is used to decide the samples' weights.Then,applying SVDD algorithm to each class and according to the projection position of each sample between minority class center and majority class center and finally samples are divided into noise samples,normal samples,boundary samples and overlap samples.To enlarge the weight of boundary samples and reduce the weight of noise samples and overlap samples,a weights' modification process is applied.The simulation results show that WSVMCIL algorithm could improve the performance of SVM algorithm.
Keywords/Search Tags:imbalanced dataset, SVM, over-sampling, under-sampling, cost-sensitive learning
PDF Full Text Request
Related items