Research On Classification Method For Imbalanced Datasets

Posted on:2022-08-06

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Zheng

Full Text:PDF

GTID:2518306569459104

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

Most of traditional classification algorithms are proposed based on the assumption that samples' size of every class is nearly balanced.However,it's common that the dataset is imbalanced in some realistic applications like medical diagnosis,fault prediction and fraud detection,in which the imbalance problem of datasets would decrease the generalization ability of classifier.So,how to improve the performance of classifier in imbalance problem is a hot topic in machine learning realm.There are several methods to solve imbalance problem,including over-sampling,under-sampling,cost-sensitive learning and ensemble learning algorithms.In this paper,three new algorithms of over-sampling,under-sampling and costsensitive are proposed.The specific research work is as followed:Through the analysis the shortcomings of SMOTE algorithm,a new fuzzy over-sampling algorithm based on sample density(FSMOTE)is proposed.In imbalanced datasets,the denser of the majority samples and the sparser of minority samples of the point where a minority sample locate at,the minority sample is easier to be misclassified.In FSMOTE algorithm,the sampling weights of easily misclassified samples are enlarged to generate more samples nearby so that the classifier would pay more attentions to these samples.Through FSMOTE algorithm,the performance of classifier is improved.An under-sampling algorithm(SVM-US)based on SVM algorithm is proposed.In SVMUS algorithm,the first step is to remove the majority samples which may decrease the classification accuracy of minority samples.Then,through K-means and SVM algorithm,some boundary samples of majority class are found.Being different with most undersampling algorithms which remove majority samples directly from datasets,SVM-US algorithm reconstructs majority samples through interpolation of these boundary samples and their nearest neighbors.In this way,the distribution information of majority samples has been preserved,which would contribute to improving the performance of classifier.After analysis of disadvantages of SVM for imbalanced datasets,a new cost-sensitive SVM algorithm(WSVMCIL)is proposed.In WSVMCIL algorithm,firstly,Kernel Density Estimation algorithm is used to evaluate the probability density of every sample,which is used to decide the samples' weights.Then,applying SVDD algorithm to each class and according to the projection position of each sample between minority class center and majority class center and finally samples are divided into noise samples,normal samples,boundary samples and overlap samples.To enlarge the weight of boundary samples and reduce the weight of noise samples and overlap samples,a weights' modification process is applied.The simulation results show that WSVMCIL algorithm could improve the performance of SVM algorithm.

Keywords/Search Tags:

imbalanced dataset, SVM, over-sampling, under-sampling, cost-sensitive learning

PDF Full Text Request

Related items

1	Imbalanced Data Classification And Its Application In The Prediction Of The Mobile Phone Replacement
2	Application Of Cost-sensitive Learning Based On Re-sampling In Online Loan Users
3	Research On Unbalanced Learning Based On Sampling Method
4	Research On Imbalanced Dataset Classification Algorithm Based On Sampling
5	Research On Imbalanced Data Classification Algorithms Based On Weight Analysis Of Loss Function
6	Research Of Sampling Strategy In Active Learning Algorithms
7	Hybrid Ensemble Learning For Imbalanced Data
8	The Research Of Imbalanced Data Classification
9	Research On The Re-sampling Technology Of Data Mining For High-dimensional Imbalanced Dataset
10	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets