Application And Research Of Optimization Method For Imbalanced Data

Posted on:2021-11-21

Degree:Master

Type:Thesis

Country:China

Candidate:S X Liu

Full Text:PDF

GTID:2518306563486304

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

An imbalanced training set is likely to lead to excessive preference of the classification model,thereby reducing the recognition accuracy of minority samples.The solutions to the above problem include the data level(i.e.,oversampling and undersampling methods)and the algorithm level(i.e.,ensemble learning).The existing oversampling algorithms generate samples which have limitations and ignore the problem of within-class imbalance.Therefore,in this thesis,an oversampling algorithm based on Gaussian mixture model and JS divergence is proposed,called GJRSMOTE.Firstly,it utilizes the Gaussian mixture model to cluster minority samples.And then,new samples are generated in the hypersphere.Eventually,the number of samples are controlled by employing the JS divergence.Compared with other oversampling algorithms on the UCI data sets and seismic data set,it is proved that GJRSMOTE can effectively enhance the classification performance of traditional classifiers.The existing undersampling algorithms do not consider the global and local distribution of samples at the same time.To solve the above problem,an undersampling algorithm based on Gaussian mixture model and samples distribution is proposed,called GD-US.Firstly,it adopts the density of clusters to allocate the sampling ratio.Then,GDUS exploits the overall and local distribution of the samples to determine the probability of the samples being deleted.By comparing with other undersampling algorithms,the effectiveness of GD-US is verified.Random forest and its variants construct the training subset by employing Bootstrap sampling,which is easy to generate repeated samples and leads to overfitting of the base learner.Therefore,an ensemble learning algorithm based on clustering combination is proposed,called CC-RF.The algorithm clusters the two class of samples separately.And then,the clusters are combined in pairs-wise to obtain several training subsets.Finally,the classification result is obtained by the improved weighted voting strategy.Compared with other ensemble learning algorithms on the UCI data sets,the experiment results show that the classification ability of CC-RF is better than others.

Keywords/Search Tags:

Imbalance Data, Oversampling, Undersampling, Ensemble Learning

PDF Full Text Request

Related items

1	Two-class Imbalanced Data Classification Based On Diverse Data Generation And Ensemble Learning
2	Research On Methods For Classifying Imbalanced Data
3	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
4	Research And Application Of Equalization Method For Imbalanced Dataset
5	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
6	Research Of Imbalanced Data Classification Method Based On Oversampling And Ensemble Learning
7	Imbalance Malicious Text Detection Based On Ensemble Learning
8	Hybrid Ensemble Learning For Imbalanced Data
9	Research On Resampling Methods For Imbalance Data
10	Research On Ensemble Method Of Structured Support Vector Machine For Imbalanced Data