Font Size: a A A

Application And Research Of Optimization Method For Imbalanced Data

Posted on:2021-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:S X LiuFull Text:PDF
GTID:2518306563486304Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
An imbalanced training set is likely to lead to excessive preference of the classification model,thereby reducing the recognition accuracy of minority samples.The solutions to the above problem include the data level(i.e.,oversampling and undersampling methods)and the algorithm level(i.e.,ensemble learning).The existing oversampling algorithms generate samples which have limitations and ignore the problem of within-class imbalance.Therefore,in this thesis,an oversampling algorithm based on Gaussian mixture model and JS divergence is proposed,called GJRSMOTE.Firstly,it utilizes the Gaussian mixture model to cluster minority samples.And then,new samples are generated in the hypersphere.Eventually,the number of samples are controlled by employing the JS divergence.Compared with other oversampling algorithms on the UCI data sets and seismic data set,it is proved that GJRSMOTE can effectively enhance the classification performance of traditional classifiers.The existing undersampling algorithms do not consider the global and local distribution of samples at the same time.To solve the above problem,an undersampling algorithm based on Gaussian mixture model and samples distribution is proposed,called GD-US.Firstly,it adopts the density of clusters to allocate the sampling ratio.Then,GDUS exploits the overall and local distribution of the samples to determine the probability of the samples being deleted.By comparing with other undersampling algorithms,the effectiveness of GD-US is verified.Random forest and its variants construct the training subset by employing Bootstrap sampling,which is easy to generate repeated samples and leads to overfitting of the base learner.Therefore,an ensemble learning algorithm based on clustering combination is proposed,called CC-RF.The algorithm clusters the two class of samples separately.And then,the clusters are combined in pairs-wise to obtain several training subsets.Finally,the classification result is obtained by the improved weighted voting strategy.Compared with other ensemble learning algorithms on the UCI data sets,the experiment results show that the classification ability of CC-RF is better than others.
Keywords/Search Tags:Imbalance Data, Oversampling, Undersampling, Ensemble Learning
PDF Full Text Request
Related items