Font Size: a A A

Research On Classification Algorithm Based On Hybrid Sampling For Imbalanced Data

Posted on:2020-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y F WuFull Text:PDF
GTID:2428330578973737Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Classification is one of the most important research contents in machine learning and data mining,the purpose of which is to construct a classification model to describe data classes and predict future data trends.However,traditional classification algorithms do not consider the imbalance of data and there are still some challenges that needed to be overcome in the imbalanced classification.For example,in the issues of medical diagnosis,fraudulent telephone detection and some others,the concerned events account for only a small proportion of the whole data sets,and the misclassification will bring incalculable costs as a result.Correct classification of minority classes in imbalanced data sets is often more valuable than majority classes.How to classify imbalanced data sets correctly and improve the classification accuracy of minority classes has become a great challenge in classification.At present,the problem of imbalanced data classification has been attached great importance both in theory and in practice.Many classification algorithms for imbalanced data have been proposed from different perspectives.The research methods of imbalanced data sets classification mainly include algorithm improvement and data sets reconstruction.The over-and under-sampling are common methods at the data level,but the two methods either lead to over-fitting or lose important samples.Based on this insight,this paper makes an in-depth study on the imbalanced data classification algorithm based on hybrid sampling,which includes the following two aspects:(1)A hybrid sampling algorithm based on the classification hyperplane is proposed,to solve the problem of the classification hyperplane of the SVM algorithm moving to the minority class easily.we firstly use the SVM algorithm to obtain the classification hyperplane,then delete some samples in the majority class that are far away from the hyperplane and generate some new samples belonging to the minority class near the real boundary with the SMOTE algorithm iteratively,finally make the classification hyperplane closer to the real boundary slowly.Compared with other resampling methods,experimental results have shown that the F-value and G-mean of the proposed algorithmare improved.(2)A hybrid sampling algorithm based on the nearest neighbor distribution is proposed,which balances the number of minority and majority samples by changing the distribution of samples.When using Borderline-SMOTE algorithm to construct new samples,the importance of k nearest neighbors of boundary samples is judged,and the nearest neighbors suitable for generating new samples are selected first,so that a few samples can be generated more accurately.At the same time,the under-sampling method based on distance is used to delete most of the samples with less contribution,then a new balanced data set is constructed.The experimental results have shown that the proposed algorithm has higher F-value and G-mean values than other resampling algorithms.From the point of view of data level,two imbalanced data classification algorithm based on hybrid sampling are proposed to solve the problem of single sampling algorithm in this paper.The classification accuracy can be improved to a certain extent,which provide technical support for imbalanced data analysis.
Keywords/Search Tags:Hybrid Sampling, Imbalance, Classification Hyperplane, Nearest Neighbor Distribution
PDF Full Text Request
Related items