Font Size: a A A

Researches On Oversampling Methods For Imbalanced Data

Posted on:2021-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:X M ZhouFull Text:PDF
GTID:2428330626955176Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
There are a lot of imbalanced data in real life,and the minority class is often more valuable.But traditional classifiers always aim at maximizing the overall classification accuracy,so they can't effectively classify the minority class.Resampling technology is an important direction to solve the classification for imbalanced data.However,in the case of small data sets,undersampling in resampling technology may lose important information of data sets,so oversampling is the focus of classification for imbalanced data.Although the existing oversampling methods effectively solve the problem of imbalance between classes,they do not take into account the distribution of samples in the minority class.Oversampling all minority samples without distinction may lead to overlapping of samples within classes,may increase the impact of noise in the presence of noise,and fail to effectively expand the minority class area.These cases will reduce the classification accuracy of the minority class.Therefore,we make improvements aiming at these problems of existing oversampling methods.The main works are shown as follows:(1)Since existing oversampling methods may cause dense areas of minority class to be denser,and even lead to overlapping of samples.In addition,due to the noise of minority class,existing oversampling methods may generate new samples around the noise,which makes the distribution of minority class more confusing.Aiming at these problems,we propose a bidirectional oversampling method based on sample stratification.We firstly divide the minority samples into dense area and sparse area based on the highest density point and the intra-class average distance,and then the bidirectional oversampling is performed in the boundary region of dense area and the sparse area.(2)For the existing oversampling methods,samples are synthesized at one time.Only a small amount of information possessed by the original minority samples is used.The synthesized samples are too concentrated.Therefore,in order to gradually expand the minority region and make the synthesized minority samples more uniform and effective,we propose an incremental deletion oversampling method.We firstly use the neighbor feature to delete the noise points,then SMOTE algorithm is used to double-synthesize the minority samples,delete relatively dense synthetic samples,and add the remaining synthetic samples to the original minority samples to form the seed samples.By analogy,iterating continuously,and finally the seed samples and the majority samples reach a quantitative balance.In summary,we propose two new oversampling algorithms aiming at the classification for imbalanced data,and verify them on real data sets.The experimental results show that the proposed algorithms have certain advantages when dealing with the classification for imbalanced data,improve the classification accuracy of the minority class samples effectively,and provide new ideas and methods for solving the classification for imbalanced data in real life.
Keywords/Search Tags:Imbalanced data, Classification, Sample stratification, Incremental, Deletion
PDF Full Text Request
Related items