Font Size: a A A

Research On Sampling Method For Unbalanced Data

Posted on:2022-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:C JiangFull Text:PDF
GTID:2518306614459014Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
In recent years,the problem of unbalanced data classification has continued to become a hot spot,which is widely used in many fields such as medical treatment,finance,information security and so on.Different from traditional data classification,A category in unbalanced data the number of samples(minority classes)is much smaller than other classes,and the classifier can not get accurate minority classification results.However,in many cases,the correct identification of minority classes is the most concerned by users.Therefore,how to effectively identify minority classes is the key to solve the problem of unbalanced data classification.This paper studies and improves the sampling method for unbalanced data,and the main work is as follows:Firstly,aiming at the problems that the traditional oversampling algorithm ignores the importance of a few boundary samples,is easy to synthesize overlapping samples and does not deal with outlier noise points,an unbalanced data oversampling method based on boundary and cluster is proposed.Firstly,the density peak clustering algorithm is used to cluster the minority samples to identify and deal with the outlier noise points in the minority samples;Secondly,the sampling weight of the minority boundary region is adaptively determined,and the sampling proportion is determined according to the sparsity of each cluster sub cluster.New minority samples are synthesized in the sample boundary region and cluster cluster,and the oversampled new samples are combined with the original data to obtain a balanced data set.Comparative experiments are carried out under different classifiers.The results show that the proposed algorithm can effectively solve the problem of sample imbalance and improve the accuracy of classifier in minority class recognition to a certain extent.Further study the sampling problem in unbalanced data.Aiming at the problem that it is easy to lose important majority class sample information in under sampling and can not synthesize minority class samples more effective for classification decision in over sampling,a density based nearest neighbor optimization hybrid sampling method is proposed.Firstly,the concept of density coefficient is introduced to make the density of samples near the boundary larger;Then,the clustering method is used to reduce the majority classes,and the samples that can represent the overall majority class distribution are selected.In the oversampling method,the sampling weight is allocated according to the density coefficient,and more minority samples are synthesized in the area close to the boundary,which further improves the decision support of the minority class boundary,and the balanced data set is obtained after mixed sampling.Experiments show that compared with other hybrid sampling algorithms,the proposed algorithm has obvious advantages in dealing with unbalanced data.
Keywords/Search Tags:cluster, unbalanced data, sampling, boundary density
PDF Full Text Request
Related items