Font Size: a A A

Research On Undersampling Method Based On Sample's Distribution Shape And Weight

Posted on:2022-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:L X ChenFull Text:PDF
GTID:2480306521481404Subject:Statistics
Abstract/Summary:PDF Full Text Request
The problem of unbalanced data classification is widely used in our real life,such as identifying risky behaviors,analyzing the severity of cancer,text emotion recognition and so on.The research on the classification of unbalanced data can guide the relevant decision making and has practical significance.When the traditional classifier is directly applied to the unbalanced data classification,the classification model tends to predict the result into the majority class,which will result in the low classification accuracy of the minority class.Based on the data level,this paper expects to balance the data set by undersampling the majority,so as to improve the classification accuracy.In this paper,two problems exist in Nearmiss method: the noise samples are not considered;the speed of calculation is slower.An undersampling method based on shape and weight,USSW,is proposed.Compared with the resampling method based on K-nearest neighbor idea,this method can effectively reduce the sampling time and make better use of the shape information of the data set samples.Specifically,the improvement ideas of USSW method are mainly as follows:1.Different from calculating the distance between pairs of samples in two categories in the K-Nearest Neighbor idea,the USSW method first calculates the center of the minority class,and then calculates the distance between all samples and the center.Compared with the sampling method of K-Nearest Neighbor idea,this method can reduce the calculation time.2.Introduce the "shape" of data into the problem of unbalanced data classification.Most sampling methods only use the distance between samples,and distribution information of the data is not fully utilized.The distance distribution information of different samples can reflect the distribution of the original data set to some extent.Through experiments on 8 groups of unbalanced data sets in real life,three comprehensive indicators,namely F1 value,G-means value and AUC value,were taken as the evaluation criteria for classification effect,and the following conclusions were drawn: 1.The quality of sample data can be preliminarily judged through the shape information of the data.2.The USSW method can effectively improve the classification effect of unbalanced data compared with the Nearmiss method and is more stable.3.The sampling speed of USSW is better than that of Nearmiss-2.In conclusion,USSW method improves classification efficiency and speed compared with the Nearmiss method.
Keywords/Search Tags:Unbalanced data, Undersampling, Logistic regression, Shape, Weight
PDF Full Text Request
Related items