Research On Undersampling Method Based On Sample's Distribution Shape And Weight

Posted on:2022-04-29

Degree:Master

Type:Thesis

Country:China

Candidate:L X Chen

Full Text:PDF

GTID:2480306521481404

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

The problem of unbalanced data classification is widely used in our real life,such as identifying risky behaviors,analyzing the severity of cancer,text emotion recognition and so on.The research on the classification of unbalanced data can guide the relevant decision making and has practical significance.When the traditional classifier is directly applied to the unbalanced data classification,the classification model tends to predict the result into the majority class,which will result in the low classification accuracy of the minority class.Based on the data level,this paper expects to balance the data set by undersampling the majority,so as to improve the classification accuracy.In this paper,two problems exist in Nearmiss method: the noise samples are not considered;the speed of calculation is slower.An undersampling method based on shape and weight,USSW,is proposed.Compared with the resampling method based on K-nearest neighbor idea,this method can effectively reduce the sampling time and make better use of the shape information of the data set samples.Specifically,the improvement ideas of USSW method are mainly as follows:1.Different from calculating the distance between pairs of samples in two categories in the K-Nearest Neighbor idea,the USSW method first calculates the center of the minority class,and then calculates the distance between all samples and the center.Compared with the sampling method of K-Nearest Neighbor idea,this method can reduce the calculation time.2.Introduce the "shape" of data into the problem of unbalanced data classification.Most sampling methods only use the distance between samples,and distribution information of the data is not fully utilized.The distance distribution information of different samples can reflect the distribution of the original data set to some extent.Through experiments on 8 groups of unbalanced data sets in real life,three comprehensive indicators,namely F1 value,G-means value and AUC value,were taken as the evaluation criteria for classification effect,and the following conclusions were drawn: 1.The quality of sample data can be preliminarily judged through the shape information of the data.2.The USSW method can effectively improve the classification effect of unbalanced data compared with the Nearmiss method and is more stable.3.The sampling speed of USSW is better than that of Nearmiss-2.In conclusion,USSW method improves classification efficiency and speed compared with the Nearmiss method.

Keywords/Search Tags:

Unbalanced data, Undersampling, Logistic regression, Shape, Weight

PDF Full Text Request

Related items

1	Customer Default Prediction In Lendingclub Data Based On Classification Integration
2	Logistic regression with ridge penalty applying to genetic expression data (Spanish text)
3	Classification Variables Of Logistic Regression Model And Its Application Research
4	Statistical Analysis Of Massive Imbalanced Data With Multiclass Logistic Regression
5	Improved Logistic Regression Model Under High Dimensional Data And Its Application
6	Logistic regression and item response theory: Estimation item and ability parameters by using logistic regression in IR
7	Statistical Analysis Of Personal Credit Default Risk Prediction
8	Theory And Application For The Logistic Regression Models Based On Case-Control Data
9	Theory And Application For The Logistic Regression Models Based On Case-control Data
10	Improved Ridge Regression Estimators For The Logistic Regression Model