Font Size: a A A

Research On Hybrid Sampling Algorithm Under Denoising In Imbalanced Classification

Posted on:2021-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:M H ShiFull Text:PDF
GTID:2428330611466801Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
In the field of data mining,the problem of imbalanced classification has always been a research hotspot.General classification algorithms can achieve good classification performance when solving balanced data set classification problems,but face huge problems in dealing with imbalanced classification problems.For example,in the problems of medical diagnosis,credit card fraud detection,and mechanical failure detection,the category of concern accounts for a small proportion of the entire data set,but the cost of being misclassified is difficult to estimate,so it is very important to improve the classification accuracy of minority classes in the problem of imbalanced classification.In addition,some classification algorithms are more sensitive to noise points.It is difficult to distinguish samples from noise under the premise that the number of certain types of samples is absolutely scarce.This is another reason why it is difficult to achieve the desired effect on the problem of imbalanced classificationThe research on the classification of imbalanced data focuses on the data level and the algorithm level.The data level adapts to the traditional classifier by reducing the imbalance ratio of the data set,and the algorithm level improves the recognition effect by increasing the error classification cost of minority class samples.This paper studies noise and imbalance classification from the data level,the specific contents are as follows:(1)Proposed hybrid sampling under denoising based on density(HSDBD)algorithm.The algorithm first generates three parts of the minority sample through the Borderline-SMOTE algorithm:noise sample set,boundary sample set and security sample set.The noise sample set is eliminated,and the boundary samples are weighted according to the density distribution,and a new minority sample is generated in a more reasonable way.At the same time,based on an improved imbalanced data undersampling algorithm,the majority class samples are screened.The algorithm plays a significant role in denoising while retaining the majority class samples with high information content.Experiments show that the HSDBD algorithm can effectively solve the problem of imbalanced classification(2)Hybrid sampling under denoising based on clustering(HSDBC)algorithm is proposed to equalize the data under the premise of removing noise.The algorithm first uses the outlier detection algorithm based on K-means to eliminate outliers,and divides the training set into several cluster,each cluster sample has a different imbalance ratio.According to the size of the imbalance ratio of each cluster,different sampling methods are adopted carefully.The evaluation performance of AUC,F1 and G-mean test classification performance shows that HSDBC algorithms improves the classification performance of minority class samples.
Keywords/Search Tags:Imbalance data, Hybrid sampling, Denoising, Density, Clustering
PDF Full Text Request
Related items