Font Size: a A A

Unbalanced Data Sampling Based On Sample Prior Distribution Information

Posted on:2020-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:T LiFull Text:PDF
GTID:2428330590951023Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The class imbalance problem is one of the main concerns in the fields of machine learning and data mining.Especially in recent years,with the increasing popularity of the Internet and the existence of the era of big data,the problem of class imbalance has become a hot topic.In layman's terms,the class imbalance problem refers to the problem that in the same data set when the number of samples of one or more classes is much larger or far less than the other classes.The emergence of the problem of class imbalance has caused a great impact on the traditional classification model.By the tireless efforts of researchers,a variety of effective algorithms have been proposed.Although these algorithms solve the class imbalance problem to some extent,they either ignore the noise point effect,or can not reasonably deal with samples based on their distribution.In view of the above problems,this thesis will address the class imbalance problem from the perspective of sample sampling.The focus of the work is on "how to reasonably remove the noise points in the samples" and "how to divide all samples into different regions according to their distribution information,and further to adapt serve different sampling strategies".The specific research content is summarized as the following two aspects:1)When solving the class imbalance problem from the perspective of instance sampling technology,both oversampling and down-sampling face an important problem: how to remove noise points in the samples.This thesis uses oversampling technology,so the work of removing noise points becomes more important.Because once the noise point can not be identified and removed,it will have a great impact on the sampling process: the synthesized new sample distribution information tends to be impacted by the prior distribution information of the noise point,resulting in the further expansion of the noise points.That is to say,the performance of the classification model will be degraded.In this thesis,the Gaussian mixture model is used to fit the probability density of the sample,and then the sample is judged to be a noise point according to the relative ratio of the probability density of the sample in its own category and in other categories.If yes,the sample is removed.The experimental results show that the proposed under-noising algorithm can effectively remove the noise points in the samples.2)After the under-noising process,we further consider the adaptive sampling.Firstly,the Gaussian mixture model is used to fit the cleaned minority samples,and the probability density of each sample is obtained.The samples with relatively large probability density after sorting in the descending order are named as “safe samples”;then the remaining samples are put into the majority class to sort again.The samples with a large probability density are named as “boundary samples”,and the rest are named “outlier samples”.None of the three samples were repeated,and the corresponding thresholds were set in the experiment.Finally,according to the distribution characteristics of these three divisions,different parameters are allocated to execute the sampling algorithm.
Keywords/Search Tags:class imbalance, gaussian mixture model, probability density estimation, SMOTE, adaptive sampling
PDF Full Text Request
Related items