Unbalanced Data Sampling Based On Sample Prior Distribution Information

Posted on:2020-11-03

Degree:Master

Type:Thesis

Country:China

Candidate:T Li

Full Text:PDF

GTID:2428330590951023

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The class imbalance problem is one of the main concerns in the fields of machine learning and data mining.Especially in recent years,with the increasing popularity of the Internet and the existence of the era of big data,the problem of class imbalance has become a hot topic.In layman's terms,the class imbalance problem refers to the problem that in the same data set when the number of samples of one or more classes is much larger or far less than the other classes.The emergence of the problem of class imbalance has caused a great impact on the traditional classification model.By the tireless efforts of researchers,a variety of effective algorithms have been proposed.Although these algorithms solve the class imbalance problem to some extent,they either ignore the noise point effect,or can not reasonably deal with samples based on their distribution.In view of the above problems,this thesis will address the class imbalance problem from the perspective of sample sampling.The focus of the work is on "how to reasonably remove the noise points in the samples" and "how to divide all samples into different regions according to their distribution information,and further to adapt serve different sampling strategies".The specific research content is summarized as the following two aspects:1)When solving the class imbalance problem from the perspective of instance sampling technology,both oversampling and down-sampling face an important problem: how to remove noise points in the samples.This thesis uses oversampling technology,so the work of removing noise points becomes more important.Because once the noise point can not be identified and removed,it will have a great impact on the sampling process: the synthesized new sample distribution information tends to be impacted by the prior distribution information of the noise point,resulting in the further expansion of the noise points.That is to say,the performance of the classification model will be degraded.In this thesis,the Gaussian mixture model is used to fit the probability density of the sample,and then the sample is judged to be a noise point according to the relative ratio of the probability density of the sample in its own category and in other categories.If yes,the sample is removed.The experimental results show that the proposed under-noising algorithm can effectively remove the noise points in the samples.2)After the under-noising process,we further consider the adaptive sampling.Firstly,the Gaussian mixture model is used to fit the cleaned minority samples,and the probability density of each sample is obtained.The samples with relatively large probability density after sorting in the descending order are named as �safe samples�;then the remaining samples are put into the majority class to sort again.The samples with a large probability density are named as �boundary samples�,and the rest are named �outlier samples�.None of the three samples were repeated,and the corresponding thresholds were set in the experiment.Finally,according to the distribution characteristics of these three divisions,different parameters are allocated to execute the sampling algorithm.

Keywords/Search Tags:

class imbalance, gaussian mixture model, probability density estimation, SMOTE, adaptive sampling

PDF Full Text Request

Related items

1	Improved Grouped SMOTE With Noise Filtering Mechanism
2	Research On Transfer-sampling Based Method For Class-imbalance Learning
3	Research On Imbalance Data Classification Based On Hybrid Model
4	Study Of Class Imbalance Learning Based On Extreme Learning Machine
5	Research On Theory And Applications Of Gaussian Mixture PHD Filter
6	Research Of Online Density Estimation Based On Incremental Gaussian Mixtures
7	Research On The Application Of Generative Adversarial Networks In Class Imbalance
8	Multiple Target Tracking Of Cardinalized Probability Hypothesis Density Filter With Unknown Model Parameters
9	Research On Product Quality Control Of G Enterprise Based On SMT Big Data Analysis
10	Adaptive Gaussian Mixture Model And Its Application In Speaker Recognition