Font Size: a A A

Research On Resampling Methods In Unbalanced Learning

Posted on:2019-08-22Degree:MasterType:Thesis
Country:ChinaCandidate:J W ZhouFull Text:PDF
GTID:2438330551460869Subject:Intelligent computing and systems
Abstract/Summary:PDF Full Text Request
Imbalanced data widely appears in the real life.Generally,most classification algorithms of machine learning are based on the assumption that the dataset has balanced distribution or same cost for classifying wrongly.While facing relatively complex imbalanced dataset,these algorithms cannot effectively reflex the distributing character of such data,which will influence the performance of classifier.In fact,classifier trained on imbalanced dataset always trends to majority classes,which leads to the classification for minority classes samples is not ideal.However,samples that belong to minority classes usually contain important information as well as having higher cost for classifying wrongly.So that such samples should become the object that we mainly focus on.The methods that research imbalanced classification can be divided into three levels,namely data level,algorithmic level and integration level.In this paper,based on researching various strategies of imbalanced learning,we mainly propose three different re-sampling technologies in data level.Firstly,the performance of classifier will be affected due to lack representing the character of data that belongs to minority classes.To solve this problem,we propose an over-sampling method based on max-min-distance.In each dimension,there is a minimum distance between the data that belongs to minority classes and data that belongs to majority ones.Among these min-distances,we find the maximum one and its corresponding dimension.Then in this dimension,we generate a certain data perturbation to the data that belongs to minority classes.Accordingly,we get new samples that belong to minority classes.Secondly,the key of under-sampling is to decrease the number of samples that belong to majority classes,at the same time,to maintain the data distribution of majority classes as much as possible.While the Gaussian mixture model can effectively represent the data distribution of any shape.Thus in this paper,we propose a under-sampling method based on Gaussian mixture model.We exploit Gaussian mixture model to fit negative data,then get under-sampling in proportion based on probability interval that is the distribution of data on each Gaussian component.Thirdly,according to analyze the traits of cluster algorithm and under-sampling technique,we propose another under-sampling method based on a double layer clustering algorithm.In the outer layer,we use K-means algorithm to guarantee the overall distribution of samples that belong to majority classes.Then in the inner layer,we use K-medoids algorithm to get clustering analysis for each cluster generated in the outer layer.Afterwards,we select the center get by K-medoids algorithm as the final samples that belong to majority classes.In this way,we complete under-sampling.In this paper,a serial of experiments are carried out to validate the effectiveness of the three algorithms mentioned above.The experimental results on UCI imbalanced dataset show the re-sampling methods we proposed can effectively improve the classification performance.
Keywords/Search Tags:Imbalanced learning, Re-sampling, Gaussian mixture model, Machine learning, Clustering algorithm
PDF Full Text Request
Related items