Font Size: a A A

Research On Rebalance Algorithm For Imbalanced Data Based On Probability Graph

Posted on:2021-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:T Y LiFull Text:PDF
GTID:2428330602989026Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
The classification of imbalanced data is an important research direction in machine learning and data mining.Compared with the majority class data,the number of minority class data is smaller,but contain more value.In imbalanced data,the classification results tend to be majority class,which reduce the classification accuracy of classifier.For the classification of imbalanced data,many scholars have proposed different oversampling methods considering the geometric characteristics and spatial distribution of original sample data,but these methods do not involve the statistical characteristics of sample data,resulting in the generation of samples with poor quality and reduce the classification accuracy.Based on this,this paper proposes a probability graph rebalance algorithm based on Gaussian mixture model-EM algorithm(GMM-EM)for imbalanced data.First,the probability density functions of minority class and majority class data are obtained by the GMM and EM algorithm.After that,according to the probability graph(?-?graph)of dataset,the security of each minority class data is divided,and the weight of minority class data is given according to the security level.Then,oversampling is performed using the newly proposed algorithm.The algorithm not only considers the direction of data generation,but also ensures the consistency of probability distribution of data before and after the balance.Finally,the balanced data is classified using a decision tree classifier.Experimental results show that the proposed algorithm is more efficient than other existing algorithms.Through further research on new algorithm,we conclude that the imbalanced rate of original data affects the experimental effect of new algorithm,and the imbalanced rate should not be too large or too small.When the imbalanced rate is 1.48<R?5.14,new algorithm achieves the best classification effect.At the same time,setting of the generating weight of minority class data affects the classification effect,while the optimal generating weight is?1=0.2,?2 =0.3,?3=0.5,?4=1.Through the test before and after the generation of minority class data in the experiment,it is found that the selected data satisfy or approximately satisfy the Gaussian distribution,and the balanced data are closer to the Gaussian distribution,thus further verifying the effectiveness of new algorithm.
Keywords/Search Tags:Imbalanced Datasets, GMM-EM, Security, Generating Direction, Probability Graph
PDF Full Text Request
Related items