Font Size: a A A

A Probability Algorithm Research For Imbalanced Datasets Based On GMM-EM

Posted on:2020-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z J WuFull Text:PDF
GTID:2428330602958088Subject:Mathematics
Abstract/Summary:PDF Full Text Request
The classification of imbalanced datasets is an important research in Machine Learning.Generally,the minority samples are less than the majority samples in quantity as well as asymmetry in space distribution,but higher information value than majority samples.Therefore,the existing algorithms make imbalanced datasets balance in proposition of the well-distributed space,leading to high misclassification in minority samples.Thus,it is a hot spot issue to cope with classification of imbalanced datasets efficiently in information era.Although the existing algorithms take datasets space distribution characteristic into account in the basis of class balanced,they ignore datasets statistic features.Meanwhile,new samples are created randomly,resulting in obtaining inferior quality samples and decreasing precision of the minority samples.For the two issues,the imbalanced datasets are studied from two aspects:Datasets statistic features and Obtaining superior quality new samples.Enhancing Probability algorithm and Mean Inverted algorithm are proposed in this paper and verified their validity.The main study works are as follows:(1)Enhancing Probability algorithm:the algorithm respectively obtains Gaussian minority datasets and probability distribution functions(PDF)of the minority class by using GMM and EM algorithm.Secondly,According to the basic property of probability theory,original samples with high probability density have priority right to generate new instances,leading to balance datasets.A new calculated method is designed for avoiding overlapping or confusion in synthetic samples.Finally,new balanced datasets are trained by using C4.5 decision tree,obtaining effective results by using trained model.The proposed algorithm is conducted experiments to compare with "SMOTE family" and ADASYN algorithm by using datasets from UCI and KEEL.The experimental results show that the proposed algorithm has efficiency in classification.(2)Mean Inverted algorithm:the algorithm also obtains PDF of the minority by using GMM-EM.Secondly,the minority class is divided into left-missing data and right-missing data according to PDF mean asymmetry distribution phenomenon.Furthermore,new minority samples are created in the light of PDF mean symmetry and 3? rule is used to select new superior samples.If the datasets are already imbalanced,Enhancing Probability algorithm can be applied to create new samples until class balance.Finally,new balanced datasets are trained by using C4.5 decision tree,and obtain effective results by using trained model.The proposed algorithm is conducted experiments to compare with "SMOTE family" and ADASYN algorithm by using datasets from UCI and KEEL.The experimental results show that the proposed algorithm has efficiency in classification.
Keywords/Search Tags:Imbalanced Datasets, GMM-EM, Enhancing Probability Algorithm, Mean Inverted Algorithm
PDF Full Text Request
Related items