Font Size: a A A

Research And Application Of Equalization Method For Imbalanced Dataset

Posted on:2019-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhangFull Text:PDF
GTID:2428330599963852Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the field of machine learning,the model which constructed by training set can be used to predict and explain data,but its effectiveness and accuracy are influenced by the unbalance of data.The model constructed by unbalanced training set often suffers from the phenomenon of classification bias and this will further degrade classification performance of the model.In order to solve these drawbacks,two sampling algorithms based on Isolation Forest(iForest)are proposed: iForest-RM undersampling and iForest-SMOTE oversampling algorithm.Forest-RM can achieve data set balance by undersampling samples from majority class(negative class).Firstly,statistic value of each sample is estimated to express sample feature in the sample space with iForest,and then the probability distribution is formed by the sample estimation.Secondly,by the probability distribution,the Turn Roulette Algorithm is adopted to select negative samples.Lastly,all negative samples are clustered into a limited number of clusters by K-means,and each cluster center is selected as a negative sample to achieve balance between positive and negative samples.The iForest-SMOTE can achieve data set balance by oversampling samples from minority class(positive class).Firstly,the statistic value of each sample is estimated to express sample feature in the sample space with iForest.The positive samples which have negative samples in their nearest neighbors should be removed.In each SMOTE process,the positive sample P and its nearest neighbors K positive samples are randomly selected,and the sample Q is selected according to the probability distribution of the nearest neighbor samples.Finally,we interpolate the M dimensional sphere formed by P and Q to generate enough positive samples to achieve positive and negative sample equilibrium.Method proposed in the paper is compared with other sampling algorithms by the Adaboost ensemble learning model on UCI data set and seismic data set,which proves that the method has better equalization ability and can be effectively applied to lithological identification.
Keywords/Search Tags:Imbalance data, Isolation Forest, Undersampling, Oversampling
PDF Full Text Request
Related items