Font Size: a A A

The Improved Random Forests Based On The Imbalanced Data Classification

Posted on:2018-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:Z T WeiFull Text:PDF
GTID:2348330521950290Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Random forest algorithm as a widespread classification algorithm is essentially a combined classifier.Single classifiers often encounter bottlenecks in dealing with classification problems,but combining these single classifiers with ideas of integration would get good results.The essence of the algorithm is to randomly draw a number of training samples using the Bootstrap re-sampling method,and then use these training samples to construct a set of tree classifiers and then use this sets to classify by voting.The bootstrap sampling method of random forest algorithm leads to the integration of training samples into invalid sample sets,When the algorithm dealing with imbalanced data.This situation is not helpful in dealing with the problem of imbalanced data.And we all know that in the random forest algorithm the status of the decision tree is the equal.Above situations affects the final voting result,that reduce the classification performance of the algorithm.In this paper,an improved method of bootstrap resampling is presented which help us solve above problem.The quality of the training sample sets is guaranteed by the threshold value based on the non-equilibrium coefficient.Then we can get a better set of decision trees that make the voting results more accurate.Finally random forest algorithm can better deal with imbalanced data classification problems.The randomness of the bootstrap sampling method will produce the problem of different classification performance of different decision tree.As a combinatorial classifier,the random forest algorithm combines the decision tree with the voting algorithm.However,the voting rules of random forest algorithm do not take into account the difference between the basic classifie.This problem leads to poor classification results.We will weighted the decision tree through the non-equilibrium coefficient.We will obtain a variety of weighted random forests algorithms based on non-equilibrium coefficients.The new algorithm will help us improve the classification performance.The experimental data are based on twelve non-equilibrium dichotomy classification from the KEEL dataset warehouse.The imbalance Ratios of these datasets are distributed in the range of 1.25 to 42.By the experiments,it can prove that the above two improvements can enhance,to some extent,the quality of classified problem in dealing with imbalanced date by the use of Random forest algorithm.The secondary improvement on the basis of the first improvement can further improve the quality of the algorithm.
Keywords/Search Tags:imbalanced data sets, Random forest, new Bootstrap sampling, Weighted decision tree
PDF Full Text Request
Related items