Font Size: a A A

Research On Imbalanced Data Classification Method Based On Random Forest Algorithm

Posted on:2014-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:J XiaoFull Text:PDF
GTID:2298330422990426Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Random Forest algorithm is an ensemble learning approach in machinelearning field that by integrating multiple decision tree classifier classificationresults to form an overall sense result. Compared with other classificationalgorithms, Random Forest has many advantages, for example, it has highclassification accuracy, small generalization error and has ability to handlehigh-dimensional data, the advantages of the training process is reflected in thatlearning process is fast and algorithm is easy to parallelize. Based on these twoadvantages, random forests algorithm has been widely used, has also become one ofthe priority choices when selecting classification algorithm. However, under thecircumstances when the uneven distribution of the data categories, that is thenumber of samples of a class is far less than the number of samples under othercategories, the Random Forest algorithm appears poor classification results, thegeneralization error becomes large and may produce a series of other problems.So far, for the imbalance data classification based on random forest algorithm,research in this area is not a lot, there is no direct effective method. Generalapproach combining some just deal with data hierarchy, such as sampling techniquesor cost-sensitive methods. So from Random Forest algorithm’s structure to improvethe level of the effect of unbalanced data classification is a meaningful research area.This article is also a problem starting from this study, in-depth analysis of the keysteps of random forests that affect the classification results, to design a bettersolution to handle unbalanced data classification.In this thesis, by studying the imbalanced data classification methods andRandom Forest algorithm, an improved treatment of the problem of imbalanced dataclassification random forests algorithm is proposed. Specifically focus on twoaspects to improve, one is random subspace selection and the other is modelselection. The main work includes:(1) Proposing a new integration feature selection method based on the ideas ofbagging, this method is based on fast filtering feature selection algorithm, thisfeature selection method increase the selection probability of feature which is infavor of the positive class samples classification, but not too excluding featurewhich is useful to the negative class samples.(2) Taking the stratified sampling based subspace selection algorithm, thefeature subsets generated from integrating feature selection method were sampled,while ensuring the selected features’ importance and characteristics of the generatedmodels’ differences. (3) Proposing a new tree model filtering method based on consideration of thecharacteristics of imbalanced data, assessing and reorganizing the tree model set, tothe model optimization purpose.In addition, the paper also incorporates a data-level balance of sampling carriedout on the algorithm of targeted experiments. Finally, verify the improved randomforests algorithm based on imbalanced public data sets in the classification results.Compared with the original random forest algorithm, In most indicator(cross-validation accuracy, AUC index, Kappa coefficient, and F1-Measure index)has significantly improved. Also Indicates that subspace selection and modeloptimization is very important to random forest algorithm.Research in this thesis for the guidance of unbalanced data classification hasimportant academic significance and practical value, can be applied to spamdetection, anomaly detection, medical diagnostics, DNA sequence recognition, andother fields.
Keywords/Search Tags:imbalanced data classification, random forest, feature subspaceselection, model selection
PDF Full Text Request
Related items