Research On Imbalanced Data Classification Method Based On Random Forest Algorithm

Posted on:2014-08-21

Degree:Master

Type:Thesis

Country:China

Candidate:J Xiao

Full Text:PDF

GTID:2298330422990426

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Random Forest algorithm is an ensemble learning approach in machinelearning field that by integrating multiple decision tree classifier classificationresults to form an overall sense result. Compared with other classificationalgorithms, Random Forest has many advantages, for example, it has highclassification accuracy, small generalization error and has ability to handlehigh-dimensional data, the advantages of the training process is reflected in thatlearning process is fast and algorithm is easy to parallelize. Based on these twoadvantages, random forests algorithm has been widely used, has also become one ofthe priority choices when selecting classification algorithm. However, under thecircumstances when the uneven distribution of the data categories, that is thenumber of samples of a class is far less than the number of samples under othercategories, the Random Forest algorithm appears poor classification results, thegeneralization error becomes large and may produce a series of other problems.So far, for the imbalance data classification based on random forest algorithm,research in this area is not a lot, there is no direct effective method. Generalapproach combining some just deal with data hierarchy, such as sampling techniquesor cost-sensitive methods. So from Random Forest algorithm’s structure to improvethe level of the effect of unbalanced data classification is a meaningful research area.This article is also a problem starting from this study, in-depth analysis of the keysteps of random forests that affect the classification results, to design a bettersolution to handle unbalanced data classification.In this thesis, by studying the imbalanced data classification methods andRandom Forest algorithm, an improved treatment of the problem of imbalanced dataclassification random forests algorithm is proposed. Specifically focus on twoaspects to improve, one is random subspace selection and the other is modelselection. The main work includes:(1) Proposing a new integration feature selection method based on the ideas ofbagging, this method is based on fast filtering feature selection algorithm, thisfeature selection method increase the selection probability of feature which is infavor of the positive class samples classification, but not too excluding featurewhich is useful to the negative class samples.(2) Taking the stratified sampling based subspace selection algorithm, thefeature subsets generated from integrating feature selection method were sampled,while ensuring the selected features’ importance and characteristics of the generatedmodels’ differences. (3) Proposing a new tree model filtering method based on consideration of thecharacteristics of imbalanced data, assessing and reorganizing the tree model set, tothe model optimization purpose.In addition, the paper also incorporates a data-level balance of sampling carriedout on the algorithm of targeted experiments. Finally, verify the improved randomforests algorithm based on imbalanced public data sets in the classification results.Compared with the original random forest algorithm, In most indicator(cross-validation accuracy, AUC index, Kappa coefficient, and F1-Measure index)has significantly improved. Also Indicates that subspace selection and modeloptimization is very important to random forest algorithm.Research in this thesis for the guidance of unbalanced data classification hasimportant academic significance and practical value, can be applied to spamdetection, anomaly detection, medical diagnostics, DNA sequence recognition, andother fields.

Keywords/Search Tags:

imbalanced data classification, random forest, feature subspaceselection, model selection

PDF Full Text Request

Related items

1	Research On Random Forest Algorithm Based On Feature Selection And Diversity
2	Research And Application On CFS-HDRF Classification Algorithm For Imbalanced Data Set
3	Research Of Ensemble Learning For High-dimensional And Imbalanced Data Classification
4	Research For Imbalanced Big Data Classification Algorithm On Random Forest
5	Research On Feature Selection And Classification Method Based On Random Forest For Medical Datasets
6	Improvement And Application Of Ran- Dom Forest Algorithm In Recommender Systems
7	Research On Imbalanced Data Classification Algorithm Based On Random Forest And Its Parallelization
8	Class-Imbalanced Data Stream Classification Method Based On Adaptive Random Forest
9	Selection And Classification Of Unbalanced Data Based On Semi - Supervised And Integrated Learning
10	Research On The Method Of Solving Imbalanced Classification Problems Based On Random Forest Algorithm