Model Selection Of Random Forest And Its Parallelization

Posted on:2014-12-05

Degree:Master

Type:Thesis

Country:China

Candidate:L L Cai

Full Text:PDF

GTID:2268330392969043

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The traditional classification algorithm can obtain better classification effect onthe low-dimensional data sets, but its classification performance will decrease on thehigh-dimensional data sets. The high-dimensional data’s structure is complicated,which contains more non-information and noise. The Random Forest algorithm usefeature subspace to build the models, so the models are hard to avoid a lot of noise.Using these noise models to classify and predict will reduce the classification effect.So how to choose suitable models from so many models and then make the RandomForest algorithm have a better classification performance on low-dimensional andhigh-dimensional data sets becomes the focus of this issue. The same as the modelincreased, the calculation also presents the growth of the index value. So how toimprove the speed of model’s construction and prediction also becomes a researchproblem of this issue.Aiming at the problem of Random Forest model selection as well as modelbuilding and forecast with massive parallelization, this paper makes a deep analysisand research. The main research contents and results are as follows:Firstly, after theoretical study on Random Forest algorithm is performed, somefrequently-used methods on Random Forest model selection are summarized andelaborated. The process and robustness of each algorithm in terms of the modelselection are analyzed in detail. As the same time, the distributed parallel methodbased on the MapReduce framework is also introduced.Secondly, this paper proposes a dynamic model selection method based onMarkov Chain of Random Forest of, which uses the lazy dynamic selection patternand compromises the ideology of Markov Chain’s random walk, divides the models,training samples and test samples into three layers. Through calculating eachclassifier’s strength, calculating correlation among classifiers, calculating thesimilar between each test sample and the training samples set, voting by weightedand model selection, implements the Random Forest dynamic model selection by aniterative loop from upper to middle(or middle to upper), from lower to middle(ormiddle to lower), from layer to layer. According to the comparison of the resultsbased on some common model selection methods with some differentlow-dimensional and high-dimensional test sets, proves the advantage of the methodin terms of the OOB error, the strength, the average correlation, an upper bound forthe generalization error, and classification accuracy.Thirdly, this paper puts forward a random forest parallelization method basedon the MapReduce framework. The operating efficiency of the Random Forest algorithm can be improved by improving the model and voting of Random Forestparallelization.Finally, based on the theoretical research above, this paper designs andimplements a parallel system of Random Forest model selection based on theMarkov Chain dynamic integration and its parallelization. The system includes fourmodule such as: data input, parameter configuration, model selection and parallelscheduling interface. These modules contain all the operational processes of therandom forest model selection and its parallelization method. Model selectionmodule is also successfully applied to enterprise data mining platform.

Keywords/Search Tags:

Random Forest, Model Selection, Markov Chain, Dynamic, MapReduce

PDF Full Text Request

Related items

1	Research On Random Forest Algorithm Based On Feature Selection And Diversity
2	Research On The Random Forest Based Detection Of Malicious Mobile Applications At Runtime
3	Dynamic Textures Segmentation Based On Markov Random Field And Non-sampling Wavelet Transform
4	Rice Origin Verification Platform Based On Parallel Random Forest Algorithms
5	Research On Adaptive Feature Selection And Parameter Optimization Algorithm For Random Forest
6	Zigbee Channel Selection Algorithm Research Based On Markov Chain
7	Construction And Application Of Multi Factor Stock Selection Model Based On Random Forest
8	Research On Video Outlier Mining Based On Markov Random Field
9	Prediction Of Deleterious Synonymous Mutations Based On Random Forest
10	Research On Feature Selection And Classification Method Based On Random Forest For Medical Datasets