Font Size: a A A

Model Selection Of Random Forest And Its Parallelization

Posted on:2014-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:L L CaiFull Text:PDF
GTID:2268330392969043Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The traditional classification algorithm can obtain better classification effect onthe low-dimensional data sets, but its classification performance will decrease on thehigh-dimensional data sets. The high-dimensional data’s structure is complicated,which contains more non-information and noise. The Random Forest algorithm usefeature subspace to build the models, so the models are hard to avoid a lot of noise.Using these noise models to classify and predict will reduce the classification effect.So how to choose suitable models from so many models and then make the RandomForest algorithm have a better classification performance on low-dimensional andhigh-dimensional data sets becomes the focus of this issue. The same as the modelincreased, the calculation also presents the growth of the index value. So how toimprove the speed of model’s construction and prediction also becomes a researchproblem of this issue.Aiming at the problem of Random Forest model selection as well as modelbuilding and forecast with massive parallelization, this paper makes a deep analysisand research. The main research contents and results are as follows:Firstly, after theoretical study on Random Forest algorithm is performed, somefrequently-used methods on Random Forest model selection are summarized andelaborated. The process and robustness of each algorithm in terms of the modelselection are analyzed in detail. As the same time, the distributed parallel methodbased on the MapReduce framework is also introduced.Secondly, this paper proposes a dynamic model selection method based onMarkov Chain of Random Forest of, which uses the lazy dynamic selection patternand compromises the ideology of Markov Chain’s random walk, divides the models,training samples and test samples into three layers. Through calculating eachclassifier’s strength, calculating correlation among classifiers, calculating thesimilar between each test sample and the training samples set, voting by weightedand model selection, implements the Random Forest dynamic model selection by aniterative loop from upper to middle(or middle to upper), from lower to middle(ormiddle to lower), from layer to layer. According to the comparison of the resultsbased on some common model selection methods with some differentlow-dimensional and high-dimensional test sets, proves the advantage of the methodin terms of the OOB error, the strength, the average correlation, an upper bound forthe generalization error, and classification accuracy.Thirdly, this paper puts forward a random forest parallelization method basedon the MapReduce framework. The operating efficiency of the Random Forest algorithm can be improved by improving the model and voting of Random Forestparallelization.Finally, based on the theoretical research above, this paper designs andimplements a parallel system of Random Forest model selection based on theMarkov Chain dynamic integration and its parallelization. The system includes fourmodule such as: data input, parameter configuration, model selection and parallelscheduling interface. These modules contain all the operational processes of therandom forest model selection and its parallelization method. Model selectionmodule is also successfully applied to enterprise data mining platform.
Keywords/Search Tags:Random Forest, Model Selection, Markov Chain, Dynamic, MapReduce
PDF Full Text Request
Related items