Font Size: a A A

Research On Random Forest Classification Algorithm Based On Spark Distributed Platform

Posted on:2018-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z H NiuFull Text:PDF
GTID:2348330533460156Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of information technology and network has brought a large number of high-dimensional and complex data.How to effectively classify these data to find valuable information is a project of great significance.Random forest is an important classification algorithm,which has good tolerance to noise and outliers and can be applied to parallelization.The original random forest classification algorithm and its improved algorithm mostly run on a single machine.When they face a large number of high-dimensional and complex data,the time efficiency and space resource cannot meet the actual needs.Spark is an efficient and distributed computing framework,which is capable to provide balance between performance and time consuming in parallel computing,thus an effective way to solve this problem.For high dimensional data,a large portion of features are often not informative of the class of the objects,which affects the classification accuracy of the original random forest algorithm.Therefore,this paper improves the random forest algorithm on the Spark platform to improve the effectiveness of the classification of high dimensional data in the big data era.Firstly,random forest algorithm cannot treat the decision trees differently in the integration process and consequently weak decision trees have the same influence with strong trees on the classification decision,which has a bad effect on the classification performance of random forest algorithm.In order to deal with this problem,a random forest algorithm using weighted trees was proposed.And the proposed algorithm was implemented on Spark.The integration strategy of weighted trees could enhance the influence of the trees with strong classification ability and weaken the influence of the trees with weak classification ability in the integration process,thus could improve the classification ability of the random forest.The experimental results show that the proposed algorithm has better classification accuracy than the original random forest algorithm and has good scalability.Secondly,random forest algorithm uses random sampling to generate feature subspace and consequently select many subspaces that contain few informative features,which has a bad effect on the classification performance of random forest algorithm.In order to deal with this problem,an improved implementation method of stratified subspace was proposed.By adopting the improved implementation method,a random forest algorithm using stratified subspaces was proposed.And the proposed algorithm was implemented on Spark.The improved implementation method can not only ensure the correctness of the result of featurestratification,but also reduce the computational cost,thus suitable for high dimensional data.Experimental results verify that the proposed algorithm can effectively classify high dimensional data.Compared with the original random forest algorithm,the proposed algorithm has better classification accuracy and generalization ability.In addition,the proposed algorithm has good scalability.Finally,the proposed algorithms were applied to the prediction of flight delay.Based on the analysis of detailed information of the characteristics,the experimental data was preprocessed by normalization and a kind of delay grade.The experimental results verify that the proposed algorithms can effectively classify and predict the grade of flight delay.
Keywords/Search Tags:High-Dimensional Data, Classification, Random Forest Algorithm, Spark, Ensemble Strategy, Stratified Subspace, Flight Delay
PDF Full Text Request
Related items