Research On Efficient Parallelization Of Improved Random Forest Algorithm Based On Spark Platform

Posted on:2023-04-24

Degree:Master

Type:Thesis

Country:China

Candidate:K Gong

Full Text:PDF

GTID:2568306752477684

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Since human beings entered the information society,big data has penetrated into all walks of life.Effective use of data mining can get valuable knowledge from big data,which can bring huge economic benefits.Therefore,data mining in big data environment has become one of the research hotspots.As a typical ensemble learning algorithm,random forest(RF)algorithm is often used in data mining because of its advantages such as less super-parameters,excellent classification performance and easy parallelization.However,under the big data environment,the classification time of random fores algorithm is too long due to the excessive number of decision tree which has become a major obstacle for the application of random forest algorithm in the big data environment.Therefore,this thesis proposes to use the ensemble pruning technology to optimize the random forest algorithm,which can significantly reduce the model classification time while ensuring the accuracy of the random forest model.At the same time,the improved random forest algorithm is parallelized based on the distributed parallel computing framework,so as to improve the model training speed and classification speed of the algorithm under the big data environment.The specific research contents mainly include the following two aspects.In order to solve the problems of slower classification speed and repeated voting of traditional random forest algorithm under the big data environment,an improved random forest algorithm based on similarity is proposed.By eliminating the decision trees with low classification accuracy and those prone to repeated voting in the original random forest model,an improved random forest model with faster classification speed and higher classification accuracy is constructed.To better cope with the big data environment,the improved random forest algorithm is parallelized on the Spark platform.First,an original random forest model is obtained by training multiple decision trees in parallel.Second,the decision trees with low classification accuracy in the original random forest model are filtered.Third,all path information of the reserved decision trees is obtained in parallel.Fourth,a decision tree similarity matrix is constructed in parallel to eliminate the decision trees which are prone to repeated voting.Finally,an improved random forest model which can be quickly and effectively classified is obtained and applied to rolling bearing fault diagnosis.The experimental results show that the algorithm can not only achieve good fault diagnosis accuracy,but also have fast model training speed and fault diagnosis speed for large-scale rolling bearing datasets.In order to find the optimal sub-forest in the random forest model and further improve the classification speed of the random forest model.Therefore,an improved random forest algorithm based on multi-objective teaching-learning-based optimization(MO-TLBO)is proposed(MO-TLBO-RF).MO-TLBO algorithm aims at maximizing classification accuracy and minimizing classification time,and it can find a sub-forest with higher classification accuracy and faster classification speed.In addition,considering the vast time cost of ensemble pruning of random forest via MO-TLBO algorithm under the big data environment,a vote set strategy is constructed to improve the fitness evaluation process.In the Spark platform,the MO-TLBO-RF algorithm is parallelized based on data parallelism.The Shuffle optimization strategy is proposed to reduce the number of Shuffle in the execution of parallel MO-TLBO-RF algorithm in the process of model training.The effectiveness of MO-TLBO-RF algorithm is verified by rolling bearing dataset and 28 UCI datasets.The experimental results show that the algorithm can obtain an random forest model with high fault diagnosis accuracy and fast fault diagnosis speed for arge-scale rolling bearing fault data;it has good classification results for UCI datasets containing multiple scenes.The results also prove that the ensemble pruning time can be greatly reduced via the vote set strategy and Shuffle optimization strategy.

Keywords/Search Tags:

random forest algorithm, ensemble purning, distributed parallelization, Spark, big data, teaching-learning-based optimization algorithm

PDF Full Text Request

Related items

1	Research On Random Forest Classification Algorithm Based On Spark Distributed Platform
2	Research On Parallelization And Optimization Of Random Forest Classification Algorithm Based On Spark
3	Optimization Of Distributed Random Forest Algorithm Based On Hierarchical Subspace
4	Research On Imbalanced Data Classification Algorithm Based On Random Forest And Its Parallelization
5	Research On A Semi-supervised Random Forest Classification Algorithm And Its Parallelization
6	The Optimization Research Of Spark Load Balancing And Random Forest Algorithm
7	Research On Parallel Text Categorization Of Random Forest
8	Thermal Power Plant Energy Saving Analysis Based On Spark Big Data Platform
9	Application Research Of Distributed Whale Optimization Algorithm
10	The Parallelization And Optimization Of K-means Algorithm Based On Spark