Font Size: a A A

Optimization Of Distributed Random Forest Algorithm Based On Hierarchical Subspace

Posted on:2021-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:J JingFull Text:PDF
GTID:2438330605463859Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
How to efficiently mine value information in data is a problem that needs to be valued and solved in the era of big data.With the emergence of various new things such as cloud computing and the Internet of Things technology,data size continues to grow at a rate of double every two years.The application value of data in various fields has become more important than ever,and it contains a large amount of valuable information.The most significant feature of big data is that it contains a large amount of data,but the information density of the data is very low,inefficient mining will seriously waste manpower,resources,etc.To effectively solve this problem,research on the improvement of various machine learning algorithms becomes necessary.Random forest algorithm,as one of the most important machine learning algorithms nowadays,not only has high accuracy in prediction,but also has good applicability.Random forest can be applied to most large data sets and has been widely used in many fields.One of the advantages of the random forest algorithm is that it can run in parallel,which is an important way to improve the performance of machine learning algorithms in the era of big data.This paper mainly studies the random forest algorithm based on the distributed platform Spark.First,set up the spark cluster environment,which is based on HDFS(Hadoop distributed file system)and then YARN(Yet Another Resource Negotiator),and study the optimization and improvement of algorithms in the cluster environment.The main research contents are as follows:(1)Weighted stratified subspace forest algorithm.By studying the classification ability of different feature spaces,this paper proposes a weighted stratified subspace forest algorithm.The feature space can be weighted by random forest feature assessment,and some noise data can be filtered out,and then stratified sampling can be carried out according to the weight ratio.Experimental results show that the improved algorithm can effectively improve the accuracy of the model.(2)Random forest algorithm based on factor analysis.A random forest algorithm based on factor analysis is proposed by studying the correlation between features.In the feature space,the factor analysis algorithm is used to study the correlation between features,and the stratified subspace is divided according to the correlation.Experiments show that this method can effectively enhance the classification strength of decision tree and improve the generalization performance of the model.(3)Combinatorial random forest algorithm.A combinatorial random forest algorithm is proposed by further studying the feature subspace and voting mechanism.Combining theweighted stratified subspace random forest algorithm with the random forest algorithm based on factor analysis,a new feature stratification method is proposed,and combined with the weighted trees random forest algorithm to form the combinatorial random forest algorithm.By analyzing the experimental results,it is concluded that the algorithm has a good prediction accuracy and effectively improves the generalization performance of the model.In summary,the improved stratified subspace random forest algorithm has higher prediction accuracy than the traditional random forest algorithm,its learning model is more stable,and it has a smaller generalization error for the original random forest algorithm.
Keywords/Search Tags:Random Forest Algorithm, Parallelism and Distribution, Spark, Big Data, Feature Selection
PDF Full Text Request
Related items