Font Size: a A A

The Optimization Research Of Spark Load Balancing And Random Forest Algorithm

Posted on:2021-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z F ZhangFull Text:PDF
GTID:2428330620963592Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid popularization of information technology,a large amount of data has been generated and accumulated in various industries.Therefore,how to efficiently process the large amount of data and extract valuable information from it has become an important problem to be solved urgently.In recent years,from the perspective of platform,Spark,as an efficient big data processing platform based on in-memory computing,can better support and solve a series of problems of big data mining analysis and processing,has become a research hotspot in academia and industry.On the other hand,from the perspective of algorithm,the optimization of data mining algorithm based on Spark platform is also a research hotspot.Random forest algorithm is a typical algorithm in data classification methods,which is widely used because its better classification performance,so research on random forest algorithm based on Spark has theoretical significance and practical value.In this paper,the Spark platform and the random forest classification algorithm based on Spark platform are researched,mainly including the following two aspects:(1)The optimization research of Spark load balancing:Spark is an efficient big data processing platform based on in-memory computing,the load balancing of cluster has an important influence on the computing efficiency of cluster.However,the default task scheduling policy in Spark cluster does not consider the available resources of the nodes and the specific situation of the current load of the nodes,so it may lead to unbalanced load of each node in the process of task scheduling,thus affecting the task processing efficiency of the cluster.Aiming at the unbalanced load problem of Spark,this paper proposes an adaptive task scheduling policy based on Spark cluster to realize the load balancing optimization of Spark cluster.The strategy according to the computing resources of nodes and the actual situation of the load,uses the heuristic algorithm of ant colony simulated annealing fusion algorithm to optimize the task scheduling strategy of Spark clusters.And achieve the goal of load balancing and optimization by reasonably distribute the tasks,so as to improve the task of cluster processing efficiency.Finally,the effectiveness of the Spark cluster load balancing optimization studied in this paper is verified by experiments.(2)The optimization research of random forest algorithm based on Spark:When doing data analysis,data often contain some redundant features.And random forest algorithm adopts the method of random selection of features to form feature subspace when it is used to do the data mining,which cannot distinguish these redundant features when generating feature subspace,thus affecting the classification accuracy of random forest algorithm.Aiming to solve this problem,this paper optimizes the random forest algorithm based on Spark platform.The optimized random forest algorithm differentiates the strong and weak correlation features by calculating the importance of the features,and then forms the feature subspace by means of hierarchical feature extraction,so as to improve the overall classification accuracy of the random forest algorithm.Subsequently,this paper parallelizes the optimized random forest algorithm on Spark platform and verifies the classification accuracy of the improved algorithm.Finally,the optimized random forest algorithm is applied to the credit evaluation data set,and the results verify that the improved random forest algorithm can effectively improve the accuracy of credit evaluation.
Keywords/Search Tags:Spark, Load balancing, Heuristic algorithm, Random forest algorithm, Feature subspace
PDF Full Text Request
Related items