Font Size: a A A

Research On Parallelization And Optimization Of Random Forest Classification Algorithm Based On Spark

Posted on:2020-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:T Y HuFull Text:PDF
GTID:2428330575987994Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and the maturity of various application software and sensor technology,massive data information can be acquired or accumulated by organizations in various fields.Big data has gradually entered people's daily life,and has been valued by all walks of life.However,because of the characteristics of large data,people can not directly extract some valuable knowledge from it,so mining valuable knowledge from large data has become one of the hotspots of current research.Data mining technology can extract valuable information from data very well.At present,there are many large data platforms,but Spark is widely used because of its fast iteration speed.Classification algorithm is an important branch of data mining,and it is also of great significance in the era of big data.Random forest algorithm is one of the classification algorithms.Because of its good classification performance,it is widely used in all walks of life.But the performance of random forest algorithm is not so good when it faces high-dimensional data and unbalanced data.In this paper,two optimization algorithms are proposed for these two areas:In the field of feature selection,a stochastic forest algorithm based on maximum mutual information coefficient is proposed.The main idea of this method is:firstly,the maximum mutual information coefficient is used to score the features,then the features are sorted according to the scores from high to low,and all the features with high scores and some features with medium scores?random selection?are selected to participate in the construction of the random forest algorithm.Finally,the parallel design of the optimization algorithm is completed based on Spark,and the final experimental conclusion is obtained.The results show that the proposed method solves the problems encountered by the traditional stochastic forest algorithm in the face of high-dimensional data,and improves the accuracy and stability of the algorithm.In the field of unbalanced classification,a stochastic forest algorithm based on GAN model is proposed.The main idea of this method is as follows:firstly,the GAN generation model is used to generate a few samples,then the generated few samples are merged with the original data set and multiple balanced data subsets are constructed to construct the Random Forest algorithm.Finally,the parallel design of the optimization algorithm is completed based on Spark.The final experimental results show that the method proposed in this paper is a good solution.The problems encountered by traditional stochastic forest algorithm in the field of unbalanced classification have been solved,and good AUC and F1 values have been obtained.Finally,the two improved methods are used in the field of intrusion detection,which solves some problems in the field of intrusion detection,and achieves good detection accuracy and speed.
Keywords/Search Tags:Spark, Random Forest, MIC, GAN, Intrusion detection
PDF Full Text Request
Related items