Random forest algorithm is an integrated learning classification algorithm combining Bagging’s idea.It obtains the final category by integrating the prediction results of multiple decision trees.It has the characteristics of good classification effect,strong robustness and fast operation speed.In recent years,it has been widely used in landslide prediction,fault detection,and other fields.However,with the arrival of the era of big data,the explosion of data size and the complexity of structure make the traditional random forest algorithm encounter the dilemma of large sample data and high feature dimension when processing big data.Therefore,how to optimize the random forest algorithm under big data has become an important direction of current research.The proposal and wide application of MapReduce distributed framework provide a solution for breaking through the dilemma of random forest algorithm under big data.By combining MapReduce with parallel computing,the parallel random forest algorithm has alleviated the problems of low learning efficiency and poor classification accuracy of big data to some extent.However,due to the complexity of big data sets and the limitations of random forest itself,the parallel random forest algorithm combined with MapReduce still has the following problems:(1)How to effectively improve the classification effect of parallel random forest algorithm in big data environment;(2)How to effectively improve the construction efficiency of parallel random forest algorithm in big data environment;(3)How to effectively improve the parallelization performance of parallel random forest algorithm.In order to solve the above problems,based on the study of random forest algorithm,MapReduce model,information theory and other related knowledge,two parallel random forest algorithms are proposed:(1)Aiming at the problems of the boundary of sample distribution and the strong correlation of subspaces when parallel random forest algorithm processes unbalanced data sets in the big data environment,this paper proposes a parallel random forest optimization algorithm PRF-JPOSSW,which combines probability oversampling and weak correlation subspace.Firstly,an oversampling strategy OSJG,which combines the joint probability distribution and Gibbs sampler,is designed in this algorithm to oversampling a few samples through the approximate probability distribution of a few samples,effectively avoiding the problem of sample distribution boundary;Secondly,a weak correlation subspace selection method SSMI combined with mutual information is used,which solves the problem of strong correlation of subspaces by weakening the correlation between candidate feature subspaces and selected feature subspaces;Finally,the load balancing strategy NDS is designed,which distributes the key and value pairs evenly during the parallelization process,and improves the efficiency of the parallel algorithm.The experimental results show that for imbalanced large datasets,the PRF-JPOSSW algorithm can relatively improve the classification accuracy by8.47%,with lower time complexity and higher parallel efficiency.In view of the problems of excessive redundant and uncorrelated features,insufficient information content of feature subspace and low parallelization efficiency when the random forest algorithm processes high-dimensional feature data sets in the big data environment,this paper proposes a parallel random forest algorithm PRFGRSE,which combines gain rate and stack self-coder.First of all,a dimensionality reduction strategy DRNGRSE combining nonlinear normalized gain rate and stacked self-coder is designed in this algorithm,which effectively reduces the number of redundant and uncorrelated features by filtering the redundant and uncorrelated features in the feature set and using stacked self-coder to extract features;Secondly,the subspace selection strategy SSLF,which combines Latin hypercube sampling and normalized correlation degree,is used.The feature subspace with high information content is formed by multi-layer partition sampling of feature set,which effectively guarantees the information content of feature subspace;Finally,we designed the Reducer allocation strategy DSVLA combined with variable action learning automata,which makes each data cluster evenly distributed to the Reducer for processing,and effectively improves the parallelization efficiency.The experimental results show that for high-dimensional feature datasets,the PRFGRSAE algorithm has a significant improvement in acceleration ratio and accuracy,especially for datasets containing more features,the parallel acceleration ratio can be improved by 0.47 and has higher accuracy. |