Font Size: a A A

Research On Optimization And Improvement Of Random Forests Algorithm And Its Parallelization

Posted on:2020-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:G J WangFull Text:PDF
GTID:2428330572468769Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The random forests algorithm is an ensemble learning algorithm which classifies data by combining multiple decision trees,wide range application and is not easy to over-fit.It is widely applied in medicine,bioinformatics,economics and other fields.However,when the random forests algorithm is applied to the classification of unbalanced data,the classification performance and decision tree size will be reduced.On the other hand,the random forests algorithm oriented to high-dimensional data would emerge the defects of low classification accuracy and large generalization error.At the same time,with the advent of the big data era,the random forests algorithm should be capable of processing large-scale data.Aiming at the above deficiencies,this paper studies and improves the random forests algorithm from the data level and the algorithm level,the specific work as follows:(1)There is dataset marginal distribution problem using SMOTE combined random forests algorithm in dealing with imbalanced dataset.Our reasearch proposes a AFCM-SMOTE combined random forests algorithm.Firstly,cluster the smaller samples by AFCM algorithm and find the cluster center.Secondly,according to the new interpolation formula,the "artificial" sample is synthesized on the connection between the cluster center and the data in the cluster.Finally,the "artificial" samples are added to the original dataset and classified by random forests.Extensive experiments on five imbalanced datasets show that the F-value,G-mean,AUC and OOB error value of the improved algorithm are better.(2)In view of the shortcomings of the random forests algorithm for highdimensional data,our reasearch proposes the feature selection and parameter optimization of random forests based on intelligent algorithm.Combined the random forests algorithm with the immune algorithm and the bat algorithm,by binary encoding,the tree,the number of attributes and feature selection are searched at the same time,the minimum out of bag data error as the objective function.Extensive experiments on five high-dimensional datasets show that the Accuracy and OOB error value of the improved algorithm proposed in this thesis are better.(3)This thesis studies the parallel random forest algorithm by building a distributed computing platform Hadoop.The experimental results show that the performance of the parallelized random forests algorithm on large-scale datasets is superior,and the efficiency of running the algorithm is improved.
Keywords/Search Tags:random forests, unbalanced data, SMOTE, high-dimensional data, intelligent algorithm, feature selection, parameters optimization, MapReduce
PDF Full Text Request
Related items