Research On Optimization And Improvement Of Random Forests Algorithm And Its Parallelization

Posted on:2020-04-20

Degree:Master

Type:Thesis

Country:China

Candidate:G J Wang

Full Text:PDF

GTID:2428330572468769

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The random forests algorithm is an ensemble learning algorithm which classifies data by combining multiple decision trees,wide range application and is not easy to over-fit.It is widely applied in medicine,bioinformatics,economics and other fields.However,when the random forests algorithm is applied to the classification of unbalanced data,the classification performance and decision tree size will be reduced.On the other hand,the random forests algorithm oriented to high-dimensional data would emerge the defects of low classification accuracy and large generalization error.At the same time,with the advent of the big data era,the random forests algorithm should be capable of processing large-scale data.Aiming at the above deficiencies,this paper studies and improves the random forests algorithm from the data level and the algorithm level,the specific work as follows:(1)There is dataset marginal distribution problem using SMOTE combined random forests algorithm in dealing with imbalanced dataset.Our reasearch proposes a AFCM-SMOTE combined random forests algorithm.Firstly,cluster the smaller samples by AFCM algorithm and find the cluster center.Secondly,according to the new interpolation formula,the "artificial" sample is synthesized on the connection between the cluster center and the data in the cluster.Finally,the "artificial" samples are added to the original dataset and classified by random forests.Extensive experiments on five imbalanced datasets show that the F-value,G-mean,AUC and OOB error value of the improved algorithm are better.(2)In view of the shortcomings of the random forests algorithm for highdimensional data,our reasearch proposes the feature selection and parameter optimization of random forests based on intelligent algorithm.Combined the random forests algorithm with the immune algorithm and the bat algorithm,by binary encoding,the tree,the number of attributes and feature selection are searched at the same time,the minimum out of bag data error as the objective function.Extensive experiments on five high-dimensional datasets show that the Accuracy and OOB error value of the improved algorithm proposed in this thesis are better.(3)This thesis studies the parallel random forest algorithm by building a distributed computing platform Hadoop.The experimental results show that the performance of the parallelized random forests algorithm on large-scale datasets is superior,and the efficiency of running the algorithm is improved.

Keywords/Search Tags:

random forests, unbalanced data, SMOTE, high-dimensional data, intelligent algorithm, feature selection, parameters optimization, MapReduce

PDF Full Text Request

Related items

1	High-dimensional Unbalanced Data Set Classification Algorithm Based On Support Vector Machine And Its Application
2	Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm
3	Research On Application And Optimization Method Of Random Forests Algorithm
4	Research On Optimization And Improvement Of Random Forests Algorithm
5	Research On High-dimensional Unbalanced Data Classification Algorithm Based On Feature Selection And Ensemble Learning
6	The Improvement And Application Of Smote Algorithm For Unbalanced Data Sampling
7	Classification Of Non-equilibrium High-Dimensional Small Sample Data Based On RF And LSSVM Models
8	The Research Of Web Pages Filtering Based On Random Forests Algorithms
9	Research And Application Of High Dimensional Imbalanced Data Classification Based On Random Forest
10	The Research On Random Forest And Its Parallelization Oriented To Unbalanced High-dimensional Data