Font Size: a A A

An Research Of Improving Random Forests Algorithm Based On Spark

Posted on:2018-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:R S WangFull Text:PDF
GTID:2348330536466314Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Random forests is a kind of machine learning algorithm with excellent classification performance.It has the characteristics of specializing in dealing with large scale datasets,dealing with datasets with thousands of attributes,requiring less adjustment parameters and not overfitting.Therefore,the random forests has been widely applied and developed in various fields,attracting a large number of scholars to improve and research,and has achieved fruitful results.However,the process of generating random forest model will generate uneven decision tree models in classification performance and the decision tree models will be related.The decision trees with poor classification performance and the decision trees with strong correlation between each other will have a negative impact on the overall classification performance of the random forest model.Because of this,an improved random forests algorithm based on classification accuracy and similarity is proposed in this paper.This improved random forests algorithm evaluates the classification performance of the decision tree model in the random forests model through AUC value and selects the decision tree models which with the classification performance above the threshold.Then calculate the similarity among the selecting decision tree models and get the similarity matrix.Because the high similarity between two decision tree models cause high correlation between them.So through using the similarity matrix to cluster these decision tree models.Finally,the decision tree with the highest AUC value in each cluster is selected as the representative of this cluster and which makes up the new random forest model.The experiment on UCI datasets of heart disease,breast cancer,Pima Indian diabetes and Indian liver disease show that the improved random forest algorithm has better classification accuracy than the traditional random forests algorithm.In this paper,the improved random forests is implemented on the MATLAB platform.Then comparing the improved random forests algorithm and the traditional random forests algorithm on the four UCI datasets through designing experiments.Although the experiment results shows that the improved random forests has a certain improvement in the classification accuracy,but because it has two extra optimization steps from traditional random forests,so the classification rate will be reduced.And the process of big datasets and iteration is slow on MATLAB platform.So finally the improved random forests algorithm is implemented on the Spark platform and the classification rate of the improved random forests algorithm is greatly improved.
Keywords/Search Tags:random forests, classification accuracy, correlation, similarity matrix, Spark platform
PDF Full Text Request
Related items