Font Size: a A A

Research On Optimization Of Random Forest Algorithm And Its Application In Text Parallel Classification

Posted on:2019-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2428330566999260Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Random forest is a typical combined classifier.By introducing randomness,a set of decision trees is constructed,which overcomes the problem of over-fitting and local convergence of decision trees.Because random forest algorithm solves the performance bottleneck of single classifie r,gradually it has be widely used.However,random forest algorithm is also inadequate,some aspects need to be improved.In this paper,we optimize the feature selection and the imbalanced data set processing,and implement the parallel classification of random forest algorithm on the Hadoop platform.The concrete work includes:(1)In terms of feature selection,this paper improves on the method of feature selection built in random forest,and proposes a new feature selection algorithm.The algorithm builds a random forest using MapReduce on a distributed platform,and then by changing the features of each column of the external data to obtain the corresponding feature importance measurements and weight of each decision tree,the weight of the decision tree depending on the prediction consistency between the individual decision tree and the collective random forest.The feature importance list is decided by the weighted sum of the two.Finally,a certain randomness is introduced based on the sorting of feature importance,which ensures the strength of each tree and reduces the correlation between tree and tree.The experimental results show that the algorithm has better performance in the accuracy and efficiency of classification than the random forest algorithm for feature selection in traditional singlemachine mode.(2)In the aspect of data preprocessing,the research status of class imbalance of data set and several excel ent imbalanced data processing algorithms are ful y studied.According to the typical SMOTE algorithm to improve,a new M3C-SMOTE algorithm is proposed.The method combined with K-means clustering algorithm to find the three cluster centers of the sample set,and then calculate the gravity of the three centers,generating new samples centering on the gravity,which brings a good solution to the blindness,marginalization problems of SMOTE algorithm.Experimental results show that the proposed method improves the classification performance of random forest algorithm after the data set is preprocessed.(3)Text preprocessing,text feature selection,text vectorization,training,classification and so on need a large amount of statistics and calculation,the detailed and specific parallel calculation and implementation for these processes were carried out on the use of Mapreduce distributed computing framework.After comparing the speedup Experiments,the efficiency of massively parallel text classification in distributed mode is verified.At last,the feature selection algorithm of random forest is introduced in text classification,which further improves the accuracy of text classification.
Keywords/Search Tags:random forest, feature selection, unbalanced data set, Hadoop, text classification
PDF Full Text Request
Related items