Font Size: a A A

Research On Parallel Text Categorization Of Random Forest

Posted on:2019-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z PengFull Text:PDF
GTID:2428330548478305Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
The development of the Internet has led to the production of a large amount of digital media information.Apart from some multimedia information,most of the content consists of text files.Since most types of documents are unstructured,it is difficult for ordinary computer technology to handle them effectively.Classification is an important technique for processing these text files.This paper mainly studies the text classification algorithm based on random forest.Random forest algorithm is a combination algorithm constructed by decision tree.It has high classification performance,good robustness,and no overfitting phenomenon.However,the traditional random forest algorithm also has some shortcomings:First,the random forest algorithm is not ideal for imbalanced data classification,and the accuracy of a few classes is significantly lower than that of most classes.Secondly,the voting weights of all the decision trees in the random forest algorithm are the same.The role of a decision tree that does not give full play to performance does not impair the impact of a poorly performing decision tree.Then,the random forest algorithm needs to establish multiple classifiers in the training process,the operation time is relatively long,and the general running time is more than double the operation time of other algorithms.In response to the above deficiencies,this paper improves the random forest algorithm:(1)An imbalanced data random forest improvement algorithm was proposed.Undersampling was performed on a large number of training samples.A few samples were sampled back and forth so that the number of samples was balanced.The improvement was achieved without affecting the accuracy of most classes.The classification effect of a few classes.The experimental results show that the algorithm has a good effect on imbalanced text classification data sources,and the classification accuracy of a few classes is significantly improved.(2)A weighted random forest algorithm for leaf nodes is proposed,and the voting weights and classification of each decision tree are used.Experiments show that the improved random forest algorithm has higher accuracy,recall rate and F value than ordinary random forest algorithm,naive Bayes algorithm and k-nearest neighbor algorithm,which indicates that the performance of the improved random forest algorithm is improved.(3)Proposed the use of the Spark distributed framework to parallelize the text classification process.Spark is a memory-based cluster computing framework for processing and analyzing big data.Its main features are easy to use,fast,universal,and extensible.And fault tolerance.The experimental results show that the efficiency of parallelizing the text classification process on the Spark platform is higher than that on a single machine.
Keywords/Search Tags:Text Classification, Random Forest, Spark, Parallelization, Imbalanced Data
PDF Full Text Request
Related items