Research On Parallel Text Categorization Of Random Forest

Posted on:2019-12-04

Degree:Master

Type:Thesis

Country:China

Candidate:Z Peng

Full Text:PDF

GTID:2428330548478305

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

The development of the Internet has led to the production of a large amount of digital media information.Apart from some multimedia information,most of the content consists of text files.Since most types of documents are unstructured,it is difficult for ordinary computer technology to handle them effectively.Classification is an important technique for processing these text files.This paper mainly studies the text classification algorithm based on random forest.Random forest algorithm is a combination algorithm constructed by decision tree.It has high classification performance,good robustness,and no overfitting phenomenon.However,the traditional random forest algorithm also has some shortcomings:First,the random forest algorithm is not ideal for imbalanced data classification,and the accuracy of a few classes is significantly lower than that of most classes.Secondly,the voting weights of all the decision trees in the random forest algorithm are the same.The role of a decision tree that does not give full play to performance does not impair the impact of a poorly performing decision tree.Then,the random forest algorithm needs to establish multiple classifiers in the training process,the operation time is relatively long,and the general running time is more than double the operation time of other algorithms.In response to the above deficiencies,this paper improves the random forest algorithm:(1)An imbalanced data random forest improvement algorithm was proposed.Undersampling was performed on a large number of training samples.A few samples were sampled back and forth so that the number of samples was balanced.The improvement was achieved without affecting the accuracy of most classes.The classification effect of a few classes.The experimental results show that the algorithm has a good effect on imbalanced text classification data sources,and the classification accuracy of a few classes is significantly improved.(2)A weighted random forest algorithm for leaf nodes is proposed,and the voting weights and classification of each decision tree are used.Experiments show that the improved random forest algorithm has higher accuracy,recall rate and F value than ordinary random forest algorithm,naive Bayes algorithm and k-nearest neighbor algorithm,which indicates that the performance of the improved random forest algorithm is improved.(3)Proposed the use of the Spark distributed framework to parallelize the text classification process.Spark is a memory-based cluster computing framework for processing and analyzing big data.Its main features are easy to use,fast,universal,and extensible.And fault tolerance.The experimental results show that the efficiency of parallelizing the text classification process on the Spark platform is higher than that on a single machine.

Keywords/Search Tags:

Text Classification, Random Forest, Spark, Parallelization, Imbalanced Data

PDF Full Text Request

Related items

1	Research On Imbalanced Data Classification Algorithm Based On Random Forest And Its Parallelization
2	Research On Parallelization And Optimization Of Random Forest Classification Algorithm Based On Spark
3	Research On Parallel Text Classification Algorithm Base On Random Forest And Spark
4	Research On Imbalanced Data Classification Method Based On Random Forest Algorithm
5	Research For Imbalanced Big Data Classification Algorithm On Random Forest
6	Research On Random Forest Classification Algorithm Based On Spark Distributed Platform
7	Class-Imbalanced Data Stream Classification Method Based On Adaptive Random Forest
8	Research On Parallel Random Forest And Fuzzy C-Means Algorithm For Imbalanced Data
9	Research On The Method Of Solving Imbalanced Classification Problems Based On Random Forest Algorithm
10	The Improved Random Forests Based On The Imbalanced Data Classification