Research On Parallel Text Classification Algorithm Base On Random Forest And Spark

Posted on:2017-03-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y S Luo

Full Text:PDF

GTID:2308330485988805

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Text classification is widely used in the applications of search engine, information retrieval etc. Especially in the era of information technology, it is one of the important researches of data mining to classify text in big data effectively. In this paper, we study the application of random forest algorithm in the classification of massive texts. Random forest algorithm is an integrated algorithm that can effectively deal with the massive data. Better classification effect may be obtained and the problem of overfitting in the decision tree may be solved simultaneously in the random forest by utilizing randomness. Furthermore, random forest algorithm can be applied to the processing big data by using random subspace. Poor random subspace may be generated in the sampling stage, consquently the classification ability of the corresponding decision may be impeded. In this paper, rough set theory is applied to random forest which aim to improve the classification ability of the random forest. Moreover, weighted voter are adopted. The experimental results show that the improved random forest algorithm is superior to the Naive Bayes and decision tree algorithm in the classification performance in most of data sets.MapReduce framework is currently the most widely used parallel computing framework for big data. Parallel text classification algorithms under MapReduce framework have attract much attention of researchers. But the disadvantage of MapReduce is that the intermediate results can only be stored on the HDFS, which resulting in a large amount of 10 overhead. Spark is a kind of parallel framework based on memory computing, and it will not store the intermediate results directly in the execution process (However, part of data will be cached to disk when the space of memory is not enough). Therefore, the execution efficiency of Spark framework is relatively better. In this paper, parallel algorithm for text classification is investigated by using random forest under Spark framework, which is compared to text classification algorithm under MapReduce. Experiments show that the parallel text classification on Spark framework has good parallel performance, and better than the MapReduce. Finally, for the convenience of users on the use of Spark cluster, a parallel text classification system based on B/S structure is designed, which is used for remote commit tasks, cluster monitoring and data download, etc.

Keywords/Search Tags:

Parallel text classification, Rough set theory, Spark, Random forest, Classification system

PDF Full Text Request

Related items

1	Research On Parallel Text Categorization Of Random Forest
2	Research On Optimization Of Random Forest Algorithm And Its Application In Text Parallel Classification
3	Research On Random Forest Classification Algorithm Based On Spark Distributed Platform
4	Research On Parallelization And Optimization Of Random Forest Classification Algorithm Based On Spark
5	Research On Large-scale Traffic Classification Technology Based On Spark Performance Optimization
6	Random Forest In Application Of Text Classification
7	The Research And Implementation Of Parallel Algorithm For Bayesian Text Classification Based Spark Computing Environment
8	Parallel Bayesian Spam Classification System Based On Spark
9	Research And Application In Text Classification Based On Random Forest
10	Study On The Application Of Random Forests In Text Classification