Font Size: a A A

Research On Parallel Text Classification Algorithm Base On Random Forest And Spark

Posted on:2017-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y S LuoFull Text:PDF
GTID:2308330485988805Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text classification is widely used in the applications of search engine, information retrieval etc. Especially in the era of information technology, it is one of the important researches of data mining to classify text in big data effectively. In this paper, we study the application of random forest algorithm in the classification of massive texts. Random forest algorithm is an integrated algorithm that can effectively deal with the massive data. Better classification effect may be obtained and the problem of overfitting in the decision tree may be solved simultaneously in the random forest by utilizing randomness. Furthermore, random forest algorithm can be applied to the processing big data by using random subspace. Poor random subspace may be generated in the sampling stage, consquently the classification ability of the corresponding decision may be impeded. In this paper, rough set theory is applied to random forest which aim to improve the classification ability of the random forest. Moreover, weighted voter are adopted. The experimental results show that the improved random forest algorithm is superior to the Naive Bayes and decision tree algorithm in the classification performance in most of data sets.MapReduce framework is currently the most widely used parallel computing framework for big data. Parallel text classification algorithms under MapReduce framework have attract much attention of researchers. But the disadvantage of MapReduce is that the intermediate results can only be stored on the HDFS, which resulting in a large amount of 10 overhead. Spark is a kind of parallel framework based on memory computing, and it will not store the intermediate results directly in the execution process (However, part of data will be cached to disk when the space of memory is not enough). Therefore, the execution efficiency of Spark framework is relatively better. In this paper, parallel algorithm for text classification is investigated by using random forest under Spark framework, which is compared to text classification algorithm under MapReduce. Experiments show that the parallel text classification on Spark framework has good parallel performance, and better than the MapReduce. Finally, for the convenience of users on the use of Spark cluster, a parallel text classification system based on B/S structure is designed, which is used for remote commit tasks, cluster monitoring and data download, etc.
Keywords/Search Tags:Parallel text classification, Rough set theory, Spark, Random forest, Classification system
PDF Full Text Request
Related items