Font Size: a A A

Research On The Text Categorization Based On Spark

Posted on:2017-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:S L GuangFull Text:PDF
GTID:2308330503979776Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet and information technology, unstructured data is constantly increasing in the form of text. Text classification technology can effectively organize text and process data, it is widely used in various fields. The process of text classification includes pre-process, feature selection, vectorization and other stages, every stage is time-consuming and memory overhead, the conventional technology can not meet the demand when faced with large amounts of text. Big data technology provide an effective solution for Large-scale data processing, the parallel programming model MapReduce has some limitations, it is disk-based, can not be efficient use of computing an intermediate result, while Spark is a memory-based, can be very efficient use of intermediate results and has a high processing speed. In this paper using Spark to improve the efficiency of text classification.Firstly, we analyze the key technology in text categorization and Spark, then design the parallel programming for each process of text categorization based on Spark. In the feature selection, we use the χ2 statistical method and improve the method by using three parameter(TF, DF and CF). In constructing the text classifier, we use Naive Bayes classification algorithm for text classification, through the analysis of this method we find that the most critical is calculation the prior probability of item under the conditions of the category, we use an improved method for TF-IDF algorithm to calculation it. Finally, we verify the validity of the improved method through experiment, the result indicate that the improved method improve the performance of text categorization effectively; and verify that the parallel programming for each process of text categorization based on Spark can improve the efficiency of text categorization, the result show that the program reduce the time of text classification and is extendable.The parallel method of distributed text categorization based on Spark designed in this paper can improve the efficiency of pre-process, feature selection, vectorization and constructing the classifier, it can classify a large-scale texts in distributed and parallel.
Keywords/Search Tags:Text Categorization, feature selection, Spark, parallel, Naive Bayes
PDF Full Text Request
Related items