Research On The Text Categorization Based On Spark

Posted on:2017-04-07

Degree:Master

Type:Thesis

Country:China

Candidate:S L Guang

Full Text:PDF

GTID:2308330503979776

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of the Internet and information technology, unstructured data is constantly increasing in the form of text. Text classification technology can effectively organize text and process data, it is widely used in various fields. The process of text classification includes pre-process, feature selection, vectorization and other stages, every stage is time-consuming and memory overhead, the conventional technology can not meet the demand when faced with large amounts of text. Big data technology provide an effective solution for Large-scale data processing, the parallel programming model MapReduce has some limitations, it is disk-based, can not be efficient use of computing an intermediate result, while Spark is a memory-based, can be very efficient use of intermediate results and has a high processing speed. In this paper using Spark to improve the efficiency of text classification.Firstly, we analyze the key technology in text categorization and Spark, then design the parallel programming for each process of text categorization based on Spark. In the feature selection, we use the χ2 statistical method and improve the method by using three parameter(TF, DF and CF). In constructing the text classifier, we use Naive Bayes classification algorithm for text classification, through the analysis of this method we find that the most critical is calculation the prior probability of item under the conditions of the category, we use an improved method for TF-IDF algorithm to calculation it. Finally, we verify the validity of the improved method through experiment, the result indicate that the improved method improve the performance of text categorization effectively; and verify that the parallel programming for each process of text categorization based on Spark can improve the efficiency of text categorization, the result show that the program reduce the time of text classification and is extendable.The parallel method of distributed text categorization based on Spark designed in this paper can improve the efficiency of pre-process, feature selection, vectorization and constructing the classifier, it can classify a large-scale texts in distributed and parallel.

Keywords/Search Tags:

Text Categorization, feature selection, Spark, parallel, Naive Bayes

PDF Full Text Request

Related items

1	A Study On Text Categorization Based On Machine Learning
2	The Study Of Chinese Text Categorization Based On Na(?)ve Bayes
3	Design And Implementation Of Text Classification System Based On K-neighborhood And Naive Bayesian
4	Chinese Text Data Classification
5	Text Categorization Based On Naive Bayes Method
6	The Study Of Naive Bayes Text Classification System Based On Artificial Intelligence
7	The Study Of Text Categorization Methods Based On Sense Group
8	The Research And Application Of Text Categorization Arithmetic In Spam Filtering
9	Design And Realization Of Text Categorization System
10	Data Mining Systems And Their Applications - Improve The Performance Of The Naive Bayes Text Classifier, Associated Characteristics