The Research And Application Of Text Classification Based On Cloud Computing

Posted on:2017-01-11

Degree:Master

Type:Thesis

Country:China

Candidate:J M Yan

Full Text:PDF

GTID:2308330482480634

Subject:Computer technology

Abstract/Summary:

The rapid development of contemporary Internet generated a lot of valuable information on the Internet, where the text data occupies an important position, how to dig out the useful information from vast amounts of text data is a significant subject. Text Classification is an important research direction of Text Mining, and the achievement classification algorithm is an important part of the text classification, which affect the text classification results and classification performance. Classification algorithm implementation process consumes a lot of time based on a conventional computer, and it canâ€™t meet the growing demand for data processing, the research and development of cloud computing platform is being more popular in this context, which meets the growing demand of mass data processing.Based on the current research status of text classification and cloud computing platform application development trend, this thesis will base on cloud computing platform do text classification related technology research, the main work includes the following three aspects:(1) Deep analyze the Naive Bayes text classification algorithm theory, and research on the condition attribute independence assumption, then focus on the effect that the attribute weighted Naive Bayes brings to independence assumption. Then based on weighted attribute, this thesis propose cosine similarity weighted Bayesian classification algorithm, using the cosine similarity optimization attribute weights to do the improvement of the algorithm.(2) Do research on the Naive Bayes algorithm parallelization implemented in the cloud computing platform. Based on cloud computing platform Hadoop, this thesis do the design and implementation of Naive Bayes algorithm parallelization according to the MapReduce programming model; based on cloud computing platform Spark, this thesis do the analysis and design of Naive Bayes algorithm parallelization according to the memory-based computing model. Then this thesis does experiments to compare algorithm enhance performance on both platforms.(3) For the field of e-commerce merchandise category classification, based on the study of Spark platform and related technologies on text classification, this thesis analyses and designs the text classification parallelization implementation process based on Spark platform, then analyses the role of each node after submission and mission assignments. Then this thesis does research on the improved Naive Bayes algorithm and parallel realization on Spark, and gives a detailed parallelized implementation process.Experimental results show that the improved algorithm compared to the traditional method have some advantages, memory-based computing on Spark has superior algorithm execution compared with MapReduce-based model Hadoop. Improved algorithm can be effectively ported to Spark, and achieve merchandise category classification on Spark can effectively improve the classification performance.

Keywords/Search Tags:

Cloud computing, Text classification, Naive Bayes, Hadoop, Spark

Related items

1	Research And Application On Naive Bayes Classification Algorithm
2	Text Categorization Based On Naive Bayes Method
3	Research On Algorithms For Naive Bayes Classification And Its Tools Based On Hadoop
4	Study Of Parallelized Text Mining Algorithm Based On Cloud Computing Framework
5	Parallel Bayesian Spam Classification System Based On Spark
6	Research On Text Classification Algorithm Based On Naive Bayes Method
7	Research And Implement On Data Mining Algorithm Parallel Based On Hadoop
8	Research And Implementation On Feature Extraction And Classification Of Chinese Text Based On SPARK
9	A Text Classifier About High Blood Pressure Based On Naive Bayes
10	The Research And Implementation Of Parallel Algorithm For Bayesian Text Classification Based Spark Computing Environment