Font Size: a A A

The Research And Application Of Text Classification Based On Cloud Computing

Posted on:2017-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:J M YanFull Text:PDF
GTID:2308330482480634Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid development of contemporary Internet generated a lot of valuable information on the Internet, where the text data occupies an important position, how to dig out the useful information from vast amounts of text data is a significant subject. Text Classification is an important research direction of Text Mining, and the achievement classification algorithm is an important part of the text classification, which affect the text classification results and classification performance. Classification algorithm implementation process consumes a lot of time based on a conventional computer, and it can’t meet the growing demand for data processing, the research and development of cloud computing platform is being more popular in this context, which meets the growing demand of mass data processing.Based on the current research status of text classification and cloud computing platform application development trend, this thesis will base on cloud computing platform do text classification related technology research, the main work includes the following three aspects:(1) Deep analyze the Naive Bayes text classification algorithm theory, and research on the condition attribute independence assumption, then focus on the effect that the attribute weighted Naive Bayes brings to independence assumption. Then based on weighted attribute, this thesis propose cosine similarity weighted Bayesian classification algorithm, using the cosine similarity optimization attribute weights to do the improvement of the algorithm.(2) Do research on the Naive Bayes algorithm parallelization implemented in the cloud computing platform. Based on cloud computing platform Hadoop, this thesis do the design and implementation of Naive Bayes algorithm parallelization according to the MapReduce programming model; based on cloud computing platform Spark, this thesis do the analysis and design of Naive Bayes algorithm parallelization according to the memory-based computing model. Then this thesis does experiments to compare algorithm enhance performance on both platforms.(3) For the field of e-commerce merchandise category classification, based on the study of Spark platform and related technologies on text classification, this thesis analyses and designs the text classification parallelization implementation process based on Spark platform, then analyses the role of each node after submission and mission assignments. Then this thesis does research on the improved Naive Bayes algorithm and parallel realization on Spark, and gives a detailed parallelized implementation process.Experimental results show that the improved algorithm compared to the traditional method have some advantages, memory-based computing on Spark has superior algorithm execution compared with MapReduce-based model Hadoop. Improved algorithm can be effectively ported to Spark, and achieve merchandise category classification on Spark can effectively improve the classification performance.
Keywords/Search Tags:Cloud computing, Text classification, Naive Bayes, Hadoop, Spark
PDF Full Text Request
Related items