The Research And Implementation Of Parallel Algorithm For Bayesian Text Classification Based Spark Computing Environment

Posted on:2020-09-18

Degree:Master

Type:Thesis

Country:China

Candidate:W Xiao

Full Text:PDF

GTID:2428330620451117

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of information society,the Internet has been widely used and currently has become the most important source of information.In particular,with the emergence of cloud computing and the big data era,the data generated from the Internet are rapidly growing with the index grade.These data have the following characteristics: large in amount,high in dimension,complex in structure and containing much noise,but widespread application prospects.Furthermore,most of the information and data stored on the Internet are text.How to organize,manage,and utilize these text data is a great challenge for the currently limited computing power,especially when confronts with a large amount of information that needs to be searched effectively,quickly and accurately by users for the Internet applications.The Naive Bayesian algorithm is one of the ten classical algorithms in data mining,which is widely used as the basic theory for text classification.With the high-speed development of the Internet and information systems,huge amount of data are being produced all the time.Some problems are certain to arise when the traditional Bayesian classification algorithm addresses massive amount of data,especially without the parallel computing framework.This paper proposes an improved Bayesian algorithm INBPCS,for text classification in the Spark computing environment and improves the Naive Bayesian algorithm based on a polynomial model.For the data preprocessing,this paper first proposes a parallel noise elimination algorithm,and then proposes another parallel dimension reduction algorithm based on Information Gain computation in the Spark environment.Based on these preprocessed data,an improved parallel method is proposed for calculating the conditional probability that comprehensively considers the effects of the feature items in each document,class and training set,Meanwhile,this paper proposes a hybrid prediction algorithm based on multiple machine learning algorithms to improve the accuracy of Spark's memory prediction in the Shuffle phase.Finally,through experiments on different widely used corpuses on the Spark computation platform,the results illustrate that INBPCS can obtain higher accuracy and efficiency than some current popular text classification algorithms.

Keywords/Search Tags:

Big Data, Naive Bayesian Classification, Parallel Computing, Spark, Text Classification

PDF Full Text Request

Related items

1	Parallel Bayesian Spam Classification System Based On Spark
2	Research Of Chinese Text Classification Based On Naive Bayesian Method And Application Of Microblogging Data Classification
3	The Research And Application Of Text Classification Based On Cloud Computing
4	The Research And Implementation Of Bayesian Classification Algorithm In The Text Based On Spark Platform
5	Research Of Sentiment Analysis In Text Based On Spark
6	Research On Chinese Text Sentiment Polarity Classification Based On Naive Bayesian
7	Parallelized Text Classification Algorithm Research
8	Design And Implementation Of Text Classification System Based On K-neighborhood And Naive Bayesian
9	Text Classification Method Based On Unsupervised Clustering And Naive Bayesian Classifier
10	Research On Parallel Text Classification Algorithm Base On Random Forest And Spark