Font Size: a A A

The Research And Implementation Of Parallel Algorithm For Bayesian Text Classification Based Spark Computing Environment

Posted on:2020-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:W XiaoFull Text:PDF
GTID:2428330620451117Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information society,the Internet has been widely used and currently has become the most important source of information.In particular,with the emergence of cloud computing and the big data era,the data generated from the Internet are rapidly growing with the index grade.These data have the following characteristics: large in amount,high in dimension,complex in structure and containing much noise,but widespread application prospects.Furthermore,most of the information and data stored on the Internet are text.How to organize,manage,and utilize these text data is a great challenge for the currently limited computing power,especially when confronts with a large amount of information that needs to be searched effectively,quickly and accurately by users for the Internet applications.The Naive Bayesian algorithm is one of the ten classical algorithms in data mining,which is widely used as the basic theory for text classification.With the high-speed development of the Internet and information systems,huge amount of data are being produced all the time.Some problems are certain to arise when the traditional Bayesian classification algorithm addresses massive amount of data,especially without the parallel computing framework.This paper proposes an improved Bayesian algorithm INBPCS,for text classification in the Spark computing environment and improves the Naive Bayesian algorithm based on a polynomial model.For the data preprocessing,this paper first proposes a parallel noise elimination algorithm,and then proposes another parallel dimension reduction algorithm based on Information Gain computation in the Spark environment.Based on these preprocessed data,an improved parallel method is proposed for calculating the conditional probability that comprehensively considers the effects of the feature items in each document,class and training set,Meanwhile,this paper proposes a hybrid prediction algorithm based on multiple machine learning algorithms to improve the accuracy of Spark's memory prediction in the Shuffle phase.Finally,through experiments on different widely used corpuses on the Spark computation platform,the results illustrate that INBPCS can obtain higher accuracy and efficiency than some current popular text classification algorithms.
Keywords/Search Tags:Big Data, Naive Bayesian Classification, Parallel Computing, Spark, Text Classification
PDF Full Text Request
Related items