Font Size: a A A

The Research And Implementation Of Bayesian Classification Algorithm In The Text Based On Spark Platform

Posted on:2017-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:B LuFull Text:PDF
GTID:2428330488979906Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid development of Internet and Internet of Things technology promote the advent of big data era.Now all kinds of information are growing rapidly in the form of exponential,therefore how to manage and utilize these data quickly and efficiently is became the focus of academia and industry.Big data contains a lot of features,such as large scale,irregular structure,variety,high dimension and more noise data.If you want to quickly mine any valuable information from the Big Data,which requires the powerful analysis and processing capabilities.Traditional serial processing time can not meet the needs of massive data processing.The rapid development of cloud computing provides advanced conditions for the Big Data mining.Hadoop and Spark are the most popular parallel computing frameworks currently,which are used for data storage and parallel computing.The main storage of Hadoop and Spark is Hadoop distribute file system that contains high throughput and fault tolerance,which is applicable to the Big Data mining.We select Spark as a platform for data processing in this paper.Spark has the advantage of Hadoop MapReduce and it is a memory computing framework.Spark saved tremendous I/O cost result from Hadoop store the intermediate results in HDFS causing the subsequent task repeatedly reading and writing disk.It is beneficial to data mining and machine learning algorithm with large iteration.We studied and optimized the Bayesian algorithm that is one of the top ten classic data mining algorithm,and then we established our own classification model and implemented in the form of parallel on Spark in this paper.The main research includes the following aspects:(1)We can do a series of preprocess for Big Data.The traditional Bayesian algorithm takes more time to process the big data and presents lower classification accuracy.This paper established INBCS,a modified classification model,based on the characteristic of Big Data.Firstly,we denoised the original data set,and then used Information Gain to reduce the dimensionality of English document.While Chinese syntax structure,semantic expression and organization form is different from English,the Information Gain achieved dimension reduction by regarding a single word as the feature item to measure its information entropy.Thus,the method mentioned above is not applicable to reduce the dimensionality.We chose the TextRank to achieve dimension reduction by extracting the keywords in Chinese documents.Finally,we can eliminate the data skew from above processed data set.(2)Improvement to the computing methods of posterior probabilities in Bayesian algorithm:the posteriori probability of feature item in general Naive Bayesian algorithm only pays attentions to the local factors,namely,the number of feature item account for the whole class.However the number of feature item account for all kinds feature items and the number of text containing feature item of this class account for the number of text containing feature item of data set.This paper introduced the comprehensive coefficient to focus on local and global factors.(3)Implementation the improved classification model of INBCS on the Spark:In this paper,the Spark cluster is used to parallelize the improved model.We can test some performance indicators of our improved model in the Spark,including the comprehension coefficient,precision,recall,F1 value,cost of time and speedup.Research shows that our improved model is better than other classification algorithm,and Spark has an obvious advantage when dealing with large data.
Keywords/Search Tags:Cloud Computing, Big Data, Hadoop, Spark, Data Mining, Bayesian Algorithm
PDF Full Text Request
Related items