Font Size: a A A

The Research On Tree-Augmented Na?ve Bayes's Improvement And Its Parallelization

Posted on:2019-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhangFull Text:PDF
GTID:2428330572495087Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In the Internet era,data is the oil and self-portraits of various all walks of life.Classification algorithm is an effective means to mine important values with high efficiency from massive data.It's research mainly focuses on two aspects:one is the performance optimization of itself;the other is the scalability research based on the big data processing platform.The Naive Bayes(NB)is hard to judge because of its strong conditional independence assumption,but the Tree-Augmented Naive Bayes(TAN)is a simple and effective Bayesian network classifier,because it often has better accuracy than NB,and keeps a simple structure.Based on the structure characteristics of TAN,this paper has done some research on the network structure learning of TAN.We also propose the parallel design scheme on the Spark platform.(1)The traditional TAN only initializes the network structure on the set of all attributes.And it does not take the correlation differences between various attributes and categories into account,which decreases the classification accuracy.Based on the analysis of Bayesian network structure learning,we propose a learning method for constructing TAN classifier with the im-proved BIC scoring function.The experimental results show that this method effectively extends the TAN structure and eliminates redundant attributes.The learned SETAN model has the same time complexity as the same as TAN.Compared to NB and TAN,the average classification accuracy of SETAN increases by 3.5%and 5.7%.(2)we realize the construction of SETAN model on Spark platform.According to the characteristics of SETAN,we propose a parallel construction scheme of SETAN on Spark in detail,and the corresponding resource optimization scheme is also provided.The experimental results show that the parallel scheme of SETAN has a good scalability,which can effectively handle large-scale data.
Keywords/Search Tags:Tree-Augmented Na?ve Bayes, Scoring function, Spark, Bayesian classifier
PDF Full Text Request
Related items