Font Size: a A A

Research And Implementation Of Hadoop Platform Spam Filtering Algorithm

Posted on:2019-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:F ChongFull Text:PDF
GTID:2428330545970704Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In the information age,e-mail has become one of the most important means of communication in our daily life and affect people's lives,accompanied by the growing spam.Traditional spam filtering techniques,such as "black-and-white list","filtering based on key words" and other methods can filter the spam in a degree.However,the huge amount of users generate too many mails in every minutes,the number of mails is so huge,and the species of mails becomes so complex that the traditional methods can not solve the problems any more.The Cloud Data Mining is a combination of data mining technology and cloud computing technology.Implement the data mining on a cloud platform is a good way to solve the bottle necks that the big data brings.Meanwhile,Cloud Data Mining not only improves the level of flexibility and efficiency of spam filtering,but also make it possible to filter the massive mail data.The paper uses cloud data mining technology to filter spam.The article selects Bayesian mail filtering algorithm as an object of study.After in-depth study of the core technology of Hadoop platform in terms of massive data processing,Bayesian mail filtering algorithm is optimized to against drawbacks of low efficiency and early training resource-consuming for traditional distributed Bayesian algorithm.The decision rules are based on the result set of the mail to be filtered,the rules are generated by the decision table,then the mail is filtered according to the corresponding rules and Bayesian algorithm,which greatly reduces the false positive rate of the mail.Then based on a MapReduce model based Hadoop open source cloud architecture.To improve the accuracy of mail filtering under the premise of improving efficiency of spam filtration,when parallelize the processing of large amounts of mail.Experimental results show that Bayesian mail filtering MapReduce model can keep good performance of recall rate,checking precision rate and true positive rate.At the same time,it improves efficiency of the filter.
Keywords/Search Tags:Data mining, Hadoop, MapReduce model, Bayesian algorithm, spam filter
PDF Full Text Request
Related items