Font Size: a A A

Spam Filtering Techniques, Based On Data Mining

Posted on:2010-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z B WangFull Text:PDF
GTID:2208360278979051Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Convenient, instant, low-cost E-mail has already become the most popular means of communication and information exchange. However, spams have begun to harass users greatly. Spam mails, which always carries the commercial advertising, the malicious programs and the unhealthy content, take up network bandwidth and waste users' time and money, impact people's mood, and strongly harasses them. So spam filtering is a problem crying out for solutions and spam filtering technologies are developed and improved continuously.Among current popular spam filtering technologies, Naive Bayes(NB) is a simple statistics-based machine learning method. It learns by extracting the features of spam and ham (normal or legal e-mails) in testing set and then constructing a statistical model(the classifier). The classifier will predict the probability of the arriving new e-mail belonging to spam or ham by statistical laws. The new e-mail will be classified as the class of the bigger probability. NB is widely used in spam filtering for its speed and simplicity to implement.The key problems in Naive Bayes spam filtering (NBF) models are formal representation of mails, feature selection, probability computing models. Based on the analysis of the classic method, the article deeply studies these problems, and introduces the clustering method of text mining in data mining into the traditional filter model, and has effectively improved it.In the process of spam filtering, people would rather receive more junk e-mail than a normal e-mail is classified wrongly as spam, and the shortcomings of white list technology are as follows: a number of normal e-mails that are not sent by ones in white list system may be blocked. In this paper, this issue is studied thoroughly, and a two-tier model of spam filtering is proposed based on the white list technology and Bayesian classification.The paper first outlines the situation of spam filtering research, including the definition of spam, hazards, characteristic analysis and popular filtering techniques; then it analyses thoroughly the key issues of the traditional NBF model; in order to conquer the shortcomings of the white list technology, decrease the number of features in Bayesian filtering model and achieve the improvement of the traditional model of NBF, a two-tier spam filtering model, by applying the feature clustering method to reduce the dimension.
Keywords/Search Tags:Spam Filtering, Naive Bayes, Feature Selection, Feature Clustering, Similarity
PDF Full Text Request
Related items